Web crawler | Getting started tutorial parsing library pyquery

Practical source code for web crawler development: https://github.com/MakerChen66/Python3Spider

It is not easy to be original, plagiarism and reprinting are prohibited in this article, a summary of years of practical crawler development experience, infringement must be investigated!

1. Introduction of pyquery

1.1 What is pyquery?

Before, we introduced the usage of Beautiful Soup, which is a very powerful web page parsing library, but sometimes you feel that it is a bit inappropriate or inconvenient to use? Do you feel that its CSS selector is not so powerful? As a parsing library that also uses CSS selectors, pyquery not only contains many attribute methods and basic CSS selectors, but also supports node operations and many pseudo-class selectors, so its function is more powerful than Beautiful Soup

1.2 Install pyquery

pip install pyquery -i https://pypi.doubanio.com/simple

You can also specify other mirror sources

1.3 Import pyquery

from pyquery import PyQuery as pq



Two, pyquery use

2.1 Initialization

There are many ways to initialize pyquery, such as directly passing in strings, URLs, file names, and so on. Let's go into detail

String initialization
Let's use an example to get a feel

from pyquery import PyQuery as pq

html = '''
<div class="wrap">
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
 </div>
'''

doc = pq(html)
print(doc('li'))

First introduce the PyQuery object, name it pq, then declare an HTML string, and pass it into the PyQuery class as a parameter, and the object initialization is completed. Then pass the initialized object to the CSS selector, here we pass in the li node, so that all the li nodes can be selected. The output result is as follows: URL initialization The initialization parameters can not only pass

strings
insert image description here
,
but also pass URLs

from pyquery import PyQuery as pq

doc = pq(url='https://www.qiushibaike.com/hot/')
print(doc('title'))

The PyQuery object will first request this URL, and then complete the initialization with the returned HTML content, which is equivalent to passing the source code of the web page to the PyQuery class in the form of a string to initialize. It has the same function as the following

code

from pyquery import PyQuery as pq
import requests

doc = pq(requests.get('https://www.qiushibaike.com/hot/').text)
print(doc('title'))

The output results are as follows:
insert image description here
File initialization
Of course, in addition to passing strings and URLs, local file names can also be passed. In this case, the parameter needs to be specified as filename

from pyquery import PyQuery as pq

doc = pq(filename='demo.html')
print(doc('li'))

A local HTML file demo.html is needed here, and its content is the HTML string to be parsed

2.2 Basic CSS selectors

Use an example to feel the usage of pyquery's CSS selector:

from pyquery import PyQuery as pq

html = '''
<div class="wrap">
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
 </div>
'''

doc = pq(html)
print(doc('#container .list li'))
print(type(doc('#container .list li')))

After initializing the PyQuery object, a CSS selector #container.list li is passed in, which means to select the node whose id is container first, and then select all the li nodes inside the node whose internal class is list, and then select the content And its type printout

The output results are as follows:
insert image description here
As you can see, we have successfully obtained the qualified node, and its type is still the PyQuery type

2.3 Finding Nodes

Next, we will introduce some commonly used query functions. These functions are used exactly the same as the functions in jQuery. The

child node
needs to use the find() method to find all descendant nodes. The parameter passed in at this time is a CSS selector. In order to avoid code redundancy, Let's take the previous HTML as an example:

doc = pq(html)
items = doc('.list')
html = items.find('li')
print(html)
print(type(html))

We selected the node whose class is list, and then selected all descendant nodes li inside the node.

The output result is as follows:
insert image description here
If we just want to find the child nodes, then we can use the children() method:

doc = pq(html)
items = doc('.list')
html = items.find('li')
print(html.children())
print(type(html.children()))

Here we select the node whose class is list, and then select all the descendant nodes li inside the node, and finally select all the child nodes of the node

The output results are as follows:
insert image description here
In addition, the children() method can also pass in CSS selectors, as follows:

items = doc('.list')
html = items.children('.active')
print(html)

The output results are as follows:
insert image description here
parent node
We can use the parent() method to obtain the direct parent node of a node, as follows:

from pyquery import PyQuery as pq

doc = pq(html)
items = doc('.list')
container = items.parent()
print(container)
print(type(container))

We first select the node whose class is list, and then use the parent() method to obtain the direct parent node of the node, whose type is still PyQuery. The output results are

as follows:
insert image description here
If you want to obtain an ancestor node, you can use the parents() method, as follows:

from pyquery import PyQuery as pq

doc = pq(html)
items = doc('.list')
parents = items.parents()
print(parents)
print(type(parents))

Similarly, the parents() method can also pass in CSS selectors, as follows:

wrap = items.parents('.wrap')
print(wrap,type(wrap))

Sibling nodes
To obtain sibling nodes, use the siblings() method, as follows:

doc = pq(html)
li = doc('.list .item-0.active')
print(li.siblings())

The third li node is selected here, and it has 4 sibling nodes. The

output results of the first, second, fourth, and fifth nodes are as follows:
insert image description here
similarly, the siblings() method can also be passed to a CSS selector to obtain a sibling node. as follows:

doc = pq(html)
li = doc('.list .item-0.active')
print(li.siblings('.active'))

The output is as follows:
insert image description here

2.4 Traversal

From the above, we can observe that the selection node of pyquery may be a single node or multiple nodes, but the types are all PyQuery types, instead of returning a list like Beautiful Soup. For a single node, it can be printed directly

. It can also be directly converted to a string, as follows:

doc = pq(html)
li = doc('.list .item-0.active')
print(li)
print(str(li))

The output results are as follows:
insert image description here
For the results of multiple nodes, you need to traverse to obtain them. First use the items() method to convert the type into a generator type, and then traverse each li node, as follows:

doc = pq(html)
lis = doc('li').items()
print(lis)
for li in lis:
  print(li,type(li))

After calling items(), you will get a generator that needs to be traversed to get the li node objects one by one. The node type is also the PyQuery type. The output results are as follows: Extension: You can also

use
insert image description here
the enumerate() method to traverse again, so that you can get each The serial number of the node:

doc = pq(html)
lis = doc('li').items()
print(lis)
for i,li in enumerate(lis):
  print(i,li,type(li))

The output is as follows:
insert image description here

2.5 Access to information

After finding the node, our ultimate goal is of course to extract the information contained in the node. The more important information includes attributes and text

acquisition
. After the attributes are extracted to a node of a PyQuery type, the attr() method can be called to obtain the attributes, as follows :

doc = pq(html)
a = doc('.item-0.active a')
print(a,type(a))
print(a.attr('href'))

The output results are as follows:
insert image description here
attributes can also be obtained by calling the attr attribute

print(a.attr.href)

The results of these two methods are exactly the same.
Note : when the returned result contains multiple nodes, calling the attr() method will only get the attributes of the first node. To get all the attributes of the node, you need to use the aforementioned

Get the text by traversing
Call the text() method to get the internal text of the node, as follows:

doc = pq(html)
a = doc('.item-0.active a')
print(a)
print(a.text())

The output results are as follows:
insert image description here
If you want to get the HTML text inside this node, you need to use the html() method, as follows:

doc = pq(html)
li = doc('.item-0.active')
print(li)
print(li.html())

The output is as follows:
insert image description here


3. Node operation

pyquery provides a series of methods to dynamically modify nodes, such as adding or removing an attribute for a node, removing a node, etc. These operations sometimes bring great traversal to extract information. AddClass and

removeClass
we Use an example to feel it:

from pyquery import PyQuery as pq

doc = pq(html)
li = doc('.item-0.active')
print(li)
li.removeClass('active')
print(li)
li.addClass('active')
print(li)

The output results are as follows:
insert image description here
Therefore, the methods addClass() and removeClass() can dynamically change the class attribute of the node attr

, text and html.
text() and html() methods to change the content inside the node, examples are as follows:

html = '''
<ul class="list">
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
</ul>
'''
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.attr('name','link')
print(li)
li.text('changed data')
print(li)
li.html('<span>changed data</span>')
print(li)

First select the li node, then call the attr() method to modify the attribute, the first parameter of this method is the attribute name, the second parameter is the attribute value, and then call the text() method and html() method to change the internal content of the node content, print out

The output results are as follows:
insert image description here
Note : If the attr() method only passes in the attribute name of the first parameter, it is to obtain the attribute value, and if the second parameter is passed in, it can be used to modify the attribute value; the text() method If the and html() method does not pass parameters, it will get the plain text and HTML text in the node, otherwise, it will be assigned.

remove
Using the remove() method will remove a certain node, which sometimes brings great convenience for extracting information:

html = '''
  <div class="wrap">
    hello,world
    <p>this is a paragraph</p>
  </div>
'''
doc = pq(html)
wrap = doc('.wrap')
wrap('p').remove()
print(wrap.text())

The output is as follows:

hello,world



4. Pseudo class selector

Another important reason why CSS selectors are powerful is that they support a variety of pseudo-class selectors, such as selecting the first node, the last node, odd and even nodes, nodes containing a certain text, etc.

from pyquery import PyQuery as pq

html = '''
<div class="wrap">
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
 </div>
'''
doc = pq(html)
li = doc('li:first-child')
print(li)
li = doc('li:last-child')
print(li)
li = doc('li:nth-child(2)')
print(li)
li = doc('li:gt(2)')
print(li)
li = doc('li:nth-child(2n)')
print(li)
li = doc('li:contains(second)')
print(li)

The pseudo-class selector of CSS3 is used here, and the first li node, the last li node, the second li node, the li node after the third li, the li node with an even position, and the li containing the second text are selected in sequence

The output of the node is as follows:
insert image description here
Summary
So far, the common usage of pyquery has been introduced, and its functions are very powerful. Many operations are not available in parsing libraries such as lxml and Beautiful Soup. For

more usage of CSS selectors, please refer to the link:
https: //www.w3school.com.cn/css/index.asp

For more usage of the pyquery parsing library, please refer to the link:
https://pyquery.readthedocs.io

5. Link to the original text

Link to the original text of my original public account: Click me to read the original text

Originality is not easy, if you find it useful, I hope you can give it a thumbs up, thank you guys!

6. Author Info

Author: Xiaohong's Fishing Daily, Goal: Make programming more interesting!

Original WeChat public account: " Xiaohong Xingkong Technology ", focusing on algorithms, crawlers, websites, game development, data analysis, natural language processing, AI, etc., looking forward to your attention, let us grow and code together!

Reprint instructions: This article prohibits plagiarism and reprinting, and infringement must be investigated!

Guess you like

Origin blog.csdn.net/qq_44000141/article/details/121568618