Python3 crawler combat -30, PyQuery

In the previous section we introduced the use of BeautifulSoup, it is a very powerful web parsing library, can not you feel it's some of the ways to use struggling? Do you feel it's CSS selector function not so powerful?

If you are involved in some of the Web, if you prefer to use CSS selectors, if you understand the jQuery, then there is a more suitable for your parsing library - PyQuery.

Next we come to feel the power of PyQuery.

1. Preparations

Before you begin make sure you have properly installed the PyQuery, if not installed The installation procedure is the first chapter.

2. Initialize

Like BeautifulSoup the same, PyQuery initialization, they also need to pass HTML data sources to initialize an operating target, there are a variety of its initialization methods, such as direct incoming string incoming URL, pass the file name. Below we explain in detail.

Initialization string

First, we use an example to feel:

html = '''
<div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('li'))
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

operation result:

<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

Here we first introduced PyQuery this object, alias as pq, then declares a long HTML string as a parameter to PyQuery, so that the successful completion of the initialization, and then the next target will be initialized passed CSS selector in this example we pass li node, so that you can select all li nodes, you can see the HTML text printout of all li nodes.

URL initialization

Not only initialization parameters can be passed as a string, you can also pass the page's URL, here only need to specify the parameters to url:

from pyquery import PyQuery as pq
doc = pq(url='http://www.segmentfault.com')
print(doc('title'))

operation result:

<title>SegmentFault 思否</title>

In this case PyQuery will first request this URL, and then completes the initialization with HTML content to get, in fact, we use the equivalent page's source code is passed to PyQuery to initialize a string.

It is the same with the following functions:

from pyquery import PyQuery as pq
import requests
doc = pq(requests.get('http://www.segmentfault.com').text)
print(doc('title'))

Initialization file

Of course, in addition to passing a URL, you can also pass a local file name, filename can be specified as a parameter:

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
print(doc('li'))

Of course, here the need for a local HTML file demo.html, is content to be parsed HTML string. So that it first reads the contents of the local file, then the file contents to PyQuery to initialize a string.

These three initialization method can, of course, the most common way to initialize a string is passed.

3. Basic CSS selectors

First, we use an example to feel the CSS selectors usage PyQuery of:

html = '''
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))
print(type(doc('#container .list li')))

operation result:

<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<class 'pyquery.pyquery.PyQuery'>

Here we initialize PyQuery after the object, passing a CSS selector, # container .list li, which means to select all li id ​​node in the node is inside the container class for the list of nodes. Then print out, you can get to see the success of the qualified node.

We then its type printout, you can see its type still PyQuery type.

4. Find node

Here we introduce some common query functions that use jQuery and functions are also identical.

Child node

Find the child nodes need to use the find () method, passing parameters are CSS selectors, we were still above HTML example:

from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
print(type(items))
print(items)
lis = items.find('li')
print(type(lis))
print(lis)

operation result:

<class 'pyquery.pyquery.PyQuery'>
<ul class="list">
    <li class="item-0">first item</li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

First we select the class for the list of nodes, then we call the find () method, passing the CSS selector, select its internal li nodes, eventually printout can be observed corresponding query results, can be found find ( ) method returns all the selected nodes meet the conditions, type of the result is PyQuery type.

In fact, find () Look in all descendants of the node, and if we want to find the child node, then you can use children () method:

lis = items.children()
print(type(lis))
print(lis)

operation result:

<class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

If you want to filter all child nodes qualified node, for example, we want to filter out child node class is active nodes that may be passed CSS selector .active to children () method:

lis = items.children('.active')
print(lis)

operation result:

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>

You can see the output of screening has been done, leaving a class for the active node.

Parent

We can use parent () method to get the parent node of a node, we use an example to feel:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
container = items.parent()
print(type(container))
print(container)

operation result:

<class 'pyquery.pyquery.PyQuery'>
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

Here we first used .list select the class for the list of nodes, and then call the parent () method to obtain a parent node, the type is still PyQuery type.

Here is the direct parent of the parent node of the node, that is, it will not go look for the parent's parent, that ancestor node.

But if we want to get an ancestor node how to do it? Parents can use the () Method:

from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
parents = items.parents()
print(type(parents))
print(parents)

operation result:

<class 'pyquery.pyquery.PyQuery'>
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
 <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>

Here we call the parents () method, you can see the output there are two, one is the class to wrap node, a node is the id of the container, that is to say, parents () method returns all the ancestor nodes.

If we want to filter an ancestor node, then you can pass CSS selectors to parents () method, which returns a node ancestor node is valid CSS selector:

parent = items.parents('.wrap')
print(parent)

operation result:

<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

We can see the output less of a node, leaving only class to wrap nodes.

Sibling

In the above we describe the use of child and parent nodes, there is a sibling node that is, if you want to get sibling can use the siblings () method. We were still above HTML code as an example to feel:

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')
print(li.siblings())

Here we first selected the node class to class internal list of nodes and active item-0, which is the third li nodes. Then obviously it has four siblings, that is, first, two, four, five li nodes.

operation result:

<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0">first item</li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

We can see the results of what we have just said four siblings.

If you want to filter a sibling node, we can still pass the CSS selector method, which would pick out from all the siblings in the qualified node:

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')
print(li.siblings('.active'))

Here we screened a class for the active node, we can observe the class is active only sibling nodes by the results of the fourth li earlier, so the result should be a.

operation result:

<li class="item-1 active"><a href="link4.html">fourth item</a></li>

5. traversal

We just can observe that the choice may be the result PyQuery multiple nodes, may be a single node types are PyQuery type, and did not return as BeautifulSoup the same list.

For a single node, we can directly print out, can be transferred directly to a string:

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(str(li))

operation result:

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

For the results of multiple nodes, we need to traverse to get up, for example, where we have to traverse every node li ,, need to call items () method:

from pyquery import PyQuery as pq
doc = pq(html)
lis = doc('li').items()
print(type(lis))
for li in lis:
    print(li, type(li))

operation result:

<class 'generator'>
<li class="item-0">first item</li>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-1"><a href="link2.html">second item</a></li>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<class 'pyquery.pyquery.PyQuery'>

After the call where we can find items () method, will get a generator, traversing it, you can get li-by-node object, and its type is PyQuery type, so each node can also call li previously mentioned method to choose, such as child nodes to research, find an ancestor nodes, etc., it is very flexible.

6. Obtain Information

After extracting the node, our ultimate aim is of course the information contained in the extracted node, the more important information there are two types, one is to obtain property, and second, to get the text, we are described separately below.

Acquiring property

After extraction to a PyQuery types of nodes, we can call attr () method to get property:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a, type(a))
print(a.attr('href'))

operation result:

<a href="link3.html"><span class="bold">third item</span></a> <class 'pyquery.pyquery.PyQuery'>
link3.html

Here we first selected a node in the node li class item-0 and is active, its type can be seen PyQuery type.

Then we call the attr () method and pass it the name of the property, you can get the value of this property.

Can also be obtained by calling the attribute attr attribute is used as follows:

print(a.attr.href)

result:

link3.html

The result is exactly the same, here we do not call a method, it calls the attr attribute, then call the property name, property values ​​can also be obtained.

If we select a plurality of elements, then call attr () method would be what results appear? We use an example to test:

a = doc('a')
print(a, type(a))
print(a.attr('href'))
print(a.attr.href)

operation result:

<a href="link2.html">second item</a><a href="link3.html"><span class="bold">third item</span></a><a href="link4.html">fourth item</a><a href="link5.html">fifth item</a> <class 'pyquery.pyquery.PyQuery'>
link2.html
link2.html

Logically speaking we selected a node should be four, but the print result is four, but when we call attr () method, the results returned, but only the first one.

Therefore, when returning results include a plurality of nodes, call attr () method to obtain only the properties of the first node.

So this is the case if we want to get all the attributes of a node, you need to use the above-mentioned traversed:

from pyquery import PyQuery as pq
doc = pq(html)
a = doc('a')
for item in a.items():
    print(item.attr('href'))

operation result:

link2.html
link3.html
link4.html
link5.html

So, during the property to get the time to look at whether one or more nodes to return, if it is more, you need to traverse in order to obtain the attributes of each node.

Get the text

Another major operation after the acquisition is to get the text node inside of, we can invoke the text () method to get:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a)
print(a.text())

operation result:

<a href="link3.html"><span class="bold">third item</span></a>
third item

We first selected a a node, and then calls the text () method, you can get the text of the inside information, it will ignore internal node contains all the HTML and returns only plain text.

But if we want to get the internal nodes of HTML text, you can use html () method:

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(li.html())

Here we selected the third li node, then call the html () method, which returns the result should be all text within the HTML li nodes.

operation result:

<a href="link3.html"><span class="bold">third item</span></a>

Similarly, there is a problem, if we select the result is a plurality of nodes, text () or html () will return what?

We use an example look at:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li')
print(li.html())
print(li.text())
print(type(li.text())

operation result:

<a href="link2.html">second item</a>
second item third item fourth item fifth item
<class 'str'>

The results may be more surprising, we checked all the li nodes, can be found in html () method returns the first internal node li HTML text, and text () returns all of the internal nodes li plain text, separated by a space intermediate, it is actually a string.

So this place is worth noting that, if we get the result that a plurality of nodes, each node if you want to get inside the HTML text, you need to traverse each node, and the text () method does not require traversal can get, it is all taken after the merger nodes into a text string.

7. node operation

PyQuery provides a range of methods for dynamic modification operations on the node, such as adding a class for a node, a node is removed and so on, sometimes these actions will bring great convenience to extract information.

Because too many nodes operating method, following a few typical examples to illustrate its use.

addClass、removeClass

Let's use an example to feel:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.removeClass('active')
print(li)
li.addClass('active')
print(li)

First, we selected the third li node, then call the removeClass () method, the active li nodes removed this class, and later called addClass () method, add the class turn back, each to perform one operation, print output at the contents of the current li node.

operation result:

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

You can see conducted a total of three times the output, the second output node active li This class is removed, and the third class is added back.

So we addClass (), removeClass () These methods can dynamically change the class attribute node.

attr、text、html

Of course, in addition to the operation of class attributes, there attr () method to operate specific attributes may be changed in the node with the content text (), html () method.

We feel instance:

html = '''
<ul class="list">
     <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
</ul>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.attr('name', 'link')
print(li)
li.text('changed item')
print(li)
li.html('<span>changed item</span>')
print(li)

Here we first checked li node, then call attr () method to modify the properties, the first argument is the name of the second parameter is the attribute value, then we call the text () and html () method to change node internal content. After three operations, respectively, and print out the current li node.

operation result:

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link">changed item</li>
<li class="item-0 active" name="link"><span>changed item</span></li>

Can be found, call attr () method after, li more node attributes that are not present a name, a value of Link, calling text () method, after the incoming text, text found on the whole inside li node is changed to pass into a text string. After calling html () method incoming HTML text, li internal node has changed for incoming HTML text.

So, attr () method only if the first argument passed attribute name, is to get the property value, if passed in the second parameter can be used to modify the attribute values, text () and html () method if you do not pass parameters to obtain a node is within plain text and HTML text, if it is passed in the parameter assignment.

remove

remove the name suggests removed, remove () method can sometimes bring great convenience to extract information. Let's look at an example:

html = '''
<div class="wrap">
    Hello, World
    <p>This is a paragraph.</p>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())

Here there is a HTML text, we now want to extract Hello, World string, and the string do not p internal node, how to extract this?

Here we have the first direct attempt to extract the class node wrap the content to see is not what we want, the results are as follows:

Hello, World This is a paragraph.

However, this result also contains the contents of the internal node p, that text () all the plain text extracted from the whole. If we want to get rid of the text inside the node p, then the text can be selected in the p-node extraction again, and then remove the substring from the entire results, but this approach obviously more complicated.

Well, this is the remove () method can come in handy, we can then do the following:

wrap.find('p').remove()
print(wrap.text())

We first selected the node p, then call the remove () method to remove it, and then on the left inside the wrap Hello, World this sentence, then you can use to extract text () method.

So, remove () method to remove some redundant content, to facilitate our extraction. At the appropriate time use can greatly improve efficiency.

In fact, there are many ways in addition node operation, such as append (), empty (), prepend () methods, and they jQuery usage is exactly the same, detailed usage can refer to the official document:http://pyquery.readthedocs.io...

8. The pseudo class selector

CSS selectors were strong, there is a very important reason is that it supports a wide variety of pseudo-class selectors. For example, select the first node, the last node, the number of parity node, a node that contains text, and so on, we use an example to feel:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li:first-child')
print(li)
li = doc('li:last-child')
print(li)
li = doc('li:nth-child(2)')
print(li)
li = doc('li:gt(2)')
print(li)
li = doc('li:nth-child(2n)')
print(li)
li = doc('li:contains(second)')
print(li)
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Here we use CSS3 pseudo class selector sequentially selecting the first node li, li last node, the second node li, li li following the third node, li node even positions, comprising second text the li nodes, is very powerful.

9. Conclusion

So far PyQuery of common usage on presentation.thank

Guess you like

Origin blog.51cto.com/14445003/2426470