Python3 [Analysis library pyquery]

1. Introduction to pyquery

The above describes the use of Beautiful Soup, you will find that CSS selectors are not so powerful, next

Learning pyquery makes up for CSS selectors

installation:

pip install pyquery

2. Basic use

html ='''
<!DOCTYPE html>
<html>
<head>
    <title>故事</title>
</head>
<body>
   <p class="title" name="dromouse"><b>这个是dromouse</b></p>
   <p class="story">Once upon a time there were three little sister;
       and their names were
       <a href="http://www.baidu.com" class="sister" id="link1"><!--GH--></a>
       <a href="http://www.baidu.com/oracle" class="sister" id="link2">Local</a>and
       <a href="http://www.baidu.com/title" class="sister" id="link3">Tillie</a>;
   and they lived at the bottom of a well.</p>
   <p class="story">...</p>

</body>
</html>

'''

from pyquery Import pyquery AS pq 

# object parameters passed pq acquaintance of 
DOC = pq (HTML) 

# output of all the p-tag content 
Print (DOC ( ' p ' ))

You can directly pass in the URL

from pyquery Import pyquery AS pq 

# object parameters passed pq acquaintance of 
DOC = pq (URL = " https://www.baidu.com " ) 

Print (DOC ( ' title ' ))

Request file also

Import Requests 

from pyquery Import pyquery AS pq 

URL = " https://www.baidu.com " 
# object parameters passed pq acquaintance of 
DOC = pq (requests.get (URL) .text) 

Print (DOC ( ' title ' ))

File initialization

from pyquery Import pyquery AS pq 

# object parameters passed pq acquaintance of 
DOC pq = (filename = " demo.html " ) 

Print (DOC ( ' title ' ))

Three. Basic CSS selector

1. Basic use

from pyquery import PyQuery as pq

doc = pq(url="https://www.baidu.com")

div = doc('.card .lazyload ')

print(div)

2. Find nodes

Introduce the query function. The usage of these functions is exactly the same as the function in jQuery.

1. Child Node

The find () method finds all descendant nodes

from pyquery Import pyquery AS pq 

DOC = pq (url = " https://www.baidu.com " ) 

div = DOC ( ' .card ' ) 

# Use find to locate the tag 
img = div.find ( ' img ' ) 

Print ( img)

Just want to find the child nodes, then you can use the children () method

from pyquery Import pyquery AS PQ 

DOC = PQ (URL = " https://www.baidu.com " ) 

div = DOC ( ' .card ' ) 

# use children find direct child node 
IMG = div.children ( ' A ' ) 

print (img)

2. The parent node

We can use the parent () method to get the parent node of a node (direct parent node)

from pyquery Import pyquery AS PQ 

DOC = PQ (URL = "" ) 

# First node locator 
items DOC = ( ' .fa ' ) 

# direct parent of the child node of 
the contains = items.parent () 

Print (the contains)

If it is a grandfather node, that is, the parent node of the parent node uses parents ()

from pyquery Import pyquery AS PQ 

DOC = PQ (URL = " : //.com HTTPS " ) 

# First node locator 
items DOC = ( ' .fa ' ) 

# grandfather child node 
the contains = items.parents () 

Print ( contains)

3. Brother node

If using sibling nodes then use the siblings () method

from pyquery Import pyquery AS PQ 

DOC = PQ (URL = "" ) 

# first positioning element 
items DOC = ( ' .card-text ' ) 

# sibling nodes at the same level 
the contains = items.siblings () 

Print (the contains)

4. Traverse

pyquery may select multiple nodes or a single node. For the results of multiple nodes, we need to use traversal.

from pyquery Import pyquery AS PQ 

DOC = PQ (URL = "" ) 

# plurality of nodes 
List = DOC ( ' .card ' ) .items () 

Print (type (List)) 

# iterate each output 
for div in List: 

    Print (div)

5. Get information

We have finished the node, then we have to get the information in the node, get the text or get the attributes

Get attribute attr ()

from pyquery Import pyquery AS PQ 

DOC = PQ (URL = "" ) 

# plurality of nodes, adding items () represents all 
IMG = DOC ( ' .card .lazyload ' ) .items () 

for I in IMG:  
     # Get Attribute 
    img_href = i.attr ( ' data-src ' )
     # img_href = i.attr.data-src 
    print (img_href)

Get text using text () and html ()

from pyquery Import pyquery AS PQ 

DOC = PQ (URL = "" ) 

# plurality of nodes, adding items () represents all 
the infos = DOC ( ' .btn ' ) .items () 

for I in the infos:  
     # Get Text 
    info = I .text ()
     # text of html with 
    info_html = i.html ()
     Print (info, info_html)

The above attr (attribute name, attribute value) , text ("modify text") and html ("<a> </a>") can modify parameters directly

6. Node operation

pyquery provides a series of methods to dynamically modify nodes, such as adding a class to a node and removing a node. These operations sometimes bring great convenience for extracting information

The methods addClass () and removeClass () dynamically change the class attribute of a node

from pyquery Import pyquery AS PQ 

DOC = PQ (URL = "" ) 

# query 
IMG = DOC ( ' .card A IMG ' ) 

Print (IMG)
 # removed class of lazyload 
img.removeClass ( ' lazyload ' ) 

Print (IMG)

7. Remove ()

The remove () method is to remove the element

from pyquery import PyQuery as pq
html = '''

<div class="wrap">
     Hello,world
     <p>This is a man</p>
    </div>
'''
doc = pq(html)

#获取hello world
wrap = doc('.wrap')

wrap.find('p').remove()

print(wrap.text())

Some commonly used methods append (), empty () and prepend () and other methods, they are completely consistent with jQuery usage

Official documentation: http://pyquery.readthedocs.io/en/latest/api.html

8. Pseudo-class selector

from pyquery import PyQuery as pq
html = '''

<!DOCTYPE html>
<html>
<head>
    <title>故事</title>
</head>
<body>
   <p class="title" name="dromouse"><b>这个是dromouse</b></p>
   <p class="story">Once upon a time there were three little sister;
       and their names were
       <a href="http://www.baidu.com" class="sister" id="link1"><!--GH--></a>
       <a href="http://www.baidu.com/oracle" class="sister" id="link2">Local</a>and
       <a href="http://www.baidu.com/title" class="sister" id="link3">Tillie</a>;
   and they lived at the bottom of a well.</p>
   <p class="story">...</p>
   <ul>
       <li>1</li>
       <li>2</li>
       <li>3</li>
       <li>4</li>
   </ul>
    
</body>
</html>

'''
doc = pq(html)

li_f = doc("li:first-child")
li_l = doc("li:last-child")
li_n = doc("li:nth-child(2)")
li_n = doc("li:nth-child(2n)")
li_text = doc("li:contains(were)")
li = doc("li:gt(2)")