1. Introduction to pyquery
The above describes the use of Beautiful Soup, you will find that CSS selectors are not so powerful, next
Learning pyquery makes up for CSS selectors
installation:
pip install pyquery
2. Basic use
html =''' <!DOCTYPE html> <html> <head> <title>故事</title> </head> <body> <p class="title" name="dromouse"><b>这个是dromouse</b></p> <p class="story">Once upon a time there were three little sister; and their names were <a href="http://www.baidu.com" class="sister" id="link1"><!--GH--></a> <a href="http://www.baidu.com/oracle" class="sister" id="link2">Local</a>and <a href="http://www.baidu.com/title" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body> </html> ''' from pyquery Import pyquery AS pq # object parameters passed pq acquaintance of DOC = pq (HTML) # output of all the p-tag content Print (DOC ( ' p ' ))
You can directly pass in the URL
from pyquery Import pyquery AS pq # object parameters passed pq acquaintance of DOC = pq (URL = " https://www.baidu.com " ) Print (DOC ( ' title ' ))
Request file also
Import Requests from pyquery Import pyquery AS pq URL = " https://www.baidu.com " # object parameters passed pq acquaintance of DOC = pq (requests.get (URL) .text) Print (DOC ( ' title ' ))
File initialization
from pyquery Import pyquery AS pq # object parameters passed pq acquaintance of DOC pq = (filename = " demo.html " ) Print (DOC ( ' title ' ))
Three. Basic CSS selector
1. Basic use
from pyquery import PyQuery as pq doc = pq(url="https://www.baidu.com") div = doc('.card .lazyload ') print(div)
2. Find nodes
Introduce the query function. The usage of these functions is exactly the same as the function in jQuery.
1. Child Node
The find () method finds all descendant nodes
from pyquery Import pyquery AS pq DOC = pq (url = " https://www.baidu.com " ) div = DOC ( ' .card ' ) # Use find to locate the tag img = div.find ( ' img ' ) Print ( img)
Just want to find the child nodes, then you can use the children () method
from pyquery Import pyquery AS PQ DOC = PQ (URL = " https://www.baidu.com " ) div = DOC ( ' .card ' ) # use children find direct child node IMG = div.children ( ' A ' ) print (img)
2. The parent node
We can use the parent () method to get the parent node of a node (direct parent node)
from pyquery Import pyquery AS PQ DOC = PQ (URL = "" ) # First node locator items DOC = ( ' .fa ' ) # direct parent of the child node of the contains = items.parent ()
Print (the contains)
If it is a grandfather node, that is, the parent node of the parent node uses parents ()
from pyquery Import pyquery AS PQ DOC = PQ (URL = " : //.com HTTPS " ) # First node locator items DOC = ( ' .fa ' ) # grandfather child node the contains = items.parents () Print ( contains)
3. Brother node
If using sibling nodes then use the siblings () method
from pyquery Import pyquery AS PQ DOC = PQ (URL = "" ) # first positioning element items DOC = ( ' .card-text ' ) # sibling nodes at the same level the contains = items.siblings () Print (the contains)
4. Traverse
pyquery may select multiple nodes or a single node. For the results of multiple nodes, we need to use traversal.
from pyquery Import pyquery AS PQ DOC = PQ (URL = "" ) # plurality of nodes List = DOC ( ' .card ' ) .items () Print (type (List)) # iterate each output for div in List: Print (div)
5. Get information
We have finished the node, then we have to get the information in the node, get the text or get the attributes
Get attribute attr ()
from pyquery Import pyquery AS PQ DOC = PQ (URL = "" ) # plurality of nodes, adding items () represents all IMG = DOC ( ' .card .lazyload ' ) .items () for I in IMG: # Get Attribute img_href = i.attr ( ' data-src ' ) # img_href = i.attr.data-src print (img_href)
Get text using text () and html ()
from pyquery Import pyquery AS PQ DOC = PQ (URL = "" ) # plurality of nodes, adding items () represents all the infos = DOC ( ' .btn ' ) .items () for I in the infos: # Get Text info = I .text () # text of html with info_html = i.html () Print (info, info_html)
The above attr (attribute name, attribute value) , text ("modify text") and html ("<a> </a>") can modify parameters directly
6. Node operation
pyquery provides a series of methods to dynamically modify nodes, such as adding a class to a node and removing a node. These operations sometimes bring great convenience for extracting information
The methods addClass () and removeClass () dynamically change the class attribute of a node
from pyquery Import pyquery AS PQ DOC = PQ (URL = "" ) # query IMG = DOC ( ' .card A IMG ' ) Print (IMG) # removed class of lazyload img.removeClass ( ' lazyload ' ) Print (IMG)
7. Remove ()
The remove () method is to remove the element
from pyquery import PyQuery as pq html = ''' <div class="wrap"> Hello,world <p>This is a man</p> </div> ''' doc = pq(html) #获取hello world wrap = doc('.wrap') wrap.find('p').remove() print(wrap.text())
Some commonly used methods append (), empty () and prepend () and other methods, they are completely consistent with jQuery usage
Official documentation: http://pyquery.readthedocs.io/en/latest/api.html
8. Pseudo-class selector
from pyquery import PyQuery as pq html = ''' <!DOCTYPE html> <html> <head> <title>故事</title> </head> <body> <p class="title" name="dromouse"><b>这个是dromouse</b></p> <p class="story">Once upon a time there were three little sister; and their names were <a href="http://www.baidu.com" class="sister" id="link1"><!--GH--></a> <a href="http://www.baidu.com/oracle" class="sister" id="link2">Local</a>and <a href="http://www.baidu.com/title" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> <ul> <li>1</li> <li>2</li> <li>3</li> <li>4</li> </ul> </body> </html> ''' doc = pq(html) li_f = doc("li:first-child") li_l = doc("li:last-child") li_n = doc("li:nth-child(2)") li_n = doc("li:nth-child(2n)") li_text = doc("li:contains(were)") li = doc("li:gt(2)")