Python reptile beautifulsoup4 commonly used analytical methods are summarized (writeup)

Today small for everyone to share a common analytical methods regarding Python reptile beautifulsoup4 summary, small series that the content very good, now for everyone to share, have a good reference value, followed by small series with a look a friend in need
Summary

How to resolve situations with beautifulsoup4 page
beautifulsoup4 use

About beautifulsoup4, the official website has been talk in great detail, I've put some commonly used analytical methods to be a summary for easy access.

Loading html document

The first step is to use beautifulsoup html document into the beautifulsoup, to form a beautifulsoup object.

Requests Import
from the BeautifulSoup BS4 Import
URL = "http://new.qq.com/omn/20180705/20180705A0920X.html"
R & lt requests.get = (URL)
HTMLs = r.text
#Print (HTMLs)
Soup = the BeautifulSoup (HTMLs , 'html.parser')

initialization BeautifulSoup class, two parameters need to be added, i.e., the first argument we climb html source, the second parameter is html parser, there are three commonly used parser, are "html .parser "," lxml "," html5lib ", the official website recommended lxml, because of the high efficiency, of course, need to pip install lxml it.

Of course, the content of these three objects analytically in some cases get resolved is different, such as incomplete label for this case (p tag only half):

= the BeautifulSoup Soup ( "<a> </ P>", "html.parser")
# only automatically start tag complement, only gray end tag automatically ignore
# results: <a> </a>
Soup the BeautifulSoup = ( "<a> </ P>", "lxml")
# results: <HTML> <body> <a> </a> </ body> </ HTML>
Soup = the BeautifulSoup ( "<a> </ P> "," html5lib ")
# html5lib labels are generally occurs autocomplete
# results: <html> <head> < / head> <body> <a> <p> </ p> </ a> </ body> </ html>

use

In use, I try to introduce a frequency I use, after all, in order to view ~

According to the label name, id, class and other information to obtain a label

HTML = '<p class = "title" ID = "P1"> <B> of The Dormouses Story </ B> </ p>'
Soup = the BeautifulSoup (HTML, 'lxml')
# Get the p-tag based on the name of class All content
soup.find (class _ = "title")
# or
soup.find ( "p", class _ = "title" the above mentioned id = "p1")
# get the text content for the title of class p label "the Dormouse's story"
soup.find (class _ = "title"). get_text ()
# can specify the separator between different text label acquired, can also choose whether to remove the front and rear blank.
soup = BeautifulSoup ( '<p class = "title" id = "p1"> <b> The Dormouses story </ b> </ p> <p class = "title" id = "p1"> <b> The Dormouses Story </ B> </ P> ', "html5lib")
. soup.find (class _ = "title") get_text ( "|"




soup.find_all (= the re.compile the class_ ( "TIT"))
#recursive parameters, when the recursive = False, only to find the current data of the first sub-stage of the tag label
soup = BeautifulSoup ( '<html> <head> <title> ABC ',' lxml ')
soup.html.find_all ( "title", recursive This = False)

by tag name, id, class information acquired a plurality of labels

soup = BeautifulSoup ( '<p class = "title" id = "p1"> <b> The like story </ b> </ p> <p class = "title" id = "p1"> <b> The Dormouses Story </ B> </ P> ', "html5lib")
# Get all class of title label
for in soup.find_all I (class _ = "title"):
Print (i.get_text ())
# obtain a certain number of class of title label
for in soup.find_all I (class _ = "title", limit = 2):
Print (i.get_text ())

acquired in accordance with a label tag other properties

html = '<a alog-action="qb-ask-uname" href="/usercent" rel="external nofollow" target="_blank"> snail Song </a>'
Soup = the BeautifulSoup (HTML, 'lxml' )
# Get "snail Song", at this time, neither the tag nor class id, access rules need to be defined according to their properties
author = soup.find ( 'a', { "alog-action": "qb-ask -Uname. "}) get_text ()
# or
author = soup.find (attrs = {" alog-action ":" qb-ask-uname "})

looking ahead and behind the label

soup.find_all_previous("p")
soup.find_previous("p")
soup.find_all_next("p")
soup.find_next("p")

找父标签

soup.find_parents("div")
soup.find_parent("div")

css选择器

soup.select ( "title") # tag name
soup.select ( "html head title") # multistage tag name
soup.select ( "p> a") all tags within a #p
soup.select ( "P> "inside) #P tab, check the label id link1 #
soup.select (" # link1 ~ .sister ") # Find a sibling of the same class
soup.select (" # link1 + .sister ")
soup.select (." sister ") # class name search by
soup.select (" # sister ") # id name search by
soup.select ( 'a [href =" http://example.com/elsie "rel =" external nofollow "]' ) # according to property tag check
soup.select ( 'a [href $ = "Tillie"]')
soup.select_one (. "SISTER")

Note that a few errors that may appear, can try to prevent the capture process reptiles

UnicodeEncodeError: 'charmap' codec can not encode character u '\ xfoo' in position bar ( or other type of UnicodeEncodeError
needs to be transcoded

AttributeError: 'NoneType' object has no attribute 'foo'
do not have this property
Note: Many people learn Python process will encounter all kinds of trouble can not solve the problem. For this reason small series built a Python full-stack free Q & A exchanges skirt: 624 440 745 older drivers do not understand the issues resolved, there is also the latest Python Tutorial project can get together with each other ,, supervise progress together!
Text and images in this article from the network with their own ideas, only to learn, exchange, not for any commercial purposes, belongs to original author, if any questions, please contact us for treatment.

Guess you like

Origin www.cnblogs.com/shabge/p/12372906.html