Beautiful Soup parses anything you give it, and does the tree traversal stuff for you.
BeautifulSoup library is resolved, traverse, Maintenance "tag tree" function library (traversal refers to a strip search along the route followed by each node of the tree were done only once and do a visit). https://www.crummy.com/software/BeautifulSoup
BeautifulSoup library we often call bs4, import the library: from bs4 import BeautifulSoup. Wherein, import BeautifulSoup i.e. primarily in bs4 BeautifulSoup class.
bs4 parser library
The basic elements of the class BeautifulSoup
Import Requests. 1 2 from the BeautifulSoup Import BS4 . 3 . 4 requests.get RES = ( 'http://www.pmcaff.com/site/selection') . 5 = the BeautifulSoup Soup (res.text, 'lxml') . 6 Print (Soup. a) . 7 # any tag present in HTML syntax can be used Soup. <tag> to obtain access, the presence of a plurality of identical <tag> when the corresponding content, Soup. <tag> when returns the first HTML document. . 8 . 9 Print (soup.a.name) 10 # each <tag> has its own name, can be obtained by .name <tag>, string type . 11 12 is Print (soup.a.attrs) 13 is Print (Soup. a.attrs [ 'class']) 14 # a <tag> may have one or more attributes, a type dictionary 15 16 Print (soup.a.string) . 17 # <tag> .string tag can take into nonattributed the string 18 is . 19 soup1 the BeautifulSoup = ( '<P> <-! here is a comment -> </ p>', 'lxml' 22 # comment is a special type, can also be taken through .string <tag>
operation result:
<a class="no-login" href="">登录</a>
a
{'href': '', 'class': ['no-login']} ['no-login']
log in
Here is a comment
<class 'bs4.element.Comment'>
HTML content bs4 traversal library
The basic structure of the HTML
Traversing the tree downlink tag
Wherein, BeautifulSoup tag type is the root of the tree.
Traversing son node # 1 2 for Child in soup.body.children: . 3 Print (child.name) . 4 . 5 # descendant node traversal . 6 in soup.body.descendants for Child: . 7 Print (child.name)
Traversing up the tree tag
# 1 when traversing all ancestor nodes, including soup itself, so if ... else ... Analyzing 2 for parent in soup.a.parents: . 3 None IF parent IS: . 4 Print (parent) . 5 the else: . 6 Print ( parent.name)
operation result:
div
div
body
html
[document]
Tag tree traversal parallel
1 # 遍历后续节点 2 for sibling in soup.a.next_sibling: 3 print(sibling) 4 5 # 遍历前续节点 6 for sibling in soup.a.previous_sibling: 7 print(sibling)
bs4库的prettify()方法
prettify()方法可以将代码格式搞的标准一些,用soup.prettify()表示。在PyCharm中,用print(soup.prettify())来输出。