(4) Getting started with Beautiful Soup library
BeautifulSoup library official documentation
(1) Basic elements of Beautiful Soup library
- Beautiful Soup library is a function library for parsing, traversing and maintaining "tag tree"
<p class="title">...</p>
<p>..</p> :标签 Tag
p :名称 Name (成对出现)
class='title':属性 Attributes (0个或多个)
- BeautifulSoup corresponds to the entire content of an HTML / XML document
from bs4 import BeautifulSoup
soup1 = BeautifulSoup("<html>data</html>","html.parser")
soup2 = BeautifulSoup("open("D://demo.html")","html.parser")
(2) BeautifulSoup library parser
Parser | Instructions | condition |
---|---|---|
HTML parser for bs4 | BeautifulSoup(“mk”,“html.parser”) | Install bs4 library |
lxml HTML parser | BeautifulSoup(“mk”,“lxml”) | pip install lxml |
lxml's XML parser | BeautifulSoup(“mk”,“xml”) | pip install lxml |
parser for html5lib | BeautifulSoup(“mk”,“html5lib”) | pip install html5lib |
(3) The basic elements of the BeautifulSoup class
例:<p class="title">...</p>
fundamental element | Explanation |
---|---|
Tag | Tags, the most basic unit of information organization, with <> and </> mark the beginning and end |
Name | The name of the label, … The name is 'p', the format: .name |
Attributes | Tag attributes, organized in dictionary format, format: .attrs |
NavigableString | Non-attribute character string in tag, character string in <> ... </>, format: .string |
Comment | The comment part of the string in the label, a special type of Comment |
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"html.parser")
soup.title #获得标题
soup.a #格式:soup.<tag>,文档中存在多个相同<tag>,只返回第一个
soup.find_all('a') #找到文档中所有的<a>标签
soup.get_text() #从文档中获取所有文字内容
#获取<tag>的名字,格式:<tag>.name,字符串类型
soup.a.name
soup.a.parent.name
soup.a.parent.parent.name
#一个<tag>可以有0或多个属性,字典类型
tag.attrs
tag.attrs['class']
#NavigableString类型可以跨越多个层次
soup.a.string #bs4.element.NavigableString类型
#Comment是一种特殊类型
newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
newsoup.b.string #bs4.element.Comment类型
newsoup.p.string #bs4.element.NavigableString类型
(4) HTML content traversal method based on bs4 library
1. Downward traversal of the tag tree
- BeautifulSoup type is the root node of the tag tree
Attributes | Explanation |
---|---|
.contents | List of child nodes, save all son nodes into the list |
.children | The iteration type of the child node, similar to .contents, is used to loop through the son node |
.descendants | Iteration type of descendant nodes, including all descendant nodes, for loop traversal |
soup = BeautifulSoup(demo, "html.parser")
soup.head #获取head
soup.head.contents #获取head的子节点列表
soup.body.contents #获取body的子节点列表
len(soup.body.contents) #获取body的子节点列表长度
soup.body.contents[1]
#下行遍历儿子节点
for child in soup.body.children:
print(child)
#下行遍历子孙节点
for child in soup.body.descendants:
print(children)
2. Uplink traversal of the tag tree
Attributes | Explanation |
---|---|
.parent | Node's parent label |
.parents | The iteration type of the node ancestor label, used to loop through the ancestor nodes |
#遍历所有先辈节点,包括soup本身,所以要区别判断
soup = BeautifulSoup(demo, "html.parser")
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
3. Parallel traversal of the label tree
Attributes | Explanation |
---|---|
.next_sibling | Returns the next parallel node label in HTML text order |
.previous_sibling | Returns the previous parallel node label in HTML text order |
.next_siblings | Iteration type, return subsequent parallel node tags in HTML text order |
.previous_siblings | Iteration type, return previous parallel node tags in HTML text order |
#遍历后续节点
for sibling in soup.a.next_sibling:
print(sibling)
#遍历前续节点
for sibling in soup.a.previous_sibling:
print(sibling)
(5) HTML format output based on bs4 library
#美化输出
#.prettify()为HTML文本<>及其内容增加了'\n'
soup.prettify()
#.prettify()可用于标签
soup.a.prettify()
Node tags |
| .previous_sibling | Return the previous parallel node tags in HTML text order |
| .next_siblings | Iteration type, return subsequent parallel node tags in HTML text order |
| .previous_siblings | Iteration type, return in HTML text order Previous Parallel Node Label |
[External link image is being transferred ... (img-mxfhEfda-1586863050051)]
#遍历后续节点
for sibling in soup.a.next_sibling:
print(sibling)
#遍历前续节点
for sibling in soup.a.previous_sibling:
print(sibling)
(5) HTML format output based on bs4 library
#美化输出
#.prettify()为HTML文本<>及其内容增加了'\n'
soup.prettify()
#.prettify()可用于标签
soup.a.prettify()
- The bs4 library turns any HTML input into UTF-8 encoding