(4) Getting started with Beautiful Soup library

BeautifulSoup library official documentation

(1) Basic elements of Beautiful Soup library

Beautiful Soup library is a function library for parsing, traversing and maintaining "tag tree"

<p class="title">...</p>

<p>..</p>	 :标签 Tag
	p		 :名称 Name （成对出现）
class='title':属性 Attributes （0个或多个）

BeautifulSoup corresponds to the entire content of an HTML / XML document

from bs4 import BeautifulSoup
soup1 = BeautifulSoup("<html>data</html>","html.parser")
soup2 = BeautifulSoup("open("D://demo.html")","html.parser")

(2) BeautifulSoup library parser

Parser	Instructions	condition
HTML parser for bs4	BeautifulSoup(“mk”,“html.parser”)	Install bs4 library
lxml HTML parser	BeautifulSoup(“mk”,“lxml”)	pip install lxml
lxml's XML parser	BeautifulSoup(“mk”,“xml”)	pip install lxml
parser for html5lib	BeautifulSoup(“mk”,“html5lib”)	pip install html5lib

(3) The basic elements of the BeautifulSoup class

例：<p class="title">...</p>

fundamental element	Explanation
Tag	Tags, the most basic unit of information organization, with <> and </> mark the beginning and end
Name	The name of the label, … The name is 'p', the format: .name
Attributes	Tag attributes, organized in dictionary format, format: .attrs
NavigableString	Non-attribute character string in tag, character string in <> ... </>, format: .string
Comment	The comment part of the string in the label, a special type of Comment

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"html.parser")
soup.title		#获得标题
soup.a			#格式：soup.<tag>，文档中存在多个相同<tag>，只返回第一个
soup.find_all('a')	#找到文档中所有的<a>标签
soup.get_text()		#从文档中获取所有文字内容

#获取<tag>的名字，格式：<tag>.name，字符串类型
soup.a.name		
soup.a.parent.name
soup.a.parent.parent.name

#一个<tag>可以有0或多个属性，字典类型
tag.attrs
tag.attrs['class']

#NavigableString类型可以跨越多个层次
soup.a.string		#bs4.element.NavigableString类型

#Comment是一种特殊类型
newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
newsoup.b.string	#bs4.element.Comment类型
newsoup.p.string	#bs4.element.NavigableString类型

(4) HTML content traversal method based on bs4 library

HTML basic format
Traverse the tree

1. Downward traversal of the tag tree

BeautifulSoup type is the root node of the tag tree

Attributes	Explanation
.contents	List of child nodes, save all son nodes into the list
.children	The iteration type of the child node, similar to .contents, is used to loop through the son node
.descendants	Iteration type of descendant nodes, including all descendant nodes, for loop traversal

soup = BeautifulSoup(demo, "html.parser")
soup.head				#获取head
soup.head.contents		#获取head的子节点列表
soup.body.contents		#获取body的子节点列表
len(soup.body.contents)	#获取body的子节点列表长度
soup.body.contents[1]

#下行遍历儿子节点
for child in soup.body.children:
    print(child)
    
#下行遍历子孙节点
for child in soup.body.descendants:
    print(children)

2. Uplink traversal of the tag tree

Attributes	Explanation
.parent	Node's parent label
.parents	The iteration type of the node ancestor label, used to loop through the ancestor nodes

#遍历所有先辈节点，包括soup本身，所以要区别判断
soup = BeautifulSoup(demo, "html.parser")
for parent in soup.a.parents:
	if parent is None:
    	print(parent)
   	else:
        print(parent.name)

3. Parallel traversal of the label tree

Attributes	Explanation
.next_sibling	Returns the next parallel node label in HTML text order
.previous_sibling	Returns the previous parallel node label in HTML text order
.next_siblings	Iteration type, return subsequent parallel node tags in HTML text order
.previous_siblings	Iteration type, return previous parallel node tags in HTML text order

Parallel traversal

#遍历后续节点
for sibling in soup.a.next_sibling:
	print(sibling)

#遍历前续节点    
for sibling in soup.a.previous_sibling:
	print(sibling)

(5) HTML format output based on bs4 library

#美化输出
#.prettify()为HTML文本<>及其内容增加了'\n'
soup.prettify()
#.prettify()可用于标签
soup.a.prettify()

[External link image is being transferred ... (img-mxfhEfda-1586863050051)]

#遍历后续节点
for sibling in soup.a.next_sibling:
	print(sibling)

#遍历前续节点    
for sibling in soup.a.previous_sibling:
	print(sibling)

(5) HTML format output based on bs4 library

#美化输出
#.prettify()为HTML文本<>及其内容增加了'\n'
soup.prettify()
#.prettify()可用于标签
soup.a.prettify()

The bs4 library turns any HTML input into UTF-8 encoding

Haoran

Published 10 original articles · Like1 · Visits 139

Private letter concerns

Python crawler learning (4) Introduction to Beautiful Soup library

(4) Getting started with Beautiful Soup library

(1) Basic elements of Beautiful Soup library

(2) BeautifulSoup library parser

(3) The basic elements of the BeautifulSoup class

(4) HTML content traversal method based on bs4 library

1. Downward traversal of the tag tree

2. Uplink traversal of the tag tree

3. Parallel traversal of the label tree

(5) HTML format output based on bs4 library

(5) HTML format output based on bs4 library

Guess you like