Python crawler learning (4) Introduction to Beautiful Soup library

(4) Getting started with Beautiful Soup library

BeautifulSoup library official documentation

(1) Basic elements of Beautiful Soup library

  • Beautiful Soup library is a function library for parsing, traversing and maintaining "tag tree"
<p class="title">...</p>

<p>..</p>	 :标签 Tag
	p		 :名称 Name (成对出现)
class='title':属性 Attributes (0个或多个)
  • BeautifulSoup corresponds to the entire content of an HTML / XML document
from bs4 import BeautifulSoup
soup1 = BeautifulSoup("<html>data</html>","html.parser")
soup2 = BeautifulSoup("open("D://demo.html")","html.parser")

(2) BeautifulSoup library parser

Parser Instructions condition
HTML parser for bs4 BeautifulSoup(“mk”,“html.parser”) Install bs4 library
lxml HTML parser BeautifulSoup(“mk”,“lxml”) pip install lxml
lxml's XML parser BeautifulSoup(“mk”,“xml”) pip install lxml
parser for html5lib BeautifulSoup(“mk”,“html5lib”) pip install html5lib

(3) The basic elements of the BeautifulSoup class

例:<p class="title">...</p>
fundamental element Explanation
Tag Tags, the most basic unit of information organization, with <> and </> mark the beginning and end
Name The name of the label,

The name is 'p', the format: .name
Attributes Tag attributes, organized in dictionary format, format: .attrs
NavigableString Non-attribute character string in tag, character string in <> ... </>, format: .string
Comment The comment part of the string in the label, a special type of Comment
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"html.parser")
soup.title		#获得标题
soup.a			#格式:soup.<tag>,文档中存在多个相同<tag>,只返回第一个
soup.find_all('a')	#找到文档中所有的<a>标签
soup.get_text()		#从文档中获取所有文字内容

#获取<tag>的名字,格式:<tag>.name,字符串类型
soup.a.name		
soup.a.parent.name
soup.a.parent.parent.name

#一个<tag>可以有0或多个属性,字典类型
tag.attrs
tag.attrs['class']

#NavigableString类型可以跨越多个层次
soup.a.string		#bs4.element.NavigableString类型

#Comment是一种特殊类型
newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
newsoup.b.string	#bs4.element.Comment类型
newsoup.p.string	#bs4.element.NavigableString类型

(4) HTML content traversal method based on bs4 library

HTML basic format
Traverse the tree

1. Downward traversal of the tag tree

  • BeautifulSoup type is the root node of the tag tree
Attributes Explanation
.contents List of child nodes, save all son nodes into the list
.children The iteration type of the child node, similar to .contents, is used to loop through the son node
.descendants Iteration type of descendant nodes, including all descendant nodes, for loop traversal
soup = BeautifulSoup(demo, "html.parser")
soup.head				#获取head
soup.head.contents		#获取head的子节点列表
soup.body.contents		#获取body的子节点列表
len(soup.body.contents)	#获取body的子节点列表长度
soup.body.contents[1]

#下行遍历儿子节点
for child in soup.body.children:
    print(child)
    
#下行遍历子孙节点
for child in soup.body.descendants:
    print(children)

2. Uplink traversal of the tag tree

Attributes Explanation
.parent Node's parent label
.parents The iteration type of the node ancestor label, used to loop through the ancestor nodes
#遍历所有先辈节点,包括soup本身,所以要区别判断
soup = BeautifulSoup(demo, "html.parser")
for parent in soup.a.parents:
	if parent is None:
    	print(parent)
   	else:
        print(parent.name)

3. Parallel traversal of the label tree

Attributes Explanation
.next_sibling Returns the next parallel node label in HTML text order
.previous_sibling Returns the previous parallel node label in HTML text order
.next_siblings Iteration type, return subsequent parallel node tags in HTML text order
.previous_siblings Iteration type, return previous parallel node tags in HTML text order

Parallel traversal

#遍历后续节点
for sibling in soup.a.next_sibling:
	print(sibling)

#遍历前续节点    
for sibling in soup.a.previous_sibling:
	print(sibling)

(5) HTML format output based on bs4 library

#美化输出
#.prettify()为HTML文本<>及其内容增加了'\n'
soup.prettify()
#.prettify()可用于标签
soup.a.prettify()

Node tags |
| .previous_sibling | Return the previous parallel node tags in HTML text order |
| .next_siblings | Iteration type, return subsequent parallel node tags in HTML text order |
| .previous_siblings | Iteration type, return in HTML text order Previous Parallel Node Label |

[External link image is being transferred ... (img-mxfhEfda-1586863050051)]

#遍历后续节点
for sibling in soup.a.next_sibling:
	print(sibling)

#遍历前续节点    
for sibling in soup.a.previous_sibling:
	print(sibling)

(5) HTML format output based on bs4 library

#美化输出
#.prettify()为HTML文本<>及其内容增加了'\n'
soup.prettify()
#.prettify()可用于标签
soup.a.prettify()
  • The bs4 library turns any HTML input into UTF-8 encoding
Published 10 original articles · Like1 · Visits 139

Guess you like

Origin blog.csdn.net/qq_39419113/article/details/105519006