Full name: Bautiful Soup, this module is a third-party library, so it needs to be downloaded separately
installation method:
pip install bs4
Since BS4 needs to rely on the document parser when parsing pages, lxml also needs to be installed as a parsing library:
pip install lxml
In addition, when we use the bs4 module, we need to create aBeautifulSoup object. Commonly used parsers for this object are: a>
- html.parser
- lxml
- xml
- html5lib
BeautifulSoup supports Python's standard HTML parsing library by default, but it also supports some third-party parsing libraries.
parser | Instructions | Advantage | Disadvantages |
---|---|---|---|
python standard library | BeautifulSoup(markup,“html.parser”) | Python has a built-in standard library, moderate execution speed, and strong document fault tolerance. | Document fault tolerance is poor in versions before python2.7.3 or 3.2.2 |
lxml HTML parser | BeautifulSoup(markup,“lxml”) | Fast speed and strong document fault tolerance | Requires C language library |
lxml xml parser | BeautifulSoup(markup,“xml”); BeautifulSoup(“lxml”,“xml”) | Fast, the only parser that supports xml | Requires C language library |
html5lib | BeautifulSoup(markup, "html5lib") | Best fault tolerance | slow |
BeautifulSoup
● Usage examples:
BeautifulSoup(url.text,'html.parser') # "url.text"为被解析内容,"html.parser"为解析器
Among them, the first parameter is the content that needs to be parsed, and the second parameter is the parser that needs to be used.
Find node
There are two ways to search, one is find and the other is find_all
● Writing format find("tag name to be found", attrs={attribute: attribute name}) Usage Example:
# 查找单个标签
html.find("div",attrs={
"class":"v7W49e"})
html.find_all("div",attrs={
"class":"v7W49e"})
# 查找多个标签
html.find_all([a,h3])
# 查找ul标签下的所有a标签
html.find("ul",attrs={
"class":"v7W49e"}).find_all('a')
Get text
-
There are two commonly used methods to get the text in labels in bs4:
method explain a.text get_text()
is a method of the Beautiful Soup object, used to obtain all text content in the specified tag and its subtags and merge them into a string. If a tag contains multiple subtags,get_text()
will combine their text contents into a single string without retaining any separators between the subtags. The calling method ofget_text()
isp.get_text()
, wherep
is the specific tag found by the Beautiful Soup objecta.get_text() text
is an attribute of the Beautiful Soup tag object, used to obtain the text content directly contained in the tag, excluding the text content within the sub-tag. If a tag contains multiple sub-tags,text
will only return the text content directly contained in the tag, excluding the text content of the sub-tags. The calling method oftext
isp.text
, wherep
is the specific tag found by the Beautiful Soup object. -
Example:
<p> 这是一个段落。 <strong>粗体文本</strong> <em>斜体文本</em> </p>
Using
p.get_text()
will get the merged text content:这是一个段落。粗体文本斜体文本
p.text
Only the text content directly contained in the paragraph tag will be obtained:这是一个段落。
The text content within the subtag is not included in
p.text
Get attribute value:
In BeautifulSoup4 (bs4), you can use the following two methods to obtain the attribute value of a tag:
- Use
tag.get(attribute)
method: This method can obtain the specified attribute value of the tag. If the property does not exist, returnsNone
. - Use
tag['attribute']
ortag.attribute
: Get the attribute value of the tag directly through square brackets or periods. If the attribute does not exist, using square brackets will report an error, while using dots will returnNone
.
- Example:
<a class="link" href="xxx.xxx.xxx" target="_blank">Example Website</a>
Use python to extract the attributes in the a tagherf
from bs4 import BeautifulSoup
data = '<a class="link" href="xxx.xxx.xxx" target="_blank">Example Website</a>'
# 使用 BeautifulSoup 解析 HTML
soup = BeautifulSoup(data, 'html.parser')
# 使用 get 方法获取属性值
tag = soup.find('a')
value = tag.get('href')
print(value)
# 使用方括号获取属性值
brackets = tag['href']
print(brackets)
# 使用点号获取属性值
dot = tag.href
print(dot)
Note: If the specified attribute does not exist in the tag, using square brackets will report an error, while using the
get()
method or the period method will returnNone
. Therefore, before using square brackets to get the attribute value, it is best to ensure that the attribute exists, or use theget()
method to get the attribute value safely.