BeautifulSoup is a very useful third-party library for python to parse html!

Article Directory

installation
Import
Parsing library
skills
BeautifulSoup4 four types of objects
Traverse the document tree
- .contents: Get all the child nodes of Tag and return a list
- .children: Get all the child nodes of Tag and return a generator
Search the document tree
CSS selector

installation

pip install beautifulsoup4

Import

from bs4 import BeautifulSoup

Parsing library

BeautifulSoup supports Python's standard HTML parsing library by default, but it also supports some third-party parsing libraries:

Insert picture description here

skills

There is such a web page html:

<!DOCTYPE html>
<!--STATUS OK-->
<html>
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="always" name="referrer"/>
  <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
  <title>
   百度一下，你就知道 </title>
 </head>
 <body link="#0000cc">
  <div id="wrapper">
   <div id="head">
    <div class="head_wrapper">
     <div id="u1">
      <a class="mnav" href="http://news.baidu.com" name="tj_trnews">
       新闻 </a>
      <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">
       hao123 </a>
      <a class="mnav" href="http://map.baidu.com" name="tj_trmap">
       地图 </a>
      <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">
       视频 </a>
      <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">
       贴吧 </a>
      <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">
       更多产品 </a>
     </div>
    </div>
   </div>
  </div>
 </body>
</html>

# 创建beautifulsoup4对象
from bs4 import BeautifulSoup
bs = BeautifulSoup(html,"html.parser")

# 缩进格式
print(bs.prettify()) 

# 获取title标签的所有内容
print(bs.title) 

# 获取title标签的名称
print(bs.title.name) 

# 获取title标签的文本内容
print(bs.title.string) 

# 获取head标签的所有内容
print(bs.head) 

# 获取第一个div标签中的所有内容
print(bs.div) 

# 获取第一个div标签的id的值
print(bs.div["id"])

# 获取第一个a标签中的所有内容
print(bs.a) 

# 获取所有的a标签中的所有内容
print(bs.find_all("a"))

# 获取id="u1"
print(bs.find(id="u1")) 

# 获取所有的a标签，并遍历打印a标签中的href的值
for item in bs.find_all("a"): 
	print(item.get("href")) 
	
# 获取所有的a标签，并遍历打印a标签的文本值
for item in bs.find_all("a"): 
	print(item.get_text()) // 等同于print(item.string)

BeautifulSoup4 four types of objects

BeautifulSoup4 converts complex HTML documents into a complex tree structure. Each node is a Python object. All objects can be summarized into 4 types:

Tag、NavigableString、BeautifulSoup、Comment

Tag:Tag, in layman's terms, is a tag in HTML

from bs4 import BeautifulSoup
bs = BeautifulSoup(html,"html.parser")

# 获取title标签的所有内容
print(bs.title)

# 获取head标签的所有内容
print(bs.head)

# 获取第一个a标签的所有内容
print(bs.a)

# 类型
print(type(bs.a))

We can easily get the content of these tags by adding the tag name to the soup. The type of these objects is bs4.element.Tag.
But note that it looks for the first label that meets the requirements in all content.

For Tag, it has two important attributes, name and attrs:

# [document] 
#bs 对象本身比较特殊，它的 name 即为 [document]
print(bs.name)
# head 
#对于其他内部标签，输出的值便为标签本身的名称
print(bs.head.name) 

# 在这里，我们把 a 标签的所有属性打印输出了出来，得到的类型是一个字典。
print(bs.a.attrs) # 常用

#还可以利用get方法，传入属性的名称，二者是等价的
print(bs.a['class']) # bs.a.get('class')

# 可以对这些属性和内容等等进行修改
bs.a['class'] = "newClass"
print(bs.a) 

# 还可以对这个属性进行删除
del bs.a['class'] 
print(bs.a)

NavigableString: Get the text inside the label with .string (Commonly used)

print(bs.title.string)
print(type(bs.title.string))

BeautifulSoup: Represents the content of a document

Most of the time, it can be regarded as a Tag object, which is a special Tag. We can obtain its type, name, and attributes separately, for example:

print(type(bs.name))
print(bs.name)
print(bs.attrs)

Comment: is a special type of NavigableString object, its output does not include comment symbols

print(bs.a)  # 此时不能出现空格和换行符，a标签如下：
# <a class="mnav" href="http://news.baidu.com" name="tj_trnews"><!--新闻--></a>

print(bs.a.string) # 新闻
print(type(bs.a.string)) # <class 'bs4.element.Comment'>

Traverse the document tree

.contents: Get all the child nodes of Tag and return a list

# tag的.content 属性可以将tag的子节点以列表的方式输出
print(bs.head.contents)

# 用列表索引来获取它的某一个元素
print(bs.head.contents[1])

.children: Get all the child nodes of Tag and return a generator

for child in bs.body.children:
    print(child)

Search the document tree

find_all(name, attrs, recursive, text, **kwargs)

name parameter:

String filtering : will find content that exactly matches the string, and return the label

a_list = bs.find_all("a")
print(a_list)

Regular expression filtering : If a regular expression is passed in, BeautifulSoup4 will match the content through search()

t_list = bs.find_all(re.compile("a"))
for item in t_list:
    print(item)

List : If you pass in a list, BeautifulSoup4 will return the node that matches any element in the list

t_list = bs.find_all(["meta","link"])
for item in t_list:
    print(item)

Method : Pass in a method and match according to the method

def name_is_exists(tag):
    return tag.has_attr("name")

t_list = bs.find_all(name_is_exists)
for item in t_list:
    print(item)

kwargs parameters

# 查询id=head的Tag
t_list = bs.find_all(id="head")
print(t_list)

# 查询href属性包含ss1.bdstatic.com的Tag
t_list = bs.find_all(href=re.compile("http://news.baidu.com"))
print(t_list)

# 查询所有包含class的Tag(注意：class在Python中属于关键字，所以加_以示区别)
t_list = bs.find_all(class_=True)
for item in t_list:
    print(item)

attrs parameter

Not all attributes can be searched in this way, such as HTML data-* attributes, we can use the attrs parameter to define a dictionary to search for tags that contain special attributes:

t_list = bs.find_all(attrs={
    
    "data-foo":"value"})
for item in t_list:
    print(item)

text parameter

The text parameter can be used to search for the string content in the document, which is the same as the optional value of the name parameter.
The text parameter accepts strings, regular expressions, and lists

t_list = bs.find_all(attrs={
    
    "data-foo": "value"})
for item in t_list:
    print(item)

t_list = bs.find_all(text="hao123")
for item in t_list:
    print(item)

t_list = bs.find_all(text=["hao123", "地图", "贴吧"])
for item in t_list:
    print(item)

t_list = bs.find_all(text=re.compile("\d"))
for item in t_list:
    print(item)

limit parameter

Pass in a limit parameter to limit the number of returns.
For example, the following number of replacements is 2

t_list = bs.find_all("a",limit=2)
for item in t_list:
    print(item)

find()

Return the first tag
that meets the conditions, that is, when we want to take a value, we can use this method

t = bs.div.div

# 等价于
t = bs.find("div").find("div")

CSS selector

## 通过标签名查找
print(bs.select('title'))
print(bs.select('a'))

# 通过类名查找
print(bs.select('.mnav'))

# 通过id查找
print(bs.select('#u1'))

# 组合查找
print(bs.select('div .bri'))

# 属性查找
print(bs.select('a[class="bri"]'))
print(bs.select('a[href="http://tieba.baidu.com"]'))

# 获取内容
t_list = bs.select("title")
print(bs.select('title')[0].get_text())

BeautifulSoup4 basic tutorial of python crawler