Python crawler - use BeautifulSoup4 to parse HTML documents

Python crawler - use BeautifulSoup4 to parse HTML documents


1. Introduction to BeautifulSoup4

1.1 BS4 and lxml

Straight to the point! Let's first talk about what BS4 is and what it can do. BS4 is a python library for extracting data from HTML and XML files. It can convert complex HTML files into a complex tree structure. Each tree structure The nodes are all Python objects, and all objects can be classified into 4 categories, which we will say later

BeautifulSoup will automatically convert the input document to Unicode encoding and convert the input document to utf-8 encoding

lxml is an excellent python interpreter like BS4, it is also, but the difference between the two determines why we choose BeautifulSoup, lxml starts from the part of the document and can only traverse it locally, but BS4 is different , it will load the entire document and parse the entire DOM tree. Because of the huge amount of work, BS4 will have a lot of time and space overhead, and the overall performance is lower than that of lxml

But we use BeautifulSoup4 for its reasons: BS4 is relatively simple to parse HTML, and the API design is also very silver (humanized) for us to use, and it also supports CSS selectors, which can be said to be excellent for BS4 that parses the entire DOM tree. Even more powerful, it also supports the HTML parser in the Python standard library and the XML parser in lxml

1.2 Four types of objects of BeautifulSoup

We said earlier: "Convert complex HTML files into a complex tree structure, each node of this tree is a Python object", these pythons can be roughly divided into 4 categories: Tag, NavigableString, BeautifulSoup and Comment, we Use a few examples to familiarize yourself with the use of these four types of objects:

  • 1.Tap tag and its content, but only get the first one by default
  • 2. Content in the NavigableString tag (string)
  • 3. BeautifulSoup represents the entire document
  • 4.Comment is a special NavigableString type, but the output content does not contain comments

The first step is to import the package: from bs4 import BeautifulSoup; then use the file operation statement to open an html file in our project, use a file object to receive our html file and want BeautifulSoup to parse it, we also need to pass Enter a parser: html.parser

from bs4 import BeautifulSoup
import re
file = open("./Demo.html", "rb")
html = file.read().decode("utf-8")
# 通过html.parser解析器把我们的HTML解析成了一棵树
bs = BeautifulSoup(html, "html.parser")
# 1.Tap
print("1. Tap的例子:获取title")
print(bs.title)
# 2.NavigableString
print("2. NavigableString的例子:获取title的string内容和div的属性")
print(bs.title.string)
print(bs.div.attrs)     # 获取标签中的所有属性,并返回一个字典
# 3.BeautifulSoup
print("3. BeautifulSoup的例子:获取整个html文档的name")
print(bs.name)
# 4.Comment
print("4. Comment的例子:获取a的string")
print(bs.a.string)

According to the previous management, let's take a look at the code! Starting from line 7, we explained the four types of objects with examples. Corresponding to the Tap type, we printed bs.title, which is the title of this HTML document.标签的全部信息,第12、13行我们通过bs.title.string和bs.div.attrs打印了标签内的信息,title的string信息和div盒子的属性,后面的代码也是同样的道理

insert image description here

2. Document search method

It is impossible to use it directly after we obtain its source code through the address of the webpage, so we need to search and filter the information after obtaining the HTML source code, and then provide it for us to use. Corresponding to BeautifulSoup’s document search, it provides There are several search methods: use the find_all() method to search directly, use kwargs to specify parameters to search, specify string searches (often used to match regular expressions) and set limit parameters to search, we still master these through some practical operations search method

2.1 Search using find_all()

  • String filtering: it will look for content that exactly matches the string (note that it must be an exact match)
  • Regular expression search: use the search() method to match content, the object it searches is still a label, a whole instead of split
t_list = bs.find_all("a")
t_list02 = bs.find_all(re.compile("a"))

Corresponding to string search, we pass in a string "a" to the find_all() method to filter the string of the entire document. Only the content containing a single "a" will be filtered out. We use regular expressions to search When passing a matching rule such as re.compile("a") into the find_all() method, it will return all the tag content containing "a", such as no exception

insert image description here

2.2 Use kwargs to specify parameter search

The find_all() method is also used for document search, but the parameters we pass in are completely different. To give two examples, we want to search for tags with id="update", tags with class and href="baidu. html" tags

# 2.kwargs (参数):指定参数进行搜索
print("-------(1)显示id=“update“的")
t_list03 = bs.find_all(id="update")
for item in t_list03:
    print(item)
print("-------(2)只要有class就显示出来")
t_list04 = bs.find_all(class_=True)
for item in t_list04:
    print(item)
print("-------(3)指定查找href=”baidu.html“")
t_list05 = bs.find_all(href="baidu.html")
for item in t_list05:
    print(item)

insert image description here

2.3 text parameter search

It is a little easier to use the text parameter to search. When using text search, we generally use lists and regular expressions to search, so as to achieve the effect of our directional filtering data. Not much to say about the code

insert image description here

2.4 Set limit parameter search

If we use the code bs.find_all(class_=True) to filter out the part with "class", there are many, many results, but if we only want 3 results, we need to set the parameter limit=x to meet our needs

t_text10 = bs.find_all("a", limit=3)
for item in t_text10:
    print(item)

insert image description here

2.5 CSS selectors

Relatively speaking, document search through CSS selectors is very rich. You can search through tags, class names, id names, tag attributes and sub-tags, etc. The difference is that we will use the bs.select() method to filter

# css选择器
print("-------(1)通过标签访问")
t_css = bs.select('title')  # 通过标签访问
for item in t_css:
    print(item)
print("-------(2)通过类名访问")
t_css2 = bs.select(".col-sm-3")  # 通过类名访问
for item in t_css2:
    print(item)
print("-------(3)通过id访问")
t_css3 = bs.select("#btn_add")
for item in t_css3:
    print(item)
print("-------(4)通过标签的属性访问")
t_css4 = bs.select("button[type='submit']")
for item in t_css4:
    print(item)
# 注意不能有空格
print("-------(5)通过子标签访问")
t_css5 = bs.select("div > button")
for item in t_css5:
    print(item)

insert image description here


If the article is helpful to you, remember to support it with one click and three links~

Guess you like

Origin blog.csdn.net/qq_50587771/article/details/123870433