Crawler (3)--Practical and detailed explanation of Beautiful Soup 4 library

01 Preface

After the study of the first two articles, it feels hot now.
Crawler (1) - Take you to understand the most basic concepts of crawlers, and you can practice
crawlers in one article (2) - Regular expressions If
you want to use a crawler well, you need to analyze the code you find well. Using Beautiful Soup 4, commonly known as BS4, you can easily accomplish.

Official documentation: https://beautifulsoup.readthedocs.io/zh_CN/latest/

02 BeautifulSoup Library

2.1 Concept and overall output

BeautifulSoup is a Python library that can extract data from HTML or XML files, a parser that analyzes HTML or XML files. It is usually used to analyze the structure of web pages and grab corresponding web documents.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"html.parser")
print(soup.prettify())

soup.prettify()Format the output of the soup content, when using BeautifulSoup to parse the HTML document, it will DOM文档树treat the HTML document similarly.

Note: It is assumed that the HTML source code tag pair lacks the end tag, that is, there is no and tag, but prettify()the result of using the function output can beautocomplete end tag, which is one of the advantages of BeautifulSoup: Even if BeautifulSoup gets a damaged tag, it will generate a converted DOM tree, which is as consistent as possible with the content of your original document. This measure usually helps you collect data more correctly.

2.2 BeautifulSoup object

The official BeautifulSoup document summarizes all objects into the following four types:

  • Tag: including Name and Attributes,
#Name
print(tag.name)   # 输出标签名
print(tag.string)  # 输出标签的内容

#Attributes
print(tag.attrs) # 以键值对的方式输出标签的class和id
print(tag['class'])  # 得到class的值

Get the initial label content:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"html.parser")

# 获取HTML的标题,包括自身标签全部输出
head = soup.head
print('头部:', head)

# 头部: <head><title>BeautifulSoup技术</title></head>

Note that this direct .head method can only get the corresponding first label. If you want to get all the label content, you still need finid_all()the functions you learned later.

  • NavigableString
    Example: soup.a['class'].string, getthe first traversed, a hyperlink label with class classcontent
  • BeautifulSoup
    The BeautifulSoup object represents the entire content of a document.
print(type(soup))
# <class 'BeautifulSoup.BeautifulSoup'>
  • Comment
    The Comment object is a special type of NavigableString object that is used to handle comment objects.

Example:

markup = "<b><!-- This is a comment code. --></b>"  
soup = BeautifulSoup(markup, "html.parser")
comment = soup.b.string 
print(comment) 

2.3 How to get the content inside the html tag

1. Know the specific location of the information you want to get, and only get one content

soup = BeautifulSoup(html)
# 方法一:获得第二个a标签下的内容,已知标签所在位置
print(soup.select('a')[1].string)
# 方法二,已知标签所在位置
print(soup.find('a').contents[1])

2. The location information is unknown, and only one content is obtained

# 或者,未知标签所在位置,只知道你要找的class,再利用正则表达式提取文字
dufu_text = soup.find('a', {
    
    'class': 'poet1'}).text
dufu_two_chars = re.search(r'(\S\S)', dufu_text).group(1)

print(dufu_two_chars) # 输出:杜甫的两个字“杜甫”

# 或者
dufu_text = soup.find('a', class_ = 'poet1')
print(dufu_text.text)

3. Unknown location information, get all content

# 位置标签所在位置,使用find_all方法,必须使用列表遍历的方式输出
dufu_text = soup.find_all('a', class_ = 'poet1')

for dufu in dufu_text:
    print(dufu.text)

4. Want to get the url under the hyperlink tag

links = soup.find_all('a',class_ = 'poet1')

# 按照列表的形式取出
for link in links:
    print(link.get('href'))

# 或者如下
for index, link in enumerate(links1):
    href = link['href']
    print(f"Index: {
      
      index}, Href: {
      
      href}")

2.4 Detailed explanation of the most convenient statement find_all()

  find_all()function is a function in the BeautifulSoup library to find all elements in the document that meet the criteria . It can accept multiple parameters specifying specific attributes and attribute values ​​of the element to search for. The following is a detailed explanation of the parameters of the find_all() function:

  • nameParameters: Specify what to search forTag name or list of tags. Can be a string or a regular expression object. If this parameter is not specified, all tags in the document are returned.

Regular expression: find_all(re.compile("regular expression"))

  • attrsParameters: Specify what to search forElement attributes and attribute values. Can be a dictionary type or keyword arguments. The key of the dictionary type is the attribute name, and the value is the attribute value. If keyword arguments are used to pass attributes and attribute values, the attribute name is used as the keyword and the attribute value is used as the value.

dufu_text = soup.find_all(‘a’, attrs = {‘class’: ‘poet1’})

  • recursive parameter: Specifies whether to search recursively in the descendant nodes of the document. The default is True, which means recursive search. If set to False, only immediate children of the document are searched.

  • stringParameters: specify the element to search fortext content. Can be a string or a regular expression object. If specified, only elements containing the specified text are returned.

soup.find_all(string=[“Tillie”, “Elsie”, “Lacie”])

  • limit parameter: Specifies the limit on the number of results returned. The default value is None, which means return all results. If a limit is set, returns up to the specified number of results.

  • **kwargs parameter: used to pass other parameters, such asCSS selectoretc., for example:

soup.find_all(“a”, class_=“poet”,id=‘link1’)

2.5 Traversing the document tree

Refer to the following documents:
1. https://blog.csdn.net/Eastmount/article/details/109497225

2、https://blog.csdn.net/qq_42554007/article/details/90675142?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522168005957816800186557025%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fall.%2522%257D&request_id=168005957816800186557025&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2allfirst_rank_ecpm_v1~hot_rank-5-90675142-null-null.142v76insert_down38,201v4add_ask,239v2insert_chatgpt&utm_term=beautifulsoup&spm=1018.2226.3001.4187

3、https://blog.csdn.net/xuebiaojun/article/details/119654841?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522168009930316800192237083%2522%252C%2522scm%2522%253A%252220140713.130102334…%2522%257D&request_id=168009930316800192237083&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2alltop_click~default-1-119654841-null-null.142v77insert_down38,201v4add_ask,239v2insert_chatgpt&utm_term=soup.find_all%E7%94%A8%E6%B3%95&spm=1018.2226.3001.4187

  • Get the first child node: childrenor contents,

When actually used, find_all is generally used instead of these nodes, these nodes are to make the structure clearer.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"html.parser")
print(soup.head.contents)  # 输出hrad标签下的第一个子标签
for contents in soup.head.contents:
    print(contents)
  • If you need to get the content of multiple nodes, use stringsattributes:
for content in soup.head.strings:
    print(content)
  • The output string may contain extra spaces or newlines, here you need to use stripped_stringsa method to remove the extra spaces, the code is as follows:
for content in soup.stripped_strings:
    print(content)

For other document tree nodes, searching the document tree is similar to traversing the document tree, including parent nodes, child nodes, sibling nodes, etc. It is recommended to learn from the official website. generally usedfind_all

03 BOM and DOM

In order to learn to use BS4 well, it is necessary to understand BOM and DOM. I originally wanted to write them together, but later found that I may need to study them separately. This article ends here for the time being. The next article will update BOM and DOM, and I will continue to teach you how to write crawlers. ~

Guess you like

Origin blog.csdn.net/qq_54015136/article/details/129843517