Python—bs4 module analysis

I. Introduction

Beautiful Soup is a Python library that can extract data from HTML or XML files . It enables idiomatic document navigation, searching, and modifying documents through your favorite converter. Beautiful Soup will save you hours or even days of operating hours.

Beautiful Soup supports the HTML parser in the Python standard library, and also supports some third-party parsers, one of which is lxml . When using Beautiful Soup for code parsing, lxml is often used in conjunction with it.

Install and use:

  • Install bs4: python -m pip install bs4
  • Install the lxml parser: python -m pip install lxml

2. Basic use

Import the module and create the bs4 object:

from bs4 import BeautifulSoup
import xlwt
soup = BeautifulSoup(html_doc,'lxml')

When passing a piece of HTML code into BeautifulSoup, you can directly pass in a piece of code, and one can pass in a file handle, for example:

soup = BeautifulSoup(open('index.html'),'lxml')

3. The type of object

1、Tag

tag has many methods and properties, the two most important properties are name and atrributes.

name:

Each tag has its own name, obtained through the .name attribute, and the name of the tag can be modified through the .name attribute. Examples are as follows:

soup = BeautifulSoup('<b class="boldest">hello,word!!!</b>')
tag = soup.b
print(tag.name)         # u'b'
tag.name = 'testTag'
print(tag)              #<testTag class="boldest">hello,word!!!</testTag>

Attributes:

A tag may have many attributes. tag <b class="boldest"> There is a "class" attribute whose value is "boldest". The operation method of the tag attribute is the same as that of a dictionary. Examples are as follows:

print(tag['class'])            # boldest
print(tag.attrs)               # 以字典形式输出:{u'class': u'boldest'}
tag['id'] = 1                  # 对于没有的属性,则增加,如果已经有,则修改属性值
del tag['id']                  # 删除已有的属性

Multi-valued attribute:

HTML 4 defines a series of attributes that can contain multiple values. Some are removed in HTML5, but more are added. The most common multi-valued attribute is class (a tag can have multiple CSS classes). And Some attributes rel , rev , accept-charset , headers , accesskey . The return type of multi-valued attributes in Beautiful Soup is list, examples are as follows:

soup = BeautifulSoup('<p class="body strikeout"></p>')
print(soup.p['calss'])           # ["body", "strikeout"]

soup1 = BeautifulSoup('<p class="body"></p>')
print(soup1.p['calss'])          #["body"]

If an attribute looks like it has multiple values, but is not defined as a multivalued attribute in any version of the HTML definition, then Beautiful Soup will return this attribute as a string. Here's an example:

soup = BeautifulSoup('<p id="is my id"></p>')
print(soup.p['id'])                 # 'is my id'

If the converted document is in xml format, then there is no multi-valued attribute, and the attribute value will be treated as a string.

2. Traversable strings

Strings are often included in tags. Beautiful Soup uses the NavigableString class to wrap strings in tags:

soup = BeautifulSoup('<b class="boldest">hello,word!!!</b>')
tag = soup.b
print(tag.string)               # u'hello word!!!'
print(type(tag.string))         # <class 'bs4.element.NavigableString'>

A NavigableString string is the same as a Unicode string in Python, and also supports some features included in traversing the document tree and searching the document tree. The NavigableString object can be directly converted into a Unicode string through the unicode() method:

soup = BeautifulSoup('<b class="boldest">hello,word!!!</b>')
tag = soup.b
print(type(unicode(tag.string)))      #<type 'unicode'>

Fourth, traverse the document tree

Sample code: Get personal navigation page for parsing.

import requests
from bs4 import BeautifulSoup
import xlwt
import os

url = 'http://xx.xx.xx.128'
headers = {
    
    
    'userAgent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
req = requests.get(url, headers=headers)
# 创建一个Beautifulsoup对象
soup = BeautifulSoup(req.content.decode('utf-8'), 'lxml')

1. Traversing child nodes

1) Traverse through the name of the tag

A Tag may contain multiple strings or other Tags, which are child nodes of this Tag. Beautiful Soup provides many properties for operations and traversal of child nodes.

The easiest way to manipulate the document tree is to tell it the name of the tag you want to get. To get the tag, just use soup.head:

soup.head       
# <head>
# <title>个性化安全导航</title>
# <meta charset="utf-8"/>
# <link href="./css/back_ground.css" rel="stylesheet" type="text/css"/>
# </head>
soup.title
# <title>个性化安全导航</title>

If you want to get a certain subtag under the document tree, you can just use it as a tag, such as getting the h2 tag under the body;

soup.body.h2    
# <h2 class="title">论坛社区</h2>

In the case of multiple tags in the child node, only the first tag of the current name can be obtained by clicking on the attribute. If you want to obtain the data of multiple tags, you need to use the method described in Searching the tree, such as : find_all()

soup.body.find_all('h2') 
#[<h2 class="title">论坛社区</h2>,  <h2 class="title">杂七杂八</h2>]

2) Traverse through .contents and .children

The .contents attribute of a tag can output the child nodes of the tag as a list:

soup.head.contents
# [<title>个性化安全导航</title>, <meta charset="utf-8"/>, <link href="./css/back_ground.css" rel="stylesheet" type="text/css"/>]

Through the tag's .children generator, the child nodes of the tag can be looped:

for children in soup.head.children:
    print(children)
# <title>个性化安全导航</title>
# <meta charset="utf-8"/>
# <link href="./css/back_ground.css" rel="stylesheet" type="text/css"/>

3) .descendants traverse grandchildren nodes

The .contents and .children properties only contain the direct children of the tag. For example, the tag has only three direct children,和。

but标签也包含一个子节点:字符串 “The Dormouse’s story”,这种情况下字符串 “The Dormouse’s story”也属于标签的子孙节点. .descendants 属性可以对所有tag的子节点和孙节点进行递归循环

for children in soup.head.descendants:
    print(children)
# <title>个性化安全导航</title>
# 个性化安全导航
# <meta charset="utf-8"/>
# <link href="./css/back_ground.css" rel="stylesheet" type="text/css"/>

4) stripped_strings traverse the string

If the tag contains multiple strings, you can use .strings to get it in a loop, but .strings will output newline characters as strings, and use .stripped_strings to remove excess blank content (lines with all spaces will be ignored , the blanks at the beginning and end of the paragraph will be deleted):

for children in soup.body.div.find_all('div')[1].stripped_strings:
    print(children)
# FreBuff 安全论坛
# 先知安全社区
# 安全客
# CSDN博客
# 百度超级链

2. Traversing the parent node

1) .parent gets the parent node

Get the parent node of an element through the .parent attribute, such as getting the parent node of the first h2 tag:

soup_h2 = soup.h2
print(soup_h2)              
#<h2 class="title">论坛社区</h2>
print(soup_h2.parent) 
#<div class="title_card"><h2 class="title">论坛社区</h2></div>

2) .parents Get all elder nodes

All parent nodes of an element can be obtained recursively through the .parents attribute of the element. Continue to use the h2 tag as an example to obtain all parent nodes through parents:

soup_h2 = soup.h2
print(soup_h2)
for parent in soup_h2.parents:
    print(parent.name)

'''
<h2 class="title">论坛社区</h2>
div
div
body
html
[document]
'''

3. Brother nodes

1) Query the previous sibling nodes

(1) .previous_sibling

Get the previous sibling node.

soup_span = soup.find(class_ = 'card-title')
print(soup_span)
#<span class="card-title">FreBuff 安全论坛</span>
print(soup_span.previous_sibling)
#<span class="card-icon"><img src="./img/frebuff.ico"/></span>

(2) .previous_siblings

Get all previous sibling nodes.

soup_span = soup.find(class_ = 'card-title')
print(soup_span)
#<span class="card-title">FreBuff 安全论坛</span>
for span in soup_span.previous_siblings:
    print(span)
'''只有一个兄弟节点,所以只能查找到一个。
<span class="card-icon"><img src="./img/frebuff.ico"/></span>
'''

2) Query the following sibling nodes

(1)next_sibling

Get the next sibling node.

soup_span = soup.find(class_ = 'card-icon')
print(soup_span)
#<span class="card-icon"><img src="./img/frebuff.ico"/></span>
print(soup_span.next_sibling)
#<span class="card-title">FreBuff 安全论坛</span>

(2)next_siblings

Get all subsequent sibling nodes.

soup_span = soup.find(class_ = 'card-icon')
print(soup_span)
#<span class="card-icon"><img src="./img/frebuff.ico"/></span>
for span in soup_span.next_siblings:
    print(span)
'''只有一个兄弟节点,所以只输出一个
<span class="card-title">FreBuff 安全论坛</span>
'''

Five, search the document tree

There are two main methods to search the document tree, one is find, the other is find_all(), find searches for the first one in the document, and find_all() searches in the global scope and returns a list.

1、find()

1) Search by string filter

The simplest filter is a string. Pass a string parameter in the search method, and Beautiful Soup will find the content that matches the string exactly, such as finding the first tag in the document:

tag = soup.find('img')
print(tag)
#<img src="./img/frebuff.ico"/>

2) Search by regular expression

If a regular expression is passed in as a parameter, Beautiful Soup will match the content through the match() of the regular expression. The following example finds all the first tags starting with t in the document:

tag = soup.find(re.compile("^t"))
print(tag)
#<title>个性化安全导航</title>

2、find_all()

1) Filter by string

Example: Get all h2 tags.

tag = soup.find_all('h2')
print(tag)
'''
[<h2 class="title">论坛社区</h2>, <h2 class="title">区 块 链</h2>, <h2 class="title">在线办公</h2>, <h2 class="title">知识
学习</h2>, <h2 class="title">编码解码</h2>, <h2 class="title">杂七杂八</h2>]
'''

2) Filter by regular

Example: Get all tags starting with h.

for tag in soup.find_all(re.compile("^h")):
    print(tag.name,end=",")
#html,head,h2,h2,h2,h2,h2,h2,

3) Search through the dictionary

Output in order of appearance in the document:

for tag in soup.find_all(['body','h2']):
  print(tag.name,end=",")
#body,h2,h2,h2,h2,h2,h2,

4) Search by keyword

Example: Search by class (class is a multi-value attribute, you need to add "_")

for tag in soup.find_all(class_ = 'card-title'):
    print(tag.text,end=',')
#FreBuff 安全论坛,先知安全社区,安全客 ,CSDN博客,百度超级链 , 巴 比 特 ,金 色 财 经,火 币 网,欧易交易所, 非 小 号 ,Ethplorer交易浏览器......

While specifying keywords, you can also use the re module for regular matching, examples are as follows:

for tag in soup.find_all(class_ = re.compile('^card')):
    print(tag,end=',')
#<span class="card-icon"><img src="./img/frebuff.ico"/></span>,<span class="card-title">FreBuff 安全论坛</span>,<span class="card-icon"><img src="./img/xz.ico"/></span>,<span class="card-title">先知安全社区</span>,.......

A keyword is specified above, and multiple keywords can also be specified for searching.

for tag in soup.find_all(class_ ="link_tooltip",title="https://www.okx.com/"):
    print(tag,end=',')
'''
<a class="link-tooltip" href="https://www.okx.com/" target="_blank" title="https://www.okx.com/">
<span class="card-icon"><img src="./img/ouyi.ico"/></span>
<span class="card-title">欧易交易所</span>
</a>,
'''

5) Use limit to limit the number of returned results

The find_all() method returns all the search structures. If the document tree is very large, the search will be very slow. If we don’t need all the results, we can use the limit parameter to limit the number of returned results. The effect is similar to the limit keyword in SQL. When searching When the number of obtained results reaches the limit limit, the search will stop and the results will be returned.

tag =  soup.find_all('span',limit= 4)
print(tag)
#[<span class="card-icon"><img src="./img/frebuff.ico"/></span>, <span class="card-title">FreBuff 安全论坛</span>, <span class="card-icon"><img src="./img/xz.ico"/></span>, <span class="card-title">先知安全社区</span>]

6. Reference documents

  • https://blog.csdn.net/weixin_33762130/article/details/92478877
  • https://www.jb51.net/article/276323.htm#_lab2_1_0
  • https://blog.csdn.net/xo3ylAF9kGs/article/details/124722280

Guess you like

Origin blog.csdn.net/qq_45590334/article/details/129581743