python crawler---bs4 module

Full name: Bautiful Soup, this module is a third-party library, so it needs to be downloaded separately

installation method:

pip install bs4

Since BS4 needs to rely on the document parser when parsing pages, lxml also needs to be installed as a parsing library:

pip install lxml

In addition, when we use the bs4 module, we need to create aBeautifulSoup object. Commonly used parsers for this object are: a>

  • html.parser
  • lxml
  • xml
  • html5lib

BeautifulSoup supports Python's standard HTML parsing library by default, but it also supports some third-party parsing libraries.

parser Instructions Advantage Disadvantages
python standard library BeautifulSoup(markup,“html.parser”) Python has a built-in standard library, moderate execution speed, and strong document fault tolerance. Document fault tolerance is poor in versions before python2.7.3 or 3.2.2
lxml HTML parser BeautifulSoup(markup,“lxml”) Fast speed and strong document fault tolerance Requires C language library
lxml xml parser BeautifulSoup(markup,“xml”); BeautifulSoup(“lxml”,“xml”) Fast, the only parser that supports xml Requires C language library
html5lib BeautifulSoup(markup, "html5lib") Best fault tolerance slow

BeautifulSoup

● Usage examples:

BeautifulSoup(url.text,'html.parser')    # "url.text"为被解析内容,"html.parser"为解析器

Among them, the first parameter is the content that needs to be parsed, and the second parameter is the parser that needs to be used.

Find node

There are two ways to search, one is find and the other is find_all
● Writing format find("tag name to be found", attrs={attribute: attribute name}) Usage Example:

# 查找单个标签
html.find("div",attrs={
    
    "class":"v7W49e"})
html.find_all("div",attrs={
    
    "class":"v7W49e"})

# 查找多个标签
html.find_all([a,h3])

# 查找ul标签下的所有a标签
html.find("ul",attrs={
    
    "class":"v7W49e"}).find_all('a')

Get text

  • There are two commonly used methods to get the text in labels in bs4:

    method explain
    a.text get_text() is a method of the Beautiful Soup object, used to obtain all text content in the specified tag and its subtags and merge them into a string. If a tag contains multiple subtags, get_text() will combine their text contents into a single string without retaining any separators between the subtags. The calling method of get_text() is p.get_text(), where p is the specific tag found by the Beautiful Soup object
    a.get_text() text is an attribute of the Beautiful Soup tag object, used to obtain the text content directly contained in the tag, excluding the text content within the sub-tag. If a tag contains multiple sub-tags, text will only return the text content directly contained in the tag, excluding the text content of the sub-tags. The calling method of text is p.text, where p is the specific tag found by the Beautiful Soup object.
  • Example:

    <p>
        这是一个段落。
        <strong>粗体文本</strong>
        <em>斜体文本</em>
    </p>
    

    Usingp.get_text() will get the merged text content:这是一个段落。粗体文本斜体文本

    p.textOnly the text content directly contained in the paragraph tag will be obtained:这是一个段落。

    The text content within the subtag is not included inp.text

Get attribute value:

In BeautifulSoup4 (bs4), you can use the following two methods to obtain the attribute value of a tag:

  1. Use tag.get(attribute) method: This method can obtain the specified attribute value of the tag. If the property does not exist, returns None.
  2. Use tag['attribute'] or tag.attribute: Get the attribute value of the tag directly through square brackets or periods. If the attribute does not exist, using square brackets will report an error, while using dots will return None.
  • Example:
<a class="link" href="xxx.xxx.xxx" target="_blank">Example Website</a>

Use python to extract the attributes in the a tagherf

from bs4 import BeautifulSoup

data = '<a class="link" href="xxx.xxx.xxx" target="_blank">Example Website</a>'

# 使用 BeautifulSoup 解析 HTML
soup = BeautifulSoup(data, 'html.parser')

# 使用 get 方法获取属性值
tag = soup.find('a')
value = tag.get('href')
print(value)

# 使用方括号获取属性值
brackets = tag['href']
print(brackets)

# 使用点号获取属性值
dot = tag.href
print(dot)

Note: If the specified attribute does not exist in the tag, using square brackets will report an error, while using the get() method or the period method will return None. Therefore, before using square brackets to get the attribute value, it is best to ensure that the attribute exists, or use the get() method to get the attribute value safely.

Guess you like

Origin blog.csdn.net/m0_55994898/article/details/132147519