Hardcore strikes! ! ! An article to teach you how to get started with Python crawler webpage analysis artifact - BeautifulSoup explained in detail

1. Introduction to BeautifulSoup

BeautifulSoup is a Python library that can extract data from HTML or XML files. Beautiful Soup has become an excellent Python interpreter like lxml and html5lib, providing users with the flexibility to provide different parsing strategies or strong speed.
BeautifulSoup Official Documentation: BeautifulSoup

Study notes about the use of BeautifulSoup: Rakuten Notes

2. Installation

pip install bs4	   # 下载BeautifulSoup包
pip install lxml	# 下载lxml包

How to use the parser and compare its advantages and disadvantages

#标准库的使用方法
BeautifulSoup(html,'html.parser')
#优势:内置标准库,速度适中,文档容错能力强
#劣势:Python3.2版本前的文档容错能力差

#lxml HTML的使用方法
BeautifulSoup(html,'lxml')
#优势:速度快,文档容错能力强
#劣势:需要安装C语言库

#lxml XML的使用方法
BeautifulSoup(html,'xml')
#优势:速度快,唯一支持XML
#劣势:需要安装C语言库

#html5lib的使用方法
BeautifulSoup(html,'html5lib')
#优势:容错性最强,可生成HTML5
#劣势:运行慢,不依赖外部扩展

Summary of crawler parsers

parser Instructions Advantage disadvantage
Python standard library BeautifulSoup(html, “html.parser”) Built-in standard library, moderate speed, strong document fault tolerance Documentation before Python 3.2 has poor fault tolerance
lxml HTML parser BeautifulSoup(html,‘lxml’) Fast speed and strong document fault tolerance Need to install the C language library
lxml XML parser BeautifulSoup(markup, “xml”) Fast, the only parser that supports XML Need to install the C language library
html5lib BeautifulSoup(html,‘html5lib’) The most fault-tolerant, can generate HTML5 format documents Runs slowly and does not rely on external extensions

3. The principle of bs4 data analysis

  1. Instantiate a BeautifulSoup object, and load the page source code data into the object.
  2. Label positioning and data extraction are performed by calling related properties or methods in the BeautifulSoup object.

4. Commonly used methods and properties of bs4

1. BeautifulSoup construction

1.1 Build from String

from bs4 import BeautifulSoup

html = """
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Title</title>
</head>
<body>
<div id="container">
  <span class="title">
    <h3>Python爬虫网页解析神器BeautifulSoup详细讲解</h3>
  </span>
</div>
</body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')
# 打印soup对象的内容,格式化输出
print(soup.prettify())

Format and print the content of the html object, this function will be used frequently in the future.

1.2 Load from file

from bs4 import BeautifulSoup

with open(r"D:\index.html") as fp:
    soup = BeautifulSoup(fp, "lxml")
print(soup.prettify())

2. Four objects of BeautifulSoup

Beautiful Soup converts complex HTML documents into a complex tree structure, each node is a Python object, and all objects can be summarized into 4 types: Tag, NavigableString, BeautifulSoup, Comment

2.1 Tag object

Tag object also includes string, strings, stripped_strings
If a node contains only text, you can directly access the text of the node through string, for example:

from bs4 import BeautifulSoup

html = """
<title>The Kevin's story house</title>
<span>这里是王菜鸟的Python系列文章</span>
<a href="https://token.blog.csdn.net/">王菜鸟的博客</a
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.title.text)
print(soup.span.text)
print(soup.a['href'])

# 输出结果
The Kevin's story house
这里是王菜鸟的Python系列文章
https://token.blog.csdn.net/

The above method looks for the first tag that meets the requirements in all content, and for Tag, it has two important attributes, nameandattrs

print(soup.p.attrs)	# 此处获取的是p标签的所有属性,得到的类型是一个字典
print(soup.p['class'])	# 单独获取某个属性
print(soup.p.get('class'))	# 同上,单独获取某个属性

# 输出结果
{
    
    'class': ['link']}
['link']
['link']

Make changes to these attributes and content:

soup.p['class'] = "newClass"
print(soup)

# 输出结果
<title>The Kevin's story house</title>
<span>这里是王菜鸟的Python系列文章</span>
<p class="newClass">
<a href="https://token.blog.csdn.net/">王菜鸟的博客</a>
</p>

Additionally, a property can be removed:

del soup.p['class']
print(soup)

# 输出结果
<title>The Kevin's story house</title>
<span>这里是王菜鸟的Python系列文章</span>
<p>
<a href="https://token.blog.csdn.net/">王菜鸟的博客</a>
</p>

tag.attrs is a dictionary type, can pass tag.get('id')or tag.get('class')two ways, if the id or class attribute does not exist, return None. The subscript access method may throw an exception KeyError.

Second you can get_text()get the text node using

# 获取所有文本内容
soup.get_text()
# 可以指定不同节点之间的文本使用|分割。
soup.get_text("|")
# 可以指定去除空格
soup.get_text("|", strip=True)

2.2 NavigableString object

If you want to get the content in the label, you can use .stringto get

print(soup.a.string)
print(type(soup.a.string))

# 输出结果
王菜鸟的博客
<class 'bs4.element.NavigableString'>

2.3 BeautifulSoup object

The BeautifulSoup object represents the entire content of a document. Most of the time, it can be used as a Tag tag. It is a special Tag, and its type name can be obtained separately:

print(soup.name)
print(type(soup.name))
print(soup.attrs)

# 输出结果
[document]
<class 'str'>
{
    
    }

2.4 Comment object

The Comment object is a special type of NavigableString object, and the output content still does not include comment symbols.

Five, contents, children and descendants

contents, children and descendants are all child nodes of the node, but

  • contents is a list
  • children is the generator

Notice: contents and children only contain direct child nodes, and descendants is also a generator, but contains descendants of nodes.
Examples of child nodes:

from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)
print(type(soup.p.contents))

# 输出结果
['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']
<class 'list'>

Examples of descendant nodes:

from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
    print(i, child)

六、parent、parents

  • parent: parent node
  • parents: recursive parent node
    parent node example :
from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.span.parent)

Example of recursive parent node

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.parents)))

七、next_sibling、previous_sibling

  • next_sibling: the next sibling node
  • previous_sibling: previous sibling node

Example of sibling nodes

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.next_siblings)))
print(list(enumerate(soup.a.previous_siblings)))

八、 next_element、previous_element

next_element: the next node
previous_element: the previous node
The difference between next_element and next_sibling is:

  1. next_sibling starts parsing from the end tag of the current tag
  2. next_element starts parsing from the start tag of the current tag

Nine, find () and find_all ()

9.1 Method

find_parent: Find the parent node
find_parents: Recursively find the parent node
find_next_siblings: Find the subsequent sibling nodes
find_next_sibling: Find the first sibling node that meets the conditions
find_all_next: Find all the subsequent nodes
find_next: Find the first subsequent node that meets the conditions
find_all_previous: Find All previous nodes that meet the conditions
find_previous: Find the first node that meets the conditions before

9.2 tag name

# 查找所有p节点
soup.find_all('p')
# 查找title节点,不递归
soup.find_all("title", recursive=False)
# 查找p节点和span节点
soup.find_all(["p", "span"])
# 查找第一个a节点,和下面一个find等价
soup.find_all("a", limit=1)
soup.find('a')

9.3 Properties

# 查找id为id1的节点
soup.find_all(id='id1')
# 查找name属性为tim的节点
soup.find_all(name="tim")
soup.find_all(attrs={
    
    "name": "tim"})
#查找class为clazz的p节点
soup.find_all("p", "clazz")
soup.find_all("p", class_="clazz")
soup.find_all("p", class_="body strikeout")

9.4 Regular expressions

import re
# 查找与p开头的节点
soup.find_all(class_=re.compile("^p"))

9.5 Functions

# 查找有class属性并且没有id属性的节点
soup.find_all(hasClassNoId)
def hasClassNoId(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

9.6 Text

# 查找有class属性并且没有id属性的节点
soup.find_all(hasClassNoId)
def hasClassNoId(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

Ten, select () and select_one ()

select()is to select elements that satisfy all conditions, select_one()and only select the first element that meets the conditions.
The focus of select() is on the selector. CSS selectors are divided into id selectors and class selectors . The tag name does not add any modification. Add a dot before the class name and # before the id name. Use a similar approach here to filter elements.

10.1 Select by tag

It is very simple to select through tags, that is, according to the level, it is enough to use spaces to separate the names of the tags.

# 选择title节点
soup.select("title")
# 选择body节点下的所有a节点
soup.select("body a")
# 选择html节点下的head节点下的title节点
soup.select("html head title")

10.2 id and class selectors

The id and class selectors are also relatively simple. The class selector starts with . and the id selector starts with #.

# 选择类名为article的节点
soup.select(".article")
# 选择id为id1的a节点
soup.select("a#id1")
# 选择id为id1的节点
soup.select("#id1")
# 选择id为id1、id2的节点
soup.select("#id1,#id2")

10.3 Attribute selectors

# 选择有href属性的a节点
soup.select('a[href]')
# 选择href属性为http://mycollege.vip/tim的a节点
soup.select('a[href="http://mycollege.vip/tim"]')
# 选择href以http://mycollege.vip/开头的a节点
soup.select('a[href^="http://mycollege.vip/"]')
# 选择href以png结尾的a节点
soup.select('a[href$="png"]')
# 选择href属性包含china的a节点
soup.select('a[href*="china"]')
# 选择href属性包含china的a节点
soup.select("a[href~=china]")

10.4 Other selectors

# 父节点为div节点的p节点
soup.select("div > p")
# 节点之前有div节点的p节点
soup.select("div + p")
# p节点之后的ul节点(p和ul有共同父节点)
soup.select("p~ul")
# 父节点中的第3个p节点
soup.select("p:nth-of-type(3)")

Eleven, combined with actual combat

Through a case, to learn the usage of find(), find_all(), select(), select_one().

from bs4 import BeautifulSoup

text = '''
<li class="subject-item">
    <div class="pic">
      <a class="nbg" href="https://mycollege.vip/subject/25862578/">
        <img class="" src="https://mycollege.vip/s27264181.jpg" width="90">
      </a>
    </div>
    <div class="info">
      <h2 class=""><a href="https://mycollege.vip/subject/25862578/" title="解忧杂货店">解忧杂货店</a></h2>
      <div class="pub">[日] 东野圭吾 / 李盈春 / 南海出版公司 / 2014-5 / 39.50元</div>
      <div class="star clearfix">
        <span class="allstar45"></span>
        <span class="rating_nums">8.5</span>
        <span class="pl">
            (537322人评价)
        </span>
      </div>
      <p>现代人内心流失的东西,这家杂货店能帮你找回——僻静的街道旁有一家杂货店,只要写下烦恼投进卷帘门的投信口,
      第二天就会在店后的牛奶箱里得到回答。因男友身患绝... </p>
    </div>
</li>
'''

soup = BeautifulSoup(text, 'lxml')

print(soup.select_one("a.nbg").get("href"))
print(soup.find("img").get("src"))
title = soup.select_one("h2 a")
print(title.get("href"))
print(title.get("title"))

print(soup.find("div", class_="pub").string)
print(soup.find("span", class_="rating_nums").string)
print(soup.find("span", class_="pl").string.strip())
print(soup.find("p").string)

12. CSS selectors

12.1 Common selectors

https://img-blog.csdnimg.cn/20191030084013872.png

12.2 Location selector

insert image description here

12.3 Other selectors

insert image description here

Thirteen, use summary

  • It is recommended to use lxmlthe parsing library and use it when necessaryhtml.parser
  • Label selection and filtering is weak but fast
  • It is recommended to use the find()query find_all() to match a single result or multiple results
  • If you are familiar with CSS selectors, it is recommended to use select(),select_one()
  • Memorize common ways to get properties and text values

Guess you like

Origin blog.csdn.net/qq_44723773/article/details/128762205