BeautifulSoup of web page analysis

Introduction and Installation

Beautiful Soup is an HTML/XML parser, and its main function is how to parse and extract HTML/XML data.
BeautifulSoup is relatively simple to parse HTML, and the API is very user-friendly. It supports CSS selectors, HTML parser in Python standard library, and XML parser of lxml.
Beautiful Soup 3 is currently out of development, and it is recommended to use Beautiful Soup 4 for current projects. Just install it with pip:pip install beautifulsoup4

Four types of objects

Beautiful Soup converts complex HTML documents into a complex tree structure, each node is a Python object, and all objects can be summarized into 4 types:

  • Tag
  • NavigableString
  • BeautifulSoup
  • Comment

1.Tag

Tag in layman's terms is a tag in HTML.

print(soup.p)
# <p class="title" name="dromouse"><b>The Dormouse's story</b></p>

print(type(soup.p))
# <class 'bs4.element.Tag'>

We can easily get the content of these tags using the soup tag name, the type of these objects is bs4.element.Tag.
For Tag, it has two important attributes, name and attrs.

print(soup.name)
# [document] soup 对象本身比较特殊,它的 name 即为 [document]
print(soup.head.name)
# head 对于其他内部标签,输出的值便为标签本身的名称
print(soup.p.attrs)
# {'class': ['title'], 'name': 'dromouse'}
print(soup.p['class'])
# ['title']

NavigableString is simply a string that can be traversed.
E.g:

print(soup.p.string)
# The Dormouse's story
print(type(soup.p.string))
#  <class 'bs4.element.NavigableString'>

Search documents

Beautiful Soup defines many search methods, here we focus on two: find() and find_all(). The parameters and usage of other methods are similar.

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

Use find_all and similar methods to find the desired document content.
Before introducing the find_all method, let's introduce the types of filters.

string

The simplest filters are strings. Pass in a string parameter in the search method, and BeautifulSoup will find the exact match to the string.
E.g:

查找所有的b标签。
soup.find_all('b')
# [<b>The Dormouse's story</b>]

regular expression

The find_all method can accept regular expressions as parameters, and BeautifulSoup will match the content through the match method.

匹配以b开头的标签
for tag in soup.find_all(re.compile('^b')):
    print(tag.name)
#  body  b
匹配包含t的标签
for tag in soup.find_all(re.compile('t')):
    print(tag.name)
# html  title

list

The find_all method can also accept a list parameter, and BeautifulSoup will return content that matches any element in the list.

查找a标签和b标签
for tag in soup.find_all(['a','b']):
    print(tag.name)
# b a a a

method

If there is no suitable filter, we can also define a method ourselves, the method only accepts one element parameter.

匹配包含class属性,但是不包括id属性的标签。
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

print([tag.name for tag in soup.find_all(has_class_but_no_id)])
# ['p','p','p']

css selector

This is another search method that is similar to the find_all method.

  • When writing CSS, the label name is not modified, the class name is preceded by ., and the id name is preceded by #

  • Here we can also use a similar method to filter elements, the method used is soup.select(), the return type is list

1. Find by tag name

print(soup.select('title'))
#[<title>The Dormouse's story</title>]

print(soup.select('a'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select('b'))
#[<b>The Dormouse's story</b>]

2. Find by class name

print(soup.select('.sister'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

3. Find by id name

print(soup.select('#link1'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

4. Combination search

Combination search is the same as when writing a class file, the combination principle of tag name, class name and id name is the same. For example, in the search p tag, the id is equal to the content of link1, and the two need to be separated by spaces

print(soup.select('p #link1'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
直接子标签查找,则使用 > 分隔

print(soup.select("head > title"))
#[<title>The Dormouse's story</title>]

5. Property lookup

Attribute elements can also be added when searching. Attributes need to be enclosed in square brackets. Note that attributes and tags belong to the same node, so no spaces can be added in between, otherwise they will not be matched.

print(soup.select('a[class="sister"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select('a[href="http://example.com/elsie"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
同样,属性仍然可以与上述查找方式组合,不在同一节点的空格隔开,同一节点的不加空格

print(soup.select('p a[href="http://example.com/elsie"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

6. Get Content

The results returned by the above select methods are all in the form of lists, which can be traversed and output in the form, and then use the get_text() method to obtain its content.

soup = BeautifulSoup(html, 'lxml')
print(type(soup.select('title')))
print(soup.select('title')[0].get_text())

for title in soup.select('title'):
    print(title.get_text())

===============================================================================================

通过tag标签逐层查找:
soup.select("body a")
找到某个tag标签下的直接子标签
soup.select("head > title")
通过CSS的类名查找:
soup.select(".sister")
通过tag的id查找:
soup.select("#link1")
通过是否存在某个属性来查找:
soup.select('a[href]')
通过属性的值来查找:
soup.select('a[href="http://example.com/elsie"]')

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324927523&siteId=291194637