Reptile Learning - (3) Use of Beautiful Soup

Table of contents

1. Introduction to Beautiful Soup

2. Parser

3. Install Beautiful Soup

4. Basic use

5. Node Selector

6. Extract information

7. Association selection

8. Method Selector

9. CSS selectors

1. Introduction to Beautiful Soup

When learning to extract web page information through regular expressions, if there is an error in the regular expression, the results we need cannot be extracted correctly. Because web pages have certain special and hierarchical relationships, using a powerful analysis tool——Beautiful Soup can use the structure and attributes of web pages to analyze web pages. Compared with regular expressions, it can use simpler sentences to extract web page content.

To put it simply, Beautiful Soup is an HTML or XML parsing library for Python. We can use it to easily extract data from web pages. The official explanation is as follows:

2. Parser

By comparing different parsers, it can be seen that the LXML parser has the function of parsing HTML and XML, and is fast and fault-tolerant. It is recommended to use. When using the LXML parser, you only need to change the second parameter to lxml when initializing Beautiful Soup.

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>hello</p>','lxml')
print(soup.p.string)

operation result:

hello

3. Install Beautiful Soup

Make sure that the Beautiful Soup and lxml libraries have been installed correctly before use. Just install pip directly in cmd, the command is as follows:

pip install beautifulsoup4

pip install lxml

4. Basic use

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())  #自动补全代码 容错处理
print(soup.title.string)  #返回title的内容

operation result:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story

First declare the variable html string, but it should be noted that this is not a complete html string. Then pass it to the BeautifulSoup object as the first parameter, and the second parameter is the parser type (set to lxml). At this time, the initialization of the BeautifulSoup object is completed, and then the object is assigned to the soup variable. After that, you can call the various methods and attributes of soup to parse this string of html codes.

① Call the prettify method. Automatically correct formatting for non-standard html strings.

②Call soup.title.string. Output the text content of the title node in HTML.

5. Node Selector

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.title) #输出title节点的选择结果
print(type(soup.title)) #输出title节点的类型
print(soup.title.string) #输出title节点里面的文字内容
print(soup.head)  #输出head节点
print(soup.p) #输出第一个p标签的内容

operation result:

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

[Note] bs4.element.Tag is an important data structure in BeautifulSoup, and the result selected by the selector is this Tag type.

6. Extract information

#下面皆由这段html文本为例：
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")

get name

Use the name attribute to get the name of the node, call the node first and then call the name attribute to get the node name:

print(soup.title.name)

operation result:

title

get attribute

A node may have multiple attributes such as class, id, etc. After selecting a node, you can call attrs to get all its attributes

print(soup.p.attrs)

operation result:

{'class': ['title'], 'name': 'dromouse'}

The return result of calling the attrs attribute is in the form of a dictionary, including attributes and attribute values. If you want to get the attribute value, it is as follows:

print(soup.p.attrs['name'])

operation result:

dromouse

In addition, there are more convenient ways to obtain attribute values, as follows:

print(soup.p['class'])
print(soup.p['name'])

operation result:

['title'] 
dromouse

It should be noted here that the class attribute returns a list, while the name attribute returns a string. Because the value of the name attribute is unique the result returned is a single string. And a node meta element may contain multiple classes, so what is returned is a list. In the process of actual processing, we need to pay attention to this problem.

get content

It has also been used before, using the string attribute to obtain the text content contained in the node element, as follows:

print(soup.p.string)

operation result:

The Dormouse's story

nested selection

The return type is bs4.element.Tag, and the tag object can also continue to call the node for the next selection:

html = """<html><head><title>The Dormouse's story</title></head>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

operation result:

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story

7. Association selection

Child and descendant nodes

①Call the contents attribute

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)

operation result:

['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']

As can be seen from the results, the return is in the form of a list, and the p node contains both text and nodes, and these contents are finally returned in the form of a list.

But it should be noted that each element in the list is a direct child node of the p node. Like the span node contained in the first a node, it is equivalent to the descendant node, but the returned content does not select the span node separately. So the result of the contents attribute is a list of direct child nodes.

Call the children attribute to get the corresponding result:

②Call the children property

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)
for i,child in enumerate(soup.p.children):
    print(i,child)

operation result:

<list_iterator object at 0x0000024B05CE8AC0>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4 
            and
            
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
            and they lived at the bottom of a well.

③Call the descendants property

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
    print(i,child)

operation result:

<generator object Tag.descendants at 0x0000024B064982E0>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9 
            and
            
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
            and they lived at the bottom of a well.

It can be found that the returned result is a generator like the children attribute. Use the for loop to traverse the output and you can see that the output result at this time contains span nodes, because descendants will recursively query all child nodes and get all descendant nodes.

parent node and ancestor node

Call the parent and parents properties

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.parents)))

operation result:

[(0, <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>), (1, <body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body>), (2, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body></html>), (3, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body></html>)]

sibling node

Call next_sibling and previous_sibling properties

html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            Hello
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('next_sibling:',soup.a.next_sibling)
print('previous_sibling:',soup.a.previous_sibling)
print('next_siblings:',list(enumerate(soup.a.next_siblings)))
print('previous_siblings:',list(enumerate(soup.a.previous_siblings)))

operation result:

next_sibling: 
            Hello
            
previous_sibling: 
            Once upon a time there were three little sisters; and their names were
            
next_siblings: [(0, '\n            Hello\n            '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, '\n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
previous_siblings: [(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]

Four attributes are called respectively. The next_sibling and previous_sibling attributes are used to obtain the previous and next sibling nodes of the node respectively. The next_siblings and previous_siblings attributes return all the previous and subsequent sibling nodes respectively.

Through the above association selection, extract the desired information

html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Bob</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling:')
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print('-----------------------------')
print('parent:')
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])

Next Sibling:
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
-----------------------------
parent:
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">Bob</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>
['story']

8. Method Selector

find_all

As the name implies, it is to query all elements that meet the conditions.

find_all(name, attrs, recursive, text, **kwargs)

name

Elements can be queried based on the name parameter

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))

operation result:

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>

The returned result is a list type with a length of 2, and the elements in the list are all of type bs4.element.Tag. Next we can traverse each li node and get its text content:

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
        print(li.string)

operation result:

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
Foo
Bar
Jay
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo
Bar

attrs

In addition to querying by node name, we can also pass in some attributes for querying:

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={"id":"list-1"}))
print(soup.find_all(attrs={"name":"elements"}))

operation result:

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

The attrs parameter is passed in when querying, which belongs to the dictionary type.

For some commonly used attributes, such as id, class, etc., we can query them in another way without passing attrs:

print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))   #因为class是python里的关键字，注意使用下划线

operation result:

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

text

The text parameter can be used to match the text of the node, and the input form may be a string or a regular expression object:

import re
html='''
<div class="panel">
    <div class="panel-body">
        <a>Hello, this is a link</a>
        <a>Hello, this is a link, too</a>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text = re.compile('link')))
print(soup.find_all(text = re.compile('Hello')))

The return result is a list of all node texts that match the regular expression.

find

In addition to the find_all method, there is also a find method that can also query eligible elements, but the find method returns a single element, that is, the first matching element, while find_all returns a list of all matching elements. The usage of find is exactly the same as that of find_all, the difference is that the query range is different, so we will not implement them one by one here

9. CSS selectors

The CSS selector needs to call the select method, just pass in the corresponding CSS selector.

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))  #查找panel里的panel-heading内容
print(soup.select('ul li'))  #查找li标签
print(soup.select('#list-2 .element')) #查找id为list-2的element内容
print(type(soup.select("ul")[0]))  #查看ul列表元素的类型

operation result:

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>

nested selection

The select method supports nested selection, examples are as follows:

from bs4 import BeautifulStoneSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select("ul"):
    print(ul.select("li"))

operation result:

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

Output a list of all li nodes under each ul node

get attribute

Since the node is of Tag type, you can still use the original love method to obtain attributes. Here we try to obtain the id attribute of each ul node:

from bs4 import BeautifulStoneSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select("ul"):
    print(ul["id"])
    print(ul.attrs['id'])

operation result:

list-1
list-1
list-2
list-2

It can be seen that the attributes can be successfully obtained by directly passing the attribute name into the square brackets and obtaining the attribute value through the attrs attribute.

get text

To get the text, you can use the string attribute used above. There is another way here, which is to use get_text. The effect of the two is exactly the same, and you can get the text value of the node. The example is as follows:

from bs4 import BeautifulStoneSoup
soup = BeautifulSoup(html,'lxml')
for li in soup.select("li"):
    print("get_text:",li.get_text())
    print("string:",li.string)

operation result:

get_text: Foo
string: Foo
get_text: Bar
string: Bar
get_text: Jay
string: Jay
get_text: Foo
string: Foo
get_text: Bar
string: Bar