select
The function of is the find
same find_all
as that used to select a specific tag, and its selection rules depend on css
, we call it css选择器
, if you have been in contact with it before jquery
, you can find select
that the selection rules are jquery
a bit similar.
Search by tag name
The tag name is not modified in any way when filtering, as follows:
from bs4 import BeautifulSoup
import re
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
</body>
</html>
"""
soup = BeautifulSoup(html, "lxml")
print soup.select('p')
The returned results are as follows:
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>]
It can be seen from the results that what it returns is an array. Let's continue to see what the elements are in the array?
print type(soup.select('p')[0])
The result is:
<class 'bs4.element.Tag'>
It is clear that what is returned bs4.element.Tag
is the same as find_all, select('p')
which returns all tags named p.
Search by class name and id
When filtering, add a dot before the class name and # before the id name.
print soup.select('.title')
print soup.select('#link2')
The result returned is:
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
Find by attribute
If it is not an ID or a class name, is it impossible to filter? If so, how should I express it?
print soup.select('[href="http://example.com/lacie"]')
Select the tag href
for http://example.com/lacie
.
Combination search
Combination search can be divided into two types, one is to search for two conditions in a tag, and the other is to search layer by layer in a tree-like manner.
The first case is as follows:
print soup.select('a#link2')
Select the tag named a
, id
for link2
the tag.
The output results are as follows:
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
Another situation is as follows:
From body
the beginning, body
search for all in it , and search for the tag named link2 p
in all of them , such a tree-like search layer by layer is very common in the analysis structure. Layers are separated by spaces.p
a
id
html
print soup.select('body p a#link2')
The result is as follows:
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]