Usage of Beautiful Soup (5): Use of select

selectThe function of is the findsame find_allas that used to select a specific tag, and its selection rules depend on css, we call it css选择器, if you have been in contact with it before jquery, you can find selectthat the selection rules are jquerya bit similar.

Search by tag name

The tag name is not modified in any way when filtering, as follows:

from bs4 import BeautifulSoup  
import re  
  
html = """  
<html><head><title>The Dormouse's story</title></head>  
<body>  
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>  
<p class="story">Once upon a time there were three little sisters; and their names were  
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and  
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;  
and they lived at the bottom of a well.</p>  
</body>  
</html>  
"""  
  
soup = BeautifulSoup(html, "lxml")  
print soup.select('p')

The returned results are as follows:

[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>]

It can be seen from the results that what it returns is an array. Let's continue to see what the elements are in the array?

print type(soup.select('p')[0])

The result is:

<class 'bs4.element.Tag'>

It is clear that what is returned bs4.element.Tagis the same as find_all, select('p')which returns all tags named p.

Search by class name and id

When filtering, add a dot before the class name and # before the id name.

print soup.select('.title')  
print soup.select('#link2')

The result returned is:

[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

Find by attribute

If it is not an ID or a class name, is it impossible to filter? If so, how should I express it?

print soup.select('[href="http://example.com/lacie"]')

Select the tag hreffor http://example.com/lacie　.

Combination search

Combination search can be divided into two types, one is to search for two conditions in a tag, and the other is to search layer by layer in a tree-like manner.

The first case is as follows:

print soup.select('a#link2')

Select the tag named a, idfor link2the tag.

The output results are as follows:

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

Another situation is as follows:

From bodythe beginning, bodysearch for all in it , and search for the tag named link2 pin all of them , such a tree-like search layer by layer is very common in the analysis structure. Layers are separated by spaces.paidhtml

print soup.select('body p a#link2')

The result is as follows:

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]