beautifulSoup

A flexible and convenient web page parsing library with efficient processing and support for multiple parsers.

Using it, you can easily crawl web page information without writing regular expressions.

quick to use

Through the following example, have a simple understanding of bs4, and see its power:

copy code
from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)
print(soup.p["class"])
print(soup.a)
print(soup.find_all('a'))
print(soup.find(id='link3'))
copy code

The result is as follows:

Use BeautifulSoup to parse this code, you can get a BeautifulSoup object, and can output according to the standard indentation format.

At the same time, we can get all the links and text content separately through the following code:

for link in soup.find_all('a'):
    print(link.get('href'))

print(soup.get_text())

parser

Beautiful Soup supports the HTML parser in the Python standard library, and also supports some third-party parsers. If we don't install it, Python will use Python's default parser. The lxml parser is more powerful and faster, and it is recommended to install it. In versions prior to Python 2.7.3 and in Python 3 prior to 3.2.2, lxml or html5lib must be installed because the HTML parsing methods built into the standard library for those Python versions are not stable enough.

The following are common parsers:

 

1. CSS selectors

(1) Select the content

  • The selection can be done by directly passing in the CSS selector through select()
  • Direct selection via tags
  • By selector . Represents class #represents id
 1 html='''
 2 <div class="panel">
 3     <div class="panel-heading">
 4         <h4>Hello</h4>
 5     </div>
 6     <div class="panel-body">
 7         <ul class="list" id="list-1">
 8             <li class="element">Foo</li>
 9             <li class="element">Bar</li>
10             <li class="element">Jay</li>
11         </ul>
12         <ul class="list list-small" id="list-2">
13             <li class="element">Foo</li>
14             <li class="element">Bar</li>
15         </ul>
16     </div>
17 </div>
18 '''
19 from bs4 import BeautifulSoup
#通过lxml解析
20 soup = BeautifulSoup(html, 'lxml')

#Through the selector class, the content of the first layer class="panel-body and the second layer class="panel-heading is represented here. Note that there is a space in the middle of the first layer and the second layer 
21 print (soup.select( ' .panel .panel-heading ' ))
 
#Take 
22 directly through the label print (soup.select( ' ul li ' ))
 
#pass selector id 
23 print (soup.select( ' #list-2 .element ' ))
 
24 print(type(soup.select('ul')[0]))

 

(2) Access to content

The text content can be obtained by get_text()

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print(li.get_text())

When getting attributes
or attributes, you can use [attribute name] or attrs[attribute name]

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

 

 

 

 

 

basic use

Tag selector

In quick use we add the following code:
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

Through this soup.tag name, we can get the content of this tag.
There is a problem that needs to be paid attention to. If a tag is obtained in this way, if there are multiple such tags in the document, the returned result is the content of the first tag. As above, we get the p tag through soup.p, and there are multiple p tags in the document, but only the content of the first p tag is returned

get name

When we pass soup.title.name, we can get the name of the title tag, that is, title

get attribute

print(soup.p.attrs['name'])
print(soup.p['name'])
The above two methods can obtain the value of the name attribute of the p tag

get content

The result of print(soup.p.string)
can get the content of the first p tag:
The Dormouse's story

nested selection

We can get it directly in the following nested way

print(soup.head.title.string)

child nodes and descendant nodes

The use of contents is
demonstrated by the following example:

copy code
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)
copy code

The result is that all subtags under the p tag are stored in a list

The following elements are stored in the list

use of children

You can also get the contents of all child nodes under the p tag in the following way, and the results obtained through contents are the same, but the difference is that soup.p.children is an iterative object, not a list, which can only be obtained by looping Get well-known information

print(soup.p.children)
for i,child in enumerate(soup.p.children):
    print(i,child)

Both contents and children are used to get child nodes. If you want to get descendant nodes, you can pass descendants
print(soup.descendants) and the result of this acquisition is also an iterator

parent node and ancestor node

You can get the information of the parent node through soup.a.parent

The ancestor node can be obtained through list(enumerate(soup.a.parents)), the result returned by this method is a list, the information of the parent node of the a label will be stored in the list, and the parent node of the parent node will also be placed in the list. In the list, and finally, the entire document will be put into the list. The last element and the penultimate element of all lists are the information of the entire document stored.

sibling node

soup.a.next_siblings Get the next sibling node
soup.a.previous_siblings Get the previous sibling node
soup.a.next_sibling Get the next sibling label
souo.a.previous_sinbling Get the previous sibling label

Standard selector

find_all

find_all(name,attrs,recursive,text,**kwargs)
can find documents by tag name, attribute, content

usage of name

copy code
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
copy code

The result is returned as a list

At the same time, we can find_all again for the result to get all the li tag information

for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

attrs

Examples are as follows:

copy code
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))
copy code

attrs can be passed in a dictionary to find tags, but there is a special one here is class, because class is a special field in python, so if you want to find class related, you can change attrs={'class_':'element' } or soup.find_all('',{"class":"element}), special tag attributes can not write attrs, such as id

text

Examples are as follows:

copy code
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))
copy code

The result returns all the texts with text='Foo' found

find

find(name,attrs,recursive,text,**kwargs)
The first element of the matching result returned by find

Some other similar usages:
find_parents() returns all ancestor nodes, find_parent() returns immediate parent nodes.
find_next_siblings() returns all following siblings, and find_next_sibling() returns the first sibling.
find_previous_siblings() returns all previous sibling nodes, and find_previous_sibling() returns the first previous sibling node.
find_all_next() returns all eligible nodes after the node, find_next() returns the first eligible node
find_all_previous() returns all eligible nodes after the node, find_previous() returns the first eligible node

 

Summarize

It is recommended to use the lxml parsing library, and if necessary, use the html.parser
tag to select a weak but fast filter function. It is
recommended to use find(), find_all() to query a single result or multiple results.
If you are familiar with CSS selectors, it is recommended to use select()
Remember Commonly used methods for getting property and text values

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324706635&siteId=291194637