Python3 web crawler combat -29, using analytic library: BeautifulSoup

In front of us the regular expression related to usage, but once the regular writing there is a problem, it is not possible to get the result we want of it, and for a web page, it has a certain special structure and hierarchy, and many node id or class to have to make a distinction, so we by virtue of their structure and properties is also possible to extract not it?

So, in this section we introduce a powerful analytical tool called BeautiSoup, it is to make use of the web structure and properties and other characteristics analysis tool page, with which we did not have to write some complex regular, only need a simple a few statements you can complete extraction of an element of the page.

Ado, then we have to feel the power of BeautifulSoup it.

Introduction 1. BeautifulSoup

In simple terms, BeautifulSoup is an HTML or XML parsing library for Python, we can easily extract data from web pages with it, the official explanation is as follows:

BeautifulSoup provide some simple, Python type functions for handling navigation, search, modify functions parse tree. It is a toolkit to provide needed data captured by the user to parse the document, because simple, so do not need much code to write a complete application. BeautifulSoup automatically converted to Unicode encoding input document, the document is converted to an output utf-8 encoded. You do not need to consider encoding, unless the document does not specify a code, then you just need to explain the original coding on it. BeautifulSoup has become and lxml, html6lib as good Python interpreter, to provide users with different analytical strategies or strong rate flexibility.

So, we use it to save a lot of tedious work to extract and improve analytical efficiency.

2. Preparation

Before you begin make sure you have properly installed the BeautifulSoup and LXML, if not installed The installation procedure is the first chapter.

3. parser

BeautifulSoup when parsing is actually dependent on the parser, which in addition to supporting HTML parser Python standard library also supports a number of third-party parser such as LXML, let's parser BeautifulSoup support and some of their advantages the disadvantage of a simple comparison.

Parser Instructions Advantage Disadvantaged
Python Standard Library BeautifulSoup(markup, "html.parser") Python's standard library built, execution speed is moderate, strong fault tolerance documents Version of Python 2.7.3 or 3.2.2) in front of the Chinese poor fault tolerance
LXML HTML parser BeautifulSoup(markup, "lxml") Fast, strong fault tolerance documents You need to install the C language library
LXML XML parser BeautifulSoup(markup, "xml") Speed, the only support for XML parser You need to install the C language library
html5lib BeautifulSoup(markup, "html5lib") The best fault tolerance to the way the browser parses the document, document generation HTML5 format Slow, does not rely on external expansion

It can be seen from the above comparison, LXML parser has parsed HTML and XML functionality, and fast, fault-tolerant ability, it is recommended to use this parser to parse.

LXML use this parser, the initialization BeautifulSoup when we can put the second parameter can be changed lxml, as follows:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>', 'lxml')
print(soup.p.string)
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Example usage BeautifulSoup later also to demonstrate unity with the parser.
# 4. Basic use

Let's use an example to first feel the basic use of BeautifulSoup:

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

operation result:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story

First, we declare a variable html, it is an HTML string, but note that it is not a complete HTML string, body and html nodes are not closed, but we will use it as the first parameter to BeautifulSoup the object, the second parameter passed is the type of parser, where we use lxml, thus completing the initialization BeaufulSoup object, assign it to soup this variable.

So then we can call the various methods and properties of this soup string parsing the HTML code.

We first call prettify () method, which can be parsed string to standard output formats indent, here noted that the output of which contains the body and html node, that is for non-standard HTML string BeautifulSoup can automatically correct format, this step is not actually made from prettify () method, this is actually correct when BeautifulSoup initialization is complete.

Then we call the soup.title.string, this is actually output the text content in HTML title nodes. Soup.title so you can choose the title of the HTML node, and then call string attribute can get inside the text, so we can extract the text can be completed by simply calling a few properties, it is not very convenient?

5. Node Selector

Just when we select elements directly by calling the name of the node node element can be selected, and then call the property on the text string within the node may have been, this selection method is very fast, if a single-level structure, then the node is very clear, this way you can use to resolve.

Select elements

Let us then explain it in detail an example of the selection method:

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)

operation result:

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

Here we still chose just the HTML code, we first print out the results of title selection node, the output node is the title text plus inside. Then the output of its type, is bs4.element.Tag type, which is an important data structure in BeautifulSoup, after the selector selects, all this Tag type selection result, it has some properties such as string properties, Tag calling the string property, you can get the text content of the node, so the next node is the output of text.

Then we try to select the head node, the result is a node plus all its contents inside, and then select the next node p, but this is a bit special, we found the results to the contents of the first p-node, followed by a few a p-node did not choose to, that is, when there are multiple nodes, this selection method to select only the first match of a node, the other node will ignore the latter.

OK

In the above we demonstrated the value of the property to get the call string of text, then we want to get the property value node how to do it? Gets node name how to do it? Let's sort out what a unified way to extract information

Get the name of

The name attribute can be used to get the name of the node. Or in the text above example, we select the title node, and then call the name attribute can get the node name.

print(soup.title.name)

operation result:

title

Acquiring property

Each node may have multiple attributes, such as id, class and so on, and then we choose the element to this node, you can call attrs get all the properties.

print(soup.p.attrs)
print(soup.p.attrs['name'])

operation result:

{'class': ['title'], 'name': 'dromouse'}
dromouse

Attrs can see the results returned in the form of a dictionary, the combination of all the attributes and attribute values ​​of the selected node into a dictionary, then if you want to get the name attribute is equivalent to get a key from the dictionary, just use the attribute names can be added in brackets the results obtained, such as by acquiring the name attribute can attrs [ 'name'] to give the corresponding property value.

In fact, the wording a little bit complicated, there is a simpler way to get, we can not write attrs, directly behind the node element in square brackets, the incoming property name can reach the property value, the sample is as follows:

print(soup.p['name'])
print(soup.p['class'])

operation result:

dromouse
['title']

Here we noticed some returns the result is a string, the result is a list of some of the strings. For example the value of the name attribute is only returned result is a single string, but for the class, a node may be provided by a plurality of element class, it returns a list, so the actual processing is determined to be noted that type.

Access to content

You can use string property to get the text nodes contained elements, such as the above text we get the text first p node:

print(soup.p.string)

operation result:

The Dormouse's story

Note again here to select a p p node is the first node, the text of which is the first to obtain a p-node inside the text.

Nested selection

In the example above, we know that the result is returned bs4.element.Tag each type, it can also continue to call the next selected node, such as we obtain the element head node, we can continue to call the head inside thereof selected head node element.

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

operation result:

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story

The result is the first line we call the head again after calling the title element node to select the title, then we followed the printout of its type, you can see it is still bs4.element.Tag type, that we Tag on the basis of the type of choice is still the Tag again get the type of results returned are the same every time, so we can do this nested choice.

Finally, it outputs a bit string attribute, which is the node where the text content.

Select the association

We do not choose the time and sometimes you can choose to do one step node element you want, sometimes you need to choose the time you select a node element, and then select its child nodes to it as a benchmark, the parent node , siblings and so on. So here we introduce how to select these nodes elements.

Child nodes and node descendants

After selecting a node to the element, if you want to get it direct child node can call the contents attributes, we use an example to feel:

print(soup.p.contents)

operation result:

[<b>The Dormouse's story</b>]

Results obtained contents attribute is a list of direct child nodes.

Likewise, we can call the children attribute, the corresponding results:

print(soup.p.children)
for i,child in enumerate(soup.p.children):
    print(child)

operation result:

<list_iterator object at 0x10529eef0>
<b>The Dormouse's story</b>

The same text or HTML, here we call the children attribute to choose to return results can be seen that the generator type, so we use for the next cycle of the output bit corresponding content, the content is actually the same, but children returns generator type, and returns a list of the type of contents.

If we want to get all the children of the node, then you can call the descendants attributes:

print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
    print(child)

operation result:

<generator object Tag.descendants at 0x103fa5a20>
<b>The Dormouse's story</b>
The Dormouse's story

Return result or generator, the output of what can be seen traversing the descendants queries recursively to all child nodes, you get all the descendants of nodes.

Parent and ancestor nodes

If you want to get the parent node of a node element, you can call the parent property:

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)

operation result:

<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>

Here we have chosen is the first element of a parent node, it is clear that its parent is p-node, the output node and p is the contents within.

We note that this output only the direct parent of a node, but no longer looking outward ancestor parent node, if we want to get all the ancestor nodes, parents can call the property:

html = """
<html>
    <body>
        <p class="story">
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
        </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(type(soup.a.parents))
print(list(enumerate(soup.a.parents)))

operation result:

<class 'generator'>
[(0, <p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>), (1, <body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body>), (2, <html>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body></html>), (3, <html>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body></html>)]

The result is a generator type, we are here with a list of its output index and content, you can find a list of elements is the ancestor node of a node.

Sibling

Obtaining explained above the child and parent nodes, and if you want to get the same level of sibling nodes is how should I do? Let's use an example to feel:

html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            Hello
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling', soup.a.next_sibling)
print('Prev Sibling', soup.a.previous_sibling)
print('Next Siblings', list(enumerate(soup.a.next_siblings)))
print('Prev Siblings', list(enumerate(soup.a.previous_siblings)))

operation result:

Next Sibling 
            Hello

Prev Sibling 
            Once upon a time there were three little sisters; and their names were

Next Siblings [(0, '\n            Hello\n            '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
Prev Siblings [(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]

You can see here we call the four different attributes, next_sibling and previous_sibling each node can get the next one and a sibling, next_siblings and previous_siblings then return the generator sibling of all the front and rear.

OK

In the above we explained the method of selecting an element node is associated, if we want to get some of their information, such as text, attributes, etc. is the same way.

html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Bob</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
        </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling:')
print(type(soup.a.next_sibling))
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print('Parent:')
print(type(soup.a.parents))
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])

operation result:

Next Sibling:
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
Parent:
<class 'generator'>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">Bob</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>
['story']

If the result is a single node, it can be called string, obtained directly attrs attributes and properties of the text, if the result is taken out after a plurality of nodes of an element generator, may be converted to a list, and then call string, attrs other property to get its corresponding node and other text attributes.

6. The method selector

We are talking about the previous selection method is to select elements by attribute, this selection method is very fast, but if you want to be more complex choice of words will be more complicated, not flexible enough. So BeautifulSoup also provides methods of inquiry for us, such as find_all (), find () methods, we can call the method and pass the appropriate parameters can be flexibly queried.

The most common way is to query find_all () and find (), and the following detailed description of our use of them.

find_all()

find_all, as the name suggests, is to query all eligible element, it can give incoming text attributes or elements to get qualified, very powerful.

Its API is as follows:

find_all(name , attrs , recursive , text , **kwargs)

name

We can query elements according to the node name, let's use an example to feel:

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))

operation result:

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Here we call find_all () method, passing a parameter name, parameter values ​​ul, which means that we want to query all ul nodes, the result is a list type, length of 2, each element is still bs4 .element.Tag type.

Because all Tag type, so we can still be nested queries, or the same text, in its li here to research our internal node ul check out all the nodes before.

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))

operation result:

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

The result is a list of types, each element in the list is still the Tag Type.

Then we can get it through each li of the text.

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
        print(li.string)

operation result:

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
Foo
Bar
Jay
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo
Bar

attrs

In addition to the name of the node based on the query, we can also pass some of the properties to query, we use an example to feel:

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))

operation result:

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

Here we query is passed attrs parameter type parameter is a dictionary of the type, such as we id nodes to query the list-1, it can pass attrs = { 'id': 'list-1'} query condition, the result is a list form, is consistent with the content id is included in list-1 for all nodes, in the above example the number of elements in line with the conditions is 1, the result is a list of length 1.

For some common attributes such as id, class and so on, we can not attrs be passed, for example, we want to query id for the list-1 nodes, we can directly pass the id parameter, or the above text, we put it another way Inquire.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))

operation result:

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

Here we directly into the id = 'list-1' can query for the list-1 node id of the element. As for the class, because there is a class in python keyword, so here behind the need to add an underscore, class _ = 'element', the result returned is still the Tag list composed.

text

text parameter can be used to match text nodes, can be passed in the form of a string, can be a regular expression object, we use an example to feel:

import re
html='''
<div class="panel">
    <div class="panel-body">
        <a>Hello, this is a link</a>
        <a>Hello, this is a link, too</a>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text=re.compile('link')))

operation result:

['Hello, this is a link', 'Hello, this is a link, too']

Here are two a node, its interior contains text information, here we call find_all () method incoming text parameter, parameter as a regular expression object, the result will return a list of all text matching the regular expression node composed.

find()

In addition to find_all () method, as well as find () method, but the find () method returns a single element, that is, the first matching element, and find_all () returns a list of all matching elements thereof.

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find(name='ul'))
print(type(soup.find(name='ul')))
print(soup.find(class_='list'))

operation result:

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<class 'bs4.element.Tag'>
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

Return result is no longer a form of a list, but the first matching node element, the type is still the Tag type.

find_all (), find () in addition to many query methods, and the use of the method described above is identical, but different query range, where do some simple instructions.

  • find_parents() find_parent()

find_parents () returns all ancestor nodes, find_parent () Returns the direct parent.

  • find_next_siblings() find_next_sibling()
  • find_next_siblings () returns all siblings behind, find_next_sibling () returns after the first sibling node.
  • find_previous_siblings() find_previous_sibling()

find_previous_siblings () Returns all previous siblings, find_previous_sibling () to return before the first sibling node.

  • find_all_next() find_next()

find_all_next () returns the node after all the qualified nodes, find_next () returns the first matching node.

  • find_all_previous() 和 find_previous()

find_all_previous () returns the node after all the qualified nodes, find_previous () returns the first matching node

7. CSS selectors

BeautifulSoup also provides another selector that CSS selectors, if Web developers familiar with the dialogue, CSS selectors certainly not unfamiliar, if not familiar, you can look at:http://www.w3school.com.cn/cs...。

Use CSS selectors, just call select () method, passing the appropriate CSS selectors can, we use an example to feel:

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

operation result:

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>

Here we use three CSS selectors, the returned results are in line with the list of nodes consisting of CSS selectors. For example, select ( 'ul li') is to select all li ul all nodes below node, li result is a list of all the nodes.

Finally, we print out a list of the types of elements, you can still see the type Tag type.

Nested selection

select () method also supports nested choice, e.g. ul We chose all nodes, then traversing each node selects its li ul node, the following examples:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))

operation result:

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

After you can see the normal output of traversing each node ul, li listing of all nodes under composition.

Acquiring property

We know that the node type is Tag type, so you can still get the properties acquired by the original method is still above HTML text, we are here to try to get each node ul id attribute.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

operation result:

list-1
list-1
list-2
list-2

We can see the brackets and the property name and property acquired by attrs attribute values ​​directly into all be successful.

Get the text

So of course, can also be used to get the text stated before the string property, there is a method that is get_text (), you can also get text value.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print('Get Text:', li.get_text())
    print('String:', li.string)

operation result:

Get Text: Foo
String: Foo
Get Text: Bar
String: Bar
Get Text: Jay
String: Jay
Get Text: Foo
String: Foo
Get Text: Bar
String: Bar
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

The effect of both is exactly the same, can obtain the value of the text node.

8. Conclusion

Use this basic introduction BeautifulSoup is over, finally make a brief summary:

  • Recommended LXML parsing library, use html.parser if necessary.
  • Filter node selection function is weak, but fast.
  • Recommended find (), find_all () or a plurality of individual results matching the query results.
  • If you are familiar with CSS selectors, then you can use select () selection.

Guess you like

Origin blog.51cto.com/14445003/2426468