Python3 crawler (6) Beautiful Soup of the use of parsing library

 Infi-chu:

http://www.cnblogs.com/Infi-chu/

 

Beautiful Soup

The web page is parsed with the help of the structure and properties of the web page, so that the writing of complex regular expressions can be omitted.

Beautiful Soup is an HTML or XML parsing library for Python.

1. Parser

parser Instructions Advantage disadvantage
Python Standard Library BeautifulSoup(markup,"html.parser") Moderate execution speed and strong document fault tolerance Versions prior to 2.7.3 and 3.2.2 have poor fault tolerance
lxml HTML parser BeautifulSoup(markup,"lxml") Fast speed, strong document fault tolerance Need to install C language library
lxml XML parser BeautifulSoup(markup,"xml") Fast, the only parser that supports XML Need to install C language library
html5lib BeautifulSoup(markup,"html5lib") Best fault tolerance, browser-like parsing of documents, and HTML5-formatted documents Slow and does not depend on external extensions

To sum up, lxml HTML parser is recommended

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello World</p>','lxml')
print(soup.p.string)

2. Basic usage:

html = '''
<html>
<head><title>Infi-chu example</title></head>
<body>
<p class="title" name="dr"><b>title example</b></p>
<p class="story">link
<a href="http://example.com/elsie" class="sister" id="link1">elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">lacie</a>,
<a href="http://example.com/tillie" class="sister" id="link3">tillie</a>,
last sentence</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify()) # fix html
print(soup.title.string) # Output the string content of the title node

3. Node selector:

select element

Use soup. element to get

 

OK

(1) Get the name

Use soup.element.name to get the element name

(2) Get attributes

Use soup.element.attrs

Use soup.element.attrs['name']

(3) Element content

Use soup.element.string to get the content

 

nested selection

Use soup.parent.element.string to get the content

 

Association selection

(1) Child nodes and descendant nodes

html = '''
<html>
<head><title>Infi-chu example</title></head>
<body>
<p class="title" name="dr"><b>title example</b></p>
<p class="story">link
<a href="http://example.com/elsie" class="sister" id="link1"><span>elsie</span></a>,
<a href="http://example.com/lacie" class="sister" id="link2"><span>lacie</span></a>,
<a href="http://example.com/tillie" class="sister" id="link3"><span>tillie</span></a>,
last sentence</p>
'''
from bs4 import BeautifulSoup
# Get direct child nodes, children property
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)
for i ,child in enumerate(soup.p.children):
    print(i,child)

# Get all descendant nodes, descendants property
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enmuerate(soup.p.descendants):
    print(i,child)

(2) Parent node and ancestor node

Call the parent node, using the parent property

Get all ancestor nodes, use the parents property

(3) Sibling node

next_sibling next sibling element

previous_sibling previous sibling element

next_siblings all previous siblings

previous_siblings all subsequent siblings

(4) Extract information

 

4. Method selector:

find_all()

find_all(name,attrs,recursize,text,**kwargs)

(1)name

soup.find_all(name='ul')
for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='ul'))
for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
        print(li.string)

(2)atts

# Query by node name
print(soup.find_all(attrs={'id':'list1'}))
print(soup.find_all(attrs={'name':'elements'}))

# can also be written as
print(soup.find_all(id='list1'))
print(soup.find_all(class='elements'))

(3)text

The text parameter can be used to match the text of the node. The incoming form can be a string or a regular expression object

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text=re.compile('link')))

 

find()

return an element

【Note】

find_parents()和find_parent()

find_next_siblings()和find_next_sibling()

find_previous_siblings()和find_previous_sibling()

find_all_next()和find_next()

find_all_previous()和find_previous()

 

5. CSS selectors:

nested selection

for ul in soup.select('ul'):
    print(ul.select('li'))

get attribute

for ul in soup.select('ul'):
    print(ul['id'])
    # Equivalent to
    print(ul.attrs['id'])

get text

In addition to the string attribute, there is also a get_text() method to get text

for li in soup.select('li'):
    # same effect
    print(li.get_text())
    print(li.string)

 

 

Beautiful Soup

The web page is parsed with the help of the structure and properties of the web page, so that the writing of complex regular expressions can be omitted.

Beautiful Soup is an HTML or XML parsing library for Python.

1. Parser

parser Instructions Advantage disadvantage
Python Standard Library BeautifulSoup(markup,"html.parser") Moderate execution speed and strong document fault tolerance Versions prior to 2.7.3 and 3.2.2 have poor fault tolerance
lxml HTML parser BeautifulSoup(markup,"lxml") Fast speed, strong document fault tolerance Need to install C language library
lxml XML parser BeautifulSoup(markup,"xml") Fast, the only parser that supports XML Need to install C language library
html5lib BeautifulSoup(markup,"html5lib") Best fault tolerance, browser-like parsing of documents, and HTML5-formatted documents Slow and does not depend on external extensions

To sum up, lxml HTML parser is recommended

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello World</p>','lxml')
print(soup.p.string)

2. Basic usage:

html = '''
<html>
<head><title>Infi-chu example</title></head>
<body>
<p class="title" name="dr"><b>title example</b></p>
<p class="story">link
<a href="http://example.com/elsie" class="sister" id="link1">elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">lacie</a>,
<a href="http://example.com/tillie" class="sister" id="link3">tillie</a>,
last sentence</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify()) # fix html
print(soup.title.string) # Output the string content of the title node

3. Node selector:

select element

Use soup. element to get

 

OK

(1) Get the name

Use soup.element.name to get the element name

(2) Get attributes

Use soup.element.attrs

Use soup.element.attrs['name']

(3) Element content

Use soup.element.string to get the content

 

nested selection

Use soup.parent.element.string to get the content

 

Association selection

(1) Child nodes and descendant nodes

html = '''
<html>
<head><title>Infi-chu example</title></head>
<body>
<p class="title" name="dr"><b>title example</b></p>
<p class="story">link
<a href="http://example.com/elsie" class="sister" id="link1"><span>elsie</span></a>,
<a href="http://example.com/lacie" class="sister" id="link2"><span>lacie</span></a>,
<a href="http://example.com/tillie" class="sister" id="link3"><span>tillie</span></a>,
last sentence</p>
'''
from bs4 import BeautifulSoup
# Get direct child nodes, children property
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)
for i ,child in enumerate(soup.p.children):
    print(i,child)

# Get all descendant nodes, descendants property
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enmuerate(soup.p.descendants):
    print(i,child)

(2) Parent node and ancestor node

Call the parent node, using the parent property

Get all ancestor nodes, use the parents property

(3) Sibling node

next_sibling next sibling element

previous_sibling previous sibling element

next_siblings all previous siblings

previous_siblings all subsequent siblings

(4) Extract information

 

4. Method selector:

find_all()

find_all(name,attrs,recursize,text,**kwargs)

(1)name

soup.find_all(name='ul')
for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='ul'))
for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
        print(li.string)

(2)atts

# Query by node name
print(soup.find_all(attrs={'id':'list1'}))
print(soup.find_all(attrs={'name':'elements'}))

# can also be written as
print(soup.find_all(id='list1'))
print(soup.find_all(class='elements'))

(3)text

The text parameter can be used to match the text of the node. The incoming form can be a string or a regular expression object

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text=re.compile('link')))

 

find()

return an element

【Note】

find_parents()和find_parent()

find_next_siblings()和find_next_sibling()

find_previous_siblings()和find_previous_sibling()

find_all_next()和find_next()

find_all_previous()和find_previous()

 

5. CSS selectors:

nested selection

for ul in soup.select('ul'):
    print(ul.select('li'))

get attribute

for ul in soup.select('ul'):
    print(ul['id'])
    # Equivalent to
    print(ul.attrs['id'])

get text

In addition to the string attribute, there is also a get_text() method to get text

for li in soup.select('li'):
    # same effect
    print(li.get_text())
    print(li.string)

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325232978&siteId=291194637