BeautifulSoup Brief

Outline

Processing data, the total face HTML and XML documents. BeautifulSoup can be extracted from an HTML or XML Python library data, powerful, easy to use, the best for, Simple, cute and data processing tools.

installation

Ever since pip this artifact, the installation is no longer a problem. BeautifulSoup support HTML parser Python standard library, but also support other parser. I recommend using a third-party parser more cattle fork lxml-- I have dealt with it over a single file xml hundreds of megabytes of data, quick to respond, there is no sense of lag. Of course, using existing parser system, in addition to speed and efficiency, the basic problems will not so what.

$ pip install beautifulsoup4
$ pip install lxml

start using

> From BS4 Import the BeautifulSoup
 > = the BeautifulSoup Soup ( " <HTML> Data </ HTML> " , " html.parser " ) # using standard library built python, moderate speed, fault-tolerance 
> = the BeautifulSoup Soup ( " <HTML> Data </ HTML> " , " html5lib " ) # to parse the document browser manner, fault tolerance is preferably 
> = the BeautifulSoup Soup ( " <HTML> Data </ HTML> " , [ " lxml-XML " ]) # lxml the XML parser speed 
>soup = BeautifulSoup("<html>data</html>", " Lxml " ) # lxml HTML parser, fast, fault-tolerance

If you do not specify the parser, BeautifulSoup parser using the system will automatically find available.

 

Experience

All examples are the following html example.

html_doc = """
<html>
<div id="My gift">
<p class="intro short-text" align="left">One</p>
<p class="intro short-text" align="center">Two</p>
<p class="intro short-text" align="right">Three</p>
</div>
<img class="photo" src="demo.jpg">
<div class="photo">
<a href="sdysit.com"><img src="logo.png"></a>
<p class="subject">山东远思信息科技有限公司</p>
</div>
</html>
"""
  • Text is also nodes, which we call a text node type, such as p tags of One, Two, Three
  • Child node of a node more often than we see, because the text type in addition to those child nodes of child nodes visible line break, space, tab stops, etc., it is also a node

 

Node object name, property

Using a parser generator lxml BeautifulSoup objects soup, the tag name can then be used to obtain the node object:

> soup = BeautifulSoup(html_doc, 'lxml')
> tag = soup.html
> tag.name
'html'
>tag.p.name
'p'

In fact, we do not care who the label's parent, get node objects directly from the soup:

> soup.p.name
'p'
> soup.img['src']
'demo.jpg'
> soup.img.attrs
{'class': ['photo'], 'src': 'demo.jpg'}
> soup.p['class']
['intro', 'short-text']
> soup.div['id']
'My gift'

Obviously, in such a way obtained node must be the same type of the first html tag. The above example also demonstrates all the properties and how to obtain the specified attribute node objects. When the class attribute has multiple values, it returns a list, and the id attribute does not recognize multiple values.

Text content of the node
to obtain a node's text content, there are a number of ways, such as:

> soup.p.text
'One'
> soup.p.getText()
'One'
> soup.p.get_text()
'One'
> soup.p.string
'One'
> type(soup.p.string)
<class 'bs4.element.NavigableString'>

When only a text node child node type, the effect of the first three methods are exactly the same, a fourth look the same, but the type is returned NavigableString (traversable string).

When the node includes the element type of child nodes, the resulting output may already not what we need. At this point, you can use .strings or .stripped_strings (remove blank lines and extra spaces) to get an iterator to traverse to get what we want.

>>> soup.div.text
 ' \ NONE \ nTwo \ nThree \ the n- ' 
>>> soup.html.text
 ' \ the n-\ NONE \ nTwo \ nThree \ the n-\ the n-\ the n-\ the n-\ Shandong far from thinking the n-Information Technology Limited \ n-\ n- ' 
>>> for Item in soup.div.stripped_strings:
 Print (Item) 

One 
Two 
Three

Child node

.contents, .children, .descendants can get the node's children, but usage varied:

  • .contents, .children only get direct child node, .descendants you can get all the child nodes recursively
  • The list of child nodes .contents returned, .children, .descendants returns iterators

Parent

.parent property to get the parent node of an element:

>>> soup.p.parent.name
'div'

.parents property can recursively get all fathers node elements:

>>> for parent in soup.p.parents:
print(parent.name)

div
body
html
[document]

Sibling

  • .Next_sibling properties and may be used to query a .previous_sibling sibling node before or after one, it must be noted, in addition to visible sibling nodes, there may be line feeds, spaces, tabs, and other text-based bit mixed in sibling nodes.
  • You may be used .next_siblings and outputs the current iterative .previous_siblings attribute acquiring sibling node preceding or following the node.

Search nodes
generally use find () and find_all () list Searching for qualifying and the first node of all nodes.

>>> soup.find('p')
<p align="left" class="intro short-text">One</p>
>>> soup.find_all('img')
[<img class="photo" src="demo.jpg"/>, <img src="logo.png"/>]

 

Use regular expressions to match the tag name

Search to begin with d Tags:

>>> import re
>>> for tag in soup.find_all(re.compile("^d")):
print(tag.name)

div
div

 

Use property search

Soup.find_all >>> (id = ' My Gift ' ) [0] .name # Find the node id = My gift 
' div ' 
>>> soup.find_all (id = True) [0] .name # Finding id node attribute 
' div ' 
>>> soup.find_all (attrs = { " ID " : " My Gift " }) [0] .name # use attrs Find 
' div ' 
>>> soup.find_all (attrs = { " class " : " Intro Short-text " , " align = left ":"right"}) [0] .text # use attrs Find 
' Three ' 
>>> soup.find_all (attrs = { " align = left " : " right " }) [0] .text # use attrs Find 
' Three '

Using CSS Search

>>> soup.find_all("p", class_="intro")
[<p align="left" class="intro short-text">One</p>, <p align="center" class="intro short-text">Two</p>, <p align="right" class="intro short-text">Three</p>]
>>> soup.find_all("p", class_="intro short-text")
[<p align="left" class="intro short-text">One</p>, <p align="center" class="intro short-text">Two</p>, <p align="right" class="intro short-text">Three</p>]
>>> 

Use text search

>>> soup.find_all(string="Two")
['Two']
>>> soup.find_all(string=re.compile("Th"))
['Three']

Use the function screening

>>> def justdoit(tag):
return tag.parent.has_attr('id') and tag['align']=='center'

>>> soup.find_all(justdoit)
[<p align="center" class="intro short-text">Two</p>]

 

Guess you like

Origin www.cnblogs.com/yangmaosen/p/Mr_Y13.html