Parsing library beautifulsoup

An introduction

Beautiful Soup is a can extract data from HTML or XML file Python library. It can be achieved through your favorite converter usual document navigation, search, way .Beautiful Soup modify the document to help you save hours or even days working hours. you may be looking Beautiful Soup3 document, Beautiful Soup 3 has stopped development, the official website recommended Beautiful Soup 4 in the current project, transplanted to BS4

#安装 Beautiful Soup
pip install beautifulsoup4

# Install Parser 
HTML Parser Beautiful Soup supports Python standard library also supports a number of third-party parser, one of which is lxml Depending on the operating system, you can choose the following methods to install lxml.:

$ apt-get install Python-lxml

$ easy_install lxml

$ pip install lxml

Another alternative parser is pure Python implementation of html5lib, html5lib the same analytical methods and the browser, you can choose the following methods to install html5lib:

$ apt-get install Python-html5lib

$ easy_install html5lib

$ pip install html5lib

The following table lists the main parser, as well as their advantages and disadvantages, as the official website recommended lxml parser, because of the higher efficiency. In previous versions and Python3 Python2.7.3 in the previous 3.2.2 version, you must install or lxml html5lib, because those versions of the Python standard library built-in HTML parsing method is not stable enough.

Parser Instructions Advantage Disadvantaged
Python Standard Library BeautifulSoup(markup, "html.parser")
  • Python's standard library built
  • Execution rate is moderate
  • Documents fault-tolerant capability
  • Version of Python 2.7.3 or 3.2.2) before the document fault tolerance poor
lxml HTML parser BeautifulSoup(markup, "lxml")
  • high speed
  • Documents fault-tolerant capability
  • You need to install the C language library
lxml XML parser

BeautifulSoup(markup, ["lxml", "xml"])

BeautifulSoup(markup, "xml")

  • high speed
  • The only support XML parser
  • You need to install the C language library
html5lib BeautifulSoup(markup, "html5lib")
  • The best fault tolerance
  • Browser way to parse the document
  • Generating documentation HTML5 format
  • Slow
  • Do not rely on external expansion

Chinese document: https: //www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

Two basic use

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# Basic use: fault-tolerant processing, fault tolerance, the document refers to the case where the html code incomplete, the module may be used to identify the error. Use BeautifulSoup parsing the code above, the object can be obtained in a BeautifulSoup, and in accordance with the standard structure of the output of the indentation 
from BS4 Import BeautifulSoup
Soup = the BeautifulSoup (html_doc, ' lxml ' ) # fault tolerant 
RES = soup.prettify () # deal retracted, the structured display 
Print (RES)

Three traverse the document tree

# Traversing the document tree: that is, directly through the label name selection, is characterized by fast speed choice, but if there is more of the same label only return the first 
# 1, the use of 
# 2, get the name of the label of 
# 3, get tag attributes 
# 4, obtaining content tag 
# 5, choose nested 
# 6, child node, descendant node 
# 7, a parent node, ancestor node 
# 8, sibling

Four search document tree

1, these filters

# Search document tree: BeautifulSoup defines a number of search methods, here we focus on two:. Find () and find_all () method parameters and usage of other similar 
html_doc = "" "
<html><head><title>The Dormouse's story</title></head>
<body>
<p id="my p" class="title"><b id="bbb" class="boldest">The Dormouse's story</b>
</p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

# 1, five kinds of filters: a string, a regular expression, list, True, method 
# 1.1, string: the tag name 
Print (soup.find_all ( ' B ' ))

# 1.2, regular expressions 
Import Re
 Print (soup.find_all (re.compile ( ' ^ b ' ))) # find the beginning of the label b, resulting in body and b label

# 1.3, list: if the list of parameters passed, Beautiful Soup any content in the list will be a matching element to find the code returns the following tags and document all <a> <b> tag:. 
Print (soup.find_all ([ ' A ' , ' B ' ]))

# 1.4, True: can match any value, the following code to find all of the tag, but does not return the string node 
Print (soup.find_all (True))
 for Tag in soup.find_all (True):
     Print (tag.name)

# 1.5: If no suitable filter, you can also define a method, the method accepts only one element parameters, if this method returns True if the current element matching and is found, if it is not in return False 
DEF has_class_but_no_id (Tag):
     return tag.has_attr ( ' class ' ) and  Not tag.has_attr ( ' ID ' )

print(soup.find_all(has_class_but_no_id))
View Code

2、find_all( name , attrs , recursive , text , **kwargs )

# 2, find_all (name, attrs, recursive This, text, ** kwargs) 
# 2.1, name: name search parameter values may cause any type of filter, channeling characters, regular expressions, a list, a method or True. 
Print (soup.find_all (= the re.compile name ( ' ^ T ' )))

# 2.2, keyword: the key = value form, value filter may be: a string, a regular expression, list, True 
Print (soup.find_all (= the re.compile ID ( ' My ' )))
 Print (Soup. find_all (the re.compile the href = ( ' LaCie ' ), the re.compile id = ( ' \ D ' ))) # Note class use the class_ 
Print (soup.find_all (id = True)) # Finding id attribute tag

# Some tag attribute can not be used in the search, for example in HTML5 data- * attributes: 
data_soup the BeautifulSoup = ( ' ! <Div Data-foo = "value"> foo </ div> ' , ' lxml ' )
 # data_soup.find_all ( data-foo = "value") # error: SyntaxError: Not cAN bE aN expression the keyword 
# but can find_all () attrs parameter defines a method to search a dictionary containing special parameter attribute Tag: 
Print (data_soup.find_all (attrs {= " Data-foo " : " value " }))
 # [<div Data-foo = "value"> foo </ div>!]

# 2.3, according to the class name lookup, note that keywords are class_, class_ = value, value may be one of the five selectors 
Print (soup.find_all ( ' A ' , class_ = ' sister ' )) # Find a class of sister a label 
Print (soup.find_all ( ' a ' , the class_ = ' sister ssss ' )) # lookup class and sss sister and a tag, also no match sequence error 
Print (soup.find_all (= the re.compile the class_ ( ' SIS ^ ' ))) # Find all class sister label

#2.4、attrs
print(soup.find_all('p',attrs={'class':'story'}))

# 2.5, text: value may be: a character, a list, True, regular 
Print (soup.find_all (text = ' Elsie ' ))
 Print (soup.find_all ( ' A ' , text = ' Elsie ' ))

# 2.6, limit parameters: If the document is large tree then the search will be slow if we do not need the full results, limit the number of parameters can be used to limit the effect of the SQL returns the result in similar keyword limit, when the number of search results. when the limit is reached the limit, it stops the search results returned 
Print (soup.find_all ( ' a ' , limit = 2 ))

# 2.7, recursive This:. When you call tag of find_all () method, Beautiful Soup retrieves all descendants of nodes in the current tag, search tag if you only want direct child node, you can use the parameter = False recursive This 
Print (soup.html.find_all ( ' A ' ))
 Print (soup.html.find_all ( ' A ' , recursive This = False))

'''
Like calling find_all () call tag as
find_all () Almost Beautiful Soup is the most commonly used search methods, so we define it as a shorthand method. BeautifulSoup objects and tag objects can be used as a way to use the results of this method and call this object find_all () the same method, the following two lines of code are equivalent:
soup.find_all("a")
soup("a")
These two lines of code is equivalent to:
soup.title.find_all(text=True)
soup.title(text=True)
'''
View Code

3、find( name , attrs , recursive , text , **kwargs )

# 3, the Find (name, attrs, recursive This, text, ** kwargs) 
find_all () method returns the document to all tag qualifying, although sometimes we just want to get a result. For example, the document is only one <body> tag, then use find_all () method to find the <body> tag is not appropriate, and a method using find_all set limit = 1 parameter is directly used find () method following two lines of code are equivalent:

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]
soup.find('title')
# <title>The Dormouse's story</title>

The only difference is the result returned find_all () method is a value containing a list of elements, and find () method returns a result.
find_all () method does not find the goal is to return an empty list, find () method can not find a target, returns None.
print(soup.find("nosuchtag"))
# None

soup.head.title is the shorthand name of the method tag of this principle is shorthand for the current tag of multiple calls to find () method:

soup.head.title
# <title>The Dormouse's story</title>
soup.find("head").find("title")
# <title>The Dormouse's story</title>
View Code

4, CSS selectors

# This module provides the select method to support css, see the official website: HTTPS: //www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id37 
html_doc = "" "
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
    <b>The Dormouse's story</b>
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">
        <span>Elsie</span>
    </a>
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    <div class='panel-1'>
        <ul class='list' id='list-1'>
            <li class='element'>Foo</li>
            <li class='element'>Bar</li>
            <li class='element'>Jay</li>
        </ul>
        <ul class='list list-small' id='list-2'>
            <li class='element'><h1 class='yyyy'>Foo</h1></li>
            <li class='element xxx'>Bar</li>
            <li class='element'>Jay</li>
        </ul>
    </div>
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

#1、CSS选择器
print(soup.p.select('.sister'))
print(soup.select('.sister span'))

print(soup.select('#link1'))
print(soup.select('#link1 span'))

print(soup.select('#list-2 .element.xxx'))

Print (soup.select ( ' # 2-List ' ) [0] .Select ( ' .element ' )) # can always select, but in fact not necessary, will be able to select a

# 2, get property 
Print (soup.select ( ' # h1 of List-2 ' ) [0] .attrs)

# 3, acquires content 
Print (soup.select ( ' # h1 of List-2 ' ) [0] .get_text ())

 

5, other methods jianguanwang

Five modify the document tree

Link: https: //www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id40

Six summary

# Summary: 
# 1 recommended lxml parsing library 
# 2, about three selectors: tag selector, find and find_all, css selector 
    1 , tag selector screening function is weak, but fast
     2 , recommended find, find_all single query result or multiple results matching
     3 , if css selectors are very familiar recommends using the sELECT
 # 3, commonly used method of acquiring property remember attrs and text value get_text () of

 

Guess you like

Origin www.cnblogs.com/KrisYzy/p/11937948.html