An introduction
Beautiful Soup is a can extract data from HTML or XML file Python library. It can be achieved through your favorite converter usual document navigation, search, way .Beautiful Soup modify the document to help you save hours or even days working hours. you may be looking Beautiful Soup3 document, Beautiful Soup 3 has stopped development, the official website recommended Beautiful Soup 4 in the current project, transplanted to BS4
#安装 Beautiful Soup pip install beautifulsoup4 # Install Parser HTML Parser Beautiful Soup supports Python standard library also supports a number of third-party parser, one of which is lxml Depending on the operating system, you can choose the following methods to install lxml.: $ apt-get install Python-lxml $ easy_install lxml $ pip install lxml Another alternative parser is pure Python implementation of html5lib, html5lib the same analytical methods and the browser, you can choose the following methods to install html5lib: $ apt-get install Python-html5lib $ easy_install html5lib $ pip install html5lib
The following table lists the main parser, as well as their advantages and disadvantages, as the official website recommended lxml parser, because of the higher efficiency. In previous versions and Python3 Python2.7.3 in the previous 3.2.2 version, you must install or lxml html5lib, because those versions of the Python standard library built-in HTML parsing method is not stable enough.
Parser | Instructions | Advantage | Disadvantaged |
---|---|---|---|
Python Standard Library | BeautifulSoup(markup, "html.parser") |
|
|
lxml HTML parser | BeautifulSoup(markup, "lxml") |
|
|
lxml XML parser | BeautifulSoup(markup, ["lxml", "xml"]) BeautifulSoup(markup, "xml") |
|
|
html5lib | BeautifulSoup(markup, "html5lib") |
|
|
Chinese document: https: //www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
Two basic use
html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # Basic use: fault-tolerant processing, fault tolerance, the document refers to the case where the html code incomplete, the module may be used to identify the error. Use BeautifulSoup parsing the code above, the object can be obtained in a BeautifulSoup, and in accordance with the standard structure of the output of the indentation from BS4 Import BeautifulSoup Soup = the BeautifulSoup (html_doc, ' lxml ' ) # fault tolerant RES = soup.prettify () # deal retracted, the structured display Print (RES)
Three traverse the document tree
# Traversing the document tree: that is, directly through the label name selection, is characterized by fast speed choice, but if there is more of the same label only return the first # 1, the use of # 2, get the name of the label of # 3, get tag attributes # 4, obtaining content tag # 5, choose nested # 6, child node, descendant node # 7, a parent node, ancestor node # 8, sibling
Four search document tree
1, these filters
# Search document tree: BeautifulSoup defines a number of search methods, here we focus on two:. Find () and find_all () method parameters and usage of other similar html_doc = "" " <html><head><title>The Dormouse's story</title></head> <body> <p id="my p" class="title"><b id="bbb" class="boldest">The Dormouse's story</b> </p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup=BeautifulSoup(html_doc,'lxml') # 1, five kinds of filters: a string, a regular expression, list, True, method # 1.1, string: the tag name Print (soup.find_all ( ' B ' )) # 1.2, regular expressions Import Re Print (soup.find_all (re.compile ( ' ^ b ' ))) # find the beginning of the label b, resulting in body and b label # 1.3, list: if the list of parameters passed, Beautiful Soup any content in the list will be a matching element to find the code returns the following tags and document all <a> <b> tag:. Print (soup.find_all ([ ' A ' , ' B ' ])) # 1.4, True: can match any value, the following code to find all of the tag, but does not return the string node Print (soup.find_all (True)) for Tag in soup.find_all (True): Print (tag.name) # 1.5: If no suitable filter, you can also define a method, the method accepts only one element parameters, if this method returns True if the current element matching and is found, if it is not in return False DEF has_class_but_no_id (Tag): return tag.has_attr ( ' class ' ) and Not tag.has_attr ( ' ID ' ) print(soup.find_all(has_class_but_no_id))
2、find_all( name , attrs , recursive , text , **kwargs )
# 2, find_all (name, attrs, recursive This, text, ** kwargs) # 2.1, name: name search parameter values may cause any type of filter, channeling characters, regular expressions, a list, a method or True. Print (soup.find_all (= the re.compile name ( ' ^ T ' ))) # 2.2, keyword: the key = value form, value filter may be: a string, a regular expression, list, True Print (soup.find_all (= the re.compile ID ( ' My ' ))) Print (Soup. find_all (the re.compile the href = ( ' LaCie ' ), the re.compile id = ( ' \ D ' ))) # Note class use the class_ Print (soup.find_all (id = True)) # Finding id attribute tag # Some tag attribute can not be used in the search, for example in HTML5 data- * attributes: data_soup the BeautifulSoup = ( ' ! <Div Data-foo = "value"> foo </ div> ' , ' lxml ' ) # data_soup.find_all ( data-foo = "value") # error: SyntaxError: Not cAN bE aN expression the keyword # but can find_all () attrs parameter defines a method to search a dictionary containing special parameter attribute Tag: Print (data_soup.find_all (attrs {= " Data-foo " : " value " })) # [<div Data-foo = "value"> foo </ div>!] # 2.3, according to the class name lookup, note that keywords are class_, class_ = value, value may be one of the five selectors Print (soup.find_all ( ' A ' , class_ = ' sister ' )) # Find a class of sister a label Print (soup.find_all ( ' a ' , the class_ = ' sister ssss ' )) # lookup class and sss sister and a tag, also no match sequence error Print (soup.find_all (= the re.compile the class_ ( ' SIS ^ ' ))) # Find all class sister label #2.4、attrs print(soup.find_all('p',attrs={'class':'story'})) # 2.5, text: value may be: a character, a list, True, regular Print (soup.find_all (text = ' Elsie ' )) Print (soup.find_all ( ' A ' , text = ' Elsie ' )) # 2.6, limit parameters: If the document is large tree then the search will be slow if we do not need the full results, limit the number of parameters can be used to limit the effect of the SQL returns the result in similar keyword limit, when the number of search results. when the limit is reached the limit, it stops the search results returned Print (soup.find_all ( ' a ' , limit = 2 )) # 2.7, recursive This:. When you call tag of find_all () method, Beautiful Soup retrieves all descendants of nodes in the current tag, search tag if you only want direct child node, you can use the parameter = False recursive This Print (soup.html.find_all ( ' A ' )) Print (soup.html.find_all ( ' A ' , recursive This = False)) ''' Like calling find_all () call tag as find_all () Almost Beautiful Soup is the most commonly used search methods, so we define it as a shorthand method. BeautifulSoup objects and tag objects can be used as a way to use the results of this method and call this object find_all () the same method, the following two lines of code are equivalent: soup.find_all("a") soup("a") These two lines of code is equivalent to: soup.title.find_all(text=True) soup.title(text=True) '''
3、find( name , attrs , recursive , text , **kwargs )
# 3, the Find (name, attrs, recursive This, text, ** kwargs) find_all () method returns the document to all tag qualifying, although sometimes we just want to get a result. For example, the document is only one <body> tag, then use find_all () method to find the <body> tag is not appropriate, and a method using find_all set limit = 1 parameter is directly used find () method following two lines of code are equivalent: soup.find_all('title', limit=1) # [<title>The Dormouse's story</title>] soup.find('title') # <title>The Dormouse's story</title> The only difference is the result returned find_all () method is a value containing a list of elements, and find () method returns a result. find_all () method does not find the goal is to return an empty list, find () method can not find a target, returns None. print(soup.find("nosuchtag")) # None soup.head.title is the shorthand name of the method tag of this principle is shorthand for the current tag of multiple calls to find () method: soup.head.title # <title>The Dormouse's story</title> soup.find("head").find("title") # <title>The Dormouse's story</title>
4, CSS selectors
# This module provides the select method to support css, see the official website: HTTPS: //www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id37 html_doc = "" " <html><head><title>The Dormouse's story</title></head> <body> <p class="title"> <b>The Dormouse's story</b> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; <div class='panel-1'> <ul class='list' id='list-1'> <li class='element'>Foo</li> <li class='element'>Bar</li> <li class='element'>Jay</li> </ul> <ul class='list list-small' id='list-2'> <li class='element'><h1 class='yyyy'>Foo</h1></li> <li class='element xxx'>Bar</li> <li class='element'>Jay</li> </ul> </div> and they lived at the bottom of a well. </p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup=BeautifulSoup(html_doc,'lxml') #1、CSS选择器 print(soup.p.select('.sister')) print(soup.select('.sister span')) print(soup.select('#link1')) print(soup.select('#link1 span')) print(soup.select('#list-2 .element.xxx')) Print (soup.select ( ' # 2-List ' ) [0] .Select ( ' .element ' )) # can always select, but in fact not necessary, will be able to select a # 2, get property Print (soup.select ( ' # h1 of List-2 ' ) [0] .attrs) # 3, acquires content Print (soup.select ( ' # h1 of List-2 ' ) [0] .get_text ())
5, other methods jianguanwang
Five modify the document tree
Link: https: //www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id40
Six summary
# Summary: # 1 recommended lxml parsing library # 2, about three selectors: tag selector, find and find_all, css selector 1 , tag selector screening function is weak, but fast 2 , recommended find, find_all single query result or multiple results matching 3 , if css selectors are very familiar recommends using the sELECT # 3, commonly used method of acquiring property remember attrs and text value get_text () of