A flexible and convenient web page parsing library with efficient processing and support for multiple parsers.
Using it, you can easily crawl web page information without writing regular expressions.
quick to use
Through the following example, have a simple understanding of bs4, and see its power:
from bs4 import BeautifulSoup html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' soup = BeautifulSoup(html,'lxml') print(soup.prettify()) print(soup.title) print(soup.title.name) print(soup.title.string) print(soup.title.parent.name) print(soup.p) print(soup.p["class"]) print(soup.a) print(soup.find_all('a')) print(soup.find(id='link3'))
The result is as follows:
Use BeautifulSoup to parse this code, you can get a BeautifulSoup object, and can output according to the standard indentation format.
At the same time, we can get all the links and text content separately through the following code:
for link in soup.find_all('a'): print(link.get('href')) print(soup.get_text())
parser
Beautiful Soup supports the HTML parser in the Python standard library, and also supports some third-party parsers. If we don't install it, Python will use Python's default parser. The lxml parser is more powerful and faster, and it is recommended to install it. In versions prior to Python 2.7.3 and in Python 3 prior to 3.2.2, lxml or html5lib must be installed because the HTML parsing methods built into the standard library for those Python versions are not stable enough.
The following are common parsers:
1. CSS selectors
(1) Select the content
- The selection can be done by directly passing in the CSS selector through select()
- Direct selection via tags
- By selector . Represents class #represents id
1 html=''' 2 <div class="panel"> 3 <div class="panel-heading"> 4 <h4>Hello</h4> 5 </div> 6 <div class="panel-body"> 7 <ul class="list" id="list-1"> 8 <li class="element">Foo</li> 9 <li class="element">Bar</li> 10 <li class="element">Jay</li> 11 </ul> 12 <ul class="list list-small" id="list-2"> 13 <li class="element">Foo</li> 14 <li class="element">Bar</li> 15 </ul> 16 </div> 17 </div> 18 ''' 19 from bs4 import BeautifulSoup #通过lxml解析 20 soup = BeautifulSoup(html, 'lxml') #Through the selector class, the content of the first layer class="panel-body and the second layer class="panel-heading is represented here. Note that there is a space in the middle of the first layer and the second layer 21 print (soup.select( ' .panel .panel-heading ' )) #Take 22 directly through the label print (soup.select( ' ul li ' )) #pass selector id 23 print (soup.select( ' #list-2 .element ' )) 24 print(type(soup.select('ul')[0]))
(2) Access to content
The text content can be obtained by get_text()
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for li in soup.select('li'): print(li.get_text())
When getting attributes
or attributes, you can use [attribute name] or attrs[attribute name]
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul['id']) print(ul.attrs['id'])
basic use
Tag selector
In quick use we add the following code:
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)
Through this soup.tag name, we can get the content of this tag.
There is a problem that needs to be paid attention to. If a tag is obtained in this way, if there are multiple such tags in the document, the returned result is the content of the first tag. As above, we get the p tag through soup.p, and there are multiple p tags in the document, but only the content of the first p tag is returned
get name
When we pass soup.title.name, we can get the name of the title tag, that is, title
get attribute
print(soup.p.attrs['name'])
print(soup.p['name'])
The above two methods can obtain the value of the name attribute of the p tag
get content
The result of print(soup.p.string)
can get the content of the first p tag:
The Dormouse's story
nested selection
We can get it directly in the following nested way
print(soup.head.title.string)
child nodes and descendant nodes
The use of contents is
demonstrated by the following example:
html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.contents)
The result is that all subtags under the p tag are stored in a list
The following elements are stored in the list
use of children
You can also get the contents of all child nodes under the p tag in the following way, and the results obtained through contents are the same, but the difference is that soup.p.children is an iterative object, not a list, which can only be obtained by looping Get well-known information
print(soup.p.children) for i,child in enumerate(soup.p.children): print(i,child)
Both contents and children are used to get child nodes. If you want to get descendant nodes, you can pass descendants
print(soup.descendants) and the result of this acquisition is also an iterator
parent node and ancestor node
You can get the information of the parent node through soup.a.parent
The ancestor node can be obtained through list(enumerate(soup.a.parents)), the result returned by this method is a list, the information of the parent node of the a label will be stored in the list, and the parent node of the parent node will also be placed in the list. In the list, and finally, the entire document will be put into the list. The last element and the penultimate element of all lists are the information of the entire document stored.
sibling node
soup.a.next_siblings Get the next sibling node
soup.a.previous_siblings Get the previous sibling node
soup.a.next_sibling Get the next sibling label
souo.a.previous_sinbling Get the previous sibling label
Standard selector
find_all
find_all(name,attrs,recursive,text,**kwargs)
can find documents by tag name, attribute, content
usage of name
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all('ul')) print(type(soup.find_all('ul')[0]))
The result is returned as a list
At the same time, we can find_all again for the result to get all the li tag information
for ul in soup.find_all('ul'): print(ul.find_all('li'))
attrs
Examples are as follows:
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(attrs={'id': 'list-1'})) print(soup.find_all(attrs={'name': 'elements'}))
attrs can be passed in a dictionary to find tags, but there is a special one here is class, because class is a special field in python, so if you want to find class related, you can change attrs={'class_':'element' } or soup.find_all('',{"class":"element}), special tag attributes can not write attrs, such as id
text
Examples are as follows:
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text='Foo'))
The result returns all the texts with text='Foo' found
find
find(name,attrs,recursive,text,**kwargs)
The first element of the matching result returned by find
Some other similar usages:
find_parents() returns all ancestor nodes, find_parent() returns immediate parent nodes.
find_next_siblings() returns all following siblings, and find_next_sibling() returns the first sibling.
find_previous_siblings() returns all previous sibling nodes, and find_previous_sibling() returns the first previous sibling node.
find_all_next() returns all eligible nodes after the node, find_next() returns the first eligible node
find_all_previous() returns all eligible nodes after the node, find_previous() returns the first eligible node
Summarize
It is recommended to use the lxml parsing library, and if necessary, use the html.parser
tag to select a weak but fast filter function. It is
recommended to use find(), find_all() to query a single result or multiple results.
If you are familiar with CSS selectors, it is recommended to use select()
Remember Commonly used methods for getting property and text values