concept
installation:
Installation: command line input pip install beautifulsoup4
BeautifulSoup support parser
Basic Usage
from bs4 import BeautifulSoup
html='''
<html><head><title>The Dormousae's story</title></head>
<body>
<p class="title" name="drimouse"><b>The Dormousae's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/title" class="sister" id="link3">Tillie</a>;
and they lived at the boottom of a well.</p>
<p class="story">...</p>
'''
soup=BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)
For html we can see that not a complete HTML string, through soup = BeautifulSoup (html, 'lxml '), on the object initialization BeautifulSoup, soup.prettify () method of the drug can be a standard character string parsed indented output,
soup.title.string print in addition to the contents of title nodes.
Tag selector
Select elements:
# html与上述的一致
soup=BeautifulSoup(html,'lxml')
print(soup.title)# 打印title标签以及其中的内容
print(type(soup.title))#<class 'bs4.element.Tag'>
print(soup.head)# 打印head标签以及其中的内容
print(soup.p)# 只会打印第一个p节点以及其中的内容
Get the name of
from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.title.name)
#打印出节点的名称title
Acquiring property
from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.p.attrs)#{'class': ['title'], 'name': 'drimouse'}
print(soup.p.attrs['name'])#drimouse
print(soup.p['name'])#drimouse
Access to content
from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.title.string)
Nesting options:
print(soup.title.string)#
print(soup.head.title.string)
print(soup.head.title)
print(type(soup.head.title))
print(type(soup.head.title.string))
# 打印结果依次为:
The Dormousae's story
The Dormousae's story
<title>The Dormousae's story</title>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
Related options:
In doing choice, and sometimes do not step on the election to the node elements you want, you need to select one element node, then select it again a reference to it as a child node, parent, sibling, etc.
(1) child nodes and descendant nodes:
from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.p.contents)#获取子节点
# [<b>The Dormousae's story</b>]
Method 2:
from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.p.children)# 迭代器类型
for i,child in enumerate(soup.p.children):
print(i,child)
Print is:
<list_iterator Object AT 0x000001BABACB9EF0>
0 at The Dormousae's Story
Descendant nodes:
from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.p.descendants)#获取子孙节点
for i,child in enumerate(soup.descendants):
print(i,child)
(2) Get the parent and ancestor nodes
soup=BeautifulSoup(html,'lxml')
print(soup.a.parent)#获取父节点
print(soup.a.parents)#返回迭代器
print(list(enumerate(soup.a.parents)))#获取祖先节点
(3) siblings:
from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(list(enumerate(soup.a.next_siblings)))#获取后面的兄弟节点
print(list(enumerate(soup.a.previous_siblings)))#获取前面的兄弟节点
打印结果:
[(0, ‘,\n’), (1, Lacie), (2, ’ and\n’), (3, Tillie), (4, ‘;\nand they lived at the boottom of a well.’)]
[(0, ‘Once upon a time there were three little sisters;and their names were\n’)]
Method selector:
The previously mentioned attributes are selected by this method is faster, but if you encounter more complex choice, too much trouble, do not flexible, BeautifulSoup library also provides find_all (), as well as find () method
find_all(name,attrs,recursive,text,**kwargs)
To find documents based on tag names, attributes, content
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
Print results
attrs attribute:
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))
Equivalent to
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))# 不能直接使用class,在python中class时关键字
text text
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find_all(text='Foo'))
find method
Find (name, attrs, recursive This, text, ** kwargs)
Find returns a single element, find_all returns all elements
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))
CSS selectors
By select directly into the CSS selector to complete the selection
(1) acquire property
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
print(ul['id'])
print(ul.attrs['id'])
(2) obtain content
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
for li in soup.select('li'):
print(li.get_text())
to sum up:
Summary: recommended lxml parsing library, if necessary, use html.parser
label select the filter function is weak but fast
is recommended to use find (), find_all () query matches a single result or multiple results
if familiar with the CSS selectors recommend the use of select ()
Remember commonly used method of obtaining property values and text