python reptile (six) BeautifulSoup library

concept

Here Insert Picture Description

installation:

Installation: command line input pip install beautifulsoup4

BeautifulSoup support parser

Here Insert Picture Description

Basic Usage

from bs4 import BeautifulSoup
html='''
<html><head><title>The Dormousae's story</title></head>
<body>
<p class="title" name="drimouse"><b>The Dormousae's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/title" class="sister" id="link3">Tillie</a>;
and they lived at the boottom of a well.</p>
<p class="story">...</p>
'''
soup=BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)

For html we can see that not a complete HTML string, through soup = BeautifulSoup (html, 'lxml '), on the object initialization BeautifulSoup, soup.prettify () method of the drug can be a standard character string parsed indented output,
soup.title.string print in addition to the contents of title nodes.

Tag selector

Select elements:
# html与上述的一致
soup=BeautifulSoup(html,'lxml')
print(soup.title)# 打印title标签以及其中的内容
print(type(soup.title))#<class 'bs4.element.Tag'>
print(soup.head)# 打印head标签以及其中的内容
print(soup.p)# 只会打印第一个p节点以及其中的内容

Get the name of
from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.title.name)
#打印出节点的名称title
Acquiring property
from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.p.attrs)#{'class': ['title'], 'name': 'drimouse'}
print(soup.p.attrs['name'])#drimouse
print(soup.p['name'])#drimouse
Access to content
from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.title.string)
Nesting options:
print(soup.title.string)#
print(soup.head.title.string)
print(soup.head.title)
print(type(soup.head.title))
print(type(soup.head.title.string))
# 打印结果依次为:
The Dormousae's story
The Dormousae's story
<title>The Dormousae's story</title>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
Related options:

In doing choice, and sometimes do not step on the election to the node elements you want, you need to select one element node, then select it again a reference to it as a child node, parent, sibling, etc.
(1) child nodes and descendant nodes:

from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.p.contents)#获取子节点
# [<b>The Dormousae's story</b>]

Method 2:

from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.p.children)# 迭代器类型
for i,child in enumerate(soup.p.children):
	print(i,child)

Print is:
<list_iterator Object AT 0x000001BABACB9EF0>
0 at The Dormousae's Story

Descendant nodes:


from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.p.descendants)#获取子孙节点
for i,child in enumerate(soup.descendants):
	print(i,child)

(2) Get the parent and ancestor nodes

soup=BeautifulSoup(html,'lxml')
print(soup.a.parent)#获取父节点
print(soup.a.parents)#返回迭代器
print(list(enumerate(soup.a.parents)))#获取祖先节点

(3) siblings:

from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(list(enumerate(soup.a.next_siblings)))#获取后面的兄弟节点
print(list(enumerate(soup.a.previous_siblings)))#获取前面的兄弟节点

打印结果:
[(0, ‘,\n’), (1, Lacie), (2, ’ and\n’), (3, Tillie), (4, ‘;\nand they lived at the boottom of a well.’)]

[(0, ‘Once upon a time there were three little sisters;and their names were\n’)]

Method selector:

The previously mentioned attributes are selected by this method is faster, but if you encounter more complex choice, too much trouble, do not flexible, BeautifulSoup library also provides find_all (), as well as find () method

find_all(name,attrs,recursive,text,**kwargs)

To find documents based on tag names, attributes, content

html='''
<div class="panel">
	<div class="panel-heading">
		<h4>Hello</h4>
	</div>
	<div class="panel-body">
		<ul class="list" id="list-1">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
			<li class="element">Jay</li>
		</ul>
		<ul class="list list-small" id="list-2">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
		</ul>
	</div>
</div>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
for ul in soup.find_all('ul'):
	print(ul.find_all('li'))

Print results
Here Insert Picture Description
attrs attribute:

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))

Equivalent to

print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))# 不能直接使用class,在python中class时关键字

text text

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find_all(text='Foo'))
find method

Find (name, attrs, recursive This, text, ** kwargs)
Find returns a single element, find_all returns all elements

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))

CSS selectors

By select directly into the CSS selector to complete the selection
(1) acquire property

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
	print(ul['id'])
	print(ul.attrs['id'])

(2) obtain content

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
for li in soup.select('li'):
	print(li.get_text())

to sum up:

Summary: recommended lxml parsing library, if necessary, use html.parser
label select the filter function is weak but fast
is recommended to use find (), find_all () query matches a single result or multiple results
if familiar with the CSS selectors recommend the use of select ()
Remember commonly used method of obtaining property values and text

Published 63 original articles · won praise 12 · views 4048

Guess you like

Origin blog.csdn.net/qq_45353823/article/details/104215426