python reptile 7 - BeautifulSoup parsing library

Introduction 1. BeautifulSoup

HTML or XML parsing library. Support parser are:

  • python standard library: BeautifulSoup (markup "html.parser '.); moderate execution speed, fault-tolerant ability; python3.2.2 version before python2.7.3 and fault tolerance poor;
  • lxml HTML parsers: BeautifulSoup (markup "lxml '.); strong fast fault tolerance; recommended;
  • lxml XML parsing library: BeautifulSoup (markup "xml '.); fast, only supports XML;
  • html5lib: (. markup "html5lib ') BeautifulSoup; best fault tolerance is to parse the browser generates html5 document, but slower.

 

2. Initialize resolve

2.1 parse HTML text:

  from bs4 import BeautifulSoup

  soup = BeautifulSoup(open(‘’res.text','lxml')

  print strings (soup.prettify ()) #prettify () method to be parsed in standard output indented

2.2 parsing local files:

  from bs4 import BeautifulSoup

  soup = BeautifulSoup(open('./test.html',encoding='utf-8','lxml')

  print strings (soup.prettify ()) #prettify () method to be parsed in standard output indented

 

 

3. Node Selector

Local Html files:
<head>
<meta charset="UTF-8">
<title>The Dormouse's story</title>
</head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">there name were
<a href="http://elsie" class="sister" id="link1"><span>Elsie</span></a>,
<a href="http://lacie" class="sister" id="link2">Lacie</a>and
<a href="http://tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

Import the BeautifulSoup BS4 from
Import lxml
Soup = the BeautifulSoup (Open ( 'the test.html ./', encoding = 'UTF-. 8'), 'lxml')

3.1 selective element
print (soup.title) #title add text label
print ( type (soup.title)) # <class 'bs4.element.Tag'>
Print (soup.title.string) # string using the acquired object property Tag text
print (soup.p) # time value when a plurality of nodes the first match to

3.2 extract information
3.2.1 node name obtaining
print (soup.title.name)

3.2.2 get property
print (soup.p.attrs) # p acquired all the attributes and values of the nodes, the results of the dictionary form
print ( soup.p.attrs [ 'name']) # Get the name attribute value of p nodes
a more concise way
print (soup.p [ 'class'] ) # a node may have multiple element class, it returns a list of
print (soup .p [ 'name'])

3.3 acquires content
Print (soup.p.string)

3.1 nest
print (soup.head.title.string) # If the returned object can continue to call for the Tag node selection

3.2.1 associated with the selected child node and the descendant node
print (soup.p.contents) # Returns a list of direct child nodes
print (soup.p.children) # returned directly Builder child nodes, the output of iterations required
for child in enumerate (soup.p.children):
Print (Child)
Print (soup.p.descendants) # returns all descendant nodes of the generator, the output of iterations required
for Child in the enumerate (soup.p.descendants):
Print (Child)

3.2. 2 associated with the selected parent node, and descendant nodes
print (soup.a.parent) # direct parent
print (list (enumerate (soup.a.parents) )) # all ancestor nodes, the generator returns

3.2.3 associated with the selected, brothers node
print (soup.a.next_sibling) # next node element
print (soup.a.previous_sibling) on a node element #
print (list (enumerate (soup.a.next_siblings) )) # generator nodes behind all elements
print (soup.a.previous_siblings) # front of all the elements of the generator nodes

3.2.4 associated with the selected, extracting information
print (list (soup.a.parents) [0]) # extracted with table index or information extraction method

 

4. The method selector

Local HTML code:

<body>
  <ul class="list">
<li class="first" name="dennisz" id="one">桃花影落飞神剑</li>
<li class="hehe">我心已向大海</li>
<li title="third line">碧海潮生按玉箫</li>
<li>笑书神侠倚碧鸳</li>
<a href="http://www.baidu.com">金庸</a>
</ul>
<div class="xixi">
<ul>
<!-- 这是一个注释 -->
<li class="second" id="two">谁管世间满风浪</li>
<li class="haha">仗剑匹马走天涯</li>
<li id="yoyoyo">笑傲江湖成绝响</li>
<li class="ok" id="no">人间再无侠客行</li>
<div>
<li>令狐</li>
<li>任我行</li>
<li>盈盈</li>
<li>东方不败</li>
<li>左冷禅</li>
<li>岳不群</li>
</div>
</ul>
<d>
<a href="http://www.taobao.com">绝世秘籍</a>
</d>
</div>
</body>
 
4.1 通过节点名称查询
print(soup.find_all('li')) #返回所有li标签,结果以列表形式
print(soup.find_all('li',limit=3)) #限制条数
print(soup.find_all(['li','a'])) #返回所有的li或a标签
4.2通过属性查询
print(soup.find_all('li',class_='second')) #返回class='second'的li标签
print(soup.find_all('li',class_='second')[0].string)

4.3 其他方法

  • find():和find_all()相似,但是只返回第一个匹配到的结果
  • find_parents()和find_parent():所有祖先节点和直接父节点
  • find_next_sublings()和find_next_sublings():后面所有兄弟节点和第一个兄弟节点
  • find_previous_sublings()和find_previous_sublings():前面所有兄弟节点和第一个兄弟节点
  • find_all_next()和find_next():后面所有和后面第一个
  • find_all_previous()和find_previous():前面所有和后面第一个

5.CSS选择器

本地html代码:

<body>
  <ul class="list1">
<li class="first" name="dennisz" id="one">桃花影落飞神剑</li>
<li class="hehe">我心已向大海</li>
<li title="third line">碧海潮生按玉箫</li>
<li>笑书神侠倚碧鸳</li>
<a href="http://www.baidu.com">金庸</a>
</ul>
<div class="xixi">
<ul class="list2">
<!-- 这是一个注释 -->
<li class="second" id="two">谁管世间满风浪</li>
<li class="haha">仗剑匹马走天涯</li>
<li id="yoyoyo">笑傲江湖成绝响</li>
<li class="ok" id="no">人间再无侠客行</li>
<div>
<li>令狐</li>
<li>任我行</li>
<li>盈盈</li>
<li>东方不败</li>
<li>左冷禅</li>
<li>岳不群</li>
</div>
</ul>
<d>
<a href="http://www.taobao.com">绝世秘籍</a>
</d>
</div>
</body>
5.1 根据标签查找
print(soup.select('li')) #所有li标签
根据class属性查找
print(soup.select('.list'))
print(soup.select('li[class="hehe"]'))
根据id查找
print(soup.select('#two'))
print(soup.select('li[id="no"]'))

5.2 嵌套查找
for ul in soup.select('ul'):
print(ul.select('li'))

5.3 获取属性
for ul in soup.select('ul'):
print(ul['class']) #两种方法都可以
print(ul.attrs['class'])

5.4获取文本
for li in soup.select('li'):
print(li.text) #三种方法都可以
print(li.string)
print(li.get_text())

Guess you like

Origin www.cnblogs.com/rong1111/p/12159436.html