Today, I will introduce the next reptile Python library BeautifulSoup traverse the document tree and tag with a detailed method of operation function
following example is the use of Python libraries BeautifulSoup reptiles document tree is traversed and label operation, it is the most basic content
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'lxml')
First, the child node
Tag may contain a plurality of strings or other Tag, the Tag which are child nodes .BeautifulSoup provides many operational and traversal attribute child nodes.
1. obtained by Tag Tag name
print(soup.head)
print(soup.title)
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
By name only way to get the first Tag, if you want to get all of the Tag you can use some method find_all
soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
2.contents attributes: Tag will return by way of the child node list
head_tag = soup.head
head_tag.contents
[<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
title_tag
<title>The Dormouse's story</title>
title_tag.contents
["The Dormouse's story"]
3.children: circulates through the attribute child node
for child in title_tag.children:
print(child)
The Dormouse's story
4.descendants: either contents or the children are returned directly to the child node, while the descendants recursive loop tag for all the children of node
for child in head_tag.children:
print(child)
```bash
for child in head_tag.descendants:
print(child)
<title>The Dormouse's story</title>
The Dormouse's story
If only one tag 5.string NavigableString type child node, then the tag may be used to give the sub-node .string
title_tag.string
"The Dormouse's story"
If a tag has only one child node, use .string get their only child nodes NavigableString.
head_tag.string
head_tag.string
If multiple sub-node tag, tag can not determine that the content .string corresponding child nodes, it returns None
print(soup.html.string
)
None
6.strings和stripped_strings
If the tag contains a plurality of strings, may be used .strings acquisition cycle
for string in soup.strings:
print(string)
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...
Content .string output contains many spaces and blank lines, remove the blank content using strpped_strings
for string in soup.stripped_strings:
print(string)
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...
Second, the parent node
1.parent: get the parent node of an element
title_tag = soup.title
title_tag.parent
<head><title>The Dormouse's story</title></head>
Strings have a parent node
title_tag.string.parent
<title>The Dormouse's story</title>
2.parents: recursively get all fathers nodes
link = soup.a
for parent in link.parents:
if parent is None:
print(parent)
else:
print(parent.name)
p
body
html
[document]
Third, sibling
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",'lxml')
print(sibling_soup.prettify())
<html>
<body>
<a>
<b>
text1
</b>
<c>
text2
</c>
</a>
</body>
</html>
1.next_sibling和previous_sibling
sibling_soup.b.next_sibling
<c>text2</c>
sibling_soup.c.previous_sibling
<b>text1</b>
In the actual document and previous_sibling .next_sibling usually a string or whitespace
soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
soup.a.next_sibling # 第一个<a></a>的next_sibling是,\n
```bash
‘,\n’
```bash
soup.a.next_sibling.next_sibling
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
2.next_siblings和previous_siblings
for sibling in soup.a.next_siblings:
print(repr(sibling))
',\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'
for sibling in soup.find(id="link3").previous_siblings:
print(repr(sibling))
' and\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'
Fourth, rollback and forward
1.next_element和previous_element
Points to the next or previous object is parsed (or a string tag), i.e., the depth-first traversal order of the node and the previous node
last_a_tag = soup.find("a", id="link3")
print(last_a_tag.next_sibling)
print(last_a_tag.next_element)
;
and they lived at the bottom of a well.
Tillie
last_a_tag.previous_element
' and\n'
2.next_elements和previous_elements
By .next_elements and previous_elements forward or backward access to parse the contents of the document, if the document is being parsed as
for element in last_a_tag.next_elements:
print(repr(element))
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'
We recommend the python learning sites, click to enter , to see how old the program is to learn! From basic python script, reptiles, django, data mining, programming techniques, work experience, as well as senior careful study of small python partners to combat finishing zero-based information projects! The method has timed programmer Python explain everyday technology, to share some of the learning and the need to pay attention to small details