Using Python Reptile library BeautifulSoup traverse the document tree and tag Operational Details

Today, I will introduce the next reptile Python library BeautifulSoup traverse the document tree and tag with a detailed method of operation function
following example is the use of Python libraries BeautifulSoup reptiles document tree is traversed and label operation, it is the most basic content

html_doc = """
<html><head><title>The Dormouse's story</title></head>
 
<p class="title"><b>The Dormouse's story</b></p>
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
 
<p class="story">...</p>
"""
 
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'lxml')

First, the child node

Tag may contain a plurality of strings or other Tag, the Tag which are child nodes .BeautifulSoup provides many operational and traversal attribute child nodes.

1. obtained by Tag Tag name

print(soup.head)
print(soup.title)
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>

By name only way to get the first Tag, if you want to get all of the Tag you can use some method find_all

soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

2.contents attributes: Tag will return by way of the child node list

head_tag = soup.head
head_tag.contents
[<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
title_tag
<title>The Dormouse's story</title>
title_tag.contents
["The Dormouse's story"]

3.children: circulates through the attribute child node

for child in title_tag.children:
  print(child)
The Dormouse's story

4.descendants: either contents or the children are returned directly to the child node, while the descendants recursive loop tag for all the children of node

for child in head_tag.children:
  print(child)

```bash
The Dormouse's story ```
for child in head_tag.descendants:
  print(child)
<title>The Dormouse's story</title>
The Dormouse's story

If only one tag 5.string NavigableString type child node, then the tag may be used to give the sub-node .string

title_tag.string
"The Dormouse's story"

If a tag has only one child node, use .string get their only child nodes NavigableString.

head_tag.string
head_tag.string

If multiple sub-node tag, tag can not determine that the content .string corresponding child nodes, it returns None
print(soup.html.string)

None

6.strings和stripped_strings

If the tag contains a plurality of strings, may be used .strings acquisition cycle

for string in soup.strings:
  print(string)
The Dormouse's story
 
 
The Dormouse's story
 
 
Once upon a time there were three little sisters; and their names were
 
Elsie
,
 
Lacie
 and
 
Tillie
;
and they lived at the bottom of a well.
 
 
...

Content .string output contains many spaces and blank lines, remove the blank content using strpped_strings

for string in soup.stripped_strings:
  print(string)
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

Second, the parent node

1.parent: get the parent node of an element

title_tag = soup.title
title_tag.parent
<head><title>The Dormouse's story</title></head>

Strings have a parent node

title_tag.string.parent
<title>The Dormouse's story</title>

2.parents: recursively get all fathers nodes

link = soup.a
for parent in link.parents:
  if parent is None:
    print(parent)
  else:
    print(parent.name)
p
body
html
[document]

Third, sibling

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",'lxml')
print(sibling_soup.prettify())
<html>
 <body>
 <a>
  <b>
  text1
  </b>
  <c>
  text2
  </c>
 </a>
 </body>
</html>

1.next_sibling和previous_sibling

sibling_soup.b.next_sibling
<c>text2</c>
sibling_soup.c.previous_sibling
<b>text1</b>

In the actual document and previous_sibling .next_sibling usually a string or whitespace

soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
soup.a.next_sibling # 第一个<a></a>的next_sibling是,\n

```bash

‘,\n’

	
```bash
soup.a.next_sibling.next_sibling
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>

2.next_siblings和previous_siblings

for sibling in soup.a.next_siblings:
  print(repr(sibling))
',\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'
for sibling in soup.find(id="link3").previous_siblings:
  print(repr(sibling))
' and\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'

Fourth, rollback and forward

1.next_element和previous_element

Points to the next or previous object is parsed (or a string tag), i.e., the depth-first traversal order of the node and the previous node

last_a_tag = soup.find("a", id="link3")
print(last_a_tag.next_sibling)
print(last_a_tag.next_element)

;

and they lived at the bottom of a well.
Tillie
last_a_tag.previous_element
' and\n'

2.next_elements和previous_elements

By .next_elements and previous_elements forward or backward access to parse the contents of the document, if the document is being parsed as

for element in last_a_tag.next_elements:
  print(repr(element))
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'

We recommend the python learning sites, click to enter , to see how old the program is to learn! From basic python script, reptiles, django, data mining, programming techniques, work experience, as well as senior careful study of small python partners to combat finishing zero-based information projects! The method has timed programmer Python explain everyday technology, to share some of the learning and the need to pay attention to small details

Published 41 original articles · won praise 54 · views 60000 +

Guess you like

Origin blog.csdn.net/haoxun05/article/details/104506265
Recommended