Python crawler tool | Beautiful Soup4 traverse documents

Description of Beautiful Soup

Beautiful Soup is a Python library that can extract data from HTML or XML files. It can realize idiomatic document navigation, find and modify the way of document through your favorite converter. BeautifulSoup will save you hours or even days of working time.
The above is taken from the official website

Beautiful Soup installation

$ easy_install beautifulsoup4
# or
$ pip install beautifulsoup4

Install the parser

Beautiful Soup not only supports the HTML parser in the Python standard library, but also supports many third-party parsers, such as lxml, html5libetc. You can choose from the following three methods to install lxml:

$ apt-get install Python-lxml

$ easy_install lxml

$ pip install lxml

Another alternative parser is implemented in pure Python, html5liband html5libthe parsing method is the same as the browser. You can choose the following methods to install html5lib:

$ apt-get install Python-html5lib

$ easy_install html5lib

$ pip install html5lib

The advantages and disadvantages of each parser are as follows: It is
Insert picture description here
recommended to be used lxmlas a parser because it is more efficient. In versions before Python 2.7.3 and versions before 3.2.2 in Python 3 , you must install lxmlor html5lib, because those Python versions have built-in HTML in the standard library The analytical method is not stable enough.

Note: If an HTML or XML document is not in the correct format, the results returned in different parsers may be different, check

Beautiful Soup use

Beautiful Soup is very simple to use. Pass a document (a string or a file handle) into the BeautifulSoup construction method to get a document object. After having this object, we can do something about the document. Do the operation. Most of the incoming text is crawled by crawlers, so the combination of Beautiful Soup and requests library has a better experience.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")

First, the document is converted to Unicode, and all instances of HTML are converted to Unicode encoding

BeautifulSoup("Sacr&eacute; bleu!")
<html><head></head><body>Sacré bleu!</body></html>

Then, Beautiful Soup chooses the most suitable parser to parse this document.If you manually specify a parser, Beautiful Soup will choose the specified parser to parse the document.

Type of object

Beautiful Soup complex HTML documents converted into a complex tree structure, each node is Python objects, all objects can be grouped into four kinds: Tag, NavigableString, BeautifulSoup, Comment.
Wherein:

  • Tag is a tag of HTML, such as div, p, h1~h6 tags, etc. It is also the most used object.

  • NavigableString refers to the text inside the label, and the literal translation is a traversable string.

  • BeautifulSoup refers to the entire content of a document, which can be treated as a Tag.

  • Comment is a special NavigableString whose output content does not include comment content.

Tag

The Tag object is the same as the tag in the XML or HTML native document:

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

Important attributes in tag

The most important attributes in the tag are: nameandattributes

Name

Each tag has its own name, which can .namebe obtained by:

tag.name
# u'b'

If the tag is changed name, it will affect all HTML documents generated by the current Beautiful Soup object:

tag.name = "blockquote"
tag
# <blockquote class="boldest">Extremely bold</blockquote>

Attributes

A tag may have many attributes . tag <b class="boldest">and one “class”attribute. . tagThe operation method of the attribute with the value of "boldest" is the same as that tagof the dictionary: the attribute can be added, deleted or modified

tag['class']
# u'boldest'

# tag的属性可以被添加,删除或修改
tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>

tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None

You can also directly "click" to get the attributes, for example: .attrs:

tag.attrs
# {u'class': u'boldest'}

Strings are often contained in tags and .Beautiful Soupuse NavigableStringclasses to wrap tagthe strings:

tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>

A NavigableStringstring in Python Unicodesame string, and also supports some of the features contained in the document tree and traversing the tree by searching for documents. unicode()May be a method directly NavigableStringconverted into the target Unicodestring:

unicode_string = unicode(tag.string)
unicode_string
# u'Extremely bold'
type(unicode_string)
# <type 'unicode'>

The string contained in the tag cannot be edited, but it can be replaced with other strings, using replace_with():

tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>

BeautifulSoup

The BeautifulSoup object represents the entire content of a document. Most of the time, you can think of it as Tag 对象it supports traversing the document tree and searching most of the methods described in the document tree.

Because the object is not a real BeautifulSoup HTML or XML tag, so it does not nameand attributeproperty. But sometimes view its .nameproperties is very convenient, so BeautifulSoup object contains a value “[document]”of special property.name

Notes and special strings:
Tag, NavigableString, BeautifulSoupcovering almost all the contents of the xml and html, but there are some special objects: As the comments section of the document, which requires Commentobjects use a special format output comments section:

markup = "<b><!--这是一段注释--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>


# Comment 对象会使用特殊的格式输出注释部分
comment
# u'这是一段注释'

Traverse the document tree

First define a string of HTML text, and do the analysis below:

html_doc = """
<html><head><title>index</title></head>

<p class="title"><b>商城</b></p>

<p class="story">这是我的第三个商城,欢迎来参观
<a href="http://cityShop.com/elsie" class="city" id="home">home</a>
<a href="http://cityShop.com/lacie" class="city" id="design">design</a> 
<a href="http://cityShop.com/tillie" class="city" id="products">products</a>
欢迎来参观.</p>

<p class="welcome">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

Child node

A Tag may contain multiple strings or other tags, these are the child nodes of this Tag. Beautiful Soup provides many operations and attributes to traverse the child nodes.

Note: The string node in Beautiful Soup does not support these attributes because the string has no child nodes

Get the name of the tag

The easiest way to manipulate the document tree is to tell it the name of the tag you want to get. If you want to get the <head>tag, just use soup.head:

soup.head
# <head><title>index</title></head>

soup.title
# <title>商城</title>

You can tagcall this method multiple times in the document tree . For example, get <body>the first tag in the <b>tag:

soup.body.b
# <b>商城</b>

Only the first tag of the current name can be obtained by clicking the attribute:

soup.a
# <a href="http://cityShop.com/elsie" class="city" id="home">home</a>

If you want to get all the <a>tags, or get more content than one tag by name, you need to use find_all():

soup.find_all('a')
# [<a href="http://cityShop.com/elsie" class="city" id="home">home</a>
# <a href="http://cityShop.com/lacie" class="city" id="design">design</a> 
# <a href="http://cityShop.com/tillie" class="city" id="products">products</a>]

.contents 和 .children

The tag .contentsattribute can tagoutput the child nodes in a list:

head_tag = soup.head
head_tag
# <head><title>index</title></head>

head_tag.contents
[<title>index</title>]

title_tag = head_tag.contents[0]
title_tag
# <title>index</title>
title_tag.contents
# [u'index']

The child nodes of tag .children 生成器,can tagbe looped through:

for child in title_tag.children:
    print(child)
    # index

.childrenOnly the direct node of the tag can be obtained, and no descendant nodes ,.descendantscan be obtained.

.contentsAnd .childrendirect child node attribute contains only tag, for example, <head>the label is only one direct child nodes <title>
, but <title>the label also contains a sub-node: string “index”, the string in this case “index”also a <head>descendant node label. .descendantsProperty can all tag的子孙节点be recursive loop:

for child in head_tag.descendants:
    print(child)
    # <title>index</title>
    # index

Parent node

Each tag or string has a parent node: it is contained in a tag

.parent

By .parentacquiring the parent node of an element attribute .titleof the parent tag is head, htmlthe parent tag is BeautifulSoup object, and the object is the parent tag BeautifulSoup None.

title_tag = soup.title
title_tag
# <title>index</title>
title_tag.parent
# <head><title>index</title></head>


# 文档title的字符串也有父节点:<title>标签
title_tag.string.parent
# <title>index</title>


# 文档的顶层节点比如<html>的父节点是 BeautifulSoup 对象:
html_tag = soup.html
type(html_tag.parent)
# <class 'bs4.BeautifulSoup'>


# BeautifulSoup 对象的 .parent 是None:
print(soup.parent)
# None

.parents

Element by .parentsall nodes recursively obtained fathers element attributes, the following example uses the .parentsmethod traverses <a>all nodes to the root node tag.

link = soup.a
link
# <a href="http://cityShop.com/elsie" class="city" id="home">home</a>
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

#结果:
# p
# body
# html
# [document]
# None

Sibling node

The .next_sibling and .previous_sibling
sibling nodes are sibling elements at the same level. In the document tree, use the .next_siblingand .previous_siblingattribute to query the sibling nodes:

soup = BeautifulSoup(html_doc, "lxml");
p_tag=soup.p

print(p_tag.next_sibling)
print(p_tag.next_sibling.next_sibling)

# 输出结果

<p class="story">这是我的第三个商城,欢迎来参观
<a href="http://cityShop.com/elsie" class="city" id="home">home</a>
<a href="http://cityShop.com/lacie" class="city" id="design">design</a> 
<a href="http://cityShop.com/tillie" class="city" id="products">products</a>
欢迎来参观.</p>
p 的第一个 next_sibling 是p 和 p 之间的换行符。

.next_siblings and .previous_siblings
through .next_siblingsand .previous_siblingsproperty can be iterated output brothers of the current node:

soup = BeautifulSoup(html_doc, "lxml");
p_tag=soup.p

for p_tag in p_tag.previous_siblings:
	print( p_tag)

# 输出结果
<p class="story">这是我的第三个商城,欢迎来参观
<a href="http://cityShop.com/elsie" class="city" id="home">home</a>
<a href="http://cityShop.com/lacie" class="city" id="design">design</a> 
<a href="http://cityShop.com/tillie" class="city" id="products">products</a>
欢迎来参观.</p>

# 输出结果

<p class="title"><b>商城</b></p>

Forward and backward

By .next_elementand .previous_elementobjects before or after obtaining a specified tag a parsed, pay attention to this and siblings are somewhat different, sibling refers to the same father node has a child node, and this before or after a document is in accordance with a Calculated in the order of analysis.
For example, in a text instance html_doc, the headsibling is body(without regard to line breaks), because they have a common parent html, but headthe next node is title. That is soup.head.next_sibling=title soup.head.next_element=title`.

soup = BeautifulSoup(html_doc, "lxml");

head_tag=soup.head
print(head_tag.next_element)

title_tag=soup.title
print(title_tag.next_element)

# 输出结果
<title>index</title>
index

Also note that the titlenext label is not resolved body, but titlethe content within the tag, because htmlthe resolution order is open titletab, and then parses the content, and finally close titlethe label.
.previous_elementThe property is just the .next_elementopposite, it points to the object currently being parsed 前一个解析对象:

last_a_tag.previous_element
# u' and\n'
last_a_tag.previous_element.next_element
# <a href="http://cityShop.com/tillie" class="city" id="products">products</a>

In addition, the same can .next_elementsand .previous_elementsto iterate document tree. Newline characters also occupy the parsing order, which is consistent with the effect of iterating sibling nodes.

for element in last_a_tag.next_elements:
    print(repr(element))
# u'Tillie'
# u';\欢迎来参观.'
# u'\n\n'
# <p class="story">...</p>
# u'...'
# u'\n'
# None

Search the document tree

Beautiful Soup defines many search methods, of which the more important and commonly used are: find()and find_all().
Define a document instance:

html_doc = """
<html><head><title>index</title></head>

<p class="title"><b>商城首页</b></p>

<p class="story">这是我的第三个商城,欢迎来参观
<a href="http://cityShop.com/elsie" class="city" id="home">home</a>
<a href="http://cityShop.com/lacie" class="city" id="design">design</a> 
<a href="http://cityShop.com/tillie" class="city" id="products">products</a>
欢迎来参观.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

Use a find_all()similar method to find the content of the document you want to find:

soup.find_all('b')

# 结果
# [<b>商城首页</b>]

If you pass in a regular expression as a parameter, Beautiful Soup will match()match the content through the regular expression . In the following example, all bthe tags that start with are found , which means that the <body>sum <b>tags should be found:

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

Find all ”t”tags contained in the name :

for tag in soup.find_all(re.compile("t")):
    print(tag.name)
    
# 结果
# html
# title

Pass in the list parameters, and Beautiful Soup will return the content that matches any element in the list. The following code finds all <a>tags and <b>tags in the document :

soup.find_all(["a", "b"])
# [<b>商城首页</b>
#  <a href="http://cityShop.com/elsie" class="city" id="home">home</a>
#  <a href="http://cityShop.com/lacie" class="city" id="design">design</a> 
#  <a href="http://cityShop.com/tillie" class="city" id="products">products</a>]

More detailed documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#

Guess you like

Origin blog.csdn.net/weixin_43853746/article/details/108015080