Beautiful Soup library

Beautiful Soup: delicious soup

Very good third-party libraries python

Related information can be parsed html, xml format, and extract the

Beautiful Soup can provide to you any of his related crawling format and can be parsed tree

Use Principle: put any document as you give him a pot of soup, then this soup pot system

First, the installation:

pip3 install beautifulsoup4

 HTML page is some information about some of the labels in angle brackets based package

>>> import requests
>>> r=requests.get("https://python123.io/ws/demo.html")
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> demo=r.text

>>> from BS4 Import BeautifulSoup # BS4 shorthand beautifulsoup4 library from BS4 introduced BeautifulSoup class library

#soup variable represents the demo page after we resolve

>>> Soup = BeautifulSoup (Demo, "html.parser") # The first parameter is the information we require a BeautifulSoup parsed html, can '<p> data </ p >' to be replaced, can be used any variable, which analyzing the second parameter is used by the parser soup (html.parser parser parses demo of the demo parsed html)

>>> print(soup.prettify())

BeautifulSoup library successfully parse the demo page we give

 Second, the basic elements of the library Beautiful Soup

Reference library BeautifulSoup

BeautifulSoup library, also called beautifulsoup4 library or library bs4

Import BS4 from  the BeautifulSoup (refer to a type from the BeautifulSoup BS4)

import bs4 (for some variables BeautifulSoup library judge)

BeautifulSoup library itself parses html, xml document that correspond to the tag tree after the treatment BeautifulSoup class, you can put the tag tree (can be understood as a string) is converted into BeautifulSoup class, BeautifulSoup class is a representative of the tag tree type, in fact, can be considered an HTML document <----------> tag tree <----------> BeautifulSoup category three are equivalent

By BeautifulSoup class makes the tag tree into a variable, and the processing of this variable is related to the processing of the tag tree

Simply speaking, we can BeautifulSoup entire contents of the corresponding class as a HTML / XML documents

 

Parser library Beautiful Soup

Parser use conditions

bs4 HTML parser BeautifulSoup (mk, 'html.parser') mounted bs4 Library

lxml HTML parser BeautifulSoup (mk, 'lxml') pip install lxml

The parser xml lxml BeautifulSoup (mk, 'html.xml') pip install lxml

html5lib parser BeautifulSoup (mk, 'html5lib') pip install html5lib

 

Beautiful Soup category of basic elements

Basic Element Description

Tag label, the basic unit information organization, by <> and </> indicate the beginning and end, respectively,

Name tag name, <p> ... </ p> name is 'p', the format: <tag> .name

Attributes property tag, a dictionary organization format: <tag> .attrs, no property returns empty dictionary

Non NavigableString the attribute string tag, <> ... </> string in the format: <tag> .string

Comment Comment label the part of the string, a special type Comment

See page title

>>> soup.title
<title>This is a python demo page</title>

>>> tag = soup.a # plurality, only obtain a first label of a
>>> Tag
<a class = "PY1" the href = "http://www.icourse163.org/course/BIT -268001 "id =" link1 "> Basic Python </a>

>>> soup.a.name # obtain a tag name, a string type
'a'

>>> soup.a.parent.name # get a name tag of the parent tag
'p'

>>> soup.a.parent.parent.name
'body'

Tag = soup.a >>>
>>> # tag.attrs obtained tag attributes, dictionaries
{ 'href': 'http://www.icourse163.org/course/BIT-268001' , 'class': [' py1 '],' id ': ' link1 '}

>>> tag.attrs['class']
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)
<class 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>

>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.string
'Basic Python'

Soup.p >>>
<P class = "title"> <B> of The Demo Python Python Introduces Several courses. </ B> </ P>
>>> soup.p.string
'Demo of The Python Python Introduces Several courses. 'b # no print labels, can be described NavigableString across multiple levels of labels
>>> type (soup.p.string)
<class' bs4.element.NavigableString'> # types defined in the library BS4

>>> newsoup = BeautifulSoup ( "<b > <-! This is a comment -> </ b> <p> This is not a comment </ p>", "html.parser") # <- represents a start of a comment
>>> newsoup.b.string # does not need to extract the comment information, the need for determining the relevant types
' a comment This iS '
>>> type (newsoup.b.string)
<class' bs4.element. comment '>
>>> newsoup.p.string
' Not A Comment This IS '
>>> type (newsoup.p.string)
<class' bs4.element.NavigableString'>

Third, based on bs4 library of HTML content traversal methods

Basic HTML format

Downward traversal:

  Property Description

  .Contents list of child nodes, the <tag> list of all son nodes into

  Iterator type .children child nodes, and .contents similar for loop iterates son node

  Iterative .descendants descendant node type, comprising all descendant nodes, a loop through

 

Soup.head >>>
<head> <title> This Python Demo IS A Page </ title> </ head>
>>> soup.head.contents
[<title> This Python Demo IS A Page </ title>]
> >> soup.body.contents # son node for a label, the label includes not only the nodes, the nodes also include a string, like a carriage return '\ n', and he is also a son node type body tag
[ '\ n ', <p class = "title "> <b> The demo python introduces several python courses. </ b> </ p>,' \ n ', <p class = "course"> Python is a wonderful general-purpose . Python Programming Language by You CAN Learn from Novice to Professional by following courses at The Tracking:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1"> Basic python </a> and <a class = " py2" href = "http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>> len(soup.body.contents)
5
>>> soup.body.contents[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>

 

for child in soup.body.children:

  print(child)

for child in soup.body.children:

  print(child)

Up traversal:

  Property Description

  Father node label .parent

  Iterative .parents type ancestor node label for circulating traversing ancestor node

>>> soup = BeautifulSoup(demo,'html.parser')
>>> for parent in soup.a.parents:
... if parent is None:
... print(parent)
... else:
... print(parent.name)
...
p
body
html
[document]

# When a traverse all ancestors label label, will traverse the soup itself, but soup does not exist .name ancestors of information, in which case a distinction needs to be done, if you can not print None ancestors is this part of the information

Parallel traversal:

  Property Description

Returns the next node in parallel .next_sibling HTML text label in accordance with the order

.previous_sibling return a parallel node tag according to the HTML text sequence

.next_siblings iterative type, HTML text returned in order for all subsequent parallel node label

.previous_siblings iterative type, returns all Continued parallel nodes in accordance with HTML text sequence tags

>>> soup.a.next_sibling  #a标签的下一个平行节点是一个字符串and,这里注意一下,在标签树中,尽管树形结构采用的是标签的形式来组织,但是标签之间的NavigableString  也构成了标签的节点,也就是说,任何一个节点,他的平行标签,他的儿子标签是可能存在NavigableString   类型的,所以并不能想当然的认为,平行遍历获得的节点一定是标签类型。
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>> soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>> soup.a.previous_sibling.previous_sibling  #空信息
>>> soup.a.parent
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

#遍历后续节点
>>> for sibling in soup.a.next_siblings:
... print(sibling)
...
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
.#遍历前续节点
>>> for sibling in soup.a.previous_siblings:
... print(sibling)
...
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

>>>

 四、基于bs4库的HTML格式化和编码

>>> soup.prettify()  #每一个标签后面加了一个换行符\n
'<html>\n <head>\n <title>\n This is a python demo page\n </title>\n </head>\n <body>\n <p class="title">\n <b>\n The demo python introduces several python courses.\n </b>\n </p>\n <p class="course">\n Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n Basic Python\n </a>\n and\n <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n Advanced Python\n </a>\n .\n </p>\n </body>\n</html>'
>>> print(soup.prettify())  #每一个标签以及相关内容都分行显示
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
>>>

prettify这个方法能够为html文本的标签和内容增加换行符,他也可以对每一个标签进行相关处理

>>> print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>

bs4库将任何读入的html文件或字符串都转换成了utf8编码,utf8编码是国际通用的标准编码格式,他能够很好的支持中文等第三国的语言,由于py3.x默认支持编码是utf8,因此在做相关解析的时候,使用bs4库并没有相关障碍

>>> soup = BeautifulSoup("<p>中文</p>","html.parser")
>>> soup.p.string
'中文'
>>> print(soup.p.prettify())
<p>
中文
</p>
>>>

总结:BeautifulSoup是用来解析html、xml文档的功能库,可以使用from bs4 import BeautifulSoup引入BeautifulSoup类型,并用这个类型加载相关的解析器,来解析一个变量出来,这个变量就是用来提取信息和遍历信息的BeautifulSoup的类型

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/suitcases/p/11200898.html