Python web crawler (four) - Beautiful Soup library

1. Install

In the command line window, enter the following code download

pip install beautifulsoup4

2. Exercise

>>> import requests

>>> r = requests.get("http://python123.io/ws/demo.html")

>>> r.text

'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

>>> demo = r.text

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(demo , "html.parser") #对HTML解析

>>> print(soup.prettify())

<html>

 <head>

  <title>

   This is a python demo page

  </title>

 </head>

 <body>

  <p class="title">

   <b>

    The demo python introduces several python courses.

   </b>

  </p>

  <p class="course">

   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">

    Basic Python

   </a>

   and

   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">

    Advanced Python

   </a>

  </p>

 </body>

</html>

3. Beautiful Soup library is resolved, traverse, Maintenance "label" function library

Beautiful Soup library, also called beautifulsoup4 or bs4

from bs4 import BeautifulSoup

Open file

>>> from bs4 import BeautifulSoup

>>> soup2 = BeautifulSoup(open(“D://demo.html”), "html.parser")

Parser	Instructions	condition
Bs4 HTML parser	BeautifulSoup(mk,’html.parser’)	Installation bs4 library
Lxml HTML parser	BeautifulSoup(mk,’lxml’)	pip install lxml
Lxml XML parser	BeautifulSoup(mk,’xml’)	pip install lxml
Html5lib parser	BeautifulSoup(mk,’html5lib’)	pip install html5lib

Beautiful Soup category of basic elements

fundamental element	Explanation
Tag	Tag, the basic information organization unit, respectively <> and </> indicate the beginning and end
Name	Name tag, <p> ... </ p> name is 'p', the format: <tag> .name
Attributes	Tag attributes, organized in the dictionary, the format: <tag> .attrs
NavigableString	Non attribute string in the tag, <> ... </> string in the format: <tag> .string
Comment	Note the tag part of the string, a special type Comment

>>> import requests

>>> r = requests.get("http://python123.io/ws/demo.html")

>>> demo = r.text

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(demo , "html.parser")

>>> soup.title

<title>This is a python demo page</title>

>>> tag = soup.a

>>> tag

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

>>> tag.name

'a'

>>> tag.parent.name

'p'

>>> tag.parent.parent.name

'body'

>>> tag.attrs

{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

>>> tag.attrs['class']

['py1']

>>> type(tag.attrs)

<class 'dict'>

>>> type(tag)

<class 'bs4.element.Tag'>

>>> tag.string

'Basic Python'

>>> soup.p

<p class="title"><b>The demo python introduces several python courses.</b></p>

>>> soup.p.string

'The demo python introduces several python courses.'

>>> type(soup.p.string)

<class 'bs4.element.NavigableString'>

Rookie ambition

发布了53 篇原创文章 · 获赞 117 · 访问量 2万+

私信关注

Python web crawler (four) - Beautiful Soup library

1. Install

2. Exercise

3. Beautiful Soup library is resolved, traverse, Maintenance "label" function library

Guess you like