Python web crawler (four) - Beautiful Soup library

1. Install

In the command line window, enter the following code download

pip install beautifulsoup4

2. Exercise

>>> import requests

>>> r = requests.get("http://python123.io/ws/demo.html")

>>> r.text

'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> demo = r.text

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(demo , "html.parser") #对HTML解析

>>> print(soup.prettify())

<html>

 <head>

  <title>

   This is a python demo page

  </title>

 </head>

 <body>

  <p class="title">

   <b>

    The demo python introduces several python courses.

   </b>

  </p>

  <p class="course">

   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">

    Basic Python

   </a>

   and

   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">

    Advanced Python

   </a>

  </p>

 </body>

</html>

 3. Beautiful Soup library is resolved, traverse, Maintenance "label" function library

Beautiful Soup library, also called beautifulsoup4 or bs4

from bs4 import BeautifulSoup

Open file

>>> from bs4 import BeautifulSoup

>>> soup2 = BeautifulSoup(open(“D://demo.html”), "html.parser")

Parser Instructions condition
Bs4 HTML parser BeautifulSoup(mk,’html.parser’) Installation bs4 library
Lxml HTML parser BeautifulSoup(mk,’lxml’) pip install lxml
Lxml XML parser

BeautifulSoup(mk,’xml’)

pip install lxml
Html5lib parser BeautifulSoup(mk,’html5lib’) pip install html5lib

Beautiful Soup category of basic elements

fundamental element Explanation
Tag Tag, the basic information organization unit, respectively <> and </> indicate the beginning and end
Name Name tag, <p> ... </ p> name is 'p', the format: <tag> .name
Attributes

Tag attributes, organized in the dictionary, the format: <tag> .attrs

NavigableString Non attribute string in the tag, <> ... </> string in the format: <tag> .string
Comment Note the tag part of the string, a special type Comment
>>> import requests

>>> r = requests.get("http://python123.io/ws/demo.html")

>>> demo = r.text

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(demo , "html.parser")

>>> soup.title

<title>This is a python demo page</title>

>>> tag = soup.a

>>> tag

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

>>> tag.name

'a'

>>> tag.parent.name

'p'

>>> tag.parent.parent.name

'body'

>>> tag.attrs

{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

>>> tag.attrs['class']

['py1']

>>> type(tag.attrs)

<class 'dict'>

>>> type(tag)

<class 'bs4.element.Tag'>

>>> tag.string

'Basic Python'

>>> soup.p

<p class="title"><b>The demo python introduces several python courses.</b></p>

>>> soup.p.string

'The demo python introduces several python courses.'

>>> type(soup.p.string)

<class 'bs4.element.NavigableString'>

 

发布了53 篇原创文章 · 获赞 117 · 访问量 2万+

Guess you like

Origin blog.csdn.net/weixin_40431584/article/details/89066394