3-3BeautifulSoupの紹介3-4BeautifulSoupの使用

3-3BeautifulSoupの紹介

 

https://www.crummy.com/software/BeautifulSoup/#Download

 

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

 

 

3-4BeautifulSoupの使用

pip install beautifulsoup4

 

C:\ Users \ Administrator \ PycharmProjects \ python_data_collection \ python_data_collection \ beautiful.py

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


from bs4 import BeautifulSoup as bs
soup = bs(html_doc)
print(soup.prettify())

印刷

 

C:\Users\Administrator\Envs\chongkong_vir\Scripts\python.exe "D:\Program Files\JetBrains\PyCharm 2017.1.3\helpers\pydev\pydevd.py" --multiproc --qt-support --client 127.0.0.1 --port 55094 --file C:/Users/Administrator/PycharmProjects/python_data_collection/python_data_collection/beautiful.py
warning: Debugger speedups using cython not found. Run '"C:\Users\Administrator\Envs\chongkong_vir\Scripts\python.exe" "D:\Program Files\JetBrains\PyCharm 2017.1.3\helpers\pydev\setup_cython.py" build_ext --inplace' to build.
pydev debugger: process 27860 is connecting

Connected to pydev debugger (build 171.4424.42)
<html>
C:/Users/Administrator/PycharmProjects/python_data_collection/python_data_collection/beautiful.py:17: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
 <head>
  <title>
   The Dormouse's story

  </title>
 </head>
 <body>
The code that caused this warning is on line 17 of the file C:/Users/Administrator/PycharmProjects/python_data_collection/python_data_collection/beautiful.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.
  <p class="title">

   <b>
    The Dormouse's story
  soup = bs(html_doc)
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

Process finished with exit code 0

 

C:\ Users \ Administrator \ PycharmProjects \ python_data_collection \ python_data_collection \ beautiful.py

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


from bs4 import BeautifulSoup as bs
soup = bs(html_doc, "html.parser")
print(soup.prettify())

印刷

C:\Users\Administrator\Envs\chongkong_vir\Scripts\python.exe "D:\Program Files\JetBrains\PyCharm 2017.1.3\helpers\pydev\pydevd.py" --multiproc --qt-support --client 127.0.0.1 --port 55160 --file C:/Users/Administrator/PycharmProjects/python_data_collection/python_data_collection/beautiful.py
warning: Debugger speedups using cython not found. Run '"C:\Users\Administrator\Envs\chongkong_vir\Scripts\python.exe" "D:\Program Files\JetBrains\PyCharm 2017.1.3\helpers\pydev\setup_cython.py" build_ext --inplace' to build.
pydev debugger: process 8520 is connecting

Connected to pydev debugger (build 171.4424.42)
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

Process finished with exit code 0

 

 

 

 

 

 

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


from bs4 import BeautifulSoup as bs
import re

soup = bs(html_doc, "html.parser")
# print(soup.prettify())

# print(soup.title)
# print(soup.title.string)
# print(soup.a)
# print(soup.find(id="link2"))
# print(soup.find(id="link2").string)
# print(soup.find(id="link2").get_text())
# print(soup.findAll("a"))

# for link in soup.findAll("a"):
#     print(link.string)

# print(soup.find("p",{"class":"story"}))
# print(soup.find("p",{"class":"story"}).get_text())

# for tag in soup.find_all(re.compile("^b")):
#     print(tag.name)

data = soup.findAll("a", href=re.compile(r"^http://example\.com/"))
print(data)

 

印刷

C:\Users\Administrator\Envs\chongkong_vir\Scripts\python.exe "D:\Program Files\JetBrains\PyCharm 2017.1.3\helpers\pydev\pydevd.py" --multiproc --qt-support --client 127.0.0.1 --port 55639 --file C:/Users/Administrator/PycharmProjects/python_data_collection/python_data_collection/beautiful.py
warning: Debugger speedups using cython not found. Run '"C:\Users\Administrator\Envs\chongkong_vir\Scripts\python.exe" "D:\Program Files\JetBrains\PyCharm 2017.1.3\helpers\pydev\setup_cython.py" build_ext --inplace' to build.
pydev debugger: process 39492 is connecting

Connected to pydev debugger (build 171.4424.42)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Process finished with exit code 0

 

 

 

 

 

おすすめ

転載: blog.csdn.net/huanglianggu/article/details/114949859