What is BeautifulSoup
In reptiles, the need to be able to read html tool to extract the desired data. This is the analytical data.
[Extract data] refers to the data we need to pick from a number of data.
Parse and extract data in the reptile in both a focus, but also difficult
BeautifulSoup how to use
Installation beautifulsoup ==>
pip install BeautifulSoup4
In parentheses, to two input parameters, the first parameter is 0 to be parsed text, note, and it must be must be a string.
Brackets The first parameter is used to identify the parser, we use Python is a built-in library:
html.parser
import requests #调用requests库
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')
#获取网页源代码,得到的res是response对象
print(res.status_code) #检查请求是否正确响应
html = res.text #把res的内容以字符串的形式返回
print(html)#打印html
复制代码
import requests
from bs4 import BeautifulSoup
#引入BS库
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')
html = res.text
soup = BeautifulSoup(html,'html.parser') #把网页解析为BeautifulSoup对象
print(type(soup)) # <class 'bs4.BeautifulSoup'>
复制代码
soup
out the source code and before we useresponse.text
to print out the source code is exactly the sameAlthough
response.text
andsoup
print out the contents of the surface looks exactly the same,They belong to different classes:
<class 'str'>
and<class 'bs4.BeautifulSoup'>
Extract data
We still use
BeautifulSoup
to extract the data.This step, knowledge can be divided into two parts:
find()
thefind_all()
wellTag对象
.
find()
Andfind_all()
areBeautifulSoup
two methods of the object,They can match html tags and attributes, the BeautifulSoup objects that match the requirements of the data are extracted
The difference is that
find()
extract only the data of the first to meet the requirements, andfind_all()
the extracted data to meet the requirements of all
localprod.pandateacher.com/python-manu…
HTML code, there are three
<div>
elements, withfind()
can be extracted first element, andfind_all()
can be extracted all
import requests
from bs4 import BeautifulSoup
url = 'https://localprod.pandateacher.com/python-manuscript/crawler-html/spder-men0.0.html'
res = requests.get (url)
print(res.status_code)
soup = BeautifulSoup(res.text,'html.parser')
item = soup.find('div') #使用find()方法提取首个<div>元素,并放到变量item里。
print(type(item)) #打印item的数据类型
print(item) #打印item
复制代码
import requests
from bs4 import BeautifulSoup
url = 'https://localprod.pandateacher.com/python-manuscript/crawler-html/spder-men0.0.html'
res = requests.get (url)
print(res.status_code)
soup = BeautifulSoup(res.text,'html.parser')
items = soup.find_all('div') #用find_all()把所有符合要求的数据提取出来,并放在变量items里
print(type(items)) #打印items的数据类型
print(items) #打印items
复制代码
For example in brackets
class_
, there is an underscore, for python syntax and classclass
distinction, to avoid program conflictsSmall Exercise: crawled the page title three books, links, and Book reviews
localprod.pandateacher.com/python-manu…
Another point of knowledge -
Tag
object.
We usually choose a
type()
function check data types,Python is an object-oriented programming language, and only know what the object is to call the relevant object properties and methods.
With the
find()
extracted data types and just the same, orTag
objects
We can
Tag.text
proposeTag
target text, with theTag['href']
extracted URL
import requests #调用requests库
from bs4 import BeautifulSoup
# 获取数据
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')
# res.status_code 状态码
# res.content 二进制
# res.text html代码
# res.encoding 编码
# 解析数据
# soup 是beautifulsoup对象
soup = BeautifulSoup(res.text,'html.parser')
# soup.find(标签名,属性=属性值)
# soup.find_all(标签名, 属性=属性值)
# 提取数据 list 里面是tag对象
item = soup.find_all('div',class_='books')
for i in item:
# i.find().find().find() # tag对象, 可以一级一级找下去
# i.find_all()
# i 是tag对象, 也可以使用find和find_all, 得到结果还是tag对象
# i.find().find().find().find()
print(i.find('a',class_='title').text) # 获取标签内容
print(i.find('a',class_='title')['href']) # 获取标签属性(href)
print(i.find('p',class_='info').text) # 获取标签内容
复制代码
The process of retrieving a bit like the layers you want to buy snacks at the supermarket
Change process objects
Initially
requests
acquired data, toBeautifulSoup
parse the data, and thenBeautifulSoup
extract dataContinue to cast our experience is that the object of the operation
full version
full version
To recap the code ...
import requests #调用requests库
from bs4 import BeautifulSoup
# 获取数据
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')
# res.status_code 状态码
# res.content 二进制
# res.text html代码
# res.encoding 编码
# 解析数据
# soup 是beautifulsoup对象
soup = BeautifulSoup(res.text,'html.parser')
# soup.find(标签名,属性=属性值)
# soup.find_all(标签名, 属性=属性值)
# 提取数据 list 里面是tag对象
item = soup.find_all('div',class_='books')
for i in item:
# i.find().find().find() # tag对象, 可以一级一级找下去
# i.find_all()
# i 是tag对象, 也可以使用find和find_all, 得到结果还是tag对象
# i.find().find().find().find()
print(i.find('a',class_='title').text) # 获取标签内容
print(i.find('a',class_='title')['href']) # 获取标签属性(href)
print(i.find('p',class_='info').text) # 获取标签内容
复制代码
to sum up
beautifulsoup parser
Parser | Instructions | Advantage | Disadvantaged |
---|---|---|---|
Python Standard Library | BeautifulSoup(text, "html.parser") | Built-in standard Python library execution speed of moderate intensity document fault tolerance | 2.7.3 or 3.2.2 versions of Python documentation before the fault tolerance of difference |
lxml HTML parser | BeautifulSoup(text, "lxml") | Strong speed document fault tolerance | You need to install the C language library |
lxml XML parser | BeautifulSoup(text, "xml") | The only support fast XML parser | You need to install the C language library |
html5lib | BeautifulSoup(text, "html5lib") | Generating documentation HTML5 format | It does not depend on external expansion slow |
Job 1: crawling articles, and saves it to (each article, a html file)
wordpress-edu-3autumn.localprod.forc.work
Job 2: crawling books under the name and the corresponding price classification, save to books.txt
books.toscrape.com
final effect...
Quick Jump:
Cat brother teach you to write reptile 000-- begins .md
cat brother teach you to write reptile 001 - print () functions and variables .md
cat brother teach you to write reptile 002-- job - Pikachu .md print
cat brother teach you to write reptiles 003 data type conversion .md
cat brother teach you to write reptile 004-- data type conversion - small practice .md
cat brother teach you to write reptile 005-- data type conversion - small jobs .md
cat brother teach you to write reptile 006- - conditional and nested conditions .md
cat brother teach you to write 007 reptile conditional and nested conditions - small operating .md
cat brother teach you to write reptile 008 - input () function .md
cat brother teach you to write reptiles 009 - input () function - AI little love students .md
cat brother teach you to write a list of 010 reptiles, dictionaries, circulation .md
cat brother teach you to write reptile 011-- lists, dictionaries, circulation - small jobs .md
cat brother teach you to write a Boolean value, and four reptile 012-- statements .md
cat brother teach you to write a Boolean value, and four reptile 013-- statements - smaller jobs .md
cat brother teach you to write reptile 014 - pk game. md
cat brother teach you to write reptile 015 - pk game (new revision) .md
cat brother teach you to write reptile 016-- function .md
cat brother teach you to write reptile 017-- function - a small job .md
cat brother to teach you write reptile 018--debug.md
cat brother teach you to write reptile 019 - debug- job. md
cat brother teach you to write reptiles 020-- Classes and Objects (on) .md
cat brother teach you to write reptiles 021-- Classes and Objects (a) - Job .md
Cat brother teach you to write reptiles 022-- Classes and Objects (lower) .md
cat brother teach you to write reptiles 023-- Classes and Objects (lower) - Job .md
cat brother teach you to write reptile 024-- decoding coded && .md
cat brother teach you to write reptile 025 && decoding coded - small jobs .md
cat brother teach you to write reptile 026-- module .md
cat brother teach you to write reptile 027-- module introduces .md
cat brother teach you to write reptile 028- - introduction module - small job - billboards .md
cat brother teach you to write Preliminary -requests.md reptile reptilian 029--
cat brother teach you to write reptile reptilian 030-- Preliminary -requests- job .md
cat brother teach you to write 031 reptiles - reptile basis -html.md
cat brother teach you to write reptile reptilian 032-- first experience -BeautifulSoup.md
cat brother teach you to write reptile reptilian 033-- first experience -BeautifulSoup- job .md
cat brother teach you to write reptile 034- - reptile -BeautifulSoup practice .md
cat brother teach you to write 035-- reptile reptilian -BeautifulSoup practice - job - film top250.md
cat brother teach you to write 036-- reptile reptilian -BeautifulSoup practice - work - work to resolve .md movie top250-
cat brother teach you to write 037-- reptile reptiles - to listen to songs .md baby
cat brother teach you to write reptile 038-- arguments request .md
cat brother teach you to write data stored reptile 039-- .md
cat brother teach you to write reptiles 040-- store data - Job .md
cat brother teach you to write reptile 041-- analog login -cookie.md
Cat brother teach you to write reptile 042 - session usage .md
cat brother teach you to write reptile 043-- analog browser .md
cat brother teach you to write reptile 044-- analog browser - job .md
cat brother teach you to write reptiles 045-- coroutine .md
cat brother teach you to write reptile 046-- coroutine - practice - what to eat not fat .md
cat brother teach you to write reptile 047 - scrapy framework .md
cat brother teach you to write reptile 048-- .md reptile reptiles and anti-
cat brother teach you to write reptile 049-- end Sahua .md
Reproduced in: https: //juejin.im/post/5cfc4adb6fb9a07eee5ec09a