Cat brother teach you to write reptile reptilian 032-- first experience -BeautifulSoup

What is BeautifulSoup

In reptiles, the need to be able to read html tool to extract the desired data. This is the analytical data.

[Extract data] refers to the data we need to pick from a number of data.

Parse and extract data in the reptile in both a focus, but also difficult

BeautifulSoup how to use

Installation beautifulsoup ==> pip install BeautifulSoup4

In parentheses, to two input parameters, the first parameter is 0 to be parsed text, note, and it must be must be a string.

Brackets The first parameter is used to identify the parser, we use Python is a built-in library:html.parser

import requests #调用requests库
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html') 
#获取网页源代码,得到的res是response对象
print(res.status_code) #检查请求是否正确响应
html = res.text #把res的内容以字符串的形式返回
print(html)#打印html
复制代码
import requests
from bs4 import BeautifulSoup
#引入BS库
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html') 
html = res.text
soup = BeautifulSoup(html,'html.parser') #把网页解析为BeautifulSoup对象
print(type(soup)) # <class 'bs4.BeautifulSoup'>
复制代码

Print soupout the source code and before we use response.textto print out the source code is exactly the same

Although response.textand soupprint out the contents of the surface looks exactly the same,

They belong to different classes: <class 'str'>and<class 'bs4.BeautifulSoup'>

Extract data

We still use BeautifulSoupto extract the data.

This step, knowledge can be divided into two parts: find()the find_all()well Tag对象.

find()And find_all()are BeautifulSouptwo methods of the object,

They can match html tags and attributes, the BeautifulSoup objects that match the requirements of the data are extracted

The difference is that find()extract only the data of the first to meet the requirements, and find_all()the extracted data to meet the requirements of all

localprod.pandateacher.com/python-manu…

HTML code, there are three <div>elements, with find()can be extracted first element, and find_all()can be extracted all

import requests
from bs4 import BeautifulSoup
url = 'https://localprod.pandateacher.com/python-manuscript/crawler-html/spder-men0.0.html'
res = requests.get (url)
print(res.status_code)
soup = BeautifulSoup(res.text,'html.parser')
item = soup.find('div') #使用find()方法提取首个<div>元素,并放到变量item里。
print(type(item)) #打印item的数据类型
print(item)       #打印item 
复制代码
import requests
from bs4 import BeautifulSoup
url = 'https://localprod.pandateacher.com/python-manuscript/crawler-html/spder-men0.0.html'
res = requests.get (url)
print(res.status_code)
soup = BeautifulSoup(res.text,'html.parser')
items = soup.find_all('div') #用find_all()把所有符合要求的数据提取出来,并放在变量items里
print(type(items)) #打印items的数据类型
print(items)       #打印items
复制代码

For example in brackets class_, there is an underscore, for python syntax and class classdistinction, to avoid program conflicts

Small Exercise: crawled the page title three books, links, and Book reviews

localprod.pandateacher.com/python-manu…

Another point of knowledge - Tagobject.

We usually choose a type()function check data types,

Python is an object-oriented programming language, and only know what the object is to call the relevant object properties and methods.

With the find()extracted data types and just the same, or Tagobjects

We can Tag.textpropose Tagtarget text, with the Tag['href']extracted URL

import requests #调用requests库
from bs4 import BeautifulSoup
# 获取数据
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html') 
# res.status_code 状态码
# res.content 二进制
# res.text html代码
# res.encoding 编码
# 解析数据
# soup 是beautifulsoup对象
soup = BeautifulSoup(res.text,'html.parser')
# soup.find(标签名,属性=属性值)
# soup.find_all(标签名, 属性=属性值)
# 提取数据 list 里面是tag对象
item = soup.find_all('div',class_='books')
for i in item:
    # i.find().find().find() # tag对象, 可以一级一级找下去
    # i.find_all()
    # i 是tag对象, 也可以使用find和find_all, 得到结果还是tag对象
    # i.find().find().find().find()
    print(i.find('a',class_='title').text) # 获取标签内容
    print(i.find('a',class_='title')['href']) # 获取标签属性(href)
    print(i.find('p',class_='info').text) # 获取标签内容
复制代码

The process of retrieving a bit like the layers you want to buy snacks at the supermarket

Change process objects

Initially requestsacquired data, to BeautifulSoupparse the data, and then BeautifulSoupextract data

Continue to cast our experience is that the object of the operation

full version

full version

To recap the code ...

import requests #调用requests库
from bs4 import BeautifulSoup
# 获取数据
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html') 
# res.status_code 状态码
# res.content 二进制
# res.text html代码
# res.encoding 编码
# 解析数据
# soup 是beautifulsoup对象
soup = BeautifulSoup(res.text,'html.parser')
# soup.find(标签名,属性=属性值)
# soup.find_all(标签名, 属性=属性值)
# 提取数据 list 里面是tag对象
item = soup.find_all('div',class_='books')
for i in item:
    # i.find().find().find() # tag对象, 可以一级一级找下去
    # i.find_all()
    # i 是tag对象, 也可以使用find和find_all, 得到结果还是tag对象
    # i.find().find().find().find()
    print(i.find('a',class_='title').text) # 获取标签内容
    print(i.find('a',class_='title')['href']) # 获取标签属性(href)
    print(i.find('p',class_='info').text) # 获取标签内容
复制代码

to sum up

beautifulsoup parser

Parser Instructions Advantage Disadvantaged
Python Standard Library BeautifulSoup(text, "html.parser") Built-in standard Python library execution speed of moderate intensity document fault tolerance 2.7.3 or 3.2.2 versions of Python documentation before the fault tolerance of difference
lxml HTML parser BeautifulSoup(text, "lxml") Strong speed document fault tolerance You need to install the C language library
lxml XML parser BeautifulSoup(text, "xml") The only support fast XML parser You need to install the C language library
html5lib BeautifulSoup(text, "html5lib") Generating documentation HTML5 format It does not depend on external expansion slow

Job 1: crawling articles, and saves it to (each article, a html file)

wordpress-edu-3autumn.localprod.forc.work

Job 2: crawling books under the name and the corresponding price classification, save to books.txt

books.toscrape.com

final effect...

Quick Jump:

Cat brother teach you to write reptile 000-- begins .md
cat brother teach you to write reptile 001 - print () functions and variables .md
cat brother teach you to write reptile 002-- job - Pikachu .md print
cat brother teach you to write reptiles 003 data type conversion .md
cat brother teach you to write reptile 004-- data type conversion - small practice .md
cat brother teach you to write reptile 005-- data type conversion - small jobs .md
cat brother teach you to write reptile 006- - conditional and nested conditions .md
cat brother teach you to write 007 reptile conditional and nested conditions - small operating .md
cat brother teach you to write reptile 008 - input () function .md
cat brother teach you to write reptiles 009 - input () function - AI little love students .md
cat brother teach you to write a list of 010 reptiles, dictionaries, circulation .md
cat brother teach you to write reptile 011-- lists, dictionaries, circulation - small jobs .md
cat brother teach you to write a Boolean value, and four reptile 012-- statements .md
cat brother teach you to write a Boolean value, and four reptile 013-- statements - smaller jobs .md
cat brother teach you to write reptile 014 - pk game. md
cat brother teach you to write reptile 015 - pk game (new revision) .md
cat brother teach you to write reptile 016-- function .md
cat brother teach you to write reptile 017-- function - a small job .md
cat brother to teach you write reptile 018--debug.md
cat brother teach you to write reptile 019 - debug- job. md
cat brother teach you to write reptiles 020-- Classes and Objects (on) .md
cat brother teach you to write reptiles 021-- Classes and Objects (a) - Job .md
Cat brother teach you to write reptiles 022-- Classes and Objects (lower) .md
cat brother teach you to write reptiles 023-- Classes and Objects (lower) - Job .md
cat brother teach you to write reptile 024-- decoding coded && .md
cat brother teach you to write reptile 025 && decoding coded - small jobs .md
cat brother teach you to write reptile 026-- module .md
cat brother teach you to write reptile 027-- module introduces .md
cat brother teach you to write reptile 028- - introduction module - small job - billboards .md
cat brother teach you to write Preliminary -requests.md reptile reptilian 029--
cat brother teach you to write reptile reptilian 030-- Preliminary -requests- job .md
cat brother teach you to write 031 reptiles - reptile basis -html.md
cat brother teach you to write reptile reptilian 032-- first experience -BeautifulSoup.md
cat brother teach you to write reptile reptilian 033-- first experience -BeautifulSoup- job .md
cat brother teach you to write reptile 034- - reptile -BeautifulSoup practice .md
cat brother teach you to write 035-- reptile reptilian -BeautifulSoup practice - job - film top250.md
cat brother teach you to write 036-- reptile reptilian -BeautifulSoup practice - work - work to resolve .md movie top250-
cat brother teach you to write 037-- reptile reptiles - to listen to songs .md baby
cat brother teach you to write reptile 038-- arguments request .md
cat brother teach you to write data stored reptile 039-- .md
cat brother teach you to write reptiles 040-- store data - Job .md
cat brother teach you to write reptile 041-- analog login -cookie.md
Cat brother teach you to write reptile 042 - session usage .md
cat brother teach you to write reptile 043-- analog browser .md
cat brother teach you to write reptile 044-- analog browser - job .md
cat brother teach you to write reptiles 045-- coroutine .md
cat brother teach you to write reptile 046-- coroutine - practice - what to eat not fat .md
cat brother teach you to write reptile 047 - scrapy framework .md
cat brother teach you to write reptile 048-- .md reptile reptiles and anti-
cat brother teach you to write reptile 049-- end Sahua .md

Reproduced in: https: //juejin.im/post/5cfc4adb6fb9a07eee5ec09a

Guess you like

Origin blog.csdn.net/weixin_34161083/article/details/91416929