The content is for your own learning and use. If there are any mistakes, please correct me. Thank you!
Author: rookiequ
python crawler basics 01
We can use crawlers to crawl the data we want from the website. The crawler I use is Pycharm+anaconda. If you want to crawl data, you first need a python environment and related IDE, which I won’t go into details here.
Principle knowledge of python crawler
How the python crawler works:
Four steps of crawler:
- retrieve data. The crawler will get the URL we want it to crawl, send a request to the server, and obtain the data returned by the server.
- Analytical data. The crawler will convert the data returned by the server into a format that humans can understand.
- Filter data. The crawler will filter out the specific data we need from the returned data.
- Storing data. The crawler will save the data according to the storage method we set, so that we can perform the next step.
Use of Requests library
To install the Requests library, use the cmd command as an administrator and then install it.
pip install requests
First introduction to crawlers, let’s look at a small case
import requests #引入requests模块
r = requests.get('http://www.baidu.com')
print(r.status_code) #状态码
r.encoding = 'utf-8' #设置编码
print(r.text) #读取的网页信息
HTTP protocol: It is a stateless application layer protocol based on the "request and response" model.
URL format: http://host[:port][path]
, such as http://www.rookiequ.top/admin, url is the Internet path corresponding to http protocol access resources, one url corresponds to one data resource
- host: legal host domain name or IP address
- port: port number, the default port is 80
- path: the path of the requested resource
7 main ways Requests provide points
method | illustrate |
---|---|
requests.request() | Construct a request that supports the basis of each of the following methods |
requests.get() | The main method to obtain HTML web pages, corresponding to http's get |
requests.head() | The main method to obtain HTML header information, corresponding to the head of http |
requests.post() | Method to submit post request to HTML web page, corresponding to http post |
requests.put() | Method to submit a put request to an HTML web page, corresponding to http's put |
requests.patch() | Submit a partial modification request to the HTML web page, corresponding to the http patch |
requests.delete() | Submit a deletion request to an HTML web page, corresponding to http's delete |
requests.request(method, url, **kwargs)
**kwargs: Parameters to control access, all optional
params,data, json, header, cookies, auth, files, timeout , proxies, allow_redirects, stream, verify, cert,
The right side is a Request object that makes a request to the server, and the left side returns a Response object containing server resources.
#调用request方法
r = requests.get(url,params=None,**kwargs)
#url:获取页面的url链接
#params:url中的额外的参数,字典或者字节流格式,可选
#**kwargs:12个控制访问的参数
The following attributes of response are as follows:
Attributes | Knowledge points |
---|---|
response.status_code | Check if the request was successful |
response.content | The binary data of the response object |
response.text | String data of response object |
response.encoding | The encoding of the response object |
import requests #引入requests模块
r = requests.get('http://www.baidu.com')
print(r.status_code) #http请求的返回状态,200成功,404失败
r.encoding = 'utf-8' #从header中猜测的,响应内容的编码方式
type(r) #查看返回的数据类型 Response对象
print(r.text) #读取的网页信息,响应内容的字符串形式
print(r.content) #响应内容的二进制格式,一般图片音频用二进制
print(r.apparent_encoding) #从内容中分析出来的响应内容编码方式(备选编码方式)
print(r.header) #头信息
Note: In r.encoding, if charset does not exist in the header, the encoding is considered to be ISO-8859-1 (Chinese cannot be parsed)
Exceptions in Requests library
abnormal | illustrate |
---|---|
requests.ConnectionError | Network connection errors and exceptions, such as DNS query failure, connection rejection, etc. |
requests.HttpError | HTTP error exception |
requests.URLRequired | URL missing exception |
requests.TooManyRedirects | Exceeding the maximum number of redirections will result in a redirection exception. |
requests.ConnectTimeout | Timeout exception when connecting to remote server |
requests.Timeout | The request URL times out and a timeout exception occurs. |
response.raise_for_status() | If it is not 200, exception requests.HttpError is generated. |
A common code framework for crawling web pages
import requests
def getHTMLText(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status() #如果状态不是200,引发异常
r.encoding = r.apparent_encoding
return r.text
except:
return '产生异常'
if __name__ == '__main__':
url = 'http://www.baidu.com'
print(getHTMLText(url))
Robots protocol: The website tells web crawlers which content can and cannot be crawled. There are a lot of English words in the agreement, Allow and Disallow. Allow means that it can be accessed, and Disallow means that access is prohibited.
Example
import requests
# r = requests.get('http://www.baidu.com')
# print(r.status_code)
# r.encoding = 'utf-8'
# print(r.text)
keyword = 'python'
try:
kv = {
'wd':keyword}
r = requests.get('http://www.baidu.com/s',params=kv)
print(r.request.url)
r.raise_for_status()
print(len(r.text))
except:
print('爬取失败')
import requests
url = 'https://item.jd.com/2967929.html'
try:
r = requests.get(url)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[:1000])
except:
print("爬取失败")
import requests
url = 'https://www.amazon.cn/gp/product/B01M8L5Z3Y'
try:
kv = {
'user-agent':'Mozilla/5.0'}
r = requests.get(url, headers=kv)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[1000:2000])
except:
print("爬取失败")
Use of BeautifulSoup
BeautifulSoup
Used to parse and extract data from web pages.
- Interpret data: Translate the HTML source code returned by the server into something we can understand
- Extracting data: refers to selecting the data we need from the source data in a targeted manner
First prepare the learning environment:
pip install bs4
bs object = BeautifulSoup (text to be parsed, interpreter) , the brackets contain two parameters, the first must be a string type or variable, the second parameter is the parser, we can use html.parser (this is python A built-in library, it is not the only parser, others can be used)
Give a simple chestnut:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://xiaoke.kaikeba.com/example/canteen/index.html')
soup = BeautifulSoup(res.text,'html.parser')
#把网页解析为BeautifulSoup对象
print(type(soup)) #查看soup的类型
print(soup) # 打印soup
soup
The data type is <class 'bs4.BeautifulSoup'>
, indicating soup
that it is an BeautifulSoup
object. response.text
The data types are <class 'str'>
different, but the printed content is the same.
Use BeautifulSoup to extract data. Here are two of its methods: find() and find_all(), which can match HTML tags and attributes and extract all data that meets the requirements in the BeautifulSoup object. The usage of the two is basically the same. The difference is that find()
only the first data that meets the requirements is extracted, and then the <class 'bs4.element.Tag'>
type is returned, while find_all()
all the data that meets the requirements are extracted, and then the type returned is <class 'bs4.element.ResultSet'>
the type, which is the Tag object. The structure of a list is stored together, and we can think of it as a list. There is also the select() method, which you can learn about later.
The Tag type object was mentioned above, so what properties and methods does it have?
import requests
from bs4 import BeautifulSoup
res = requests.get('https://xiaoke.kaikeba.com/example/canteen/index.html')
html = res.text
soup = BeautifulSoup(html,'html.parser')
items = soup.find_all(class_='show-list-item')
for item in items:
title = item.find(class_='desc-title') # 在列表中的每个元素里,匹配属性class_='title'提取出数据
material = item.find(class_='desc-material') #在列表中的每个元素里,匹配属性class_='desc-material'提取出数据
step = item.find(class_='desc-step') #在列表中的每个元素里,匹配属性class_='desc-step'提取出数据
print(title.text,material.text,step.text)
Let’s look at the process of the crawler through a graph:
Let’s look at a case of crawling images:
import requests
from kkb_tools import open_file
res = requests.get('https://xiaoke.kaikeba.com/example/canteen/images/banner.png')
pic=res.content #二进制
photo = open('banner.jpg','wb')
photo.write(pic)
photo.close()
open_file('banner.jpg')