Python crawler 2: requests library - principle

foreword

It is very simple to implement a web crawler in python, you only need to master certain basic knowledge and certain library usage skills. The purpose of this series is to sort out relevant knowledge points for future review.

Directory Structure

1 Overview

Python actually comes with a request library, namely urllib, but this library is not very easy to use, so most people still use the third-party library requests.

Request supports HTTP connection retention and connection pooling, session retention using cookies, file uploading, encoding of automatic response content, and automatic encoding of internationalized URL and POST data.

Highly encapsulated on the basis of python's built-in modules, so that python becomes humanized when making network requests. Using Requests can easily complete any operation that the browser can have.

In addition, the installation of requests is very simple, just need pip:

pip install requests

If the network is not good, you can also specify a mirror source to download faster:

pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple some-package

You can use the following code to check whether the installation is successful:

import requests
# 网址
url = 'https://www.baidu.com'
# 请求
response = requests.get(url)
# 打印返回结果
print(response.content.decode('utf-8'))

The returned results are as follows:

insert image description here

2. response object

When we use get, post or other methods to request, we will get a response object ( such as the object returned by the get function in the test code above ), you can call it a response object, let's learn about this object Common properties and methods.

2.1 encoding property

Role: Returns the encoding format of the web page.

Code:

import requests
# 网址
url = 'https://www.baidu.com'
# 请求
response = requests.get(url)
# 打印返回结果
print(response.encoding)

The printed result is:

ISO-8859-1

This result is also a kind of encoding, but it is our domestic standard, and other common ones are utf-8 and the like.

2.2 url attribute

Function: Return the url of the response server.

Code:

import requests
# 网址
url = 'https://www.baidu.com'
# 请求
response = requests.get(url)
# 打印返回结果
print(response.url)

The printed result is:

https://www.baidu.com/

Don't underestimate this property . Sometimes we visit: http://www.test.com, but since this website has a redirection, what we actually visit is: http://www.good.com, so we use this attribute to determine the URL we actually visit.

2.3 status_code attribute

Function: Return the status code of the response.

Code:

import requests
# 网址
url = 'https://www.baidu.com'
# 请求
response = requests.get(url)
# 打印返回结果
print(response.status_code)

The printed result is:

What are status codes?

Simply put, it is the status of the website you visit, whether it is a normal visit, or a server error, or a problem with the network, etc.

Here are the common corresponding status codes:

2xx | 正常
3xx | 重定向
4xx | 错误
5xx | 服务器错误

2.4 cookie attributes

Function: Return the cookie object.

Code:

import requests
# 网址
url = 'https://www.baidu.com'
# 请求
response = requests.get(url)
# 打印返回结果
print(response.cookies)

Print result:

<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

What are cookies?

Here is a brief explanation of what a cookie is. It can be simply understood as a temporary identity file stored locally on our computer. For example, when you visit a website, you log in to it for the first time, then you close the website, open it again, and find that there is no need to restart Login, this is where cookies come into play. When you log in, a temporary file is created locally for later login.

2.5 request. headers property

Role: Return the request header.

Code:

import requests
# 网址
url = 'https://www.baidu.com'
# 请求
response = requests.get(url)
# 打印返回结果
print(response.request.headers)

The printed value is:

{
    
    'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

What are request headers?

To put it simply, when you visit a website, such as Baidu search, you will definitely search for something, then naturally you request this website, and the information sent to this website contains this thing, which we generally call the request header. Contains all the information you requested.

From the above return value, we can see one thing: Baidu website knows that you are a python script , and 'User-Agent': 'python-requests/2.31.0'you can see it from it. Therefore, when we write reptiles, we must do a certain amount of camouflage, otherwise we will be recognized as reptiles directly .

2.6 headers property

Role: Return the response header.

Code:

import requests
# 网址
url = 'https://www.baidu.com'
# 请求
response = requests.get(url)
# 打印返回结果
print(response.headers)

Print result:

{
    
    'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Fri, 04 Aug 2023 05:36:28 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:23:55 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}

What are response headers?

The response header is part of the information returned to us by the server/website itself,

2.7 text attribute

Function: Return the source code of the web page, but it is the result of decoding according to the encoding speculated by the chardet module ( inaccurate ).

Code:

import requests
# 网址
url = 'https://www.baidu.com'
# 请求
response = requests.get(url)
# 打印返回结果
print(response.text)

Return result:

<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“</title></head> <body link=#0000cc> <div id=wrapper> 
....

# 部分结果

Note: text returns the source code of the web page, but it is speculative when decoding, and sometimes it is not accurate . Therefore, we use the content attribute more often, and add our own decoding, which is more accurate.

2.8 content attribute

Function: Return the result in bytes form of the web page source code.

Code:

import requests
# 网址
url = 'https://www.baidu.com'
# 请求
response = requests.get(url)
# 打印返回结果
print(response.content.decode('utf-8'))

Partial results printed:

<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> 
....

# 部分结果

How to judge the coding form of the web page?

This is simple, just open a webpage, such as Baidu, click the right mouse button, view the source code, and find the content in the picture below (the overall frame of the html webpage source code is fixed, so you can find similar content on any webpage):

insert image description here

3. GET request

3.1 Method overview

The get request is one of the most commonly used request methods we usually use. In the requests module, we are provided with a very convenient get request method:

import requests
url = 'http://www.baidu.com'
response = requests.get(url)
print(response.status_code)

3.2 Common parameters

parameter	effect
url	requested address
headers	Request header parameters (dictionary)
params	Request parameters (dictionary)
cookies	request cookies parameters (dictionary)
timeout	Timeout setting

3.2 Example usage of parameters

Here is a simple example to illustrate how these parameters are used.

Didn’t I say it before, if you directly use a script to visit the Baidu website, it will be recognized as a python script, so we can use the request header parameters for simple disguise:

import requests
# 地址
url = 'https://www.baidu.com'
# 伪造请求头
headers = {
    
    
    'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36',
}
# 访问
response = requests.get(url,headers=headers)
# 打印
print(response.request.headers)

Note: The parameters passed in must be in the form of a dictionary, and the content must match the content of the request header .

The printed results are as follows:

{
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

It can be seen that the forgery was indeed successful.

How to correctly write the request header format?

This is very important and very simple. You can open any webpage, right-click on it, select "Check" (Google browser as an example), and then operate as shown below to see the content of the request header:

insert image description here

Here I introduce the most common request header parameter content:

User-Agent ： 客户端浏览器的版本信息
Host ： 服务器的主机
Referer：表示从哪个页面跳转到当前的页面
Cookie：用于记录客户端的身份信息，例如通过cookie登录网站
x-forward-for：表示客户端的IP地址，一般被称为”XXF“头，服务器通过这个字段可以知道客户端的真实IP或代理IP

(A separate case explanation follows the params parameter)

4. POST request

post request is also one of the most commonly used request methods in our website. Generally, almost all form submissions are post requests. However, in the requests module, the difference between post requests and get requests lies in the parameters (params becomes data, headers parameters are the same as other parameters):

data： 接收一个字典，里面爬虫发出的数据

5. Proxy settings

Using a proxy can reduce the probability of our own real IP exposure.

To put it simply, a proxy is to use someone else's IP to access a website, so that if the website detects it, it can only detect someone else's IP address, without knowing that you are actually visiting his website.

General proxy IPs are classified according to the degree of anonymity:

High hidden:
- Others do not know your real IP
transparent:
- Others know your real IP

However, in the requests library, there is a parameter in the get and post request methods proxies, which is used to set the proxy.

The method of use is as follows:

import requests

proxies = {
    
    
    'http/https':'http/https:ip:port',
    'http/https':'http/https:ip:port',
    'http/https':'http/https:ip:port',
    'http/https':'http/https:ip:port'
}
requests.get(url,proxies=proxies)

When we need to pursue efficiency, in addition to multitasking, another way is to use agents. For example, we need to download pictures. In order not to be detected by the website as a crawler, we have to sacrifice efficiency. For example, we take a break for one second every time we crawl a picture, which is obviously very inefficient. But once we crawl hundreds of pictures in a second, it is very easy for the website to detect that the user is not alone, because it is impossible for a real person to download hundreds of pictures in a second, so the website will block us for a short time ip, so that we cannot continue to visit, so that the crawler fails.

At this time we need a proxy, so that we can use different ip addresses when crawling.

So to sum up, why do we need an agent?

Reduce the probability of your real IP exposure
Multiple proxy IPs can be used to achieve fast access

因为，现在的网站几乎都会屏蔽速度过快的访问，比如一秒几十次乃至几百、上千万次的访问，这样的访问一看就不正常，因此站长几乎都会屏蔽乃至禁止你的IP，因此我们可以使用多个代理IP，比如我们使用100个，那么我们每秒就会访问100次都不会有任何的问题

6. Session maintenance

We sometimes encounter some websites that force you to log in to obtain data. At this point you know that you need to use the post method to request to log in to the website, but when you use the get method to access the website after the request is successful, you will find that you have been rejected, which means that your get request is a new request, Instead of a post based request after a successful post request. If you can’t understand the above, you can understand it this way, a get is to open a browser, and there is no communication between the two browsers, so your second request is equivalent to opening a new browser, and naturally you cannot get the content .

Therefore, we need some way to solve this problem.

The reference code is as follows:

#下面给出的是思路，不是具体的代码，具体的代码还是根据实际案例来讲
import requests

#创建session对象
session = requests.session()
#使用session对象去发送post请求
session.post(.....)     #这里post的用法和requests.post()用法一致
#请求成功后再用session对象去请求只有登录才能访问的页面
response = session.get(....)   #这里post的用法和requests.get()用法一致
#接下来再去操作即可

(This will also be a separate explanation case)

7. ssl certificate

Sometimes, our crawler needs to ignore the ssl certificate, that is, when we manually access it, it will display that the webpage is not safe/ssl certificate expired and other prompts.

The method of ignoring is very simple, refer to the following:

import requests

response = requests.get('https://www.baidu.com',verify=False)
print(response.status_code)

8. Summary

This article mainly sorts out the common methods of requests, and the following articles will illustrate some of the methods involved in this article.