Basic knowledge of Request module

Introduction to Request Module

We can grab the content of these requests and responses in the browser, so can we "forge" the request? That is, no longer sending these data through the browser, but through python to simulate the browser to send the request. The answer is feasible. The Request module can accomplish this function.
The Requests module is a simple and easy-to-use HTTP library implemented by Python

Are there other libraries? The answer is yes, for example urllib, urllib2, httplib, httplib2and other modules. But currently the Requests module is the most popular. And it's also a good module

send request:

It Requestsis very simple to use to send network requests. At the beginning, you need to import the requests module

import requests

Then, try to get a certain webpage. In this example, let's get the homepage of Sogou webpage:

r=requests.get('https://www.sogou.com/')

Now, we have a Response object named r, and we can get the information we want from this object. For example: print out the returned content

r.text

In fact, if we open this URL on the browser, right-click and select "View Page Source Code", you will find that it is exactly the same as what we just printed out (if there is no anti-crawl or the website is a static website). In other words, the above few lines of code have helped us crawl down all the source code of Sogou's homepage.

Splicing URLs:
1.

import requests
url='https://www.sogou.com/web?query='+'挖掘机小王子'
response=request.get(url)
print(response.text)
import requests
url='https://www.sogou.com/web?‘
param={
    
    
'query' : '挖掘机小王子'
}
response=request.get(url,params=param)
print(response.text)

Note: None of the keys in the dictionary will be added to the query string of the URL

Expand knowledge:
You can also pass in a list:

import requests
payload={
    
    ’key1‘:'value1''key2':['value2''value3']}
r=requests.get('http://httpbin.org/get',params=payload)
print(r.url)

The printed result is: http://httpbin.org/get?key1=vavlue1&key2=value2&key2=value3

Text response content
We can read the content of the server response. Before that, we used r.text to access the content returned to us by the server. And we can see the content returned, the magic is that we didn't even do anything about encoding and decoding. in fact

requests will automatically decode the content from the server. Most UNICODE character sets can be decoded seamlessly.
After the request is sent, Requests will make an educated guess based on the encoding of the HTTP header response. When you access r.text, requests will use its guessed text encoding. You can find out what encoding the Requests uses, and you can use the e.encoding property to change it:
r.encoding
results in'utf -8'

r.encoding='gb2312'

If you change the encoding, every time you access r.text, Request will use the new value of r.encoding

HTML page can set encoding information, you can view the encoding information, and then set r.encoding to the corresponding encoding
so that the correct encoding can be used to parse r.text. And the encoding of the webpage can be viewed in the browser

Binary:
Requests will automatically decode gzip and deflate transmission coded response data for you

For example, picture information can be easily saved to a file. For pictures, mp3, video and other data, it is often necessary to use a binary method to read
import requests

url='https://pic.sogou.com/pics/recompic/detail.jsp?category=%E7%BE%8E%E5%A5%B3&tag=%E5%86%99%E7%9C%9F#1%263976741'
r=requests.get(url)
print(r.content)
with open('baidu.png','wb') as f:
f.write(r.content)

1.r.content is binary information 2.open
() opens the file object with is the context manager, it can help us automatically close the file
f=open()
f.write()
f.close()
3.as is to take one Alias, nickname
4.wb w is for writing, even if a file is opened for writing data, b is for writing binary data,
r is for reading data, and
a is for appending data, not overwriting data

gbk gb2312

Custom request header:

The ultimate purpose of building a website is for people to visit, in fact, almost every website does not welcome crawlers. Even now, more and more websites will directly prohibit the crawler from visiting when they find that the other party is a crawler program. That is, no information will be returned to the crawler, and some will return a prompt: you are currently illegally visiting! !

How does the web server know that we are a crawler? There are many ways to judge, and the most common one is to judge through the request header.

When sending an HTTP request, the browser will bring the request header information, etc. (the default is not when the program is sent) if our program does not bring it, or the request header information is wrong, that is, it will not be sent by the server. Yes, it will be rejected by the server.
The verification of the request header is also the simplest anti-crawler strategy

Add request header

Our solution is to simulate the function or behavior of the browser as much as possible. Since the browser sends the request header, our program should naturally also add it. In Requests, we add request header information through the headers parameter
Insert picture description here
Insert picture description here

imports requests
url=''
headers={
    
    
'User-Agent':' '
}
e=requests.get(url,headers=headers)

POST request is mainly used to submit form data, and then the server analyzes the form data, and then decides what kind of data to return to the client

Form form submission data

It is also relatively easy to perform a POST request in Requests. To achieve this, simply pass a dictionary to the data parameter, and your data dictionary will be automatically encoded as a form when making a request:

payload={'key1':'value1','key2':'value2'}
r=requests.post{'',data=payload}
Of course, the post method in the Request is just one more data parameter compared to the get method. Others The parameters are similar. For example, we can also add the query string params parameter to the URL in the post, or add the headers parameter like the get method.

POST data

Content-Type is the type of data to be transferred 1. Content-Type
: application/x-www-form-urlencoded (Form form) The
data parameter is used when the form is passed and a dictionary is accepted.

2. Content-Type: application/json (transfer data in json format)
when transferring data, use json parameters and accept a dictionary.
You can also use data=json.dumps() to convert the format (convert the dictionary to a string)

Response status code

The response status code can easily check our response status. We can check the response status code:
r=requests.get('https://httpbin.org/get')
r.status_code
If an error request (a 4XX Client error, or 5XX server error response), we can
throw an exception through response.raise_for_status():
r=requests.get('https://httpbin.org/status/404')
r.status_code #404
r .raise_for_status() #tray exception
#Exception Error
Traceback(most recent call last):
File “requests/models.py”,line 832,in raise_for_status
raise http_errorrequests.exceptions.HTTPError:404 Client Error

If the status_code of r in the example is 200, when we call raise_for_status(), what we get is:

r.raise_for_status() #200
None

Check the status code r.status_code 200, which is often used to judge whether the request is successful

Response header

We can view the response header of the server in the form of a Python dictionary

r.headers #The
result is
{'content-encoding':'gzip','transfer-encoding':'chunked','connection':'close'}
But this dictionary is special? : It is only for HTTP headers. According to RFC2616 (HTTP1.1 protocol), HTTP headers are case-insensitive

Therefore, we can use any uppercase form to access these response header fields:
r.headers['Content-Type']#'application/json'
r.headers.get('Content-Type')#'application/json'

The response header r.headers returns the response header information in a dictionary format and can be accessed in a dictionary

Cookie

There is such a type of website in the current website, which is a website that requires users to register and log in to access, or that they cannot access their private data without logging in, such as Weibo, WeChat, etc.

The way the website records user information is through the cookie value of the client. For example, when we save the account and password in the browser, the browser saves our user information on our computer, and the next time we visit this page Will automatically load cookie information for us

In the website that needs to log in, the browser sends out the information in the cookie, and the server verifies the cookie information to confirm the login. Since the browser carries cookie information when sending the request, our program should also carry cookie information

A cookie is a piece of text stored in your computer when you visit a certain site or a specific page. It is used to track and record the relevant data of website visitors, such as search preferences, behavior clicks, account numbers, passwords, etc.

Usually the cookie value information can be copied in the browser and placed in the headers

headers={ 'Accept':'application/json,text/javascript, / ;q=0.01', 'Accept-Encoding':'gzip,deflate,br', 'Connection':'keep-alive''Cookie ': 'xxxxxxxxxxxxxxxxxxxxxx' #Copy in the browser ………………………… } This can be sent along with the request header. Of course, requests also provide cookies parameters for us to submit cookie information:






import requests
url=xxx
cookies={
    
    'Cookie':'你的cookie值'}
r=request.get(url,cookies=cookies)

Station B visit case

import requests
url='https://account.bilibili.com/home/userInfo'
r=requests.get(url)
print(r.json())

Redirection and request history

Redirect

Redirection is to redirect a network request to another location through various methods. The possible reason is that some URLs are now deprecated and are not ready to be used, etc.

Handle redirection

By default, for our commonly used GET and POST requests, etc., Requests will automatically handle all redirects. For example, Github redirects all HTTP requests to HTTPS:

r=requests.get('http://github.com')
r.url #'https://github.com'
r.status_code #200

If you are using GET, POST, etc., then you can disable redirection processing via the allow_redirects parameter:

r=requests.get('http://github.com',allow_redirects=Flase)
r.status_code #301

You can use the history method of the response object to track redirects. response.history is a list of Response objects. These objects are created to complete the request. The list of objects is sorted from the oldest to the most recent request.

r=requests.get('http://github.com')
r.history #[<Response[301]>]

time out

Sometimes we don’t want to wait for a long time due to time or the other’s website and wait for the response to be returned. Then we can add a timeout parameter. If it exceeds the time we set and the response has not returned, then we don’t Wait again.

You can tell requests to stop waiting for a response after the number of seconds set by the timeout parameter. It is recommended that all production codes should use this parameter

requests.get((‘http://github.com’,timeout=0.001)

Errors and exceptions

When encountering network problems (such as: DNS query failure, connection refused, etc.), Requests will throw a ConnectionError exception
1. If the HTTP request returns an unsuccessful status code, Response.raise_for_status() will throw an HTTPError exception
2. If the request times out, a Timeout exception
will be thrown 3. If the request exceeds the set maximum number of resets, a TooManyRedirects exception will be thrown

All exceptions explicitly thrown by Requests inherit from requests.exceptions.RequestException

Characters are the general term for various characters and symbols, including national characters, punctuation marks, graphic symbols, numbers, etc. A character set is a collection of multiple characters. The
character set includes: ASCII character set, GB2313 character set, GB18030 character set, Unicode character set, etc.
ASCII code is 1 byte, and Unicode code is usually 2 bytes

Insert picture description here

User-Agent:
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3866.400 QQBrowser/10.8.4379.400
Insert picture description here

XP corresponds to Windows NT 5.1
Windows 7 corresponds to Windows NT 6.1
Windows 8 corresponds to Windows NT 6.3

GET request refers to requesting data from the server
Insert picture description here

User-Agent:
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3866.400 QQBrowser/10.8.4379.400

Modify headers

Modified by the headers parameter of Request

import requests
url='https://www.baidu.com/'
heads={
    
    
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3866.400 QQBrowser/10.8.4379.400'
    }
e=requests.get(url,headers=heads)
print(e.status_code)

Insert picture description here

import json
import urllib.request
import urllib.parse
url='https://www.baidu.com/'
heads={
    
    
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3866.400 QQBrowser/10.8.4379.400'
    }
e=urllib.request.Request(url,heads)

print(e)

Insert picture description here

Modified by request.add_header() method

import json
import urllib.request
import urllib.parse
url='https://www.baidu.com/'
heads={
    
    
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3866.400 QQBrowser/10.8.4379.400'
    }
e=urllib.request.Request(url)
e.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3866.400 QQBrowser/10.8.4379.400')
print(e)

Insert picture description here

Guess you like

Origin blog.csdn.net/CSNN2019/article/details/114537245