Get a webpage
import requests
r = requests. get('https://www.baidu.com/')
print(type(r))
print(r. status_code)
print (type(r. text))
print(r. text)
print(r.cookies)
Various requests
# 发送一个 HTTP POST 请求:
r = requests.post("http://httpbin.org/post",
data = {'key':'value'})
r = requests.delete('http://httpbin.org/delete')
# 发送一个 HTTP delete 请求:
r = requests.head('http://httpbin.org/get')
# 发送一个 HTTP head 请求:
r = requests.options('http://httpbin.org/get')
# 发送一个 HTTP options 请求:
Construct get request passing parameters
For GET requests, use the parameter params
import requests
data={
"key1":"value1",
"key2":"value2"}
r = requests.get('http://httpbin.org/get', params=data)
print(r.url)
#http://httpbin.org/get?key1=value1&key2=value2
You can also pass in a list as a value
import requests
data={
"key1":"value1",
"key2":["value2","value3"]}
r = requests.get('http://httpbin.org/get', params=data)
print(r.url)
#http://httpbin.org/get?key1=value1&key2=value2&key2=value3
Note: None of the keys in the dictionary will be added to the URL query string.
In addition, the return type of the web page is actually str type, but it is very special and is in JSON format. Therefore, if you want to directly parse the returned result and get a dictionary format, you can call the json () method directly. Examples are as follows:
import requests
r = requests.get('http://httpbin.org/get')
print(r.json())
'''
{'args': {},
'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.23.0', 'X-Amzn-Trace-Id': 'Root=1-5e9b0b15-4d6629f8460bc48037fa4244'}, 'origin': '124.164.123.240', 'url': 'http://httpbin.org/get'}
'''
Crawl the web
Taking the Zhihu-News page as an example, a request header needs to be constructed. It can be found in the developer tools.
import requests
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Firefox/68.0'
}
r = requests.get("https://daily.zhihu.com/",headers=headers)
print(r.text)
Of course, we can add other field information in the headers parameter.
Grab binary data
In the above example, we grabbed a page that we knew about, but it actually returned an HTML document. What should I do if I want to capture pictures, audio, video, etc.?
The files of picture, audio and video are essentially composed of binary code. Due to the specific storage format and corresponding analysis method, we can only see these various multimedia. Therefore, if you want to grab them, you must get their binary code.
import requests
r = requests.get("https://github.com/favicon.ico")
with open("favicon.jpg","wb") as f:
f.write(r.content)
The open () method is used here. Its first parameter is the file name, and the second parameter represents opening in the form of binary write, and binary data can be written to the file. After running, you can find an icon named favicon.ico appears in the folder.
Build a loop statement here to continuously grab data.
POST request
import requests
data ={'name ':'germey', 'age':'22'}
r = requests.post("http://httpbin.org/post", data=data)
print(r.text)
#部分输出:
# "form": {
# "age": "22",
# "name ": "germey"
#}
The form part is the submitted data, which proves that the POST request was successfully sent.
response
After sending the request, the response is naturally obtained. In the above example, we used text and content to get the content of the response. In addition, there are many attributes and methods that can be used to obtain other information, such as status codes, response headers, cookies, and so on.
import requests
r = requests.get('http://www.baidu.com')
print(type(r.status_code), r.status_code)
print(type(r.headers), r.headers)
print ( type(r.cookies), r.cookies)
print(type(r. url), r. url)
print(type(r.history), r.history)
Here we print out the status_code attribute to get the status code, the headers attribute to get the response header, the cookies attribute to get the cookies, the url attribute to get the URL, and the history attribute to get the request history.
File Upload
import requests
files = {'file' : open ('favicon.ico','rb')}
r = requests.post('http://httpbin.org/post', files=files)
print(r.text)
It should be noted that favicon.ico needs to be in the same directory as the current script. If there are other files, of course, you can also use other files to upload, just change the code.
This website will return a response, which contains the files field, and the form field is empty, which proves that the file upload part will have a separate file field to identify.
Cookies
First look at getting cookies
import requests
r = requests.get('https://www.baidu.com')
print(r.cookies)
for key,value in r.cookies.items():
print(key + '=' + value)
#<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
#BDORZ=27315
Here we first call the Cookie property to successfully obtain the Cookie, we can find that it is of type RequestCookieJar. Then use the items () method to convert it into a list of tuples, traverse and output the name and value of each cookie, and implement cookie traversal analysis.
We can also directly use cookies to maintain the login status. The following uses Zhihu as an example.
import requests
header={
'Host':'www.zhihu.com',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Firefox/68.0',
'Cookie':'_zap=4f14e95a-0cea-4c5e-b2f7-45cfd43f9809; d_c0="AJCZtnMxyBCPTgURbVjB11p6-JAwsTtJB4E=|1581092643"; _xsrf=VaBV0QQwGFjz01Q9n2AmjAilhHrJXypa; tst=h; q_c1=516c72a5ff954c66b6563ff42e63387d|1585814979000|1582104705000; tshl=; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1587179989,1587216876,1587216886,1587267050; capsion_ticket="2|1:0|10:1587267052|14:capsion_ticket|44:YTUwMDY3MGYyNmJlNDU0ZTgxNjlhNjMwNWNkYzAxNmQ=|7b8a5ebd3649fb076617379e734a79cd7ef54d1242ddd1841aba6844c9d14727"; l_cap_id="YjJiNjc1MzY0ZmEzNGNlYjlkYThkODEyYmEzOWRiOTk=|1587222516|5a01a93ea68209c1116647750ed2131efa309a3d"; r_cap_id="N2EwMjY0N2NlNTM1NGZlMjliNGNhMGJmOTkyMDc1OTE=|1587222516|238b677c781f1ef90a7ad343d6cdd3871aff3269"; cap_id="OTVhNjZiMDQ3MDkzNGVjY2I5ZTUyNTlhOTcxNzk3Njg=|1587222516|6dd1ed77526aa949bccd4146ef218d8164804a6e"; KLBRSID=031b5396d5ab406499e2ac6fe1bb1a43|1587267062|1587267049; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1587267062; SESSIONID=wopWDVALc4X3RJObFrIWNChoNDJpogYSdBPicuRm7vV; JOID=WlgXBkLsoG-SjPrGduF5tDN1xettk80YycmkhT2OnDWm0rGBFgxg_8GF8MN9HDmwsdmzwZheWKVLuonghNnDleo=; osd=V1gXB0vhoG-ThffGduBwuTN1xOJgk80ZwMSkhTyHkTWm07iMFgxh9syF8MJ0ETmwsNC-wZhfUahLuojpidnDlOM=; z_c0="2|1:0|10:1587267060|4:z_c0|92:Mi4xT2JORUJnQUFBQUFBa0ptMmN6SElFQ1lBQUFCZ0FsVk45Qk9KWHdEa0NUcXVheUJDdnJtRzRUVEFHNjFqQThvd013|bb30373e1f13c8b751a3ffc09e8ab4c98780350f77989d93b20be7eb3a0b2fad"'
}
r = requests.get('https://www.zhihu.com/hot',headers=header)
print(r.text)
The result includes the result after login. Of course, you can also set it through the cookies parameter, but then you need to construct the RequestsCookieJar object, and you need to split the cookies. This is relatively cumbersome
Session maintenance
In requests, if you directly use the methods such as get () or post (), you can indeed simulate the request of the web page, but this is actually equivalent to different sessions, which means that you have opened it with two browsers. Different pages.
In fact, the main way to solve this problem is to maintain the same session, which is equivalent to opening a new browser tab instead of opening a new browser. But I don't want to set cookies every time, what should I do? At this time, there is a new weapon-Session object.
import requests
s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)
#{
# "cookies": {
# "number": "123456789" }}
With Session, you can simulate the same session without worrying about cookies. It is usually used to simulate the next operation after a successful login.
SLL certificate verification
In addition, requests also provide the function of certificate verification. When sending an HTTP request, it will check the SSL certificate, we can use the verify parameter to control whether to check this certificate. In fact, if the verify parameter is not added, the default is True, and it will be automatically verified.
For example, the 12306 website is not trusted by the official CA.
import requests
response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)
Proxy settings
In order to prevent the verification code from popping up after multiple visits, or jumping to the login authentication page, we need to set up a proxy to solve this problem, which requires the use of proxies parameters. It can be set in this way:
import requests
proxies = { 'http':'http:10 .10.1.10:3128',
'http':'http: //10.10.1.10: 1080', }
requests.get('https://www.taobao.com', proxies=proxies)
#代理无效,请换用自己的代理
requests also supports SOCKS proxy.
Timeout setting
When the local network condition is not good or the server network response is too slow or even no response, we may wait for a long time before we can receive a response, or even report an error when we finally receive no response. In order to prevent the server from responding in time, a timeout period should be set, that is, if there is no response after this time, an error will be reported. This requires the timeout parameter. The calculation of this time is the time to send the request to the server and return the response. Examples are as follows:
import requests
r=requests.get('https://www.taobao.com', timeout=1)
print(r.status_code)
If you want to wait forever, you can directly set timeout to None, or leave it blank without setting, because the default is None.
Authentication
requests provides a simple way to write a tuple, it will use the HTTPBasicAuth class for authentication by default.
import requests
r = requests.get('https://localhost:5000',
auth=(' username',' password'))
requests also provides other authentication methods, such as OAuth authentication.