A, Request library
1, GET requests
Crawl page (add headers, modify headers, to prevent website blocking)
#抓取网页,知乎 import requests import re ## 浏览器标识 headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"} r = requests.get('https://www.zhihu.com/explore', headers=headers) pattern = re.compile('explore-feed.*?question_link.*?>(.*?)</a>', re.S) titles = re.findall(pattern, r.text) print(titles)
Grab binary data (images, audio, video ...)
## grab binary data acquired github icon in the current directory Import Requests R & lt requests.get = ( 'https://github.com/favicon.ico') with Open ( 'the favicon.ico', 'WB ') AS F: f.write (r.content)
2, POST request (message submission form)
3, in response to (commit request, returns a response)
4, file upload
# 文件上传 import requests files = {'file':open('favicon.ico','rb')} r = requests.post('http://httpbin.org/post', files=files) print(r.text)
5, get, set, save logged Cookies []
## Get Cookies Import Requests R & lt requests.get = ( 'https://baidu.com') Print (r.cookies) ## RequestsCookieJar type #Cookie traversal resolved: item () method to convert into a tuple cookies, through each cookie a name and value for key, value in r.cookies.items () : ## items () loading the cookies Huawei list of tuples, each traversing a cookie name and value of the print (key + '=' + value)
## Cookies remain logged ## get cookie directly on the page has been logged, assigned to the headers Import Requests headers = { 'cookie':'_zap=bf241714-d6f9-4e5f-9608-fa7b85f32db6; _xsrf=79ff86e9-5e76-4fa6-a384-4f528af88eb9; d_c0="AHBWI_Li4RCPTpXmzvEr1EkNgFDaBMtY-nA=|1582816893"; __guid=74140564.2608088362801457700.1582816893817.983; _ga=GA1.2.1547480939.1582816896; _gid=GA1.2.1931834251.1582816896; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1582816896; capsion_ticket="2|1:0|10:1582818182|14:capsion_ticket|44:MDhiMmFkNmY0YjI1NGRkYzgxMGZkY2Q3Mzk3YWYxZjU=|5bfdca13743bf8cb5de50f1c152f7d51120a4bf811eb2bfafdfc1079d69ffa9d"; z_c0="2|1:0|10:1582818209|4:z_c0|92:Mi4xSU00SERnQUFBQUFBY0ZZajh1TGhFQ2NBQUFDRUFsVk5vSEJfWGdDU2JjQkRxS3JNdElMNmZ3UjIzUVZ1WThyWWFn|61d7ba8d2dca14b10c7004277e43687cc4ef25116720ae3649d656dcc8cfef26"; monitor_count=3; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1582818210;KLBRSID = e42bab774ac0012482937540873c03cf | 1582818280 | 1582816893 ', 'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' } r = requests.get('https://www.zhihu.com/people/kuluma-59', headers=headers) print(r.text)
6, the session objects to maintain -Session
Each use get () or post () method to submit a request, equivalent to different sessions, equivalent to the use of two different browser opens the page.
Use the Session object can maintain a conversation, do not worry cookie problem
Login for simulation
Requests Import # submitted to the first request, provided Cookie requests.get ( 'http://httpbin.org/cookies/set/number/123456789') # submit a second request is not provided Cookie R & lt requests.get = ( ' http://httpbin.org/cookies') # acquires cookie most recent request Print (r.text) # "Cookies": {} Import requests ## the session () objects will remain the same session s = requests.Session ( ) s.get ( 'http://httpbin.org/cookies/set/number/123456789') R & lt s.get = ( 'http://httpbin.org/cookies') Print (r.text) # "Cookies ": {" number ":" 123456789 "}
. 7, the SSL certificate validation parameters --vertify
If there was a request for certificate validation error page SSLError representation, the website's certificate is not trusted agency official CA
Vertify need to modify the parameters to False, True default, which would request was successful
Requests Import Import requests.packages Import urllib3 urllib3.disable_warnings () ## when running the program, ignoring the warning R & lt requests.get = ( 'https://www.12306.cn', vertify = False) Print (r.status_code)
8, proxy settings -proxies parameters
Large-scale and frequent site request may pop up a verification code or jump to the login page, or IP ban
※ HTTP proxy
import requests proxies = { 'http':'http://user:password@host:port' } requests.get('https://www.taobao.com', proxies=proxies)
※ SOCKS proxy protocol
Library install socks: pip install 'requests [socks]'
import requests proxies = { 'http':'socks5://user:password@host:port', 'https':'socks5://user:password@host:port' } requests.get('https://www.taobao.com', proxies=proxies)
9, the timeout parameter set -timeout
Exceeds the set time has not yet responded to throw an exception
10, authentication
Requests Import r = requests.get ( 'HTTP: // localhost: 5000', auth = ( 'username', 'password')) Print (r.status_code) ## 200 return the correct username and password, or 401
11、Prepared Request
The request indicates the data structure