Introduction
Requests can be simulated using the browser's request, built of urllib than python, API requests more convenient module (essentially encapsulates urllib3)
Note: requests the library after sending a request to download web content, and does not execute js code, which requires our own analysis of the target site and then initiate a new request request
installation
>: pip3 install requests
use
各种请求方式:常用的就是requests.get()和requests.post() >>> import requests >>> r = requests.get('https://api.github.com/events') >>> r = requests.post('http://httpbin.org/post', data = {'key':'value'}) >>> r = requests.put('http://httpbin.org/put', data = {'key':'value'}) >>> r = requests.delete('http://httpbin.org/delete') >>> r = requests.head('http://httpbin.org/get') >>> r = requests.options('http://httpbin.org/get')
Based GET request
-
Basic request
Import Requests Response = requests.get ( URL = ' HTTPS:. WWW // destination URL .com ' ) response.encoding = ' UTF-. 8 ' # print contents of the response in text form Print (response.text) # write text with Open ( ' xxx.html ' , ' W ' ) AS F: f.write (response.text)
-
GET request carries parameters
HTTP default request method is GET * no request body * data must be within the 1K! * GET request data will be exposed in the browser's address bar operation commonly used GET request: 1 . Gives the URL directly into the address bar of your browser, then it must be a GET request 2 . Click the hyperlink on the page, must be GET request 3. when the form is submitted, the form used by default GET request, but can be set to POST
Analysis request parameter key = vules
Carrying a parameter request mode: url splicing Import Requests Response = requests.get ( URL = ' https://www.baidu.com/s?wd= animal pictures ' , # request header headers = { ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 75.0.3770.100 Safari / 537.36 ' } ) response.encoding = ' UTF-. 8 ' Print (response.text) with Open ( ' animal pictures 1.html ' , ' w' ) AS F: f.write (response.text) carrying two embodiment parameters of the request: the params Import Requests Response = requests.get ( URL = ' https://www.baidu.com/s ' , headers = { ' User- - Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 75.0.3770.100 Safari / 537.36 ' }, the params = { ' WD ' : ' animal pictures ' } ) response.encoding= 'utf-8' print(response.text) with open('动物图片2.html', 'w') as f: f.write(response.text)
Based on POST requests
POST request ( 1 ) The data does not appear in the address bar ( 2 ). The size of the data is no upper limit ( 3 ). There request body ( 4 ). If the request body Chinese present, will use the URL encoding! # ! ! ! requests.post () usage and requests.get () exactly, is a special requests.post () has a data parameter, used to store the volume data request
-
Analog browser login behavior
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36', 'Referer': 'http://www.aa7a.cn/user.php?&ref=http%3A%2F%2Fwww.aa7a.cn%2F', } res = requests.post('http://www.aa7a.cn/user.php', headers=headers, data={ 'username': '[email protected] ' , ' password ' : ' XXX ' , ' captcha ' : ' codes ' , ' Remember ' :. 1 , ' REF ' : ' http://www.aa7a.cn/ ' , ' ACT ' : ' act_login ' } ) #If the login is successful, cookie will be present in the res objects in the cookie = res.cookies.get_dict () # carry cookies to send a request to get home res = requests.get ( ' http://www.aa7a.cn/ ' , headers = headers , Cookies = the cookie, ) IF ' [email protected] ' in res.text: Print ( " Login successful " ) the else : Print ( " not logged " )
'' ' A target site analysis browser input https://github.com/login then entered the wrong account password, Ethereal found logon behavior is a post submitted to: https: //github.com/session and request header contains cookie Further request comprises: the commit: Sign in UTF8: ✓ authenticity_token: lbI8IJCwGslZS8qJPnof5e7ZkCoSoMn6jmDTsL1r / m06NLyIbw7vCrpwrFAPzHMep3Tmf / TSJVoXWrvDZaVwxQ == Login: egonlin password: 123 two flow analysis to gET: https: //github.com/login get init cookie and authenticity_token return POST : https: //github.com/session, take initial Cookie, the request to bring the body (authenticity_token, user name, password, etc.) last login get Cookie `` ` PS: If the cipher text password, it is possible to lose wrong account, lost password, and then into your browser to get the encrypted passwords, github password is expressly `` ` '' ' ImportRequests Import Re # first request R1 = requests.get ( ' https://github.com/login ' ) r1_cookie = r1.cookies.get_dict () # get initial Cookie (not authorized) authenticity_token = the re.findall (R & lt ' .? name = "authenticity_token" value = *. "? (*)" ' , r1.text) [0] # get from page CSRF TOKEN # second request: cookie transmitted with the original and TOKEN POST request to the login page, take account password Data = { ' the commit ' : ' Sign in ' , ' UTF8 ' : ' ✓ ' , ' Authenticity_token ' : authenticity_token, ' Login ' : ' [email protected] ' , ' password ' : ' alex3714 ' } R2 = requests.post ( ' https://github.com/session ' , Data = Data, Cookies = r1_cookie ) login_cookie = r2.cookies.get_dict () # third request: after login, you can hold login_cookie, such as access to some personal configuration r3 = requests.get ( 'https://github.com/settings/emails ' , Cookies = login_cookie) Print ( ' [email protected] ' in r3.text) # True to automatically log github (deal with their own cookie information)
supplement
-
UA acquisition request header
Add headers (the browser will identify the request headers, without access may be denied such access is usually when we need to send a request to bring the request header, request header is the key to the browser itself disguised as common useful request head follows the Host the Referer # large sites usually depending on the source of the parameter determination request the User-- Agent # end client cookies # cookies, although included in the request information in advance, but there is a separate parameter module requests treated him, headers = {} on the Do not put it
Gets the browser's User-Agent
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36' } import requests response = requests.get( url='https://www.baidu.com/s', headers = headers, params={ 'wd': '动物图片' } ) response.encoding = 'utf-8' Print (response.text) Print (response.status_code) # 200 is to print the response status code
eg:
#爬取视频 #https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=1&start=48&mrd=0.9993282952193101&filterIds=1625835,1625642,1625837,1625841,1625870,1625869,1625813,1625844,1625801,1625856,1625857,1625847,1625838,1625827,1625787 #https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=1&start=0 #获取视频 import re res=requests.get('https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=1&start=0') reg_text='<a href="(.*?)" class="vervideo-lilink actplay">' obj=re.findall(reg_text,res.text) print(obj) for url in obj: url='https://www.pearvideo.com/'+url res1=requests.get(url) obj1=re.findall('srcUrl="(.*?)"',res1.text) print(obj1[0]) name=obj1[0].rsplit('/',1)[1] print(name) res2=requests.get(obj1[0]) with open(name,'wb') as f: for line in res2.iter_content(): f.write(line)