Reptile - requests

Introduction

Requests can be simulated using the browser's request, built of urllib than python, API requests more convenient module (essentially encapsulates urllib3)

Note: requests the library after sending a request to download web content, and does not execute js code, which requires our own analysis of the target site and then initiate a new request request

installation

>: pip3 install requests

use

各种请求方式:常用的就是requests.get()和requests.post()
>>> import requests
>>> r = requests.get('https://api.github.com/events')
>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})
>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})
>>> r = requests.delete('http://httpbin.org/delete')
>>> r = requests.head('http://httpbin.org/get')
>>> r = requests.options('http://httpbin.org/get')

Based GET request

  • Basic request

Import Requests
 
Response = requests.get ( 
    URL = ' HTTPS:. WWW // destination URL .com ' 
  ) 
response.encoding = ' UTF-. 8 ' 
# print contents of the response in text form 
Print (response.text)
 # write text 
with Open ( ' xxx.html ' , ' W ' ) AS F: 
    f.write (response.text)
  • GET request carries parameters

HTTP default request method is GET
      * no request body
      * data must be within the 1K!
     * GET request data will be exposed in the browser's address bar 

operation commonly used GET request:
        1 . Gives the URL directly into the address bar of your browser, then it must be a GET request
        2 . Click the hyperlink on the page, must be GET request
        3. when the form is submitted, the form used by default GET request, but can be set to POST

Analysis request parameter key = vules

Carrying a parameter request mode: url splicing 
Import Requests 
Response = requests.get ( 
    URL = ' https://www.baidu.com/s?wd= animal pictures ' ,
     # request header 
    headers = {
         ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 75.0.3770.100 Safari / 537.36 ' 
    } 
) 
response.encoding = ' UTF-. 8 ' 
Print (response.text) 
with Open ( ' animal pictures 1.html ' , ' w' ) AS F: 
    f.write (response.text) 

carrying two embodiment parameters of the request: the params 
Import Requests 
Response = requests.get ( 
    URL = ' https://www.baidu.com/s ' , 
    headers = {
         ' User- - Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 75.0.3770.100 Safari / 537.36 ' 
    }, 
    the params = {
         ' WD ' : ' animal pictures ' 
    } 
) 
response.encoding= 'utf-8'
print(response.text)
with open('动物图片2.html', 'w') as f:
    f.write(response.text)

Based on POST requests

POST request 
( 1 ) The data does not appear in the address bar 
( 2 ). The size of the data is no upper limit 
( 3 ). There request body 
( 4 ). If the request body Chinese present, will use the URL encoding! 

# ! ! ! requests.post () usage and requests.get () exactly, is a special requests.post () has a data parameter, used to store the volume data request
  • Analog browser login behavior

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
    'Referer': 'http://www.aa7a.cn/user.php?&ref=http%3A%2F%2Fwww.aa7a.cn%2F',
}
res = requests.post('http://www.aa7a.cn/user.php',
                    headers=headers,
                    data={
                        'username': '[email protected] ' ,
                         ' password ' : ' XXX ' ,
                         ' captcha ' : ' codes ' ,
                         ' Remember ' :. 1 ,
                         ' REF ' : ' http://www.aa7a.cn/ ' ,
                         ' ACT ' : ' act_login ' 
                    } 
                    ) 
#If the login is successful, cookie will be present in the res objects in 
the cookie = res.cookies.get_dict () 

# carry cookies to send a request to get home 
res = requests.get ( ' http://www.aa7a.cn/ ' , headers = headers , 
                 Cookies = the cookie, 
                 ) 

IF  ' [email protected] '  in res.text:
     Print ( " Login successful " )
 the else :
     Print ( " not logged " )

 

'' ' 
A target site analysis 
    browser input https://github.com/login 
    then entered the wrong account password, Ethereal 
    found logon behavior is a post submitted to: https: //github.com/session 
    and request header contains cookie 
    Further request comprises: 
        the commit: Sign in 
        UTF8: ✓ 
        authenticity_token: lbI8IJCwGslZS8qJPnof5e7ZkCoSoMn6jmDTsL1r / m06NLyIbw7vCrpwrFAPzHMep3Tmf / TSJVoXWrvDZaVwxQ == 
        Login: egonlin 
        password: 123 



two flow analysis 
    to gET: https: //github.com/login get init cookie and authenticity_token 
    return POST : https: //github.com/session, take initial Cookie, the request to bring the body (authenticity_token, user name, password, etc.) 
    last login get Cookie 

`` ` 
PS: If the cipher text password, it is possible to lose wrong account, lost password, and then into your browser to get the encrypted passwords, github password is expressly 
`` ` 

'' ' 

ImportRequests
 Import Re 

# first request 
R1 = requests.get ( ' https://github.com/login ' ) 
r1_cookie = r1.cookies.get_dict () # get initial Cookie (not authorized) 
authenticity_token = the re.findall (R & lt ' .? name = "authenticity_token" value = *. "? (*)" ' , r1.text) [0] # get from page CSRF TOKEN 

# second request: cookie transmitted with the original and TOKEN POST request to the login page, take account password 
Data = {
     ' the commit ' : ' Sign in ' ,
     ' UTF8 ' : ' ' ,
    ' Authenticity_token ' : authenticity_token,
     ' Login ' : ' [email protected] ' ,
     ' password ' : ' alex3714 ' 
} 
R2 = requests.post ( ' https://github.com/session ' , 
             Data = Data, 
             Cookies = r1_cookie 
             ) 

login_cookie = r2.cookies.get_dict () 

# third request: after login, you can hold login_cookie, such as access to some personal configuration 
r3 = requests.get ( 'https://github.com/settings/emails ' , 
                Cookies = login_cookie) 

Print ( ' [email protected] '  in r3.text) # True 

to automatically log github (deal with their own cookie information)
View Code

supplement

  • UA acquisition request header

Add headers (the browser will identify the request headers, without access may be denied such access 
is usually when we need to send a request to bring the request header, request header is the key to the browser itself disguised as common useful request head follows 
the Host 
the Referer # large sites usually depending on the source of the parameter determination request 
the User-- Agent # end client 
cookies # cookies, although included in the request information in advance, but there is a separate parameter module requests treated him, headers = {} on the Do not put it

Gets the browser's User-Agent

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
import requests
response = requests.get(
    url='https://www.baidu.com/s',
    headers = headers,
    params={
        'wd': '动物图片'
    }
)
response.encoding = 'utf-8'
Print (response.text)
 Print (response.status_code)   # 200 is to print the response status code

 

 

eg:

#爬取视频
#https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=1&start=48&mrd=0.9993282952193101&filterIds=1625835,1625642,1625837,1625841,1625870,1625869,1625813,1625844,1625801,1625856,1625857,1625847,1625838,1625827,1625787
#https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=1&start=0
#获取视频
import re
res=requests.get('https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=1&start=0')


reg_text='<a href="(.*?)" class="vervideo-lilink actplay">'

obj=re.findall(reg_text,res.text)
print(obj)
for url in obj:
    url='https://www.pearvideo.com/'+url
    res1=requests.get(url)
    obj1=re.findall('srcUrl="(.*?)"',res1.text)
    print(obj1[0])
    name=obj1[0].rsplit('/',1)[1]
    print(name)
    res2=requests.get(obj1[0])
    with open(name,'wb') as f:
        for line in res2.iter_content():
            f.write(line)
View Code

 

Guess you like

Origin www.cnblogs.com/waller/p/11928802.html