The basic principle of reptiles, requests module, simulated landing site, crawling video sites, cookie pool and pool agents, forward proxy and reverse proxy

1, reptiles definition:

  Initiate a request to the site, after obtaining the resources and extract useful data analysis program

2, the basic process of reptiles

  (1) sends a request:

    Use http library launched http request to the target site, it sends a Request Request includes: request headers, request bodies, etc.

  (2) Content acquisition response:

    If the server can be a normal response, you'll get a Response Response include: html, json, pictures, videos, etc.

  (3) analytical content:

    Parse html data: regular expressions, such as third-party parsing library Beautifulsoup, pyquery such as parsing json data: json module parses binary data: a way to write documents b

  (4) Data storage:

    Database files

3, cookie pool and agent pool:

  cookie pool: Some sites may limit access to the frequency set to simulate different user access

  Agent pool: ip used to simulate different users to access the site

4, forward proxy and reverse proxy:

  Forward agency: their own, for example, over the wall (access Google): send a request to a server, the server again Google made the request, the requested data back to itself.

  Reverse Proxy: send a request to a server (a ip address), sent to nginx (reverse proxy server), and then sent to the back-end server (there may be many units, but to see it is one).

5, simulated landing:

# Set request header 
headers = {
 ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 77.0.3865.120 Safari / 537.36 '
}
# POST request log, and the parameters carried in the 
RES = requests.post (
     ' http://www.aa7a.cn/user.php ' ,
    headers=headers,
    data={
'username':'[email protected]',
'password':'147258369..',
'captcha': 'ubw3',
'remember': 1,
'ref': 'http://www.aa7a.cn/',
'act': 'act_login'
    })
# Successful login request cookies results obtained in 
the cookie = res.cookies.get_dict ()
 # verify a successful login request login page contains information 
RES = requests.post ( ' http://www.aa7a.cn/ ' , headers = headers, Cookies = the cookie)
 IF  ' [email protected] '  in res.text:
     Print ( ' successful landing! ' )
 the else :
     Print ( ' Login failed! ' )

 

6, crawling pear Video Case:

Import Re
 # crawling pear video step: 
# 1 first obtains the target url to crawl content 
# https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=3&start=36&mrd=0.762001536578264&filterIds=1625871,1625877 , 1625918,1625694,1625682,1625684,1625687,1625674,1625664,1625661,1625648,1625645,1625634,1625614,1625604 
# link 2. analysis need to get 
RES = requests.get ( ' https://www.pearvideo.com /category_loading.jsp?reqType=5&categoryId=3&start=36 ' )
 # 3. desired content is obtained using a positive match 
# <a href="video_1625614" class="vervideo-lilink actplay"> 
reg_txt = ' <the href = a " (. *?) "class = " vervideo-lilink actplay ">'
obj =the re.findall (reg_txt, res.text)
 # Print (obj) 
# 4. The path splicing requires access: as: HTTPS: //www.pearvideo.com/video_1625614 
for URL in obj:
    res_url = ' https://www.pearvideo.com/ ' + URL
     # video page obtained 
    res_v = requests.get (res_url)
     # . 5, matched to the final video address (list) 
    obj_v the re.findall = ( ' srcUrl = "(. *?)" ' , res_v.text)
     # Print (obj_v) 
    OBJ1 = requests.get (obj_v [0])
     # Parsing video Title 
    name = obj_v [0] .rsplit ( ' / ' ,. 1) [. 1 ]
     Print (name)
     # cycles written to the file, iter_content (): binary stream 
    with Open (name, ' WB ') as f:
        for line in obj1.iter_content():
            f.write(line)

 

 

 

Guess you like

Origin www.cnblogs.com/yangjiaoshou/p/11930382.html