05 -requests crawler base module (2)

Focus today:

    1, the proxy server settings

    2, simulated landing verified code (static verification code)

    3、cookie与session

    4, the thread pool


 

 

1, the proxy server settings

  Sometimes the climb to take the same site using the same IP, the server will be blocked after a long time. 1 then we should be how to deal with this problem?

  Solutions:

      If we crawl the site, the other server shows someone else's IP address, even if the other server IP ban shield. It does not matter, we can continue for the other IP addresses to continue crawling.

So use a proxy server, you can solve the problem.

      There are many online sites proxy server generally safer to spend money, of course, you want to identify whether the IP is safe.

  Proxy Type:

      HTTP: only send http protocol

      HTTPS: HTTPS protocol to send

  Degree of anonymity:

      Transparency: the other server can use proxy know you and know your IP address

      Anonymous: you use a proxy server knows the other side, you do not know the server

      High anonymous: I do not know if you use a proxy.

  Agent web site:

      https://www.kuaidaili.com/free/ (free proxy fast, but still dangerous)

      The remaining casual Baidu, there are many online proxy server

        Check your phone ip, ip next test agent

Proxy service code shown below:

Import Requests 

URL = " https://www.baidu.com/s?wd=ip " 
headers = {
     " the User-- Agent " : " the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 75.0.3770.100 Safari / 537.36 " , 
} 
# request parameter carries a plurality of proxies, the type of key-value pair http 
Response = requests.get (URL = URL, headers = headers, proxies = { " http " : " 115.233 .210.218: 808 " .}) text
 # into the current file in the directory 
with Open ( " ./ip.html " ,"w",encoding="utf-8") as fp:
    fp.write(response)
print(response)

Look at the results shows that I find ip address of the agent inside address.

 

 

 

2, simulated landing verified code (static verification code)

  Finished with agency problems, such as our next crawl the site need to be logged, but most will appear allowing users to log into the input code, so we need to use the code to simulate human input verification code.

  In the face of a verification code, in the actual project, the handling code in three ways:

    1) Manual entry. (For small projects we can take to manually enter)

    2) identification codes automatically by programming. (This is inefficient, machine learning identification code)

    3) through some coding interface, through the interface to let others help us enter the verification code, but need to pay a small fee.

  But to achieve the above way, have to analyze the images come from a verification code, verification code will not be entrained cookie (cookie, cookie will be mentioned later). As long as the analysis is complete, the recognized codes, as the dynamic parameter passed

The following data to be submitted, then complete static verification code (text input) verification.  

 

Here you can have a large platform, cloud-coding ( http://www.yundama.com/ ) recharge, and add software, then copy the relevant developer python source code, log in to your account. Code for identifying the type of fill, the output of the recognition result. Passed in as dynamic parameters.

   Then log on to solve the problem. Thus lead (Cookie)

3、cookie与session

   In use reptile, if other operations involving login, often used to Cookie. So what is a Cookie do?

      In simple terms, we have access to every Internet page, are carried out by HTP protocol and HTTP protocol is a stateless protocol, so-called stateless protocol that is unable to maintain state between sessions.

      For example, using only the HTTP protocol, we log in to a site, if the login is successful, but when we visit other pages of the site, the login status will disappear, then still need to log in once, as long as the pages involved update, you need to log in repeatedly, which is very inconvenient.

      So this time, we need the corresponding session information, such as successful login and other information - preserved some ways, more commonly used in two ways: through information or information by Cookie Session Save Session to save the session. We talk about these two methods respectively.

      If the session information is stored by Cookie, then all session information will be stored in the client, when we visit other pages on the same site, the session information is read from the corresponding Cookie in order to determine the current state of the session , for example, can determine whether sign-on. Obviously, this way we will use Cookie.

      If it is saved by Session session information, session information will be stored in the corresponding server, but the server will send the client SessionID and other information, which are generally present in the client's Cookie, of course, if the client is disabled Cookie, also stored by other means. However, at present, most of the case or the information will be saved to this part of the Cookie. Then, when users visit the site to other pages, will read from this part of the Cookie information, and then retrieved from the server in the Session in all the session information of the client according to this part of the Cookie information and session control. Obviously, using Session way to save session information, most of the time, or will the Cookie.

      By the previous analysis, we can see, no matter which way through the session control, most of the time, will be used Cookie. For example, in reptiles login, if there is no Cookie, we have a web page login was successful, but if we are to climb the other pages of the site, they still would not logged in, if you have a Cookie, when we log in successfully, when crawling other pages of the site, login status will remain content crawling.      

Conclusion is:

      1) We visit each an Internet page, are carried out by HTP protocol and HTTP protocol is a stateless protocol, so-called stateless protocol that is unable to maintain state between sessions.
      There 2) session control information commonly used ways: Cookie stores the session information, session information stored by the Session.
      3) If the information corresponding to the session information will be stored in the storage server through the session Session, but the server will send the client SessionID other information, which are generally present in the client Cookie, of course, if the client disabled Cookie also stored in other ways. However, at present, most of the situation or will it - part of the information stored in the Cookie. Then, when users visit the site to other pages, will read from this part of the Cookie information, and then retrieved from the server in the Session in all the session information of the client according to this part of the Cookie information and session control. Obviously, using Session way to save session information, most of the time, or will use Cookie.
      4) If you want to get a real sign-in address, we need to analyze, there are two main analytical methods, the first method is analyzed by F12 to bring up the debugging interface, the second method is to use software tools for analysis, common tools there Fiddler (a pseudo-server, it must go through your transmission sent by the client).

Log all network:

Requests Import
 IF __name__ == " __main__ " : 

    # login request url (available through packet capture tool) 
    POST_URL = ' http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=201873958471 ' 
    # create a session object , the object will automatically request the cookie to store and carry 
    the session = requests.Session () 
   # disguise UA15675334817 
    FormData = {
         ' in Email ' : ' 15,675,334,817 ' ,
         ' ICODE ' : '' ,
         ' origURL ' :'http://www.renren.com/home',
        'domain': 'renren.com',
        'key_id': '1',
        'captcha_type': 'web_login',
        'password': '02eb37d2b5474eb9b086c7cdea2adea2f3af747d674cb598c1fab6732a62e555',
        'rkey': '3c7c4d177b8f7edfa59e7f629fcee3bc',
        'F ' : ' HTTP%. 3A. 2F%%% 2Fwww.renren.com 2F972358414 ' , 
    } 
    # session using the transmission request, the purpose is to save the session cookie to the sub-request 
    session.post (URL = POST_URL, Data = FormData, = headers headers) 

    GET_URL = ' http://www.renren.com/972358414/profile ' 
    # re-use the session for the transmission request, the sub-request already carries Cookie 
    Response = session. GET (URL = GET_URL, headers = headers) 
    # set encoding format in response to the content 
    # write the contents of the response file 
    with Open ( ' ./renren.html ' , ' W ' , encoding = "utf-8") as fp:
        fp.write(response.text)

4, the thread pool

    Based on the data multiprocessing.dummy thread pool crawling

Exercise: Demand: crawling pear video video information

 

Requests Import
 from lxml Import etree 
Import Re 
Import Random 
from multiprocessing.dummy Import Pool 


# crawling video download and save the need to wait a long time this time you can take to solve the IO-intensive multithreaded 
# so the introduction of the thread pool 
the pool = Pool () 

headers = {
     " the User-- Agent " : " the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 75.0.3770.100 Safari / 537.36 " , 
} 

URL = " HTTPS: //www.pearvideo .com / category_1 " 

vedio_page = Requests. GET (URL = URL, headers = headers) .text

tree = etree.HTML(vedio_page)
name_url = tree.xpath('//*[@id="listvideoList"]/ul/li/div/a/@href')

vedio_urls = []#存储视频的url

for i in name_url:
    vedios_url = "https://www.pearvideo.com/" + i
    res = requests.get(url=vedios_url,headers=headers).text
    tre = etree.HTML(res)
    ex = 'srcUrl="(.*?)",vdoUrl'
    vedio = re.findall(ex,res,re.S)[0] 
    Vedio_urls.append (Vedio) 
Print (vedio_urls) 
# thread pool to download video data     
DEF Load (Link): 
    headers = {
 " the User-- Agent " : " the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML , like the Gecko) the Chrome / 75.0.3770.100 Safari / 537.36 " , 
} 
    Resul = Requests. GET (URL = Link, headers = headers) .content
     return Resul 
video_data_list = pool.map (Load, vedio_urls) 
# video data using a thread pool save 
DEF save (vedioa): 
    headers = {
 " the User-- Agent" : " The Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 75.0.3770.100 Safari / 537.36 " , 
} 
    name = STR (the random.randint ( 0 , 9999 )) + " .mp4 "     
    with Open (name, " wb " ) AS fp: 
        fp.write (vedioa) 
        Print (name, " : Download success " ) 
pool.map (the Save, video_data_list) 


pool.close () # close the thread 
pool.join () continue to the next thread #

Summary: crawling pear video ideas, first find the video where the label position, entering the address where the video position, the next request video address is empty, carefully looking for the address found in the video is displayed by js control.

bs4 and lxml are the labels are resolved, and we need is js code, so we need to use regular expressions. Regular expression matching a suffix to .mp4 video of. Then use the thread pool

Crawling download data, and then use the thread pool to save data, binary data. (Ideas must analyze carefully before beginning to write code, or prone to problems)

 

Guess you like

Origin www.cnblogs.com/lishuntao/p/11605667.html