Reptile Proxy (proxy) settings

When using the default crawler http_proxy environment variable to set the HTTP Proxy. If a website it detects a certain period of time the number of accesses an IP, if too many visits, it will ban you visit. So you can set some proxy servers to help you do the job, from time to time for a proxy, so afraid to climb when taking large amounts of data suddenly it was closed.
This article IP from anonymous free HTTP proxy domestic high IP__ page 1 domestic high Anonymous http://www.xicidaili.com/nn/
this one mainly talk about how to set up Proxy, on the code!

Import the BeautifulSoup BS4 from 
Import Requests 
Import Random 

DEF get_ip_list (URL, headers): # all IP obtained the page 
    web_data = requests.get (url, headers = headers) # page response obtained web_data 
    Soup = the BeautifulSoup (web_data.text, 'lxml ') parsing the page # 
    IPS = soup.find_all (' TR ') 
    ip_list = [] 
    for I in Range (. 1, len (IPS)): 
        ip_info IPS = [I] 
        TDS = ip_info.find_all (' TD ') 
        httptype = str.lower (tds [5] .text) # the type made lowercase 
        ip_list.append (httptype + ': //' + TDS [. 1] + .text ':' + TDS [2] .text) 
    return ip_list 
# from many IP out using a randomly selected from the 
DEF get_random_ip (ip_list): 
    proxy_ip = the random.choice (ip_list) 
    return proxy_ip

the __name__ == IF '__main__': 
    URL = 'http://www.xicidaili.com/nn/' 
    # request specifies a request to the head to simulate a browser 
    headers = { 'User-Agent' : 'Mozilla / 5.0 (Windows 6.1 NT; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 55.0.2883.87 Safari / 537.36 '} 
    ip_list = get_ip_list (URL, headers = headers) # get_ip_list function calls to obtain a list of IP 
    proxies = get_random_ip (ip_list) # calling function get_random_ip random access from the IP list 
    print (proxies)

  then! You get the added IP request, you can be friends

proxies = {'http': 'http://114.97.184.251:808',
            'https': 'https://119.135.85.253:808' }
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
headers = {'User-Agent': user_agent}
htmlText = requests.get(url, headers=headers, timeout=3, proxies=proxies).text

  

 

Guess you like

Origin www.cnblogs.com/navysummer/p/12156803.html