When using the default crawler http_proxy environment variable to set the HTTP Proxy. If a website it detects a certain period of time the number of accesses an IP, if too many visits, it will ban you visit. So you can set some proxy servers to help you do the job, from time to time for a proxy, so afraid to climb when taking large amounts of data suddenly it was closed.
This article IP from anonymous free HTTP proxy domestic high IP__ page 1 domestic high Anonymous http://www.xicidaili.com/nn/
this one mainly talk about how to set up Proxy, on the code!
Import the BeautifulSoup BS4 from Import Requests Import Random DEF get_ip_list (URL, headers): # all IP obtained the page web_data = requests.get (url, headers = headers) # page response obtained web_data Soup = the BeautifulSoup (web_data.text, 'lxml ') parsing the page # IPS = soup.find_all (' TR ') ip_list = [] for I in Range (. 1, len (IPS)): ip_info IPS = [I] TDS = ip_info.find_all (' TD ') httptype = str.lower (tds [5] .text) # the type made lowercase ip_list.append (httptype + ': //' + TDS [. 1] + .text ':' + TDS [2] .text) return ip_list # from many IP out using a randomly selected from the DEF get_random_ip (ip_list): proxy_ip = the random.choice (ip_list) return proxy_ip the __name__ == IF '__main__': URL = 'http://www.xicidaili.com/nn/' # request specifies a request to the head to simulate a browser headers = { 'User-Agent' : 'Mozilla / 5.0 (Windows 6.1 NT; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 55.0.2883.87 Safari / 537.36 '} ip_list = get_ip_list (URL, headers = headers) # get_ip_list function calls to obtain a list of IP proxies = get_random_ip (ip_list) # calling function get_random_ip random access from the IP list print (proxies)
then! You get the added IP request, you can be friends
proxies = {'http': 'http://114.97.184.251:808', 'https': 'https://119.135.85.253:808' } user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36' headers = {'User-Agent': user_agent} htmlText = requests.get(url, headers=headers, timeout=3, proxies=proxies).text