Reptile (b) establish proxy ip pool

Before we say a website commonly used method is to detect anti-crawler ip, restrict access frequency. So we have to get around this restriction by setting the proxy ip approach. There are many websites providing free proxy ip, like https://www.xicidaili.com/nt/ , we can get a lot of proxy ip from the website. But not everyone can use these ip, or say, few usable.

 

We can use beautifulsoup Analysis page, and then processed to extract proxy ip list, you can also use regular expression matching. As a regular expression will be faster. ip_url is https://www.xicidaili.com/nt/ , random_hearder function is a random access to request header.

def download_page(url):
    headers = random_header()
    data = requests.get(url, headers=headers)
    return data


def get_proxies(page_num, ip_url):
    available_ip = []
    for page in range(1,page_num):
        print("抓取第%d页代理IP" %page)
        url = ip_url + str(page)
        r = download_page(url)
        r.encoding = 'utf-8'
        pattern The re.compile = ( ' <TD class = "Country">. *? Alt = "Cn" />.*?</ TD>. *? <TD> (. *?) </ TD>. *? < (.? *) TD> </ TD> ' , re.S) 
        ip_list = the re.findall (pattern, r.text)
         for IP in ip_list:
             IF test_ip (IP):
                 Print ( ' % S:% S by testing , added to the list of available agents ' % (IP [0], IP [. 1 ])) 
                available_ip.append (IP) 
        the time.sleep ( 10 ) Print ( ' gripping end ' )
     return available_ip

We also need to get the ip ip detection to determine the ip can be used. How to detect it? We may be able to display a Web site with access ip ip proxy access, and then check the results of the request.

def test_ip(ip,test_url='http://ip.tool.chinaz.com/'):
    proxies={'http': ip[0]+':'+ip[1]}
    try_ip=ip[0]
    try:
        r=requests.get(test_url, headers=random_header(), proxies=proxies)
        if r.status_code==200:
            r.encoding='gbk'
            result=re.search('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',r.text)
            Result=result.group ()
             Print (Result)
             IF Result [:. 9] == try_ip [:. 9 ]: Print ( '% S:% S Test by '% (IP [0], IP [. 1]) )
                 return True
             the else :
                 Print ( ' % S:% S carrying agent failed, using local the IP ' % (IP [0], IP [. 1 ]))
                 return False
         the else :
             Print ( ' % S:% S request code is not 200 ' % ( IP [0], IP [. 1 ]))
             return False
     the except Exception as e:
        print(e)
        print('%s:%s 错误' %(ip[0],ip[1]))
        return False

Some tutorials just to get http status code 200 is considered a success, it is not right. Because the proxy ip access is not successful, you will use your own ip default. With my own ip access we can certainly succeeded.

 

Finally get ip before use, we also need to be detected, because you do not know when it is not available. So usually hoard more agents ip, so as not to use the time did not use statistics.

 

Code for this article made reference to https://blog.csdn.net/XRRRICK/article/details/78650764 , I made some changes a little.

Guess you like

Origin www.cnblogs.com/rain-poi/p/11517110.html