How to improve crawler efficiency by screening high-quality crawler IPs?

foreword

For technicians doing data crawling, how to crawl the data ip database pool stably and efficiently plays a decisive role. For the maintenance of the crawler ip pool, we can start from the following aspects:

1. Verify the availability of the crawler ip

You can send a request to the target website through the requests library to determine whether the crawler ip can successfully return a response. If the return is successful, it means that the crawler ip is available, otherwise it means that the crawler ip is invalid. You can set the timeout period in the code to avoid waiting for a long time for an unresponsive crawler ip.

import requests
def check_proxy(proxy):
    try:
        response = requests.get(url, proxies=proxy, timeout=3)
        if response.status_code == 200:
            return Trueexcept:
        passreturn False

2. Update the crawler ip pool

New crawler ip can be obtained by regularly crawling crawler ip websites or purchasing paid crawler ip services. You can use the requests library to send a request to the crawler ip website to get the HTML page, and use the BeautifulSoup library to parse the HTML page to get the crawler ip information. Through certain screening rules, the newly obtained crawler IP can be added to the self-owned library pool.

import requests
from bs4 import BeautifulSoup
def get_proxies():
    url = 'http://jshk.com.cn/'
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    table = soup.find('table', {
    
    'id': 'ip_list'})
    tr_list = table.find_all('tr')
    proxies = []
    for tr in tr_list[1:]:
        td_list = tr.find_all('td')
        ip = td_list[1].text
        port = td_list[2].text
        protocol = td_list[5].text.lower()
        proxy = '{}://{}:{}'.format(protocol, ip, port)
        proxies.append(proxy)
    return proxies

3. Maintain the quality of crawler ip

Some indicators can be used to measure the quality of crawler IP, such as connection speed, response time, access success rate, etc. The crawler ip can be evaluated regularly, the IP with better quality can be screened out, and the poor quality IP can be deleted from the crawler ip pool.

import requests
from multiprocessing import Pool
from functools import partial
def check_proxy_quality(proxy):
    try:
        response = requests.get(url, proxies=proxy, timeout=3)
        if response.status_code == 200:
            return True, response.elapsed.total_seconds()
    except:
        passreturn False, Nonedef evaluate_proxies(proxies):
    pool = Pool(processes=8)
    results = pool.map(partial(check_proxy_quality), proxies)
    pool.close()
    pool.join()
    quality_proxies = []
    for proxy, result in zip(proxies, results):
        is_valid, response_time = result
        if is_valid:
            quality_proxies.append((proxy, response_time))
    return quality_proxies

4. Monitor the usage of crawler ip

For monitoring the use of crawler ip, a relatively simple method is to record the number of times and success rate of each crawler ip, so as to find out in time which crawler ip is no longer available or has poor quality.

You can use the built-in shelve module in Python to save the usage of the crawler ip in a local file. The shelve module can provide a dictionary-like data storage method to read and write data conveniently and quickly.

import shelve
class ProxyManager:
    def __init__(self, filename='proxies.db'):
        self.filename = filename
        self.proxies = shelve.open(filename, writeback=True)
        if not self.proxies.get('used_proxies'):
            self.proxies['used_proxies'] = {
    
    }
    def mark_as_used(self, proxy):
        if proxy in self.proxies:
            self.proxies[proxy]['used_times'] += 1
            self.proxies[proxy]['success_rate'] = self.proxies[proxy]['success_times'] / self.proxies[proxy]['used_times']
        else:
            self.proxies[proxy] = {
    
    'used_times': 1, 'success_times': 0, 'success_rate': 0}
        self.proxies['used_proxies'][proxy] = True
    def mark_as_success(self, proxy):
        if proxy in self.proxies:
            self.proxies[proxy]['success_times'] += 1
            self.proxies[proxy]['success_rate'] = self.proxies[proxy]['success_times'] / self.proxies[proxy]['used_times']
        else:
            self.proxies[proxy] = {
    
    'used_times': 1, 'success_times': 1, 'success_rate': 1}
        self.proxies['used_proxies'][proxy] = True
    def is_used(self, proxy):
        return self.proxies['used_proxies'].get(proxy)
    def close(self):
        self.proxies.close()

When using the crawler ip to make a network request, you can first check whether the crawler ip has been used. If the crawler ip has already been used, the crawler ip will no longer be used. If the crawler ip has not been used, use the crawler ip to make a network request, and update the usage of the crawler ip after the request succeeds or fails.

def get_page(url, proxy_manager):
    for i in range(3):
        proxy = get_proxy(proxy_manager)
        if proxy:
            try:
                response = requests.get(url, proxies={
    
    'http': proxy, 'https': proxy}, timeout=3)
                if response.status_code == 200:
                    proxy_manager.mark_as_success(proxy)
                    return response.text
            except:
                pass
            proxy_manager.mark_as_used(proxy)
    return None
def get_proxy(proxy_manager):
    proxies = list(proxy_manager.proxies.keys())
    for proxy in proxies:
        if not proxy_manager.is_used(proxy):
            return proxy
    return None

It should be noted that the writing operation of the shelve module may be time-consuming. If the crawler ip pool is large, you can consider saving the crawler ip usage in a local file at regular intervals to improve performance. At the same time, if there are many invalid crawler ips in the crawler ip pool, it proves that the IP availability rate of this pool is already extremely low, and it is recommended that big guys use crawler ips provided by high-quality manufacturers.

Under normal circumstances, many people would say that as the economy goes down, it is good enough to be able to use it. What about bicycles, let alone the connectivity of the free crawler ip, in fact, as long as you choose the right crawler ip, the purchase cost is also will be within our tolerance.

Guess you like

Origin blog.csdn.net/weixin_45841831/article/details/130427057