The crawler uses http proxy

1. Each process randomly fetches the IP list from the interface and reuses it. After failure, call the API to get it.

The general logic is as follows:

(1) For each process, randomly retrieve part of the ip from the interface, and repeatedly try the ip directory to obtain data;

(2) If the access is successful, continue to grab the next one.

(3) If it fails, take a batch of IPs from the interface and continue to try.

Disadvantages of the scheme: each IP has an expiration date. If you extract 100 and use the 20th, most of the rest may not be usable. If the connection time exceeds 3 seconds when setting the HTTP request, and the reading time exceeds 5 seconds, it may take 3-8 seconds, and hundreds of times may be crawled within 3-8 seconds.

2. Each process randomly selects an IP from the interface for use. If it fails, call the API to get the IP.

The general logic is as follows:

(1) For each process, retrieve an ip randomly from the interface, use it to browse resources,

(2) If the access is successful, continue to grab the next one.

(3) If it fails, take an IP randomly from the interface and continue to try.

Disadvantages of the solution: Calling the API to obtain the IP is very frequent, which will put a lot of pressure on the proxy server , affect the stability of the API interface, and may limit extraction. This solution is not suitable and cannot run stably for a long time.

3. First extract a large number of IPs and import them into the local database, and extract IPs from the database.

The general logic is as follows:

(1) Create a table in the database, write an import script, how many APIs are needed per minute (consult the proxy IP service provider for advice), and import the IP list into the database.

(2) Record the import time, IP, Port, expiration time, IP availability status and other fields;

(3) Write a grabbing script to read available IPs from the database, and each process gets an IP from the database for use.

Perform crawling, judge results, process cookies, etc. , as long as there is a verification code or fails, give up the ip and change the ip again.

#! -*- encoding:utf-8 -*-

    import requests

    # 要访问的目标页面
    targetUrl = "http://ip.hahado.cn/ip"

    # 代理服务器
    proxyHost = "ip.hahado.cn"
    proxyPort = "39010"

    # 代理隧道验证信息
    proxyUser = "username"
    proxyPass = "password"

    proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
        "host" : proxyHost,
        "port" : proxyPort,
        "user" : proxyUser,
        "pass" : proxyPass,
    }

    proxies = {
        "http"  : proxyMeta,
        "https" : proxyMeta,
    }

    resp = requests.get(targetUrl, proxies=proxies)

    print resp.status_code
    print resp.text

Guess you like

Origin blog.csdn.net/weixin_73725158/article/details/130501858