Reptile Gospel: Github Star 14K+, an open source IP proxy pool

Why use a proxy?

I don’t know if you have encountered such a situation when writing a crawler. During the test, the crawler can work normally, but after running for a period of time, you will find that an error is reported or no data is returned, and the webpage may prompt “frequent IP access”. This shows that the website has anti-crawling measures for IP (the number of requests and speed of IP within a certain period of time). If it exceeds a certain threshold, it will directly deny service, which is often referred to as "blocking IP".

In this case, the proxy IP comes in. The proxy is actually a proxy server. Its working principle is actually very simple. When we request a website normally, we directly send the request to the web server, and the web server transmits the response data to us. Using a proxy IP is like building a "bridge" between the local machine and the web server. At this time, the local machine first sends a request to the proxy server, and then the proxy server sends it to the web server. The same is true for the return of response data, which requires a proxy server to transfer. In this way, it is not easy for the Web server to identify the local IP.

However, there are also pros and cons of proxy IPs. There are three types of proxy IPs: high-secrecy proxy, general-secret proxy, and transparent proxy.

Next, let's take a look at the popular open source project ProxyPool on Github, how to use it!

Introduction to ProxyPool

A crawler proxy IP pool, which regularly collects free proxies published on the Internet and verifies the storage, regularly detects the availability of proxies, and provides API and CLI usage methods. You can also expand the proxy source to increase the quality and quantity of proxy pool IPs.

download code

via git clonedownload code

git clone [email protected]:jhao104/proxy_pool.git

Download the corresponding zip file

Install dependent libraries

pip install -r requirements.txt
复制代码

Modify the configuration file

Open the code file and setting.pymodify the project configuration according to your needs.

# 配置API服务
HOST = "0.0.0.0"               # IP
PORT = 5000                    # 监听端口


# 配置数据库
DB_CONN = 'redis://:[email protected]:8888/0'
# 无密码
DB_CONN = 'redis://:@127.0.0.1:8888/0'

# proxy table name 表名(自己建的)
TABLE_NAME = 'use_proxy'

# 配置 ProxyFetcher
PROXY_FETCHER = [
    "freeProxy01",      # 这里是启用的代理抓取方法名,所有fetch方法位于fetcher/proxyFetcher.py
    "freeProxy02",
    # ....
]

run the project

Start the redis service: redis-server.exe (the executable file is in the redisinstallation path).

1. Start the scheduler

Start the scheduler, open it under the Proxy_poolproject path and cmdenter:

python proxyPool.py schedule

Read the agents in the database.

import redis

r = redis.StrictRedis(host="127.0.0.1", port=6379, db=0)
result = r.hgetall('use_proxy')
result.keys()

2. Start the webApi service

python proxyPool.py server

After starting the web service, the api interface service will be enabled by default:

api method Description params
/ GET API introduction None
/get GET Get a random agent Optional parameter: ?type=https to filter proxies that support https
/pop GET Get and remove a proxy Optional parameter: ?type=https to filter proxies that support https
/all GET get all proxies Optional parameter: ?type=https to filter proxies that support https
/count GET Check the number of agents None
/delete GET remove proxy ?proxy=host:ip

use in crawler

import requests

def get_proxy():
    return requests.get("http://127.0.0.1:5010/get/").json()

def delete_proxy(proxy):
    requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))

# your spider code

def getHtml():
    # ....
    retry_count = 5
    proxy = get_proxy().get("proxy")
    while retry_count > 0:
        try:
            html = requests.get('http://www.example.com', proxies={"http": "http://{}".format(proxy)})
            # 使用代理访问
            return html
        except Exception:
            retry_count -= 1
    # 删除代理池中代理
    delete_proxy(proxy)
    return None

After all, the proxies in the proxy pool are free proxies for crawling. The IP quality is really hard to say, but it is enough for daily development and use.

Today's content is shared here, and the editor finally prepared a python spree for everyone [Jiajun Yang: 419693945] to help everyone learn better!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324341865&siteId=291194637