[Reptile] an easy to use IP agent pool

An easy to use IP agent pool - stand

Often encounter a variety of means to write anti-reptile reptilian, sealing IP is the more common anti-climb policy

In such cases you need to use a proxy IP, useful agents usually need to spend money, and free agents often prone to failure, so we need to build their own IP agent pool to get free and efficient proxy IP. Here's a write your own IP agent pool, welcome star

installation

pip install stand

start up

stand

After the start, reptiles crawl IP from the proxy site, and the data is stored in a SQLite database called stand.db in, wait a certain number of reptiles crawl IP, you can use a proxy

use

>>> from stand import get_proxy
>>> proxy = get_proxy()
>>> print(proxy)
'103.133.222.151:8080'

Use stand as a proxy in Scrapy

import scrapy
from scrapy.crawler import CrawlerProcess


class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://api.ip.sb/ip']

    def parse(self, response):
        print(response.meta['proxy'])
        print(response.text)


DOWNLOADER_MIDDLEWARES = {
    'stand.UserAgentMiddleware': 543,
    'stand.ProxyMiddleware': 600,
}
settings = dict(
    LOG_ENABLED=False,
    DOWNLOAD_TIMEOUT=30,
    DOWNLOADER_MIDDLEWARES=DOWNLOADER_MIDDLEWARES,
)


def run():
    process = CrawlerProcess(settings)
    process.crawl(TestSpider)
    process.start()


if __name__ == "__main__":
    run()

project instruction

  1. When you start the stand, it would run a crawl function from the agency website crawling proxy IP, and crawling to the results stored in a named stand.db (save directory can be set by STAND_DIR environment variable) of the SQLite database, each IP 2 has an initial fraction
  2. Runs then validate the IP proxy function validation, verification by the highest value fraction 3, fraction minus 1 validation fails, when the score is 0 to delete the IP
  3. After running crawl regularly validate and verify functions are crawling and IP, once every 20 minutes crawling IP, IP validation once every 60 minutes

Guess you like

Origin www.cnblogs.com/lin-zone/p/12054288.html