scrapy crawls available proxy ip

On the road of crawling, how should we solve the anti-crawling measures? Of course, as the so-called "you have anti-crawling strategy, I also have a wall ladder", so the following will be used scrapyto crawl free proxy ip , let us never again I am afraid that the IP will be blocked. Since there are many free proxies, here is an example of the West Thorn proxy. The
relevant code has been uploaded to the GitHub
GitHub address: https://github.com/stormdony/scarpydemo , there are some scrapy demos in it, welcome forkandstar

1. Create a scrapy project

scrapy startproject get_ip_demo

2. Create spider

scrapy genspider get_ip www.xicidaili.com

By observing the website, find the data that needs to be obtained
write picture description here
here is mainly used xpathto locate

# -*- coding: utf-8 -*-
import telnetlib

import scrapy

from get_ip_demo.items import GetIpDemoItem

class GetIpSpider(scrapy.Spider):
    name = 'get_ip'
    allowed_domains = ['xicidaili.com']
    start_urls = ['http://www.xicidaili.com/']

    def parse(self, response):
        ip_list = response.css('.odd')
        for each_ip in ip_list:
            ip = each_ip.xpath('td[2]/text()').extract_first()
            port = each_ip.xpath('td[3]/text()').extract_first()
            province = each_ip.xpath('td[4]/text()').extract_first()
            alive = each_ip.xpath('td[7]/text()').extract_first()
            active = each_ip.xpath('td[8]/text()').extract_first()
            item = GetIpDemoItem()
            item['ip'] = ip
            item['port'] = port
            item['province'] = province
            item['alive'] = alive
            item['active'] = active
            try:
                telnetlib.Telnet(ip, port=port, timeout=20)#验证是否可用
            except:
                print('connect failed')
            else:
                print('success')
                yield item
3. Write item
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class GetIpDemoItem(scrapy.Item):
    # define the fields for your item here like:
    ip = scrapy.Field()
    port = scrapy.Field()
    alive = scrapy.Field()#存活时间
    province = scrapy.Field()#位置
    active = scrapy.Field()#验证时间

4. Set storage—pipelines.py
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy.conf import settings


class GetIpDemoPipeline(object):
    def __init__(self):
        port = settings['MONGODB_PORT']
        host = settings['MONGODB_HOST']
        db_name = settings['MONGODB_DBNAME']
        client = pymongo.MongoClient(host=host, port=port)
        db = client[db_name]
        self.post = db[settings['MONGODB_DOCNAME']]

    def process_item(self, item, spider):
        ip_info = dict(item)
        self.post.insert(ip_info)
        return item
5. Modify settings.py
# -*- coding: utf-8 -*-

# Scrapy settings for get_ip_demo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'get_ip_demo'

SPIDER_MODULES = ['get_ip_demo.spiders']
NEWSPIDER_MODULE = 'get_ip_demo.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'get_ip_demo (+http://www.yourdomain.com)'

# Obey robots.txt rules
#这里一定要修改为False,否则会出错
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#设置请求头
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1)'
}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'get_ip_demo.middlewares.GetIpDemoSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'get_ip_demo.middlewares.GetIpDemoDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#开启存储
ITEM_PIPELINES = {
   'get_ip_demo.pipelines.GetIpDemoPipeline': 300,
}
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'XiCiDaiLi'
MONGODB_DOCNAME = 'ip_item'
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
6. Run

Run the following code in Terminal

scrapy crawl get_ip

write picture description here
Open Robomongodb , you can see the data we crawled down
write picture description here

7. Problems encountered
  1. Only 20+ pieces of data were captured in the database, but there were 100+ pieces of data on the website, so there was a problem in the middle?
    It turned out to be because only class='.odd'the ip was captured, but some ip columns have no class attribute tr. I caught the title, so the task is only half completed.
    write picture description here
    If you have a solution, please let me know, crab ^_^, if it helps you, please like it

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324613629&siteId=291194637