On the road of crawling, how should we solve the anti-crawling measures? Of course, as the so-called "you have anti-crawling strategy, I also have a wall ladder", so the following will be used scrapy
to crawl free proxy ip , let us never again I am afraid that the IP will be blocked. Since there are many free proxies, here is an example of the West Thorn proxy. The
relevant code has been uploaded to the GitHub
GitHub address: https://github.com/stormdony/scarpydemo , there are some scrapy demos in it, welcome fork
andstar
1. Create a scrapy project
scrapy startproject get_ip_demo
2. Create spider
scrapy genspider get_ip www.xicidaili.com
By observing the website, find the data that needs to be obtained
here is mainly used xpath
to locate
# -*- coding: utf-8 -*-
import telnetlib
import scrapy
from get_ip_demo.items import GetIpDemoItem
class GetIpSpider(scrapy.Spider):
name = 'get_ip'
allowed_domains = ['xicidaili.com']
start_urls = ['http://www.xicidaili.com/']
def parse(self, response):
ip_list = response.css('.odd')
for each_ip in ip_list:
ip = each_ip.xpath('td[2]/text()').extract_first()
port = each_ip.xpath('td[3]/text()').extract_first()
province = each_ip.xpath('td[4]/text()').extract_first()
alive = each_ip.xpath('td[7]/text()').extract_first()
active = each_ip.xpath('td[8]/text()').extract_first()
item = GetIpDemoItem()
item['ip'] = ip
item['port'] = port
item['province'] = province
item['alive'] = alive
item['active'] = active
try:
telnetlib.Telnet(ip, port=port, timeout=20)#验证是否可用
except:
print('connect failed')
else:
print('success')
yield item
3. Write item
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class GetIpDemoItem(scrapy.Item):
# define the fields for your item here like:
ip = scrapy.Field()
port = scrapy.Field()
alive = scrapy.Field()#存活时间
province = scrapy.Field()#位置
active = scrapy.Field()#验证时间
4. Set storage—pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy.conf import settings
class GetIpDemoPipeline(object):
def __init__(self):
port = settings['MONGODB_PORT']
host = settings['MONGODB_HOST']
db_name = settings['MONGODB_DBNAME']
client = pymongo.MongoClient(host=host, port=port)
db = client[db_name]
self.post = db[settings['MONGODB_DOCNAME']]
def process_item(self, item, spider):
ip_info = dict(item)
self.post.insert(ip_info)
return item
5. Modify settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for get_ip_demo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'get_ip_demo'
SPIDER_MODULES = ['get_ip_demo.spiders']
NEWSPIDER_MODULE = 'get_ip_demo.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'get_ip_demo (+http://www.yourdomain.com)'
# Obey robots.txt rules
#这里一定要修改为False,否则会出错
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#设置请求头
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1)'
}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'get_ip_demo.middlewares.GetIpDemoSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'get_ip_demo.middlewares.GetIpDemoDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#开启存储
ITEM_PIPELINES = {
'get_ip_demo.pipelines.GetIpDemoPipeline': 300,
}
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'XiCiDaiLi'
MONGODB_DOCNAME = 'ip_item'
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
6. Run
Run the following code in Terminal
scrapy crawl get_ip
Open Robomongodb , you can see the data we crawled down
7. Problems encountered
- Only 20+ pieces of data were captured in the database, but there were 100+ pieces of data on the website, so there was a problem in the middle?
It turned out to be because onlyclass='.odd'
the ip was captured, but some ip columns have no class attributetr
. I caught the title, so the task is only half completed.
If you have a solution, please let me know, crab ^_^, if it helps you, please like it