Scrapy
First, install scrapy
Installation Twisted
Twisted: package for Python-based event-driven networking engine.
In the following URL to install Twisted
Installation scrapy
Enter cmd
pip install scrapy
Installation, cmd, enter scrapy appear successful installation.
Second, understand scrapy
Scrapy components
Engine, used to process the data stream processing of the entire system, triggering the transaction.
A scheduler for receiving a request sent over the engine, is pressed into the queue, and returns the engine again when the request.
Downloader for downloading web content, web content and return to the spider.
Spider, Spider is a major work, and use it to develop rules to resolve specific domains or web page.
Pipeline project, is responsible for handling spiders drawn from the Web project, the main task is clear, validate and store data. When a page is parsed spider, the project will be sent to the pipeline, and after a few specific order of processing data.
Downloader middleware framework positioned between the hook and the downloader Scrapy engine, mainly dealing with requests and responses between the engine and Scrapy downloader.
Spider middleware, hooks interposed between the frame and the engine Scrapy spiders, major work is a process output response to the input request and spiders.
Scheduling middleware, the middleware is interposed between the engine and Scrapy scheduling, transmitted from the engine to the Scrapy scheduling requests and responses.
Its processing flow is:
When the engine opened a domain name, the domain name processing spiders, and spiders get the URL of a crawling.
Engine needs to get the first URL from the spider that crawl, then a scheduling request in the scheduling.
From scheduling engine that were acquired Next page crawling.
Scheduling the next crawling the URL returned to the engine, the engine will send them to download via download middleware.
When the download is complete the download page, a content download response is transmitted to the engine through the middleware.
Engine receives the response and downloading it to the spider through the spider intermediate transmission processing.
Spiders crawling process and returns a response to the project, and then send a new request to the engine.
Engine will crawl to the pipeline project, scheduled to send the request.
Repeat the system behind the second portion, until no scheduling request.
Third, the project analysis
Fourth, new projects
New Folder command reptiles Weather Network
cd to the root directory, open cmd, run
scrapy startproject weather_spider
Create a spider
scrapy genspider weather www.aqistudy.cn/historydata
weather here is the spider's name
Path created as follows:
V. coding
item.py
For good camouflage crawling must UA, as defined in setting.py MY_USER_AGENT
to hold the UA, note the name must be capitalized in settings
settings.py
MY_USER_AGENT = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",]复制代码
RandomUserAgentMiddleware
class
middlewares.py
Note To activate settings.py must be 900 to remove UA scrapy itself
setting.py
Start writing the most important spider.py, we recommended scrapy.shell step by step debugging
First to get all of the city
weather.py
def parse(self, response): city_urls = response.xpath('//div[@class="all"]/div[@class="bottom"]//li/a/@href').extract()[16:17] city_names = response.xpath('//div[@class="all"]/div[@class="bottom"]//li/a/text()').extract()[16:17] self.logger.info('正在爬去{}城市url'.format(city_names[0])) for city_url, city_name in zip(city_urls, city_names): # 用的follow快捷方式,可以自动拼接url yield response.follow(url=city_url, meta={'city': city_name}, callback=self.parse_month)复制代码
weather.py
def parse_month(self, response): """ 解析月份的url :param response: :return: """ city_name = response.meta['city'] self.logger.info('正在爬取{}城市的月份url'.format(city_name[0])) # 由于爬取的信息太大了,所有先爬取前5个 month_urls = response.xpath('//ul[@class="unstyled1"]/li/a/@href').extract()[0:5] for month_url in month_urls: yield response.follow(url=month_url, meta={'city': city_name, 'selenium': True}, callback=self.parse_day_data)复制代码
WeatherSpiderDownloaderMiddleware
process_request function method to download middleware
middlewares.py
import time
import scrapy
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
class WeatherSpiderDownloaderMiddleware(object):
def process_request(self, request, spider):
if request.meta.get('selenium'):
# 为了让浏览器能够无界面的工作
chrome_options = Options()
# 设置chrome浏览器无界面模式
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=chrome_options)
# 用浏览器去访问这个地址
driver.get(request.url)
time.sleep(1.5) # 因为浏览器需要加载渲染
html = driver.page_source
driver.quit()
return scrapy.http.HtmlResponse(url=request.url, body=html, encoding='utf-8', request=request)
return None复制代码
激活WeatherSpiderDownloaderMiddleware
Finally, in the rest of the code written weather.py
Sixth, run the project
scrapy list
the existence of the project view
scrapy crawl weather -o spider.json
scrapy crawl weather -o spider..jl
scrapy crawl weather -o spider..csv
scrapy crawl weather -o spider..xml
FEED_EXPORT_ENCODING = 'utf-8'
Seven, storage operation
For starters mainly addressed the pipelines in
pipelines.py
Activate pipelines in settings in
Results are as follows