可以发现settings.py文件中的ITEM_PIPELINES是一个字典,在后面再添加python文件,即可多处理爬取结果。
weather/pipelines2json.py
import time import json import codecs class WeatherPipeline(object): def process_item(self, item, spider): today = time.strftime('%Y%m%d', time.localtime()) fileName = today + '.json' with codecs.open(fileName, 'a', encoding='utf8') as fp: line = json.dumps(dict(item), ensure_ascii=False) + '\n' fp.write(line) return item
weather/settings.py
ITEM_PIPELINES = { 'weather.pipelines.WeatherPipeline':1, 'weather.pipelines2json.WeatherPipeline':2 }
上述代码有2个处理爬虫结果的程序,其中一个将结果以json数据形式保存。
为什么这里用到codecs模块:涉及到字符编码的问题,用python IO模板方式写每次都需要手动转换编码,太过于麻烦。
- with open('/path/to/file', 'r') as f:
- f.read().decode('gbk')
还好Python还提供了一个codecs模块帮我们在读取文件时自动转换编码。
- with codecs.open('/path/to/file', 'r', encoding='gbk') as f:
- f.read()
有关json模块:json.dumps()用于将dict类型的数据转成str,因为如果直接将dict类型的数据写入json文件中会发生报错,因此在将数据写入时需要用到该函数。 json.loads()用于将str类型的数据转成dict。 json.dump()用于将dict类型的数据转成str,并写入到json文件中。 json.load()用于从json文件中读取数据。
Scrapy有自己默认的headers,这个headers跟浏览器的headers有区别,有些网站会检查headers,所以可以给Scrapy一个常用浏览器的headers。在settings.py添加如下代码即可。
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1'
也可以从Scrapy中间件添加headers:编写中间件文件weather/customMiddlewares.py后,在settings.py再添加有关代码。官方中间件描述https://doc.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware class CustomUserAgent(UserAgentMiddleware): def process_request(self, request, spider): ua = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1' request.headers.setdefault('User-Agent', ua)
DOWNLOADER_MIDDLEWARES = { 'weather.customMiddlewares.CustomUserAgent': 3, 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None }
以下是一些常用的浏览器的useragent:
window.navigator.userAgent 1) Chrome Win7: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1 2) Firefox Win7: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0 3) Safari Win7: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50 4) Opera Win7: Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50 5) IE Win7+ie9: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; Tablet PC 2.0; .NET4.0E) Win7+ie8: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3) WinXP+ie8: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.0) WinXP+ie7: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1) WinXP+ie6: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) 6) 傲游 傲游3.1.7在Win7+ie9,高速模式: Mozilla/5.0 (Windows; U; Windows NT 6.1; ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12 傲游3.1.7在Win7+ie9,IE内核兼容模式: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E) 7) 搜狗 搜狗3.0在Win7+ie9,IE内核兼容模式: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) 搜狗3.0在Win7+ie9,高速模式: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.33 Safari/534.3 SE 2.X MetaSr 1.0 8) 360 360浏览器3.0在Win7+ie9: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2