一:总体思路
先正常构建Scrapy项目,然后将Scrapy-redis整合进正常Scrapy项目中,最后进行分布式部署。
其中,分布式部署包括:
中心节点安装redis、(mysql)
各子节点均安装python、scrapy、scrapy-redis、Python的redis模块(与pymysql模块)
将修改好的分布式爬虫项目部署到各子节点
各子节点分别运行分布式爬虫项目
二:详细实现思路即代码
内容详细: 1. start_requests def start_requests(self): for url in self.start_urls: yield Request(url=url,callback=self.parse2) def start_requests(self): req_list = [] for url in self.start_urls: req_list.append(Request(url=url,callback=self.parse2)) return req_list 因为scrapy内部会将返回值转换成迭代器。 2. 解析器 将字符串转换成对象: - 方式一: response.xpath('//div[@id='content-list']/div[@class='item']') - 方式二: hxs = HtmlXPathSelector(response=response) items = hxs.xpath("//div[@id='content-list']/div[@class='item']") 查找规则: //a //div/a //a[re:test(@id, "i\d+")] items = hxs.xpath("//div[@id='content-list']/div[@class='item']") for item in items: item.xpath('.//div') 解析: 标签对象:xpath('/html/body/ul/li/a/@href') 列表: xpath('/html/body/ul/li/a/@href').extract() 值: xpath('//body/ul/li/a/@href').extract_first() PS: 单独应用 from scrapy.selector import Selector, HtmlXPathSelector from scrapy.http import HtmlResponse html = """<!DOCTYPE html> <html> <head lang="en"> <meta charset="UTF-8"> <title></title> </head> <body> <ul> <li class="item-"><a id='i1' href="link.html">first item</a></li> <li class="item-0"><a id='i2' href="llink.html">first item</a></li> <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li> </ul> <div><a href="llink2.html">second item</a></div> </body> </html> """ response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8') obj = response.xpath('//a[@id="i1"]/text()').extract_first() print(obj) chrome xpath 3. pipelines - pipelines基础 class FilePipeline(object): def process_item(self, item, spider): print('写入文件',item['href']) return item def open_spider(self, spider): """ 爬虫开始执行时,调用 :param spider: :return: """ print('打开文件') def close_spider(self, spider): """ 爬虫关闭时,被调用 :param spider: :return: """ print('关闭文件') - 多pipelines(值越小优先级越高) - 多pipelines,返回值会传递给下一个pipelines的process_item PS:如果想要丢弃,不给后续pipeline使用: from scrapy.exceptions import DropItem class FilePipeline(object): def process_item(self, item, spider): print('写入文件',item['href']) # return item raise DropItem() - 根据配置文件读取相关值,再进行pipeline处理 class FilePipeline(object): def __init__(self,path): self.path = path self.f = None @classmethod def from_crawler(cls, crawler): """ 初始化时候,用于创建pipeline对象 :param crawler: :return: """ path = crawler.settings.get('XL_FILE_PATH') return cls(path) def process_item(self, item, spider): self.f.write(item['href']+'\n') return item def open_spider(self, spider): """ 爬虫开始执行时,调用 :param spider: :return: """ self.f = open(self.path,'w') def close_spider(self, spider): """ 爬虫关闭时,被调用 :param spider: :return: """ self.f.close() 4. POST/请求头/Cookie 自动登录抽屉+点赞 POST+请求头: from scrapy.http import Request req = Request( url='http://dig.chouti.com/login', method='POST', body='phone=8613121758648&password=woshiniba&oneMonth=1', headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}, cookies={}, callback=self.parse_check_login, ) cookies: 手动: cookie_dict = {} cookie_jar = CookieJar() cookie_jar.extract_cookies(response, response.request) for k, v in cookie_jar._cookies.items(): for i, j in v.items(): for m, n in j.items(): cookie_dict[m] = n.value req = Request( url='http://dig.chouti.com/login', method='POST', headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}, body='phone=8615131255089&password=pppppppp&oneMonth=1', cookies=cookie_dict, # 手动携带 callback=self.check_login ) yield req 自动: class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['http://dig.chouti.com/',] def start_requests(self): for url in self.start_urls: yield Request(url=url,callback=self.parse_index,meta={'cookiejar':True}) def parse_index(self,response): req = Request( url='http://dig.chouti.com/login', method='POST', headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}, body='phone=8613121758648&password=woshiniba&oneMonth=1', callback=self.parse_check_login, meta={'cookiejar': True} ) yield req def parse_check_login(self,response): # print(response.text) yield Request( url='https://dig.chouti.com/link/vote?linksId=19440976', method='POST', callback=self.parse_show_result, meta={'cookiejar': True} ) def parse_show_result(self,response): print(response.text) 配置文件制定是否允许操作cookie: # Disable cookies (enabled by default) # COOKIES_ENABLED = False 5. 去重规则 配置: DUPEFILTER_CLASS = 'xianglong.dupe.MyDupeFilter' 编写类: class MyDupeFilter(BaseDupeFilter): def __init__(self): self.record = set() @classmethod def from_settings(cls, settings): return cls() def request_seen(self, request): if request.url in self.record: print('已经访问过了', request.url) return True self.record.add(request.url) def open(self): # can return deferred pass def close(self, reason): # can return a deferred pass 问题:为请求创建唯一标识 http://www.oldboyedu.com?id=1&age=2 http://www.oldboyedu.com?age=2&id=1 from scrapy.utils.request import request_fingerprint from scrapy.http import Request u1 = Request(url='http://www.oldboyedu.com?id=1&age=2') u2 = Request(url='http://www.oldboyedu.com?age=2&id=1') result1 = request_fingerprint(u1) result2 = request_fingerprint(u2) print(result1,result2) 问题:记录到低要不要放在数据库?【使用redis集合存储】 访问记录可以放在redis中。 补充:dont_filter到低在哪里? from scrapy.core.scheduler import Scheduler def enqueue_request(self, request): # request.dont_filter=False # self.df.request_seen(request): # - True,已经访问 # - False,未访问 # request.dont_filter=True,全部加入到调度器 if not request.dont_filter and self.df.request_seen(request): self.df.log(request, self.spider) return False # 如果往下走,把请求加入调度器 dqok = self._dqpush(request) 6. 中间件 问题:对爬虫中所有请求发送时,携带请求头? 方案一:在每个Request对象中添加一个请求头 方案二:下载中间件 配置: DOWNLOADER_MIDDLEWARES = { 'xianglong.middlewares.UserAgentDownloaderMiddleware': 543, } 编写类: class UserAgentDownloaderMiddleware(object): @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called request.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" # return None # 继续执行后续的中间件的process_request # from scrapy.http import Request # return Request(url='www.baidu.com') # 重新放入调度器中,当前请求不再继续处理 # from scrapy.http import HtmlResponse # 执行从最后一个开始执行所有的process_response # return HtmlResponse(url='www.baidu.com',body=b'asdfuowjelrjaspdoifualskdjf;lajsdf') def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass 方案三:内置下载中间件 配置文件: USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' 总结: 1. 存储过程、触发器函数等作用? 2. 优化补充: - 读写分离,利用数据库的主从进行分离:主,用于删除、修改更新;从,查。 原生SQL: select * from db.tb ORM: model.User.objects.all().using("default") PS: 路由 db router - 分库,当数据库中表太多,将表分到不同的数据库;例如:1w张表 - 分表 - 缓存:利用redis、memcache 3. 创建索引 text列如果想要创建索引,必须执行长度。 4. start_requests - 可迭代对象 - 生成器 5. pipelines - 配置 ITEM_PIPELINES = { 'xianglong.pipelines.FilePipeline': 300, } - 写类 class FilePipeline(object): def __init__(self,path):pass @classmethod def from_crawler(cls, crawler): pass def process_item(self, item, spider): pass return item def open_spider(self, spider): pass def close_spider(self, spider): pass 6. 去重 - 配置 DUPEFILTER_CLASS = 'xianglong.dupe.MyDupeFilter' - 写类 class MyDupeFilter(BaseDupeFilter): def __init__(self): pass @classmethod def from_settings(cls, settings): pass def request_seen(self, request): pass def open(self): # can return deferred pass def close(self, reason): # can return a deferred pass 7. 下载中间件 - 配置 DOWNLOADER_MIDDLEWARES = { 'xianglong.middlewares.UserAgentDownloaderMiddleware': 543, } - 类 class UserAgentDownloaderMiddleware(object): @classmethod def from_crawler(cls, crawler): pass def process_request(self, request, spider): pass def process_response(self, request, response, spider): pass def process_exception(self, request, exception, spider): pass 8. POST/请求头/Cookie