Use scrapy framework
Based on a coding process in the persistent store pipeline
Data analysis in reptiles file
The parsed data called encapsulated
Item
object typeWill
item
be submitted to the type of object管道
管道
Responsible for callingprocess_item
a method of receivingitem
, then some form of persistent storageOpen pipe in the configuration file
ITEM_PIPELINES = { 'frist_scrapy.pipelines.FristScrapyPipeline': 300, } # 将这段代码的注释去掉
Precautions:
1.什么情况下需要用到多个管道类 - 一个管道类对应一种形式的持久化存储 2.process_item中的return item: - 可以将item提交给下一个即将被执行的管道类 3.如果直接将一个字典写入到redis报错的话: - pip install redis==2.10.6
Full stack data crawling
Manual transmission request
yield scrapy.Request(url=new_url,callback=self.parse) # 可以传入参数 meta yield scrapy.Request(url=new_url,callback=self.parse,meta={'item':item})
Summary: When to use yield
1.向管道提交item的时候 2.手动请求发送的时候
How to send a post request:
yield scrapy.FromRequest(url=new_url,callback=self.parse,formdata={})
Why start_urls list can be send get requests:
父类对start_requests的原始实现: def start_requests(self): for url in self.start_urls: yield scrapy.Request(url,callback=self.parse)
Five core components (objects)
Asynchronous scrapy achieve a certain degree of understanding
Related methods and object instantiation call flow
The role of components:
引擎(Scrapy) 用来处理整个系统的数据流处理, 触发事务(框架核心) 调度器(Scheduler) 用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址 下载器(Downloader) 用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的) 爬虫(Spiders) 爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面 项目管道(Pipeline) 负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
How to improve the efficiency of crawling appropriate data scrapy
增加并发: 默认scrapy开启的并发线程为16个,可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。 降低日志级别: 在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写:LOG_LEVEL = ‘ERROR’ 禁止cookie: 如果不是真的需要cookie,则在scrapy爬取数据时可以禁止cookie从而减少CPU的使用率,提升爬取效率。在配置文件中编写:COOKIES_ENABLED = False 禁止重试: 对失败的HTTP进行重新请求(重试)会减慢爬取速度,因此可以禁止重试。在配置文件中编写:RETRY_ENABLED = False 减少下载超时: 如果对一个非常慢的链接进行爬取,减少下载超时可以能让卡住的链接快速被放弃,从而提升效率。在配置文件中进行编写:DOWNLOAD_TIMEOUT = 10 超时时间为10s
Request parameter passing
Role: to help realize the depth of crawling scrapy
- Depth crawl:
- Crawling data is not on the same page in a
- Depth crawl:
Requirements: crawling name and profile, https: //www.4567tv.tv/frim/index1.html
Implementation process
Parameter passing:
yield scrapy.Request(url,callback,meta),将meta这个字典传递给callback
Receive parameters
response.meta
Code:
# -*- coding: utf-8 -*- import scrapy class MvspidersSpider(scrapy.Spider): name = 'mvspiders' # allowed_domains = ['https://www.4567tv.tv/frim/index1.html'] start_urls = ['https://www.4567tv.tv/frim/index1.html'] url = "https://www.4567tv.tv/index.php/vod/show/id/5/page/%s.html" pageNum = 1 def parse(self, response): li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li') for li in li_list: a_href = li.xpath('./div/a/@href').extract_first() url = 'https://www.4567tv.tv/' + a_href # 对详情页的url进行手动请求发送 # 请求传参: # 参数meta是一个字典,字典会传递给callback yield scrapy.Request(url,callback=self.infoparse) # 用于全栈的爬取 if self.pageNum < 5: self.pageNum += 1 new_url = self.url%self.pageNum # 递归调用自己 yield scrapy.Request(new_url,callback=self.parse) def infoparse(self,response): title = response.xpath("/html/body/div[1]/div/div/div/div[2]/h1/text()").extract_first() content = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()').extract_first()
items.py
File, its role is to define your needs to the top file in theitem
package data it is necessary to add in this class corresponding data name behindscrapy.Field()
that is built-in dictionary class (dict) is an alias, and does not provide additional methods and properties, are used based on class attributes to support life grammar item.# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class MvItem(scrapy.Item): # define the fields for your item here like: title = scrapy.Field() content = scrapy.Field()
pipelines.py
Various methods of data file storage# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html #写入到文本文件中 import pymysql from redis import Redis class DuanziproPipeline(object): fp = None def open_spider(self,spider): print('开始爬虫......') self.fp = open('./duanzi.txt','w',encoding='utf-8') #方法每被调用一次,参数item就是其接收到的一个item类型的对象 def process_item(self, item, spider): # print(item)#item就是一个字典 self.fp.write(item['title']+':'+item['content']+'\n') return item#可以将item提交给下一个即将被执行的管道类 def close_spider(self,spider): self.fp.close() print('爬虫结束!!!') #将数据写入到mysql class MysqlPipeLine(object): conn = None cursor = None def open_spider(self,spider): self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='222',db='spider',charset='utf8') print(self.conn) def process_item(self,item,spider): sql = 'insert into duanzi values ("%s","%s")'%(item['title'],item['content']) self.cursor = self.conn.cursor() try: self.cursor.execute(sql) self.conn.commit() except Exception as e: print(e) self.conn.rollback() return item def close_spider(self,spider): self.cursor.close() self.conn.close() #将数据写入到redis class RedisPileLine(object): conn = None def open_spider(self,spider): self.conn = Redis(host='127.0.0.1',port=6379) print(self.conn) def process_item(self,item,spider): self.conn.lpush('duanziData',item) return item