Original: http://106.13.73.98/__/138/
Based on persistent storage instruction terminal
Prerequisites: Make sure the file crawler parse process iteration may return value data type (usually list / dict).
The return value can be written into the file format specified by the terminal in the form of instructions, for persistent storage.
Execute the following command to persistent storage:
scrapy crawl 应用名称 -o xx.文件格式
It supports file formats are: 'json', 'jsonlines', 'JL', 'CSV', 'xml', 'Marshal', 'pickle'
Based on the persistent storage pipeline
Scrapy framework provides us with efficient and convenient persistent operating functions, we can use directly.
Before using, let's recognize these two files:
- items.py : data structure template file is used to define data attributes.
- pipelines.py : conduit files, received data (items), for long-lasting operation.
--------------------------- ↓
persistence process:
- After crawling application file data, the data encapsulated into items object.
- Use yield keyword items submitted to the target pipelines for persistence operate the pipeline.
- Class file in a pipe process_item method of receiving crawler submissions over item objects,
and then write the code persistent storage item data objects stored in the persistent store.
note:
- In settings.py open pipeline configuration profiles in the dictionary: ITEM_PIPELINES
- In items.py definition of a good field to be persistent data structure template file
---------------------------↑
Examples
The climb takes the following articles and all correspondence between the URL of your CSDN blog, and persistent storage.
The first step, let's write the application file blog.py :import scrapy from Test.items import TestItem class BlogSpider(scrapy.Spider): name = 'blog' # 应用名称 start_urls = ['https://blog.csdn.net/qq_41964425/article/list/1'] # 起始url page = 1 # 用于计算页码数 max_page = 7 # 你的CSDN博客一共有多少页博文 url = 'https://blog.csdn.net/qq_41964425/article/list/%d' # 用于定位页码的url def parse(self, response): """ 访问起始url页面并获取结果后的回调函数 一般情况下,你写了多少个起始url,此方法就会被调用多少次 :param response: 向起始URL发送请求后,获取的响应对象 :return: 此方法的返回值必须为 可迭代对象(通常为list/dict) 或 None """ # 定位到当期页面中所有博文的链接(a标签) a_list = response.xpath('//div[@class="article-list"]/div/h4/a') # type: list # 每个人的博客,每一页的都有这篇文章,你打开网页源码看看就知道了 a_list.pop(0) # 帝都的凛冬 https://blog.csdn.net/yoyo_liyy/article/details/82762601 # 开始解析处理博客数据 for a in a_list: # 准备一个item对象 item = TestItem() # 保存数据 item['title'] = a.xpath('.//text()')[-1].extract().strip() # 获取文章标题 item['url'] = a.xpath('.//@href').extract_first() # 获取文章url # 下面将解析到的数据存储到item对象中 # 最后,将itme对象提交给管道进行持久化存储 yield item # 开始爬取所有页面 if self.page < self.max_page: self.page += 1 current_page = format(self.url % self.page) # 当前的要爬取的url页面 # 手动发送请求 yield scrapy.Request(url=current_page, callback=self.parse) # callback用于指定回调函数,即指定解析的方法 # 1.如果页面内容与起始url的内容一致,可使用同一个解析方法(self.parse) # 2.如果页面的内容不一致,需手动创建新的解析方法,可见下面 def dem(self, response): """ 2.如果页面的内容不一致,需手动创建新的解析方法 注意:解析方法必须接收response对象 """ pass
The second step in the data structure template file items.py in the field ready to be saved:
import scrapy class TestItem(scrapy.Item): # define the fields for your item here like: title = scrapy.Field() # 文章标题 url = scrapy.Field() # 文章url
The third step in the pipeline file pipelines.py write persistent in logic:
class TestPipeline(object): def process_item(self, item, spider): """应用文件每提交一次item,该方法就会被调用一次""" # 保存数据到文件 self.fp.write(item['title'] + '\t' + item['url'] + '\n') # 你一定要注意: # 此方法一定要返回itme对象,用于传递给后面的(优先级低的)管道类中的此方法 # 关于优先级的高低,可在settings.py文件中的ITEM_PIPELINES字典中定义 return item # 重写父类方法,用于打开文件 def open_spider(self, spider): """此方法在运行应用时被执行,注意:只会被执行一次""" print('开始爬取') self.fp = open('text.txt', 'w', encoding='utf-8') # 重写父类方法,用于关闭文件 def close_spider(self, spider): """此方法结束应用时被执行,注意:只会被执行一次""" print('爬取结束') self.fp.close()
The fourth step in the configuration file settings.py to open the following configuration items:
# 伪装请求载体身份(User-Agent) USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' # 是否遵守robots协议 ROBOTSTXT_OBEY = False # 开启管道 ITEM_PIPELINES = { 'Test.pipelines.TestPipeline': 300, # 300表示的是优先级,数值越小优先级越高,优先级高的先执行 # 自定义的管道类添加到此字典时才会被调用 }
The fifth step, execute the command in a terminal enter: scrapy crawl Blog , after the successful implementation, we open the file to look at:
How, is not like this below.
Ha ha ha.
Do not panic, we can not only persist data to a file, you can also persist data to the database (for example, the following MySQL, Redis).
--------------------------------------- ↓
persistent storage to MySQL# pipelines.py文件 import pymysql # 自定义管道类,用于将数据写入MySQL # 注意了: # 自定义的管道类需要在settings.py文件中的ITEM_PIPELINES字典中注册后才会被调用 # 而调用的顺序取决于注册时指定的优先级 class TestPipeLineByMySQL(object): # 重写父类方法,用于建立MySQL链接,并创建一个游标 def open_spider(self, spider): """此方法在运行应用时被执行,注意:只会被执行一次""" self.conn = pymysql.connect( host='localhost', port=3306, user='zyk', password='user@zyk', db='blog', # 指定你要使用的数据库 charset='utf8', # 指定数据库的编码 ) # 建立MySQL链接 self.cursor = self.conn.cursor() # 创建游标 # 本人实测:这里并发使用同一个游标没有出现问题 def process_item(self, item, spider): sql = 'insert into info(title, url) values (%s, %s)' # 要执行的sql语句 # 开始执行事务 try: self.cursor.execute(sql, (item['title'], item['url'])) # 增 self.conn.commit() # 提交 except Exception as e: print(e) self.conn.rollback() # 回滚 return item # 你一定要注意: # 此方法一定要返回itme对象,用于传递给后面的(优先级低的)管道类中的此方法 # 关于优先级的高低,可在settings.py文件中的ITEM_PIPELINES字典中定义 # 重写父类方法,用于关闭MySQL链接 def close_spider(self, spider): """此方法在结束应用时被执行,注意:只会被执行一次""" self.cursor.close() self.conn.close()
After writing Do not forget the configuration file settings.py add this pipeline class:
ITEM_PIPELINES = { 'Test.pipelines.TestPipeline': 300, # 300表示的是优先级,数值越小优先级越高,优先级高的先执行 'Test.pipelines.TestPipeLineByMySQL': 400, # 自定义的管道类添加到此字典时才会被调用 }
--------------------------------------- ↓
persistent storage to Redis# pipelines.py文件 import redis import json # 自定义管道类,用于将数据写入Redis # 注意了: # 自定义的管道类需要在settings.py文件中的ITEM_PIPELINES字典中注册后才会被调用 # 而调用的顺序取决于注册时指定的优先级 class TestPipelineByRedis(object): # 重写父类方法,用于创建Redis连接实例 def open_spider(self, spider): """此方法在运行应用时被执行,注意:只会被执行一次""" # 创建Redis连接实例 self.conn = redis.Redis(host='localhost', port=6379, password='', decode_responses=True, db=15) # decode_responses=True 写入的键值对中的value为字符串类型,否则为字节类型 # db=15 表示使用第15个数据库 def process_item(self, item, spider): dct = { 'title': item['title'], 'url': item['url'] } # 将数据持久化至Redis self.conn.lpush('blog_info', json.dumps(dct, ensure_ascii=False)) # ensure_ascii=False 这样可使取数据时显示的为中文 return item # 你一定要注意: # 此方法一定要返回itme对象,用于传递给后面的(优先级低的)管道类中的此方法 # 关于优先级的高低,可在settings.py文件中的ITEM_PIPELINES字典中定义
With the above, the configuration file settings.py add this pipeline class:
ITEM_PIPELINES = { 'Test.pipelines.TestPipeline': 300, # 300表示的是优先级,数值越小优先级越高,优先级高的先执行 'Test.pipelines.TestPipelineByMySQL': 400, # 自定义的管道类添加到此字典时才会被调用 'Test.pipelines.TestPipelineByRedis': 500, # 自定义的管道类添加到此字典时才会被调用 }
Query persistent data in Redis in:
lrange blog_info 0 -1
Original: http://106.13.73.98/__/138/