I. Basic Concepts
- scrapy: reptiles framework.
Crawling asynchronous, high-performance data parsing + persistent storage operation,
it integrates a variety of functions (high performance asynchronous download, queue, distributed, resolution, persistence, etc.) project template having highly versatile.
- Framework: integrates many features and has a highly versatile project template
- how learning framework: - the use of specific functional modules learning framework.
II. Installation Environment
windows system:
. A PIP3 install Wheel . b HTTP download Twisted: // www.lfd.uci.edu/~gohlke/pythonlibs/#twisted . c into the download directory, execute install Twisted- PIP3 17.1 . 0 -cp35-cp35m- win_amd64.whl D. PIP3 the install the pywin32 E. PIP3 the install Scrapy
Linux系统:
pip3 install scrapy
III. Use process
- ① create a project: scrapy startproject firstBlood
- ② cd firstBlood
- ③ create reptiles file: scrapy genspider First www.xxx.com
- ④ execution: scrapy crawl first
scrapy crawl reptile Name: This type of execution carried out in the form of log information displayed
scrapy crawl reptile name --nolog: the kind of execution in the form of log information is not displayed execution
Item Structure:
project_name / scrapy.cfg: project_name / __init__.py items.py pipelines.py the settings.py Spiders / __init__.py main scrapy.cfg configuration information items. (Real crawler relevant configuration information file settings.py) items.py setting data stored template for structured data, such as: the Django of the Model Pipelines persistence data processing settings.py configuration files, such as: recursed , the number of concurrent delay downloading spiders reptiles directory, such as: create a file, write reptiles parsing rules
Four basic structure:
# - * - Coding: UTF- 8 - * - Import scrapy class QiubaiSpider (scrapy.Spider): name = ' Qiubai ' # application name # crawling allows domain name (if they are non url of the domain name is not crawling data) allowed_domains = [ ' https://www.qiushibaike.com/ ' ] # URL starting crawling start_urls = [ ' https://www.qiushibaike.com/ ' ] # to access the start URL and the acquisition result . callback function, response to the argument of the function is initiated after sending a request url, the object obtained in response to the function return value must be an iterative object or NULL DEF the parse (Self, response): Print (response.text) # acquiring content in response to a string type print (response.body) # of bytes to obtain the corresponding content type
Reptile file
Example:
#嗅事百科 作者和内容
# -*- coding: utf-8 -*- import scrapy class FirstSpider(scrapy.Spider): name = 'first' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: autor = div.xpath('./div[1]/a[2]/h2/text()').extract_first() content = div.xpath('./a/div/span//text()').extract() print(autor,content)
V. persistent storage
- Persistent storage: - based on the terminal instructions: Scrapy crawl Qiubai - O filePath.csv - Benefits: Convenient - disadvantages: Strong limitation (can only write data to a local file, the file extension is a specific requirement) - Based Pipeline: - All operations on persistent storage must be written to the file pipeline pipeline
1. The instructions stored on the terminal
It must be structured in the form [{}, {}] of
performing an output format specified storage: crawling a file data is written in different formats for storage scrapy crawl crawler name - O xxx.json scrapy crawl crawler name - O XXX .xml scrapy crawl reptile name -o xxx.csv
#示例:
# -*- coding: utf-8 -*- import scrapy class FirstSpider(scrapy.Spider): name = 'first' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): all_data = [] div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: autor = div.xpath('./div[1]/a[2]/h2/text()').extract_first() content = div.xpath('./a/div/span//text()').extract() # print(autor,content) dic = { 'author':autor, 'content':content, '---':"\n"+"----------------------------------------" } all_data.append(dic) return all_data
2. Based on the pipeline of persistent storage
#在爬虫文件中
# -*- coding: utf-8 -*- import scrapy from qiubaiPro.items import QiubaiproItem class QiubaiSpider(scrapy.Spider): name = 'qiubai' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): div_list = response.xpath('//div[@id="content-left"]/div') all_data = [] for div in div_list: # author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract() author = div.xpath('./div[1]/a[2]/h2/text()').extract_first() content = div.xpath('./a/div/span//text()').extract() content = ''.join(content) # print(content) #实例化一个item类型的对象 item = QiubaiproItem() #使用中括号的形式访问item对象中的属性 item['author'] = author item['content'] = content #将item提交给管道 yield item
#items.py文件中
import scrapy class QiubaiproItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() #scrapy.Field()万能的数据类型 author = scrapy.Field() content = scrapy.Field()
#pipelines.py(管道文件)中
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html #一个类表示的是将解析/爬取到的数据存储到一个平台 import pymysql from redis import Redis
#存在本地文件 class QiubaiproPipeline(object): fp = None def open_spider(self,spider): print('开始爬虫......') self.fp = open('./qiubai.txt','w',encoding='utf-8') #可以将item类型的对象中存储的数据进行持久化存储 def process_item(self, item, spider): author = item['author'] print(author, type(author)) content = item['content'] self.fp.write(author+ ":"+content) return item #返回给了下一个即将被执行的管道类 def close_spider(self,spider): print('结束爬虫!!!') self.fp.close()
# 存在mysql数据库中 class MysqlPipeLine(object): conn = None cursor = None def open_spider(self,spider): self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='',db='qiubai',charset='utf8') print(self.conn) def process_item(self, item, spider): self.cursor = self.conn.cursor() try: self.cursor.execute('insert into qiubai values("%s","%s")'%(item['author'],item['content'])) self.conn.commit() except Exception as e: print(e) self.conn.rollback() return item def close_spider(self,spider): self.cursor.close() self.conn.close() #存在redis数据库 class RedisPipeLine(object): conn = None def open_spider(self,spider): self.conn = Redis(host='127.0.0.1',port=6379) print(self.conn) def process_item(self,item,spider): dic = { 'author':item['author'], 'content':item['content'] } self.conn.lpush('qiubai',dic)
setting配置文件中