Reptile --- 06. scrapy initial framework

I. Basic Concepts

- scrapy: reptiles framework. 
      Crawling asynchronous, high-performance data parsing + persistent storage operation,
      it integrates a variety of functions (high performance asynchronous download, queue, distributed, resolution, persistence, etc.) project template having highly versatile.
- Framework: integrates many features and has a highly versatile project template
- how learning framework: - the use of specific functional modules learning framework.

 

II. Installation Environment

windows system:     

   . A PIP3 install Wheel . b HTTP download Twisted:
// www.lfd.uci.edu/~gohlke/pythonlibs/#twisted . c into the download directory, execute install Twisted- PIP3 17.1 . 0 -cp35-cp35m- win_amd64.whl D. PIP3 the install the pywin32 E. PIP3 the install Scrapy

  

  Linux系统:

 
 

      pip3 install scrapy

 

 

III. Use process

    - ① create a project: scrapy startproject firstBlood
- ② cd firstBlood
- ③ create reptiles file: scrapy genspider First www.xxx.com
- ④ execution: scrapy crawl first

 

    scrapy crawl reptile Name: This type of execution carried out in the form of log information displayed 
    scrapy crawl reptile name --nolog: the kind of execution in the form of log information is not displayed execution

 

 

Item Structure: 

project_name / scrapy.cfg: project_name / __init__.py items.py pipelines.py the settings.py Spiders / __init__.py main scrapy.cfg configuration information items. (Real crawler relevant configuration information file settings.py) items.py setting data stored template for structured data, such as: the Django of the Model Pipelines persistence data processing settings.py configuration files, such as: recursed , the number of concurrent delay downloading spiders reptiles directory, such as: create a file, write reptiles parsing rules

 

Four basic structure:

 

# - * - Coding: UTF- 8 - * - 
Import scrapy 

class QiubaiSpider (scrapy.Spider): 
    name = ' Qiubai ' # application name 
    # crawling allows domain name (if they are non url of the domain name is not crawling data) 
    allowed_domains = [ ' https://www.qiushibaike.com/ ' ] 
    # URL starting crawling 
    start_urls = [ ' https://www.qiushibaike.com/ ' ] 

     # to access the start URL and the acquisition result . callback function, response to the argument of the function is initiated after sending a request url, the object obtained in response to the function return value must be an iterative object or NULL 
     DEF the parse (Self, response): 
        Print (response.text) # acquiring content in response to a string type 
        print (response.body) # of bytes to obtain the corresponding content type

 

 Reptile file

 

 

 

 

 

 Example:

#嗅事百科 作者和内容

# -*- coding: utf-8 -*- import scrapy class FirstSpider(scrapy.Spider): name = 'first' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: autor = div.xpath('./div[1]/a[2]/h2/text()').extract_first() content = div.xpath('./a/div/span//text()').extract() print(autor,content)

 

 V. persistent storage

- Persistent storage:
     - based on the terminal instructions: Scrapy crawl Qiubai - O filePath.csv
         - Benefits: Convenient
         - disadvantages: Strong limitation (can only write data to a local file, the file extension is a specific requirement)
     - Based Pipeline:
         - All operations on persistent storage must be written to the file pipeline pipeline

 

  1. The instructions stored on the terminal

It must be structured in the form [{}, {}] of 

performing an output format specified storage: crawling a file data is written in different formats for storage scrapy crawl crawler name
- O xxx.json scrapy crawl crawler name - O XXX .xml scrapy crawl reptile name -o xxx.csv

 

#示例:

# -*- coding: utf-8 -*- import scrapy class FirstSpider(scrapy.Spider): name = 'first' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): all_data = [] div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: autor = div.xpath('./div[1]/a[2]/h2/text()').extract_first() content = div.xpath('./a/div/span//text()').extract() # print(autor,content) dic = { 'author':autor, 'content':content, '---':"\n"+"----------------------------------------" } all_data.append(dic) return all_data

 

 

 

 

   2. Based on the pipeline of persistent storage

 

 

#在爬虫文件中

# -*- coding: utf-8 -*- import scrapy from qiubaiPro.items import QiubaiproItem class QiubaiSpider(scrapy.Spider): name = 'qiubai' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): div_list = response.xpath('//div[@id="content-left"]/div') all_data = [] for div in div_list: # author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract() author = div.xpath('./div[1]/a[2]/h2/text()').extract_first() content = div.xpath('./a/div/span//text()').extract() content = ''.join(content) # print(content) #实例化一个item类型的对象 item = QiubaiproItem() #使用中括号的形式访问item对象中的属性 item['author'] = author item['content'] = content #将item提交给管道 yield item

 

 

#items.py文件中

import scrapy
class QiubaiproItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() #scrapy.Field()万能的数据类型 author = scrapy.Field() content = scrapy.Field()

 

 

#pipelines.py(管道文件)中


# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html #一个类表示的是将解析/爬取到的数据存储到一个平台 import pymysql from redis import Redis

#存在本地文件
class QiubaiproPipeline(object): fp = None def open_spider(self,spider): print('开始爬虫......') self.fp = open('./qiubai.txt','w',encoding='utf-8') #可以将item类型的对象中存储的数据进行持久化存储 def process_item(self, item, spider): author = item['author'] print(author, type(author)) content = item['content'] self.fp.write(author+ ":"+content) return item #返回给了下一个即将被执行的管道类 def close_spider(self,spider): print('结束爬虫!!!') self.fp.close()
# 存在mysql数据库中
class MysqlPipeLine(object): conn = None cursor = None def open_spider(self,spider): self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='',db='qiubai',charset='utf8') print(self.conn) def process_item(self, item, spider): self.cursor = self.conn.cursor() try: self.cursor.execute('insert into qiubai values("%s","%s")'%(item['author'],item['content'])) self.conn.commit() except Exception as e: print(e) self.conn.rollback() return item def close_spider(self,spider): self.cursor.close() self.conn.close() #存在redis数据库 class RedisPipeLine(object): conn = None def open_spider(self,spider): self.conn = Redis(host='127.0.0.1',port=6379) print(self.conn) def process_item(self,item,spider): dic = { 'author':item['author'], 'content':item['content'] } self.conn.lpush('qiubai',dic)

 

setting配置文件中

 

 

Guess you like

Origin www.cnblogs.com/sc-1067178406/p/10956908.html