Reptiles (.5 frame) Scrapy

 

 

Scrapy framework introduced:

  Write a reptile, you need to do a lot of things. For example: a network request, data analysis, data storage, counter-countermeasure crawler mechanisms (replacement ip proxy setting request class), and the like asynchronous request.

  These work if every time from scratch to write their own words, more waste of time. So Scrapysome basic things a good package, write on him reptiles can become more efficient (crawling efficiency and development efficiency).

  So the real in the company, on the amount of some reptiles, are using the Scrapyframework to resolve.

Scrapy frame module features:  

  1. Scrapy Engine(引擎): ScrapyThe core of the framework. Responsible Spiderand ItemPipeline, Downloader, Schedulerintermediate communications, data transfer and the like.
  2. Spider(爬虫): Send require crawling links to the engine, the engine finally come back to the other modules requested data is then sent to the reptiles, reptile went to parse the data you want. This is part of our developers wrote it myself, because to climb which links take, what data page is what we need, is decided by the programmers themselves.
  3. Scheduler(调度器): Is responsible for receiving the request sent from the engine, and arranged and organized into a certain way, the sequence responsible for the scheduling request and the like.
  4. Downloader(下载器): Pass over the engine is responsible for receiving the download request, and then to the network to download the corresponding data is then returned to the engine.
  5. Item Pipeline(管道): Responsible for the Spider(爬虫)transfer from the data to be saved. Specific save where developers should look to their own needs.
  6. Downloader Middlewares(下载中间件): You can extend the middleware communication function between engine and downloader.
  7. Spider Middlewares(Spider中间件): Middleware communication function can extend between the engine and the crawler.

Scrapy Chart:

  

 

 

 

 

 

Scrapy Quick Start

  Installation and documentation:    

    1. Installation: by pip install scrapyto install.
    2. Scrapy official document: http://doc.scrapy.org/en/latest
    3. Scrapy Chinese document: http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

    note:     

    1. In ubuntuthe installation scrapybefore, you need to install the following dependence:
      sudo apt-get install python3-dev build-essential python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev, and then through pip install scrapythe installation.
    2. If windowsthe system prompts this error ModuleNotFoundError: No module named 'win32api', use the following command can be resolved: pip install pypiwin32.

  Getting Started:

    Create a project:      

      To use the Scrapyframework to create a project, you need to create the command. First, you want to enter into this project storage directory. Then use the following command to create:

        scrapy startproject [项目名称]

    Directory Structure Description:

 

      Here under the action of key documents:       

      • items.py: crawling down the crawler used to store model data.
      • middlewares.py: a variety of middleware used to store files.
      • pipelines.py: used to itemsmodel stored in the local disk.
      • settings.py: some configuration information present reptiles (such as a request header, transmitting a request how often, ip agent pool, etc.).
      • scrapy.cfg: project configuration file.
      • spiders package: after all reptiles, are stored in the inside.

 

    Use Scrapy framework crawling embarrassments Encyclopedia piece:

      1. Use the command to create a crawler:  scrapy gensipder qsbk "qiushibaike.com"

        Create a name called the qsbkreptile, and the pages can be crawled only limitations in qiushibaike.comthis domain.

        Reptile code analysis:         

          
import scrapy

class QsbkSpider(scrapy.Spider):
    name = 'qsbk'
    allowed_domains = ['qiushibaike.com']
    start_urls = ['http://qiushibaike.com/']

    def parse(self, response):
        pass
Reptile code analysis

 

        In fact, the code we can own hand to write, instead of command. Just do not order too much trouble to write their own code.

        To create a Spider, you must customize a class that inherits from scrapy.Spider, and then define three properties and a method in this class.           

      1. name: The name of the reptiles, the name must be unique.
      2. allow_domains: allows domain name. Only reptile crawling pages under this domain, pages that are not the domain name will be automatically ignored.
      3. start_urls: reptile From this variable url.
      4. parse: engine will download to download the data back to throw crawler parses, reptiles and then pass the data parsemethod. This is a fixed wording. The role of this approach are two, the first is to extract the desired data. The second is to generate the next request url.

      2. Modify settings.pythe code:

        Before doing a reptile, you must remember to modify setttings.pysettings. Two places is strongly recommended settings.        

      1. ROBOTSTXT_OBEYSet to False. The default is True. That machine comply with the agreement, then when reptiles, scrapy first went robots.txt file, if not found. Directly to stop crawling.
      2. DEFAULT_REQUEST_HEADERSAdd User-Agent. This also tells the server, my request is a normal request, not a reptile.

      3. Complete the reptiles Code:

        
import scrapy
 from abcspider.items import QsbkItem

 class QsbkSpider(scrapy.Spider):
     name = 'qsbk'
     allowed_domains = ['qiushibaike.com']
     start_urls = ['https://www.qiushibaike.com/text/']

     def parse(self, response):
         outerbox = response.xpath("//div[@id='content-left']/div")
         items = []
         for box in outerbox:
             author = box.xpath(".//div[contains(@class,'author')]//h2/text()").extract_first().strip()
             content = box.xpath(".//div[@class='content']/span/text()").extract_first().strip()
             item = QsbkItem()
             item["author"] = author
             item["content"] = content
             items.append(item)
         return items
Reptile part of the code

 

        
 import scrapy
 class QsbkItem(scrapy.Item):
     author = scrapy.Field()
     content = scrapy.Field()
items.py part of the code

 

        
import json

 class AbcspiderPipeline(object):
     def __init__(self):

         self.items = []

     def process_item(self, item, spider):
         self.items.append(dict(item))
         print("="*40)
         return item

     def close_spider(self,spider):
         with open('qsbk.json','w',encoding='utf-8') as fp:
             json.dump(self.items,fp,ensure_ascii=False)
pipeline part of the code

 

      4. Run scrapy items:

        Run scrapy project. In the terminal needs to enter the path of the project is located, and then scrapy crawl [爬虫名字]you can run the specified reptiles.

        If you do not want to run every time the command line, you can put this command in a written document. After the implementation of this document in pycharm run on it.

        For example, create a new file called now start.py, then fill the following code in this file:         

          from scrapy import cmdline

          cmdline.execute("scrapy crawl qsbk".split())

 

Guess you like

Origin www.cnblogs.com/jjb1997/p/11243547.html