Scrapy simple introduction and example explanation

Scrapy is an application framework written to crawl website data and extract structural data. It can be used in a series of programs such as data mining, information processing or storing historical data. It was originally designed for web scraping (more precisely, web scraping), but can also be used to obtain data returned by APIs (eg Amazon Associates Web Services) or general-purpose web crawlers. Scrapy is versatile and can be used for data mining, monitoring, and automated testing.

Scrapy uses the Twisted asynchronous networking library to handle network communication. The overall structure is roughly as follows

 

Scrapy mainly includes the following components:

    • The engine (Scrapy) is
      used to process the data flow of the entire system and trigger transactions (the core of the framework)
    • The Scheduler is
      used to accept the request from the engine, push it into the queue, and return it when the engine requests it again. It can be imagined as a priority queue of URLs (website URLs or links that are crawled), by which to decide what URL to crawl next, while removing duplicate URLs
    • The Downloader (Downloader)
      is used to download web content and return the web content to the spider (Scrapy downloader is built on the efficient asynchronous model of twisted)
    • Crawler (Spiders)
      Crawler is the main work, used to extract the information it needs from a specific web page, the so-called entity (Item). Users can also extract links from it and let Scrapy continue to crawl the next page
    • The project pipeline (Pipeline)
      is responsible for processing the entities extracted by the crawler from the web page. The main functions are to persist the entities, verify the validity of the entities, and remove unnecessary information. When the page is parsed by the crawler, it is sent to the project pipeline, and the data is processed in several specific orders.
    • Downloader Middlewares
      is a framework between the Scrapy engine and the downloader, mainly dealing with requests and responses between the Scrapy engine and the downloader.
    • Spider Middlewares
      is a framework between the Scrapy engine and crawlers, and its main job is to process the spider's response input and request output.
    • Scheduler Middewares are middleware
      between the Scrapy engine and the scheduler, sending requests and responses from the Scrapy engine to the scheduler.

The Scrapy running process is roughly as follows:

  1. The engine takes a link (URL) from the scheduler for the next crawl
  2. The engine encapsulates the URL as a request and sends it to the downloader
  3. The downloader downloads the resource and encapsulates it into a response package (Response)
  4. The crawler parses the Response
  5. Parse out the entity (Item), then hand it over to the entity pipeline for further processing
  6. If the parsed is a link (URL), the URL is handed over to the scheduler to wait for crawling

 

1. Installation

copy code
    1. Install the wheel
        pip install wheel
    2. Install lxml
        https://pypi.python.org/pypi/lxml/4.1.0
    3. Install pyopenssl
        https://pypi.python.org/pypi/pyOpenSSL/17.5.0
    4. Install Twisted
        https://www.lfd.uci.edu/~gohlke/pythonlibs/
    5. Install pywin32
        https://sourceforge.net/projects/pywin32/files/
    6. Install scrapy
        pip install scrapy
copy code

 

 Note: The windows platform needs to rely on pywin32, please choose to download and install according to your own system 32/64 bit, https://sourceforge.net/projects/pywin32/

 

The example of reptiles

 

Introduction: Top 100 Newest American Drama Paradise (http://www.meijutt.com/new100.html)

1. Create a project

1
scrapy startproject movie

 

2. Create a crawler program

1
2
cd movie
scrapy genspider meiju meijutt.com

 

3. Automatically create directories and files

 

4. Document description:

  • The configuration information of the scrapy.cfg project mainly provides a basic configuration information for the Scrapy command line tool. (The configuration information related to the real crawler is in the settings.py file)
  • items.py sets the data storage template for structured data, such as: Django's Model
  • pipelines data processing behavior, such as: general structured data persistence
  • settings.py configuration file, such as: recursive layers, concurrency, delayed download, etc.
  • spiders crawler directory, such as: creating files, writing crawler rules

Note: Generally, when creating a crawler file, it is named after the website domain name.

 

5. Set the data storage template

  items.py

1
2
3
4
5
6
7
8
import scrapy
 
 
class  MovieItem(scrapy.Item):
     # define the fields for your item here like:
     # name = scrapy.Field()
     name = scrapy.Field()
    

 

6. Write a crawler

  meiju.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# -*- coding: utf-8 -*-
import scrapy
from  movie.items import MovieItem
 
class  MeijuSpider(scrapy.Spider):
     name =  "meiju"
     allowed_domains = [ "meijutt.com" ]
     start_urls = [ 'http://www.meijutt.com/new100.html' ]
 
     def parse(self, response):
         movies = response.xpath( '//ul[@class="top-list  fn-clear"]/li' )
         for  each_movie  in  movies:
             item = MovieItem()
             item[ 'name' ] = each_movie.xpath( './h5/a/@title' ).extract()[0]
             yield item

 

7, set the configuration file

  settings.py add the following content

1
ITEM_PIPELINES = { 'movie.pipelines.MoviePipeline' :100}

 

8. Write data processing scripts

  pipelines.py

1
2
3
4
class  MoviePipeline( object ):
     def process_item(self, item, spider):
         with open( "my_meiju.txt" , 'a' as  fp:
             fp.write(item[ 'name' ].encode( "utf8" ) +  '\n' )

 

9. Execute the crawler

1
2
cd movie
scrapy crawl meiju --nolog

 

10. Results

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324936437&siteId=291194637