Scrapy introduction and product information Dangdang crawling combat

Article Directory

 A. How to create Scrapy crawler project

(1) Win + R to open the cmd, if I want to create a project file in Scrapy F disk, enter the appropriate places below. (Cd, is to the next level, cd .. is a return to the previous, cd \ return letter)


(2) using the instructions scrapy startproject dangdang created called dangdang project.
 
 
   (3) Click core directory, and found there are a lot of files, as follows.
 

spiders folder to place our crawler, you can use scrapy genspider -t basic file name of the domain  to create a crawler documents, basic can be changed, because it is a template, can be replaced by:

Available templates: # template description 
  basic foundation to create a file reptiles 
   crawl crawler automatically create a file 
  csvfeed create crawling reptile csv data file 
  xmlfeed create crawling reptiles xml data file
Which document what to write, and Scrapy architecture relationship.

Spider do two things: Action website (1) the definition of crawling (2) analysis crawled down the pages 
  _ init_.py: reptiles project initialization file, used to do the initial work of the project. 
  items.py: data container file reptiles project, used to define the data to be acquired. 
  middlewares.py: Middleware file reptiles project. 
  pipelines.py: reptiles pipeline project file, to the data items for further processing. 
  settings.py: reptiles project settings file that contains the information set reptiles project.

 

Under (4) running crawler scrapy project file scrapy crawl dd
NOTE: crawler scrapy project file is not used scrapy runspider + filename .py run

Some instructions two instructions .Scrapy

(1) Input Scrapy -l , can view the global instruction

  The current rate of climb performance to take computer test bench 
  fetch web content download, and print contents of the current return in the terminal 
 
  genspider create .py file spiders crawling folder         
  runspider can not rely on scrapy reptiles project by runspider command, directly run a reptile file 
  settings using the settings using the corresponding command to view the configuration information for the project 
  shell by shell command to start Scrapy interactive terminal 
  startproject create crawler project 
  version output version Scrapy 
                scrapy version [-v], may be behind the increase -v display version dependencies scrapy 
  view for downloading a web page, and then look through the browser
 
(2) into the project to create the reptiles, then enter scrapy the -l , you can view the global command and command items
(3) scrapy genspider the -l see scrapy create a template file available crawler

III. Dangdang commodity crawling combat

Goal: Dangdang dress crawling name on the first page of merchandise, links, comments; and written to the database.
(1) using the cmd command line to create a good project and spiders crawling file folder to open with an editor.
(2) find item.py file, define the data container to get.


# - * - Coding: UTF-8 - * - 

# the Define here Wallpaper at The Models for your Scraped items 
# 
# See Documentation in: 
# https://doc.scrapy.org/en/latest/topics/items.html 

Import scrapy 

# items.py: data container file reptiles project, used to define the data to be acquired. 
# Dangdang crawling dress Product names, links, comments 

class DangdangItem (scrapy.Item): 
    # the DEFINE at The Fields for your Item here Wallpaper like: 
    title = scrapy.Field () 
    Link = scrapy.Field () 
    the Comment = scrapy. Field ()
 

(3) find the reptile custom files dd.py.

# - * - Coding: UTF-. 8 - * - 
Import Scrapy 
from dangdang.items Import * 
from the Request Import scrapy.http 

class DdSpider (scrapy.Spider): 
    # filename as reptiles and general 
    name = 'dd' 
    # allowing crawling the domain name 'dangdang.com' URL 
    allowed_domains = [ 'dangdang.com'] 
    # for the first time a URL crawling 
    start_urls = [ 'http://category.dangdang.com/cid4008149.html'] 

    # Response is crawling return after the information on the website information 
    DEF the parse (Self, the Response): 
        Item = DangdangItem () 
        # Item understood as a dictionary, the following is the value assigned to each key are both a list 
        # response.xpath ( '// a [@ name = "itemlist-title"] / @ title '). extract () returns a list of  
        item [ "title"] = response.xpath     (' // a [@ name = "itemlist-title"]/@title').extract()    
        item [ "link"] = response.xpath (' // a [@ name = "itemlist-title"]/@href').extract()
        Item [ "Comment"] response.xpath = ( 'A // [@ name = "ITEMLIST-Review"] / text ()'). Extract () 

        the yield Item 
        # crawling above information of the first page, then how to climb take a few pages of information behind it? 
        # Configured to request use scrapy.Request (URL, the callback) 
        # URL: request link 
        # callback: callback when the request is completed, in response to the acquisition, 
        # the engine in response to the callback function passed as a parameter. Parsing callback request or a next generation, as the following callback parse. 
        I in Range for (2,. 4): 
            URL = 'http://category.dangdang.com/pg' + STR (I) + '-cid4008149.html' 
            the yield the Request (URL, the callback = self.parse)

 

(4) Open a command line, type: scrapy crawl dd --nolog run the file reptiles, reptile that is running the project.
The result:
(part)





 

Guess you like

Origin www.cnblogs.com/BlogSsun/p/11627061.html