Article Directory
A. How to create Scrapy crawler project
(1) Win + R to open the cmd, if I want to create a project file in Scrapy F disk, enter the appropriate places below. (Cd, is to the next level, cd .. is a return to the previous, cd \ return letter)(2) using the instructions scrapy startproject dangdang created called dangdang project.
(3) Click core directory, and found there are a lot of files, as follows.
spiders folder to place our crawler, you can use scrapy genspider -t basic file name of the domain to create a crawler documents, basic can be changed, because it is a template, can be replaced by:
Available templates: # template description basic foundation to create a file reptiles crawl crawler automatically create a file csvfeed create crawling reptile csv data file xmlfeed create crawling reptiles xml data file
Spider do two things: Action website (1) the definition of crawling (2) analysis crawled down the pages _ init_.py: reptiles project initialization file, used to do the initial work of the project. items.py: data container file reptiles project, used to define the data to be acquired. middlewares.py: Middleware file reptiles project. pipelines.py: reptiles pipeline project file, to the data items for further processing. settings.py: reptiles project settings file that contains the information set reptiles project.
Under (4) running crawler scrapy project file scrapy crawl dd
NOTE: crawler scrapy project file is not used scrapy runspider + filename .py run
Some instructions two instructions .Scrapy
(1) Input Scrapy -l , can view the global instructionThe current rate of climb performance to take computer test bench fetch web content download, and print contents of the current return in the terminal genspider create .py file spiders crawling folder runspider can not rely on scrapy reptiles project by runspider command, directly run a reptile file settings using the settings using the corresponding command to view the configuration information for the project shell by shell command to start Scrapy interactive terminal startproject create crawler project version output version Scrapy scrapy version [-v], may be behind the increase -v display version dependencies scrapy view for downloading a web page, and then look through the browser
(2) into the project to create the reptiles, then enter
scrapy the -l
, you can view the
global command and command items
(3) scrapy genspider the -l see scrapy create a template file available crawler
(1) using the cmd command line to create a good project and spiders crawling file folder to open with an editor.
(2) find item.py file, define the data container to get.
(3) scrapy genspider the -l see scrapy create a template file available crawler
III. Dangdang commodity crawling combat
Goal: Dangdang dress crawling name on the first page of merchandise, links, comments; and written to the database.(1) using the cmd command line to create a good project and spiders crawling file folder to open with an editor.
(2) find item.py file, define the data container to get.
# - * - Coding: UTF-8 - * - # the Define here Wallpaper at The Models for your Scraped items # # See Documentation in: # https://doc.scrapy.org/en/latest/topics/items.html Import scrapy # items.py: data container file reptiles project, used to define the data to be acquired. # Dangdang crawling dress Product names, links, comments class DangdangItem (scrapy.Item): # the DEFINE at The Fields for your Item here Wallpaper like: title = scrapy.Field () Link = scrapy.Field () the Comment = scrapy. Field ()
(3) find the reptile custom files dd.py.
# - * - Coding: UTF-. 8 - * - Import Scrapy from dangdang.items Import * from the Request Import scrapy.http class DdSpider (scrapy.Spider): # filename as reptiles and general name = 'dd' # allowing crawling the domain name 'dangdang.com' URL allowed_domains = [ 'dangdang.com'] # for the first time a URL crawling start_urls = [ 'http://category.dangdang.com/cid4008149.html'] # Response is crawling return after the information on the website information DEF the parse (Self, the Response): Item = DangdangItem () # Item understood as a dictionary, the following is the value assigned to each key are both a list # response.xpath ( '// a [@ name = "itemlist-title"] / @ title '). extract () returns a list of item [ "title"] = response.xpath (' // a [@ name = "itemlist-title"]/@title').extract() item [ "link"] = response.xpath (' // a [@ name = "itemlist-title"]/@href').extract() Item [ "Comment"] response.xpath = ( 'A // [@ name = "ITEMLIST-Review"] / text ()'). Extract () the yield Item # crawling above information of the first page, then how to climb take a few pages of information behind it? # Configured to request use scrapy.Request (URL, the callback) # URL: request link # callback: callback when the request is completed, in response to the acquisition, # the engine in response to the callback function passed as a parameter. Parsing callback request or a next generation, as the following callback parse. I in Range for (2,. 4): URL = 'http://category.dangdang.com/pg' + STR (I) + '-cid4008149.html' the yield the Request (URL, the callback = self.parse)
(4) Open a command line, type: scrapy crawl dd --nolog run the file reptiles, reptile that is running the project.
The result:
(part)