Reptiles complete set (a): Scrapy frame

Scrapy installation

There are two ways to install Scrapy:

  • Use pip installation:pip install Scrapy
  • The use of domestic watercress installation:pip install -i https://pypi.douban.com/simple/ scrapy

The second way is recommended to install fast.

Scrapy command

Scrapy enter on the command line, it will directly display commonly used commands:

Scrapy command
1 scrapy startproject Demo(项目名): Create a new project.

2, scrapy genspider name domain: name is the name of reptiles, domain is the name of the crawled sites.

3 scrapy crawl <spider>: Start reptiles.

4 scrapy list: View all reptiles.

5 scrapy fetch <url>: Print a response.

6 scrapy shell [url]: Debug shell.

When a subsequent system design using back slowly to the various commands for debugging.

File Structure
1, scrapy.cfg: project profile

2, Spider / spiders: reptiles code files

3, Spider / items.py: crawling data storage vessel

4, Spider / pipeline.py: store data operations performed

5, Spider / setting.py: Project settings file

6, Spider / middlewares.py: Middleware

You need to modify the contents of each file when writing code.

spider class

spider classes, methods and properties defined crawler. Common methods listed below and attributes:

Class properties:

  • name: Defines the name of reptiles, can not be repeated in the project.
  • allowed_domains: Allows climb to take the domain name.
  • start_urls: Start URL list allows multiple url address.
  • custom_settings: Spider settings will override the global settings.
  • settings: Configured to run crawlers.
  • logger: Develop python logger name reptile created, can be used to send log messages.

Class Methods:

  • from_crawler(cls, crawler, *args, **kwargs): Class methods, for instance objects, it will bind to the spider object.
  • start_requsets(self): Generator constructed URL returned by Request, as an inlet, when the crawler running automatically.
  • parse(self, reponse): Analytic function, or returns Item Requests, loops until all of the processed data.
  • close(self, reason): Reptile run automatically when closed.

Request object

scrapy built using scrapy.http.Requestthe Response object to handle the network resource request and response, the parameter list a common object request:

  • url: Url address request page
  • callback: Callback function, that is, page parsing function
  • method: Http request method, default 'get'
  • headers: Http request header information
  • body: Text http request
  • cookies:cookie
  • encoding: Encoding, utf-8 default

Response object

Response class is used to return http download class information, it's just a base class, he still has several sub-categories:

  • Text Response
  • HtmlResponse
  • XMLResponse

When a page download is complete, the download creates a subclass object according to Response http response header Content-Type field. Response object properties and methods:

  • url: Url string response
  • status: Http response status code
  • body: Body of the response
  • request: Request object request returns this response
  • meta: Metadata
  • copy(): Returns a new Response, it is a copy of this Response

This is about the content of reptiles, behind the code is done step by step in doing graduate design process.

Guess you like

Origin www.cnblogs.com/bzsheng/p/12483918.html