Scrapy installation
There are two ways to install Scrapy:
- Use pip installation:
pip install Scrapy
- The use of domestic watercress installation:
pip install -i https://pypi.douban.com/simple/ scrapy
The second way is recommended to install fast.
Scrapy command
Scrapy enter on the command line, it will directly display commonly used commands:
1 scrapy startproject Demo(项目名)
: Create a new project.
2, scrapy genspider name domain
: name is the name of reptiles, domain is the name of the crawled sites.
3 scrapy crawl <spider>
: Start reptiles.
4 scrapy list
: View all reptiles.
5 scrapy fetch <url>
: Print a response.
6 scrapy shell [url]
: Debug shell.
When a subsequent system design using back slowly to the various commands for debugging.
1, scrapy.cfg: project profile
2, Spider / spiders: reptiles code files
3, Spider / items.py: crawling data storage vessel
4, Spider / pipeline.py: store data operations performed
5, Spider / setting.py: Project settings file
6, Spider / middlewares.py: Middleware
You need to modify the contents of each file when writing code.
spider class
spider classes, methods and properties defined crawler. Common methods listed below and attributes:
Class properties:
name
: Defines the name of reptiles, can not be repeated in the project.allowed_domains
: Allows climb to take the domain name.start_urls
: Start URL list allows multiple url address.custom_settings
: Spider settings will override the global settings.settings
: Configured to run crawlers.logger
: Develop python logger name reptile created, can be used to send log messages.
Class Methods:
from_crawler(cls, crawler, *args, **kwargs)
: Class methods, for instance objects, it will bind to the spider object.start_requsets(self)
: Generator constructed URL returned by Request, as an inlet, when the crawler running automatically.parse(self, reponse)
: Analytic function, or returns Item Requests, loops until all of the processed data.close(self, reason)
: Reptile run automatically when closed.
Request object
scrapy built using scrapy.http.Request
the Response object to handle the network resource request and response, the parameter list a common object request:
url
: Url address request pagecallback
: Callback function, that is, page parsing functionmethod
: Http request method, default 'get'headers
: Http request header informationbody
: Text http requestcookies
:cookieencoding
: Encoding, utf-8 default
Response object
Response class is used to return http download class information, it's just a base class, he still has several sub-categories:
- Text Response
- HtmlResponse
- XMLResponse
When a page download is complete, the download creates a subclass object according to Response http response header Content-Type field. Response object properties and methods:
url
: Url string responsestatus
: Http response status codebody
: Body of the responserequest
: Request object request returns this responsemeta
: Metadatacopy()
: Returns a new Response, it is a copy of this Response
This is about the content of reptiles, behind the code is done step by step in doing graduate design process.