import scrapy.Spider
Spider climbing class defines how a Web site, including crawling action and how to extract structured data from web content, in general spider crawling action is to define and analyze a page.
Spider is the easiest spider. Every other spider must inherit from this class (including Scrapy comes and I have written other spider spider). Spider What special features does not provide. It only provides start_requests () default implementation, the read request and spider properties start_urls , and calls the parse method according to the result returned spider (resulting responses).
work process:
- URL initialized at an initial Request, a callback function is provided and, when the download is complete the request and return Response will be generated, and passed as parameters to the callback function. Spider is in the original request from the parent class column start_requests () method call make_requests_from_url ( ) to get the. start_requests () Gets start_urls the URL, and sends to the callback function to generate a parse Request
- Web content analysis within the callback function returns, you can return the Item object, or Dict, or Request, and is a three iterations of the container may include, after the return of the Request object Scrapy deal will go through, download the appropriate content, and calls set callback function
- Within the callback function, you can get what we want by lxml, bs4, xpath, css and other methods generated content item
- The last item will be delivered to the processing Pipeline
spider some common attributes:
All write our own reptiles are inherited in this class spider.Spider
name
Definition of reptile name, the command to start when we use is this name, the name must be unique
allowed_domains
It contains a list of domain names allowed spider crawling. When offsiteMiddleware enabled, the domain name is not in the list will not be accessed URL
Therefore, in the crawler file, it will be generated each time a request Request judged here and the domain name
start_urls
url starting list
Here will be calling start_request circulation request this list each address by spider.Spider method.
custom_settings
Custom configuration, you can override the configuration settings, mainly used when we have a specific set of requirements for reptiles
Dictionary is provided provided: custom_settings = {}
Example:
custom_settings = { 'LOG_LEVEL':'INFO', 'DOWNLOAD_DELAY': 0, 'COOKIES_ENABLED': False, # enabled by default 'DOWNLOADER_MIDDLEWARES': { Agent Middleware # 'mySpider.middlewares.ProxiesMiddleware': 400, # SeleniumMiddleware Middleware 'mySpider.middlewares.SeleniumMiddleware': 543, # The scrapy default user-agent middleware Close 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, }
from_crawler
There will be problems in the use of spider
This is a class method, we define a class of such methods may be) to obtain information in this way by the configuration file settings crawler.settings.get (, while this may be used in the pipeline
Example:
def __init __ (self, mongo_uri, mongo_db): # constructor, and set two parameters
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler (cls, crawler): # settings acquired in the two variables, assigned to the two parameters are set by the constructor for use
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
start_requests()
This iterative method must return an object that contains a first Request for requesting spider crawling
This method is written in the inherited parent class spider.Spider, the default is the get request, if we need to modify the beginning of this request, you can override this method, as we want to post requests
make_requests_from_url(url)
This is also in the parent class start_requests call, of course, this way we can rewrite
parse(response)
In fact, this default callback function
Handles response and returns the processed data and follow-up of url
This method and other callback functions must return Request iterables contains a Request or the Item