Scrapy——Spider

import scrapy.Spider

Spider climbing class defines how a Web site, including crawling action and how to extract structured data from web content, in general spider crawling action is to define and analyze a page.

 

Spider is the easiest spider. Every other spider must inherit from this class (including Scrapy comes and I have written other spider spider). Spider What special features does not provide. It only provides  start_requests ()  default implementation, the read request and spider properties  start_urls , and calls the parse method according to the result returned spider (resulting responses).

 

work process:

  1. URL initialized at an initial Request, a callback function is provided and, when the download is complete the request and return Response will be generated, and passed as parameters to the callback function. Spider is in the original request from the parent class column start_requests () method call make_requests_from_url ( ) to get the. start_requests () Gets start_urls the URL, and sends to the callback function to generate a parse Request 
  2. Web content analysis within the callback function returns, you can return the Item object, or Dict, or Request, and is a three iterations of the container may include, after the return of the Request object Scrapy deal will go through, download the appropriate content, and calls set callback function
  3. Within the callback function, you can get what we want by lxml, bs4, xpath, css and other methods generated content item
  4. The last item will be delivered to the processing Pipeline

 

spider some common attributes:

 

All write our own reptiles are inherited in this class spider.Spider

name

Definition of reptile name, the command to start when we use is this name, the name must be unique

allowed_domains

It contains a list of domain names allowed spider crawling. When offsiteMiddleware enabled, the domain name is not in the list will not be accessed URL

Therefore, in the crawler file, it will be generated each time a request Request judged here and the domain name

start_urls

url starting list

Here will be calling start_request circulation request this list each address by spider.Spider method.

custom_settings

Custom configuration, you can override the configuration settings, mainly used when we have a specific set of requirements for reptiles

Dictionary is provided provided: custom_settings = {}

Example:

 

custom_settings = {
        'LOG_LEVEL':'INFO',
        'DOWNLOAD_DELAY': 0,
        'COOKIES_ENABLED': False,  # enabled by default
        'DOWNLOADER_MIDDLEWARES': {
            Agent Middleware #
            'mySpider.middlewares.ProxiesMiddleware': 400,
            # SeleniumMiddleware Middleware
            'mySpider.middlewares.SeleniumMiddleware': 543,
            # The scrapy default user-agent middleware Close
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
        }

 

from_crawler

There will be problems in the use of spider

This is a class method, we define a class of such methods may be) to obtain information in this way by the configuration file settings crawler.settings.get (, while this may be used in the pipeline

Example:

def __init __ (self, mongo_uri, mongo_db): # constructor, and set two parameters
    self.mongo_uri = mongo_uri
    self.mongo_db = mongo_db

@classmethod
def from_crawler (cls, crawler): # settings acquired in the two variables, assigned to the two parameters are set by the constructor for use
    return cls(
        mongo_uri=crawler.settings.get('MONGO_URI'),
        mongo_db=crawler.settings.get('MONGO_DATABASE')
    )

start_requests()

This iterative method must return an object that contains a first Request for requesting spider crawling

This method is written in the inherited parent class spider.Spider, the default is the get request, if we need to modify the beginning of this request, you can override this method, as we want to post requests

make_requests_from_url(url)

This is also in the parent class start_requests call, of course, this way we can rewrite

parse(response)

In fact, this default callback function

Handles response and returns the processed data and follow-up of url

This method and other callback functions must return Request iterables contains a Request or the Item

 

 

Guess you like

Origin www.cnblogs.com/lanston1/p/11894846.html