Scrapy spider primary method

Spider class is the main core classes in Scrapy, which defines the rules for crawling the site. Spider is crawling cycle, but it is crawling steps:

  1. start_requests Request start_urls initialization method of URL, and the request return result Response to parse passed as a parameter;
  2. parse is a callback function, which analyzes the content delivery over the Response, extracted from the Item object, dict, or Request comprising three iterations may be data passed to the Request Scrapy continue to the next cycle;
  3. Analysis of Response to extract the data required to use parse selector.

Zero, Spider base class

All reptiles must inherit from Spider class. He provided start_requests method default implementations and read and request start_urls , then call returns based on the results of pase method. His usual attributes are as follows:

  1. name: spider unique name, Scrapy crawler to locate and initialize the spider by name;
  2. allowed_domains: optional attribute, need to meet the middleware OffsiteMiddleWare use, it does not follow up the domain name in the list of domain names;
  3. start_urls: When not specify a URL, the page will begin acquiring data from start_urls list;
  4. custom_settings: an optional attribute parameter type is dict, will cover the project settings, you must be class.

一、 start_requests

It calls the start of the project start_requests method, url and then get generated Request from start_urls list in turn, and then call the callback method parse. This method is called only once so we can write it as a generator.

Two, parse

parse is Scrapy default callback method, she is responsible for handling Response and return the captured data to obtain return require follow-up URL.

Three, Selector

Responsible for extracting the page content, Selector is a selector mechanism to build on lxml, mainly to extract data by xpath and css. Commonly used methods are as follows:

  1. xpath: Incoming xpath expression that returns a list of nodes corresponding;
  2. css: Incoming css expression that returns a list of nodes corresponding;
  3. extract: returns a list of elements selected character string;
  4. re: regular expressions to extract a string through.

TiO: selector may be nested, for example:

image = response.css("#image")
image_new = image.css("[href*='baidu.com']").extract()

IV Summary

On a simple description to explain the main methods of spider, these methods are often used in our development.

Published 204 original articles · won praise 101 · Views 350,000 +

Guess you like

Origin blog.csdn.net/gangzhucoll/article/details/103675280