Python scrapy framework teaching (3): scrapy.Spider

The Spider class defines how to crawl a certain (or certain) website. Including the crawling action (for example: whether to follow up the link) and how to extract structured data from the content of the web page (crawling item). In other words, Spider is where you define the crawling action and analyze a certain webpage (or some webpages).

For spiders, the crawling cycle is similar to the following:

  1. Initialize the Request with the initial URL and set the callback function. When the request is downloaded and returned, a response will be generated and passed to the callback function as a parameter.

The initial request in the spider is obtained by calling start_requests(). start_requests() reads the URL in start_urls and uses parse as the callback function to generate Request.

  1. Analyze the returned (web page) content in the callback function, and return an Item object or Request or an iterable container including both. The returned Request object will then be processed by Scrapy, download the corresponding content, and call the set callback function (the function can be the same).
  2. In the callback function, you can use Selectors (you can also use BeautifulSoup, lxml or any parser you want) to analyze the content of the web page and generate items based on the analyzed data.
  3. Finally, the item returned by the spider will be stored in the database (processed by some Item Pipeline) or stored in a file using Feed exports.

Although this loop is (to some extent) applicable to any type of spider, Scrapy still provides a variety of default spiders for different needs. These spiders will be discussed later.

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542

Python learning exchange group: 1039645993

Spider

scrapy.spider.Spider is the simplest spider. Every other spider must inherit from this class (including the other spiders that Scrapy comes with and the spiders you write yourself). It only requests the given start_urls / start_requests, and calls the spider's parse method based on the resulting responses.

name

A string that defines the name of the spider. The name of the spider defines how Scrapy locates (and initializes) the spider, so it must be unique. But you can generate multiple identical spider instances (instance), there are no restrictions. The name is the most important attribute of the spider, and it is required.

If the spider crawls a single domain, a common practice is to name the spider for the domain (with or without a suffix). For example, if a spider crawls mywebsite.com, the spider will usually be named mywebsite.

allowed_domains

Optional. Contains the list of domains that the spider allows to crawl. When OwsiteMiddleware is enabled, URLs whose domain names are not in the list will not be followed.

start_urls

URL list. When no specific URL is specified, the spider will start crawling from this list. Therefore, the URL of the first page to be retrieved will be one of this list. The subsequent URL will be extracted from the obtained data.

start_requests()

The method must return an iterable object (iterable). This object contains the first Request that the spider uses to crawl.

This method is called when spider starts crawling and no URL is specified. When the URL is specified, make_requests_from_url() will be called to create the Request object. This method will only be called once by Scrapy, so you can implement it as a generator.

The default implementation of this method is to use the url of start_urls to generate a Request.

If you want to modify the Request object that initially crawled a certain website, you can override this method. For example, if you need to log in to a website with POST at startup, you can write:

def start_requests(self): 
  return [scrapy.FormRequest("http://www.example.com/login",
                             formdata={'user': 'john', 'pass': 'secret'}, 
                             callback=self.logged_in)] 

def logged_in(self, response): 
## here you would extract links to follow and return Requests for 
## each of them, with another callback 
  pass

parse

When response does not specify a callback function, this method is Scrapy's default method for processing downloaded responses.

parse is responsible for processing the response and returning the processed data and/or the URL for follow-up. Spider has the same requirements for other Request callback functions.

This method and other Request callback functions must return an iterable object containing Request and/or Item.

Parameters: response-response used for analysis

closed(reason)

When the spider is closed, this function is called.

Start method

start_urls

start_urls is a list

start_requests

Use start_requests() to rewrite start_urls, use the Request() method to send the request yourself:

def start_requests(self): 
  """重写 start_urls 规则""" 
  yield scrapy.Request(url='http://quotes.toscrape.com/page/1/', callback=self.parse)

scrapy.Request

scrapy.Request is a request object, and a callback function must be formulated when it is created.

Data preservation

You can use -o to save the data in a common format (save according to the suffix name)

The supported formats are as follows:

  • json
  • jsonlines
  • jl
  • csv
  • xml
  • marshal
  • pickle

Case: Spider sample

Let's look at an example:

## -*- coding: utf-8 -*- 
import scrapy 

class Quotes2Spider(scrapy.Spider): 
  name = 'quotes2' 
  allowed_domains = ['toscrape.com'] 
  start_urls = ['http://quotes.toscrape.com/page/2/'] 
  def parse(self, response): 
    quotes = response.css('.quote')
    for quote in quotes: 
      text = quote.css('.text::text').extract_first() 
      auth = quote.css('.author::text').extract_first() 
      tages = quote.css('.tags a::text').extract() 
      yield dict(text=text,auth=auth,tages=tages)

url stitching

import urllib.parse 
urllib.parse.urljoin('http://quotes.toscrape.com/', '/page/2/') 
Out[6]: 'http://quotes.toscrape.com/page/2/' 
urllib.parse.urljoin('http://quotes.toscrape.com/page/2/', '/page/3/') 
Out[7]: 'http://quotes.toscrape.com/page/3/'

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/114581454