Python scrapy framework teaching (1): the first scrapy crawler

Preface

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542

Python learning exchange group: 1039645993

Project requirements

Crawl famous sayings on the website (http://quotes.toscrape.com) dedicated to crawling beginners to train crawling techniques.

Create project

Before starting crawling, a new Scrapy project must be created. Go to the directory where you plan to store the code and run the following command:

(base) λ scrapy startproject quotes 
New Scrapy project 'quotes', using template directory 'd:\anaconda3\lib\site- packages\scrapy\templates\project', created in: 
  D:\课程-爬虫课程\02 框架爬虫\备课代码-框架爬虫\quotes 
You can start your first spider with: 
  cd quotes 
  scrapy genspider example example.com

First switch to the newly created crawler project directory, which is the /quotes directory. Then execute the command to create a crawler:

D:\课程-爬虫课程\02 框架爬虫\备课代码-框架爬虫 (master) 
(base) λ cd quotes\ 
D:\课程-爬虫课程\02 框架爬虫\备课代码-框架爬虫\quotes (master) 
(base) λ scrapy genspider quotes quotes.com 
Cannot create a spider with the same name as your project 

D:\课程-爬虫课程\02 框架爬虫\备课代码-框架爬虫\quotes (master) 
(base) λ scrapy genspider quote quotes.com 
Created spider 'quote' using template 'basic' in module: 
  quotes.spiders.quote

This command will create a quotes directory with the following contents:

quotes 
  │ items.py 
  │ middlewares.py 
  │ pipelines.py 
  │ settings.py 
  │ __init__.py │
  ├─spiders 
    │ quote.py 
    │ __init__.py

robots.txt

The robots protocol is also called robots.txt (unified lowercase), which is an ASCII-encoded text file stored in the root directory of the website. It usually tells the web spiders of the web search engine, which content in this website should not be searched by the search engine. What is obtained by the crawler, which can be obtained by the crawler.

The robots agreement is not a specification, but a convention.

# filename:settings.py 

# Obey robots.txt rules 
ROBOTSTXT_OBEY = False

Analysis page

Before writing a crawler program, you first need to analyze the page to be crawled. The mainstream browsers have tools or plug-ins to analyze the page. Here we use the Chrome browser's developer tools (Tools→Developer tools) to analyze the page.

Data information

Open the page http://quotes.toscrape.com in the Chrome browser and select "review element" to view its HTML code.

You can see that every label is wrapped in a label


 

Today teaches you to be the first scrapy framework crawler for Python

 

Write spider

After analyzing the page, the next step is to write a crawler. Write a crawler in Scrapy and write code in scrapy.Spider. Spider is a class written by users to crawl data from a single website (or some websites).

It contains an initial URL for downloading, how to follow up the links in the webpage, and how to analyze the content of the page, and extract the method to generate the item.

In order to create a Spider, you must extend the scrapy.Spider class and define the following three properties:

  • name: Used to distinguish Spider. The name must be unique, and you cannot set the same name for different spiders.
  • start_urls: Contains the list of URLs crawled by Spider when it starts. Therefore, the first page to be retrieved will be one of them. The subsequent URL is extracted from the data obtained from the initial URL.
  • parse(): is a method of spider. When called, the Response object generated after each initial URL is downloaded will be passed to the function as the only parameter. This method is responsible for parsing the returned data (response data), extracting the data (generating item), and generating the Request object of the URL that needs further processing.
import scrapy 

class QuoteSpider(scrapy.Spider): 
  name = 'quote' 
  allowed_domains = ['quotes.com'] 
  start_urls = ['http://quotes.toscrape.com/'] 
  def parse(self, response): 
    pass

The following is a brief description of the realization of quote.

Focus:

  • name is the name of the crawler, which is specified during genspider.
  • Allowed_domains is the domain name that the crawler can crawl. The crawler can only crawl webpages under this domain name, and you don’t need to write it.
  • start_urls is a website crawled by Scrapy, which is an iterable type. Of course, if there are multiple web pages, multiple URLs can be written in the list.
    The form of list deduction is commonly used .
  • parse is called a callback function, and the response in this method is the response obtained after the start_urls URL sends a request. Of course, you can also specify other functions to receive the response. A page parsing function usually needs to complete the following two tasks:
    extract the data in the page (re, XPath, CSS selector), extract the links in the page, and generate a download request for the linked page.

The page parsing function is usually implemented as a generator function. Every data extracted from the page and every download request to the linked page is
submitted to the Scrapy engine by a yield statement.

Analytical data

import scrapy 

... 

def parse(self, response): 
  quotes = response.css('.quote')
  for quote in quotes: 
    text = quote.css('.text::text').extract_first() 
    auth = quote.css('.author::text').extract_first() 
    tages = quote.css('.tags a::text').extract() 
    yield dict(text=text, auth=auth, tages=tages)

Focus:

  • response.css() can use css syntax directly to extract the data in the response.
  • Multiple URLs can be written in start_urls, and they can be separated in a list format.
  • extract() is to extract the data in the css object, after it is extracted, it is a list, otherwise it is an object. And for extract_first() is to extract the first

Run crawler

Run scrapy crawl quotes in the /quotes directory to run the crawler project

What happened after running the crawler?

Scrapy creates scrapy.Request objects for each URL in the start_urls attribute of Spider, and assigns the parse method to Request as a callback function.

After the Request object is scheduled, the scrapy.http.Response object is generated and sent back to the spider parse() method for processing.

After completing the code, run the crawler to crawl the data, execute the scrapy crawl <SPIDER_NAME> command in the shell to run the crawler'quote', and store the crawled data in a csv file:

(base) λ scrapy crawl quote -o quotes.csv 
2020-01-08 20:48:44 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: quotes) 
....

After waiting for the crawler to finish running, a quote.csv file will be generated in the current directory, and the data in it has been stored in csv format.

-o supports saving to multiple formats. The saving method is also very simple, just give the file suffix name. (Csv, json, pickle, etc.

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/114444725