07_scrapy application - get movie data (template for saving static page scrapy crawler data through excel/save through database)

0. Foreword:

  • Generally, for some python projects we create ourselves, we need to create a virtual environment, in which many packages will be downloaded, also called dependencies. But when we share our project with others, we cannot package the virtual environment and send it to others, because everyone’s computer system is different, we can export the dependencies as a dependency list, and then others have our dependency list, they can use a command Download our dependencies into its project environment, so that we can quickly run and deploy python projects
  • Instructions to generate a dependency list in the terminal: pip freeze > requirements.txt
  • With other people's dependency list, import the command of other people's dependency list: pip install -r requirements.txt (Note: When executing this command, you must put other people's dependency list in your project path.)
  • The code and dependency list of this project will be packaged and uploaded together

1. Project overview:

insert image description here


2. Create a project:

  • Scrapy has been downloaded in the premise project environment
  • Execute the creation command in the terminal: scrapy startproject get_news (Note: get_news is the project name)
  • Execute the command to switch to the project directory in the terminal: cd get_news
  • Execute the command to create a crawler python file in the project at the terminal: scrapy genspider crawler name crawl page URL

3. Code:

  • The code of the crawler python file
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Selector, Request

from ..items import GetNewsItem


class SpiderNewsSpider(scrapy.Spider):
    name = "spider_news"
    allowed_domains = ["movie.douban.com"]
    # start_urls = ['http://movie.douban.com/top250'] # 被下面的start_requests函数替代
    
    # start_requests里面放的是所有要爬取页面的url
    def start_requests(self):
        for i in range(10):
            # scrapy中的Request模块就是发送你要爬取的请求url的,但是请区分它不是第三方库Request
            yield Request(url=f'https://movie.douban.com/top250?start={
     
      
      i * 25}&filter=')

    def parse(self, response):
        # pass
        response_s = Selector(response)
        # 先获取页面电影数据列表
        li_list = response_s.css('#conte

Guess you like

Origin blog.csdn.net/sz1125218970/article/details/131176397