Using Scrapy to crawl Douban movies

1. Concept

Scrapy is an application framework written for crawling website data and extracting structured data. It can be used in a series of programs including data mining, information processing or storing historical data.

You can easily install scrapy through the Python package management tool. If an error is reported during the installation prompting the lack of dependent packages, then install the missing packages through pip

pip install scrapy

The composition of scrapy is shown below

The engine Scrapy Engine is used to relay the signal and data transmission of other parts

Scheduler , a queue to store Request, the engine sends the requested connection to the Scheduler, it will queue the request, but the engine will send the first request in the queue to the engine when needed

Downloader Downloader , after the engine sends the Request link to Downloader, it downloads the corresponding data from the Internet and hands the returned data Responses to the engine

Spiders , the engine sends the downloaded Responses data to Spiders for analysis to extract the web page information we need. If there is a new required URL connection in the analysis, Spiders will hand over the link to the engine and store it in the scheduler

Pipeline Item Pipline , the crawler will pass the data in the page to the pipeline through the engine for further processing, filtering, storage and other operations

Downloader Middlewares , custom extension components, used to encapsulate proxy, http request header and other operations when requesting pages

Spider Middlewares , used to make some modifications to the data such as Responses entering Requests and Requests going out

The workflow of scrapy: First, we give the portal url to the spider crawler. The crawler puts the url into the scheduler through the engine. After the queue is queued, the first request is returned. The engine then forwards the request to the downloader for download. The data is handed to the crawler for crawling. Part of the crawled data is the data we need to be handed over to the pipeline for data cleaning and storage, and part of the new url connection will be handed over to the scheduler again, and then recycled for data crawling.

2. New Scrapy project

First open the command line in the folder where the project is stored, enter the scrapy startproject project name at the command line, and the python file required for the project will be automatically created in the current folder, such as creating a project douban that crawls Douban movies, its directory The structure is as follows:

Db_Project/
    scrapy.cfg                --项目的配置文件
    douban/                   --该项目的python模块目录,在其中编写python代码
        __init__.py           --python包的初始化文件
        items.py              --用于定义item数据结构
        pipelines.py          --项目中的pipelines文件
        settings.py           --定义项目的全局设置,例如下载延迟、并发量
        spiders/              --存放爬虫代码的包目录
            __init__.py
            ...

Then enter the spiders genspider crawler name domain name under the spiders directory, a crawler file douban.py file will be generated, which is used to define the crawler's crawling logic and regular expressions, etc.

scrapy genspider douban movie.douban.com

3. Define the data

The URL of the Douban movie to be crawled is https://movie.douban.com/top250 , each of which is as follows

We want to crawl the key information such as serial number, name, introduction, star rating, number of comments, description, so we need to define these objects in the pipeline file items.py first, similar to ORM, through scrapy.Field ( ) Method defines a data type for each field

import scrapy


class DoubanItem(scrapy.Item):
    ranking = scrapy.Field()    # 排名
    name = scrapy.Field()   # 电影名
    introduce = scrapy.Field()  # 简介
    star = scrapy.Field()   # 星级
    comments = scrapy.Field()   # 评论数
    describe = scrapy.Field()   # 描述

4. Data crawling

Open the crawler file movie.py created under the spiders folder as shown below, and automatically create three variables and a method. In the parse method to process the response of the returned data, we need to provide the crawler entry address in start_urls. Note that crawlers will automatically filter out domain names other than allowed_domains, so you need to pay attention to the assignment of this variable

# spiders/movie.py
import scrapy


class MovieSpider(scrapy.Spider):
    # 爬虫名
    name = 'movie'
    # 允许爬取的域名
    allowed_domains = ['movie.douban.com']
    # 入口url
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        pass

Before crawling data, you must first set up some network proxies. Find the USER_AGENT variable in the settings.py file and modify it as follows:

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'

You can start the crawler named douban on the command line through the following command: scrapy crawl douban, or you can write a startup file run.py file as follows, just run

from scrapy import cmdline
cmdline.execute('scrapy crawl movie'.split())

Next, we need to filter the crawled data. Through Xpath rules, we can easily select the specified elements in the web page. As shown in the following figure, each movie entry is wrapped in one under <ol class = "grid_view"> <li> tag, so all movie entries on this page are selected via xpath: // ol [@ class = 'grid_view'] / li. You can get the xpath value through the Xpath plug-in of Google Chrome or ChroPath of Firefox. Right-click on the browser to view the element, and the developer tool shown below is displayed. The rightmost one is the ChroPath plug-in. Value: // div [@ id = 'wrapper'] // li
 

The xpath () method of the crawler response object can directly process the xpath rule string and return the corresponding page content. These contents are the selector object Selector, which can be further refined content selection, through xpath to select the movie name, introduction , Evaluation, star rating, etc., that is, the data structure DoubanItem defined in the items.py file. Loop through each movie list to crawl accurate movie information from it, and save it as DoubanItem object item, and finally return the item object from Spiders to Item pipeline through yield.

In addition to extracting Item data from the page, the crawler will also crawl the url link to form the Request request of the next page. The following figure shows the next page of information at the bottom of the Douban page. The parameter of the second page is "? Start = 25 & filter = ", Stitched with the website address https://movie.douban.com/top250 to get the address of the next page. Extract the content through xpath as above, if it is not empty, the request request yielded by the stitching is submitted to the scheduler

The final crawler movie.py file is as follows

# -*- coding: utf-8 -*-
import scrapy
from items import DoubanItem


class MovieSpider(scrapy.Spider):
    # 爬虫名
    name = 'movie'
    # 爬取网站的域名
    allowed_domains = ['movie.douban.com']
    # 入口url
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        # 首先抓取电影列表
        movie_list = response.xpath("//ol[@class='grid_view']/li")
        for selector in movie_list:
            # 遍历每个电影列表,从其中精准抓取所需要的信息并保存为item对象
            item = DoubanItem()
            item['ranking'] = selector.xpath(".//div[@class='pic']/em/text()").extract_first()
            item['name'] = selector.xpath(".//span[@class='title']/text()").extract_first()
            text = selector.xpath(".//div[@class='bd']/p[1]/text()").extract()
            intro = ""
            for s in text:  # 将简介放到一个字符串
                intro += "".join(s.split())  # 去掉空格
            item['introduce'] = intro
            item['star'] = selector.css('.rating_num::text').extract_first()
            item['comments'] = selector.xpath(".//div[@class='star']/span[4]/text()").extract_first()
            item['describe'] = selector.xpath(".//span[@class='inq']/text()").extract_first()
            # print(item)
            yield item  # 将结果item对象返回给Item管道
        # 爬取网页中的下一个页面url信息
        next_link = response.xpath("//span[@class='next']/a[1]/@href").extract_first()
        if next_link:
            next_link = "https://movie.douban.com/top250" + next_link
            print(next_link)
            # 将Request请求提交给调度器
            yield scrapy.Request(next_link, callback=self.parse)

xpath selector

/ Indicates searching from the directory at the next level in the current position, // indicates searching from any sublevel in the current position,

By default, the search starts from the root directory.. Means to search from the current directory, @ followed by the label attribute, and the text () function means to extract the text content

// div [@ id = 'wrapper'] // li means to find the div tag whose id is wrapper from the root directory, and then take out all li tags under it

.// div [@ class = 'pic'] / em [1] / text () means to find the first em tag under all divs with class pic from the current selector directory, and extract the text content

string (// div [@ id = 'endText'] / p [position ()> 1]) represents all text content after the second p tag under the div whose id is endText

/ bookstore / book [last ()-2] Select the penultimate book element that belongs to the child element of bookstore.

CSS selector

You can also use the css selector to select elements within the page, which expresses the selected elements through CSS pseudo-classes, as follows

# 选择类名为left的div下的p标签中的文字
response.css('div.left p::text').extract_first()

# 选取id为tag的元素下类名为star元素中的文字
response.css('#tag .star::text').extract_first()

5. Data storage

When running the crawler file, you can specify the location where the file is saved through the parameter -o. You can choose to save it as a json or csv file according to the file suffix name, for example

scrapy crawl movie -o data.csv

You can also perform further operations on the obtained Item data in the piplines.py file to save it to the database through python operation

6. Middleware settings

Sometimes in order to deal with the anti-crawler mechanism of the website, you need to make some disguise settings for the download middleware, including the use of IP proxy and proxy user-agent, create a new user_agent class in the middlewares.py file to add a user list to the request header, from Check some commonly used user agents on the Internet and put them in the USER_AGENT_LIST list, then randomly select one from the random function as an agent and set it to the User_Agent field of the reques request header

class user_agent(object):
    def process_request(self, request, spider):
        # user agent 列表
        USER_AGENT_LIST = [
            'MSIE (MSIE 6.0; X11; Linux; i686) Opera 7.23',
            'Opera/9.20 (Macintosh; Intel Mac OS X; U; en)',
            'Opera/9.0 (Macintosh; PPC Mac OS X; U; en)',
            'iTunes/9.0.3 (Macintosh; U; Intel Mac OS X 10_6_2; en-ca)',
            'Mozilla/4.76 [en_jp] (X11; U; SunOS 5.8 sun4u)',
            'iTunes/4.2 (Macintosh; U; PPC Mac OS X 10.2)',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:5.0) Gecko/20100101 Firefox/5.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:9.0) Gecko/20100101 Firefox/9.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:16.0) Gecko/20120813 Firefox/16.0',
            'Mozilla/4.77 [en] (X11; I; IRIX;64 6.5 IP30)',
            'Mozilla/4.8 [en] (X11; U; SunOS; 5.7 sun4u)'
        ]
        agent = random.choice(USER_AGENT_LIST)  # 从上面列表中随机抽取一个代理
        request.headers['User_Agent'] = agent  # 设置请求头的用户代理

Set the download middleware in the settings.py file to cancel the following lines of comments, register the agent class user_agent and set the priority, the lower the number, the higher the priority

 

Published 124 original articles · Like 65 · Visit 130,000+

Guess you like

Origin blog.csdn.net/theVicTory/article/details/103169474