Scrapy framework crawler case

Operating environment

 1. win10-64bit
 2. python 3.6(E:\ProgramData\Anaconda3\python.exe)

The part to be crawled is 
write picture description here

By looking at the source code, the code that needs to be parsed is such a part

<li>
  <div class="item">
    <div class="pic">
      <em class="">1</em>
        <a href="https://movie.douban.com/subject/1292052/">
        <img alt="肖申克的救赎" src="https://img3.doubanio.com/view/movie_poster_cover/ipst/public/p480747492.webp" class="">
        </a>
    </div>
    <div class="info">
      <div class="hd">
        <a href="https://movie.douban.com/subject/1292052/" class="">
          <span class="title">肖申克的救赎</span>
          <span class="title">&nbsp;/&nbsp;The Shawshank Redemption</span>
          <span class="other">&nbsp;/&nbsp;月黑高飞(港)  /  刺激1995(台)</span>
        </a>
        <span class="playable">[可播放]</span>
      </div>
      <div class="bd">
        <p class="">
                          导演: 弗兰克·德拉邦特 Frank Darabont&nbsp;&nbsp;&nbsp;主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
                            1994&nbsp;/&nbsp;美国&nbsp;/&nbsp;犯罪 剧情
                        </p>
        <div class="star">
          <span class="rating5-t"></span>
          <span class="rating_num" property="v:average">9.6</span>
          <span property="v:best" content="10.0"></span>
          <span>854341人评价</span>
         </div>
         <p class="quote">
                                <span class="inq">希望让人自由。</span>
         </p>
       </div>
     </div>
   </div>
 </li>

Create project

First create a project, cmd input command

scrapy startproject doubanmovie

The project was created successfully, and the project directory structure is as follows 
Directory Structure                         

 1. 在spiders文件夹下编写自己的爬虫
 2. 在items中编写容器用于存放爬取到的数据
 3. 在pipelines中对数据进行各种操作
 4. 在settings中进行项目的各种设置。

Reptile Definition

Create a file MySpider.py in the spiders folder 
Create a class in MySpider.py DoubanMovie inherits from scrapy.Spider, and defines the following properties and methods

  1. name : the unique identifier of the crawler
  2. start_urls : list of initial crawl urls
  3. parse() : The Response object generated after each initial url access is passed as the only parameter to this method. The method parses the returned Response, extracts the data, generates the item, and generates the request object of the url to be processed further.

The parse() method uses the Selector in the scrapy framework to parse the Response object. The preliminary written code is as follows:

import scrapy

class DoubanMovie(scrapy.Spider):
    # 爬虫唯一标识符
    name = 'doubanMovie'
    # 爬取域名
    allowed_domain = ['movie.douban.com']
    # 爬取页面地址
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        print(response.body)

Try to run it and find a 403 error

write picture description here

It means that the crawler is blocked, then a request header should be added to simulate browser login. 
Add the following line of code to the settings file

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0'

Then it can run successfully.

Items definition

Create a file MovieItems.py in the doubanmovie folder, and write a container to store the crawled data in this file. 
Create a class MovieItem that inherits from scrapy.Item and defines various attributes. The statement is similar to the following

name = scrapy.Field()

The written code is as follows

import scrapy


class MovieItem(scrapy.Item):
    # 电影名字
    name = scrapy.Field()
    # 电影信息
    info = scrapy.Field()
    # 评分
    rating = scrapy.Field()
    # 评论人数
    num = scrapy.Field()
    # 经典语句
    quote = scrapy.Field()
    # 电影图片
    img_url = scrapy.Field()

data analysis

At present, only the Response object is obtained in the MySpider.py file. To extract various information from it, the Response object has to be parsed. Here, I choose to use the Selector in the scrapy framework.

First initialize the selector

selector = scrapy.Selector(response)

Parse each movie item through the website source code

movies = selector.xpath('//div[@class="item"]')

Declare another item to store movie information

item = MovieItem()

After that, each movie code segment is parsed, and the required information is extracted and stored in the item. Since the movie name has different language types, and the movie information has more than one string, as follows

<span class="title">肖申克的救赎</span>
<span class="title">&nbsp;/&nbsp;The Shawshank Redemption</span>
<p class="">
                          导演: 弗兰克·德拉邦特 Frank Darabont&nbsp;&nbsp;&nbsp;主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
                            1994&nbsp;/&nbsp;美国&nbsp;/&nbsp;犯罪 剧情
</p>

If you use the following statement to extract the first string of text will lose information

titles = movie.xpath('.//span[@class="title"]/text()').extract()[0].strip()
infos = movie.xpath('.//div[@class="bd"]/p/text()').extract()[0].strip()

Here, all the information is obtained by traversing the list

# 电影各种语言名字的列表
titles = movie.xpath('.//span[@class="title"]/text()').extract()

name = ''
for title in titles:
    name += title.strip()
item['name'] = name

此时parse()解析出各个电影信息,但是只是第一页,通过网页源代码发现

<span class="next">
  <link rel="next" href="?start=25&amp;filter="/>
  <a href="?start=25&amp;filter=" >后页&gt;</a>
</span>

可以获取下一页的url,将其提取出来,通过yield处理继续爬取下一页面的电影信息。由于最后一页中下一页为空,在此处加一个判断即可

next_page = selector.xpath('//span[@class="next"]/a/@href').extract()[0]
url = 'https://movie.douban.com/top250' + next_page
if next_page:
    yield scrapy.Request(url, callback=self.parse)

整个parse()方法代码如下

    def parse(self, response):
        selector = scrapy.Selector(response)
        # 解析出各个电影
        movies = selector.xpath('//div[@class="item"]')
        # 存放电影信息
        item = MovieItem()

        for movie in movies:

            # 电影各种语言名字的列表
            titles = movie.xpath('.//span[@class="title"]/text()').extract()
            # 将中文名与英文名合成一个字符串
            name = ''
            for title in titles:
                name += title.strip()
            item['name'] = name

            # 电影信息列表
            infos = movie.xpath('.//div[@class="bd"]/p/text()').extract()
            # 电影信息合成一个字符串
            fullInfo = ''
            for info in infos:
                fullInfo += info.strip()
            item['info'] = fullInfo
            # 提取评分信息
            item['rating'] = movie.xpath('.//span[@class="rating_num"]/text()').extract()[0].strip()
            # 提取评价人数
            item['num'] = movie.xpath('.//div[@class="star"]/span[last()]/text()').extract()[0].strip()[:-3]
            # 提取经典语句,quote可能为空
            quote = movie.xpath('.//span[@class="inq"]/text()').extract()
            if quote:
                quote = quote[0].strip()
            item['quote'] = quote
            # 提取电影图片
            item['img_url'] = movie.xpath('.//img/@src').extract()[0]

            yield item

        next_page = selector.xpath('//span[@class="next"]/a/@href').extract()[0]
        url = 'https://movie.douban.com/top250' + next_page
        if next_page:
            yield scrapy.Request(url, callback=self.parse)

数据存储

目前选择将数据存放在json文件中,在doubanmovie文件夹下创建文件MoviePipelines.py,编写类MoviePipeline,重写方法process_item(self, item, spider)用于处理数据。

import json

class MoviePipeline(object):
    def __init__(self):
        # 打开文件
        self.file = open('data.json', 'w', encoding='utf-8')

    # 该方法用于处理数据
    def process_item(self, item, spider):
        # 读取item中的数据
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        # 写入文件
        self.file.write(line)
        # 返回item
        return item

    # 该方法在spider被开启时被调用。
    def open_spider(self, spider):
        pass
    # 该方法在spider被关闭时被调用。
    def close_spider(self, spider):
        self.file.close()

同时在settings文件中对pipeline进行注册

ITEM_PIPELINES = {
    'doubanmovie.MoviePipelines.MoviePipeline': 1,
}

其中数字1表示优先级,越低越优先。

存储完电影信息后,接下来存储电影图片。 
存放图片要用到scrapy框架中的ImagesPipeline 
在doubanmovie中新建文件ImgPipelines.py,编写类ImgPipeline继承自ImagesPipeline,然后重载方法

 1. get_media_requests(self, item, info)
 2. item_completed(self, results, item, info)

第一个方法从item中获得url并下载图片,返回一个Request对象,完成下载后,结果作为一个tuple(success, image_info_or_failure)发送给第二个方法,其中success是下载是否成功的bool,image_info_or_failure包括url、path和checksum三项。其中,path就是相对于IMAGES_STORE的路径(含文件名)。

整个文件代码如下

import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem


class ImagePipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        yield scrapy.Request(item['image_url'])

    def item_completed(self, results, item, info):
        image_url = [x['path'] for ok, x in results if ok]

        if not image_url:
            raise DropItem("Item contains no images")

        item['image_url'] = image_url
        return item

同时在settings文件中注册并设置下载目录

ITEM_PIPELINES = {
    'doubanmovie.MoviePipelines.MoviePipeline': 1,
    'doubanmovie.ImgPipelines.ImgPipeline': 100,
}
IMAGES_STORE = 'E:\\img\\'

然而在爬取过程中又出现问题:Forbidden by robots.txt 
在settings文件中将ROBOTSTXT_OBEY改为False,让scrapy不遵守robot协议,即可正常下载图片

The first scrapy small project is completed, and I personally feel that the difficulty is in extracting different information and taking measures for various confidentiality mechanisms of crawling websites. 
At present, it is only a small exploration of the scrapy framework, and its powerful features will be gradually discovered and studied in the future.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324529530&siteId=291194637