IMDb charts crawling reptiles Scrapy data stored in the database Mongodb

Reptile first step: New Project

  • Select the appropriate location, execute the command: scrapy startproje xxxx (my project name: douban)

Reptile Step two: clear goals

  • IMDb Top url: https:? //Movie.douban.com/top250 start = 0,
    the analysis found srart = url behind the figures, in steps of 25 increments, up to 225, so you can take advantage of this condition to send Request request
  • This article only took three fields, film name, score and presentation, of course you want more information are possible
    • item [ "name"]: Movie name
    • item [ "rating_num"]: Rating
    • item [ "inq"]: Introduction
  • With extracts xpath data
# 电影名字  extract()方法的作用是将xpath对象转为unicode对象
item["name"] = each.xpath('.//span[@class="title"][1]/text()').extract()[0]
# 评分
item["rating_num"] = each.xpath('.//span[@class="rating_num"]/text()').extract()[0]
# 介绍
item["inq"] = each.xpath('.//span[@class="inq"]/text()').extract()[0]
  • Written items.py file
import scrapy
class DoubanItem(scrapy.Item):
    # 电影名
    name = scrapy.Field()
    # 评分
    rating_num = scrapy.Field()
    # 介绍
    inq = scrapy.Field()

Reptile third step: write file reptiles spider

  • Here with a spider type, execute the command: scrapy genspider doubanMovie "movie.douban.com" (project name and Reptile name can not be the same)
import scrapy
# 从items.py文件中导入DoubanItem类
from douban.items import DoubanItem
class DoubanmovieSpider(scrapy.Spider):
    # 爬虫名 
    name = 'doubanMovie'
    # 允许爬虫的范围
    allowed_domains = ['movie.douban.com']
    # 构造url地址,因为最后那个数字是变化的,可以动态生成url地址
    url = "https://movie.douban.com/top250?start="
    offset = 0
    start_urls = [url + str(offset)]
    # 页面解析函数
    def parse(self, response):
        # xpath找到一个根节点
        datas = response.xpath('//div[@class="item"]//div[@class="info"]')
        for each in datas:
            # 实例化item对象
            item = DoubanItem()
            # 电影名字  extract()方法的作用是将xpath对象转为unicode对象
            item["name"] = each.xpath('.//span[@class="title"][1]/text()').extract()[0]
            # 评分
            item["rating_num"] = each.xpath('.//span[@class="rating_num"]/text()').extract()[0]
            # 介绍
            item["inq"] = each.xpath('.//span[@class="inq"]/text()').extract()[0]
            yield item
        # 当start后面的数字小于225就一直发请求
        if self.offset < 225:
            self.offset +=25
            # 回调函数仍然是这个方法
            yield scrapy.Request(self.url+str(self.offset),callback=self.parse)

Reptile Step four: the contents of storage, preparation of pipe file pipeline.py

  • Mongodb the data stored in the database, you need to configure the host, port, database name, table name, data of four fields in the configuration file (of course you can also choose to write in pipelinse file)
# python中操作Mongodb数据库是通过pymongo这个模块来实现的,所以要导入这个模块
import pymongo
# 导入setting.py中的相关内容
from scrapy.utils.project import get_project_settings
class DoubanPipeline(object):
    def __init__(self):
        settings = get_project_settings()
        # 主机ip
        host = settings["MONGODB_HOST"]
        # port
        port = settings["MONGODB_PORT"]
        # 数据库名
        dbname = settings['MONGODB_DBNAME']
        # 表名
        sheetname= settings['MONGODB_SHEETNAME']
        # 创建数据库连接
        client = pymongo.MongoClient(host=host,port=port)
        # 指定数据库
        mydb = client[dbname]
        # 指定数据库表名字
        self.sheet = mydb[sheetname]
    def process_item(self, item, spider):
        # 转为字典格式
        data = dict(item)
        # 插入数据
        self.sheet.insert(data)
        return item

Finally finished the configuration file can run setting.py

  • The following just need to add or modify the code, no paste the full
# 这个需要自己取消注释,才会执行我们的管道方法
ITEM_PIPELINES = {
   'douban.pipelines.DoubanPipeline': 300,
}
# MONGODB 主机名
MONGODB_HOST = "127.0.0.1"
# MONGODB 端口号
MONGODB_PORT = 27017
# MONGODB 数据库名
MONGODB_DBNAME = "Douban"
# MONGODB 存放的表名
MONGODB_SHEETNAME = "doubanmovies"

operation result

  • See with Robo 3T data visualization tools:

    FIG 250 is to get the data, the first Shawshank Redemption.

Guess you like

Origin www.cnblogs.com/TSOSTSOS/p/12173908.html