一、目标:
爬取猫眼网站上正在热映的电影及评分情况,保存到mongo的数据库中:
二、具体工作:
(1)新建工程:
scrapy startproject maoyan
cd spider
scrapy genspider maoyan "maoyan.com"
(2)分析网页:
使用chrome的xpath插件获取对应电影的名称:
获取评分情况:
(3)编写代码:
a、设置获取的内容的item:
# items.py
class MaoyanItem(scrapy.Item):
# define the fields for your item here like:
#悲伤逆流成河
title = scrapy.Field()
#评分9.1
score = scrapy.Field()
b、主体解析代码
# -*- coding: utf-8 -*-
import scrapy
import json
import re
from maoyan.items import MaoyanItem
class MaoyanMoviesSpider(scrapy.Spider):
name = 'maoyan_movies'
allowed_domains = ['maoyan.com']
start_urls = ["http://maoyan.com/films?showType=1&offset=0"]
# 访问网页需要的headers & cookies信息
headers = {"User-Agent" : "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/65.0.3325.181 Chrome/65.0.3325.181 Safari/537.36" }
cookies = {
"email" : "***********",
"password" : "********************************************",
"origin" : "account-login",
"fingerprint" : "0-1-1-p%7C3",
"csrf" : "DjhrVKMi-yPygaTCvIejqhD7WY5kvQQv6_ig"
}
# 重写start_requests 方法
def start_requests(self):
# FormRequest 是Scrapy发送POST请求的方法
for url in self.start_urls:
yield scrapy.FormRequest(
url,
headers=self.headers,
cookies =self.cookies,
callback = self.parse_page)
# 网页解析方法
def parse_page(self, response):
#print(response.body)
#with open("test.html","wb+") as filename:
# filename.write(response.body)
item = MaoyanItem()
movies = response.xpath("//dd")
for each in movies:
# 获取电影名称
title = each.xpath("./div/@title").extract()
print("title= " + title[0])
# 获取评分信息
score_integer = each.xpath('./div/i[@class="integer"]/text()').extract()
score_fraction = each.xpath('./div/i[@class="fraction"]/text()').extract()
# 处理存在评分和暂无评分的两种情况
if(not score_integer):
score = each.xpath('./div[@class="channel-detail channel-detail-orange"]/text()').extract()
print(score[0])
else:
score = score_integer + score_fraction
score = "" .join(score)
print("score=" + score)
# 保存获取的信息
item['title'] = title[0]
item['score'] = score
# 获取对应的页数
page = re.findall('\d+\Z' ,response.url)[0]
if(int(page) < 61):
nextPage = int(page) + 30
print(nextPage)
#拼凑出第二、第三页的网页,实现翻页的功能http://maoyan.com/films?showType=1&offset=30
nextUrl = re.sub('\d+\Z', str(nextPage) ,response.url,count = 1)
yield scrapy.FormRequest(nextUrl, headers=self.headers, cookies =self.cookies, callback = self.parse_page)
#返回获取的数据给管道处理
yield item
c、管道文件编写:
#pipelines.py
# -*- coding: utf-8 -*-
from scrapy.conf import settings
import pymongo
class MaoyanPipeline(object):
def __init__(self):
host = settings['MONGODB_HOST']
port = settings['MONGODB_PORT']
dbname = settings['MONGODB_DBNAME']
# pymongo.MongoClient(host, port) 创建MongoDB链接
client = pymongo.MongoClient(host=host,port=port)
# 指向指定的数据库
mdb = client[dbname]
# 获取数据库里存放数据的表名
self.post = mdb[settings['MONGODB_DBNAME']]
def process_item(self, item, spider):
data = dict(item)
# 向指定的表里添加数据
self.post.insert(data)
return item
d、setting文件修改:
# -*- coding: utf-8 -*-
BOT_NAME = 'maoyan'
SPIDER_MODULES = ['maoyan.spiders']
NEWSPIDER_MODULE = 'maoyan.spiders'
HTTPERROR_ALLOWED_CODES = [400]#上面报的是400,就把400加入。
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# MONGODB 主机环回地址127.0.0.1
MONGODB_HOST = '127.0.0.1'
# 端口号,默认是27017
MONGODB_PORT = 27017
# 设置数据库名称
MONGODB_MAOYAN = 'MaoYan'
# 存放本次数据的表名称
MONGODB_DBNAME = 'MaoYanMovies'
# Configure item pipelines
ITEM_PIPELINES = {
'maoyan.pipelines.MaoyanPipeline': 300,
}
三、实战效果:
1、先启动mongod数据库的服务:sudo mongod --dbpath ~/learn/data/db --rest
2、重新打开cmd终端执行scrapy程序:
cd spider
scrapy crawl maoyanMovies
执行上述命令后,会在~/learn/data/db目录将获取的信息写入到MaoYanMovies数据库文件中;
3、可以使用如下命令显示数据中的信息:
# 查看当前数据库
> db
# 列出所有的数据库
> show dbs
# 连接DouBan数据库
> use DouBan
# 列出所有表
> show collections
# 查看表里的数据
> db.MaoyanMoives.find()
四、遇到的问题与总结:
在编写代码的过程中遇到很多的报错,并找到的对应的解决方法,总结如下:
1、TypeError: must be str, not bytes
filehandle = open(WAV_FILE, 'wb+') //写二进制文件需要加上b+的标识
2、<403 https://passport.meituan.com/account/unitivelogin>: HTTP status code is not handled or not allowed
需要在request的请求中加入headers=self.headers,此Headers可通过抓包获取
3、TypeError: Can't convert 'list' object to str implicitly
score = score_integer + score_fraction
score = "" .join(score) //获取的电影评分的两个字符串可以通过此方法拼凑为一个字符串
print("score=" + score)
4、builtins.ImportError: No module named 'pymongo'
安装pip install pymongo
5、builtins.TypeError: name must be an instance of str
settings.py和管道文件中的MONGODB_DBNAME不完全一致导致
6、but this version of PyMongo requires at least 2 (MongoDB 2.6).
sudo pip install pymongo==3.2 //安装需要的2.6以上的版本
7、ERROR: dbpath (/data/db/) does not exist
sudo mongod --dbpath /data/db --rest //写mongo时报错,需要先创建/data/db,然后可以dbpath指定路径
8、ERROR: listen(): bind() failed errno:98 Address already in use for socket: 0.0.0.0:27017
sudo ps aux | grep mongod // 找到已经存在的mongo服务并杀死后重新运行sudo mongod
sudo kill -9 对于的mongod 的pid
作者:frank_zyp
您的支持是对博主最大的鼓励,感谢您的认真阅读。
本文无所谓版权,欢迎转载。