python爬虫scrapy框架爬取糗妹妹段子首页

声明：本文仅为学习爬虫，请勿商业和恶意攻击网站，本文所有解释权归作者。
本文分别用两种方法把获取的段子信息存储到了本地，分别是txt文件和json文件，
txt文件比较简单，生成字典后用命令直接执行即可，json文件稍显麻烦，文章里面有详细的注释可供理解。

# -*- coding: utf-8 -*-
# texts.py
import scrapy
#导入items
from first.items import FirstItem

class TextsSpider(scrapy.Spider):
    # 爬虫的名称 scrapy list列出所有的爬虫名称
    name = 'texts'
    # 允许爬虫文件所要爬的网站是基于此网站下进行的，
    # 如：有的资源如图片是在另一个服务器就爬不到了，一般注释掉，
    # 不在此允许范围内的域名就会被过滤，而不会进行爬取
    # allowed_domains = ['http://www.qiumeimei.com/']
    # 爬虫要爬取的第一个url
    start_urls = ['http://www.qiumeimei.com/text']

    # 爬虫代码的编写位置
    def parse(self, response):
        div_list = response.xpath('//div[@class="home_main_wrap"]/div[@class="panel clearfix"]')
        contents = []
        #可以保存临时文件 csv表格 json
        for div in div_list:
            author = div.xpath('./div[@class="top clearfix"]/h2/a/text()').extract_first()
            content = div.xpath('./div[@class="main"]/p/text()').extract()
            # 需要判断拿数据  extract()经常用来切片（脱壳）从一个对象中得到list
            if content == ['\xa0']:
                content= div.xpath('./div[@class="main"]/div/p/text()').extract()
            content = "".join(content)
            # 注释的是用于本地存储的,没有分模块
            # dict1 = {
            #     "author":author,
            #     "content":content
            # }
            # contents.append(dict1)
            # 本地存储，多个模块联动操作
            # 把类实例化为一个对象
            items = FirstItem()
            items["author"] = author
            items["content"] = content
            # print(items)
            # 使用yield来传数据到items.py存储，不需要return
            yield items
        #注释的是用于本地存储的,没有分模块
        # return contents

上面的是主要程序。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
# items.py
import scrapy
# 存储解析到的页面数据

class FirstItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()
    pass

然后是管道文件，这里主要解释了如何生成json文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# pipelines.py
import codecs
import json
import os
# 持久化存储的相关操作  管道文件用于持久化存储 txt json csv mysql redis
class FirstPipeline(object):
    f = None
    # 开始爬虫
    def open_spider(self,spider):
        # 打开文件
        # self.f = open("qmm.txt","w",encoding="utf_8")
        # 如果不使用codecs.open打开文件，则close_spider里面的语句不生效，就是一个编码和解码的工具
        self.f = codecs.open("qmm.json","w",encoding="utf_8")
        # 列表
        self.f.write('"list":[')
    # 执行爬虫
    def process_item(self, item, spider):
        # print("正在写入中。。。")
        author = item["author"]
        content = item["content"]
        # 写入数据  这个是直接存储txt文件
        # self.f.write(author + ":" + "\n" + content + "\n\n\n")
        # 想存储json文件，就得把item对象转变为字典对象
        res = dict(item)
        # 这是因为json.dumps 序列化时对中文默认使用的ascii编码.想输出真正的中文需要指定ensure_ascii=False：
        # 直接写入字典会保存，所以把字典形式的作为list列表的值字符串格式写入
        str = json.dumps(res,ensure_ascii=False)
        self.f.write(str + "," + "\n")
        return item
    # 关闭爬虫
    def close_spider(self,spider):
        # SEEK_END 移动游标到文件最后，再向前偏移2个字符
        self.f.seek(-2,os.SEEK_END)
        # 移除偏移后的所有字符 移除了逗号,和一个换行符\n
        self.f.truncate()
        # 完成列表
        self.f.write("]")
        self.f.close()

最后一个是settings配置,主要是伪装UA和关闭robots协议，关键是下面这行代码

#管道文件
ITEM_PIPELINES = {
   'first.pipelines.FirstPipeline': 300,
}

# 注意：
# 下面这句话的含义：在执行scrapy crawl texts -o qiumeimei.json  --nolog保存
# json文件的时候，原来保存的是二进制，在添加了下面这个配置之后保存为utf-8   feed_export_encoding
FEED_EXPORT_ENCODING = 'UTF8'  #等同于scrapy crawl texts -o qiumeimei.json -s FEED_EXPORT_ENCODING = 'UTF8' --nolog

Python 键盘上的舞者

发布了30 篇原创文章 · 获赞 5 · 访问量 3331

私信关注

python爬虫scrapy框架爬取糗妹妹段子首页

猜你喜欢