Scrapy 中文输出与存储

1、中文输出

python3.X中中文信息直接可以输出处理；

python2.X中：采用中文encode("gbk")或者encode("utf-8")。

2、中文存储

在Scrapy中对数据进行处理的文件是pipelines.py 文件，首先打开项目设置文件setting.py 配置pipelines。

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'firstpjt.pipelines.FirstpjtPipeline': 300,
#}

上面代码中的'firstpjt.pipelines.FirstpjtPipeline'分别代表“核心目录名.pipelines 文件名.对应的类名”,将代码修改为：

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'firstpjt.pipelines.FirstpjtPipeline': 300,
}

然后编写pipelines.py 文件：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# 导入codecs模块，使用codecs直接进行解码
import codecs


class FirstpjtPipeline(object):

    def __init__(self):
        # 以写入的方式创建或打开一个普通的文件用于存储爬取到的数据
        self.file = codecs.open("E:/SteveWorkspace/firstpjt/mydata/mydata1.txt", "wb", encoding="utf-8")

    def process_item(self, item, spider):
        # 设置每行要写的内容
        l = str(item) + '\n'
        # 此处通过print() 输出，方便程序的调试
        print(l)
        # 将对应信息写入文件中
        self.file.write(l)
        return item

    def close_spider(self, spider):
        self.file.close()

3、输出中文到json文件

JSON数据常见的基本存储结构有数组和对象两种。

数组形式：["苹果","梨子","葡萄"]

对象结构为键值对形式：{"姓名":"小明"，"身高":"173"}

修改pipelines.py 文件参考:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import codecs
import json


class ScrapytestPipeline(object):
    def __init__(self):
        #以写入的方式创建或打开一个json格式文件
        self.file = codecs.open("E:/PycharmWorkspace/ScrapyTest/mydata/datayamaxun.json", "ab", encoding="utf-8")
        print("打开文件---------------")

    def process_item(self, item, spider):
        print("开始写入---------------")
        for j in range(0,len(item["bookname"])):
            bookname = item["bookname"][j]
            # author=item["author"][j]
            price = item["price"][j]
            book = {"bookname": bookname, "price": price}
            #通过dict(item)将item转化为一个字典
            #然后通过json模块下的dumps()处理字典数据
            #在进行json.dumps()序列化的时候，中文会默认使用ASCII编码，显示中文需要设置ensure_ascii=False
            i = json.dumps(dict(book), ensure_ascii=False)
            #加上"\n"形成要写入的一行数据
            line = i + '\n'
            print("正在写入文件---------------")
            self.file.write(line)
        return item

    def close_spider(self, spider):
        self.file.close()
        print("关闭文件---------------")

Scrapy 中文输出与存储

猜你喜欢