The similarities and differences between JsonItemExporter and JsonLinesItemExporter save data

foreword

To persist data in the pipeline of scrapy crawler framework, JsonItemExporter and JsonLinesItemExporter of ItemExporter are generally used. The similarities and differences between the usage of these two methods are as follows:

JsonItemExporter usage

JsonItemExporter: write a large amount of data at one time, occupying memory

# -*- coding: utf-8 -*-

from scrapy.exporters import JsonItemExporter

class QsbkPipeline(object):

    def __init__(self):
        # 注意:以二进制的方式打开写入,不需要指定编码格式;以字符串的形式打开写入,就需要指定编码格式
        self.fp = open('test.json', 'wb')

        self.exporter = JsonItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')

    def open_spider(self, spider):
        print('start...')

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.fp.close()
        print('end...')

JsonLinesItemExporter usage

JsonLinesItemExporter: A dictionary with one line, which does not meet the json format; the data is directly stored in the disk file, and the memory usage is small.

# -*- coding: utf-8 -*-

from scrapy.exporters import JsonLinesItemExporter

class QsbkPipeline(object):

    def __init__(self):
        # JsonLinesItemExporter 必须要以二进制的方式打开
        # 注意:以二进制的方式打开写入,不需要指定编码格式;以字符串的形式打开写入,就需要指定编码格式
        self.fp = open('test.json', 'wb')

        self.exporter = JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')

    def open_spider(self, spider):
        print('start...')

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.fp.close()
        print('end...')

difference

  • JsonItemExporter: Add data to the memory each time, and finally write it to the disk file uniformly. The advantage is that what is stored is a data that satisfies the json rules. The disadvantage is that if the amount of data is relatively large, it consumes more memory.

  • JsonLinesItemExporter: Store the item to disk every time export_item is called. The disadvantage is that a dictionary is one line, and the entire file is not a file that satisfies the json format. The advantage is that each time the data is directly stored in the disk file, it will not consume memory, and the data is relatively safe.

Guess you like

Origin blog.csdn.net/weekdawn/article/details/126402356