[python] Python data serialization module pickle usage notes

Pickle is a Python built-in module for serializing and deserializing object structures in Python. Python serialization is a process of converting a Python object hierarchy into a byte stream that can be stored locally or transmitted over a network, and deserialization is the process of restoring a byte stream to a Python object hierarchy.

The function of data serialization is simply understood as storing data that cannot be stored directly on disk, thereby prolonging the life cycle of objects. There are two commonly used serialization libraries for Python, namely json and pickle. There are two main differences between the json library and the pickle library:

  • Pickle can serialize all data types in Python, including classes and functions, which are generally stored as binary files. And json can only serialize the basic data types of Python, and the dump results are very easy to read.
  • Pickle can only be used in Python, while json can exchange data between different languages.

Pickle is generally slower than json, especially when the amount of data is large. Both pickle and json have four basic methods:

method effect
dump Serialize write to file
load read file deserialization
dumps serialize return object
loads deserialize object

1 pickle use

The pickle.dump() function is used to serialize the python structure and save it as a binary file. The pickle.dump function accepts three arguments where the first argument contains the object to be stored in the file and the second argument gives the file object obtained when writing the desired file in binary mode. The third parameter indicates the serialization protocol.

For the protocol selection of pickle, there are currently 5 different protocols available (from Python object serialization ). The higher the protocol used, the newer the Python version required to read the generated pickle. These agreements include:

  • Protocol version 0 is the original "human-readable" protocol, backward compatible with earlier versions of Python.
  • Protocol version 1 is an old binary format that is also compatible with earlier versions of Python.
  • Protocol version 2 was introduced in Python 2.3, providing a more efficient serialization method.
  • Protocol version 3 was introduced in Python 3.0. It explicitly supports bytes objects, which is also the default protocol for Python and the recommended protocol when compatibility with other Python 3 versions is required.
  • Protocol version 4 was introduced in Python 3.4. It adds support for very large objects, serializes more types of objects, and optimizes some data formats.

Different protocols can be set through 0 to 4. The protocol parameter defaults to None, and None means to use the default protocol used by the Python version. Choose -1 for highest agreement. In addition, the protocol can be set through constants, respectively:

  • pickle.HIGHEST_PROTOCOL: Indicates the highest protocol.
  • pickle.DEFAULT_PROTOCOL: Indicates the default protocol.
import pickle
print("当前python环境最高序列化协议版本为:{}".format(pickle.HIGHEST_PROTOCOL))
print("当前python环境默认序列化协议版本为:{}".format(pickle.DEFAULT_PROTOCOL))
当前python环境最高序列化协议版本为:4
当前python环境默认序列化协议版本为:3
# 序列化实例
import pickle
import numpy as np

data = {
    
    
    "name": "data struct",
    "number": 123.456,
    "tuple": ("first", False, 10.01),
    "numpy_data": np.ones((9,9),np.uint8)
}

# 保存到本地,这个文件名包含后缀可以随意命名,反正是二进制文件
with open('data.bin', 'wb') as f:
    # 设置最底层协议
    pickle.dump(data, f, 0)

# 查看文件大小
!du -h data.bin
print('---分界线---')
# 查看文件前十行,发现有可读文字
!cat data.bin | head -n 5
4.0K	data.bin
---分界线---
(dp0
Vname
p1
Vdata struct
p2
# 保存到本地,这个文件名包含后缀可以随意命名,反正是二进制文件
with open('data.bin', 'wb') as f:
    # 设置最底层协议
    pickle.dump(data, f, 1)

# 查看文件大小
!du -h data.bin
print('---分界线---')
# 查看文件前2行
!cat data.bin | head -n 2
4.0K	data.bin
---分界线---
}q (X   nameqX   data structqX   numberqG@^�/��wX   tupleq(X   firstqI00
G@$�Q�tqX
# 保存到本地,这个文件名包含后缀可以随意命名,反正是二进制文件
with open('data.bin', 'wb') as f:
    # 设置默认协议
    pickle.dump(data, f, pickle.DEFAULT_PROTOCOL)

# 查看文件大小
!du -h data.bin
print('---分界线---')
# 查看文件前2行
!cat data.bin | head -n 2
4.0K	data.bin
---分界线---
�}q (X   nameqX   data structqX   numberqG@^�/��wX   tupleqX   firstq�G@$�Q녇qX
   numpy_dataqcnumpy.core.multiarray
# 保存到本地,这个文件名包含后缀可以随意命名,反正是二进制文件
with open('data.bin', 'wb') as f:
    # 设置默认协议
    pickle.dump(data, f, 4)

# 查看文件大小
!du -h data.bin
print('---分界线---')
# 查看文件前2行
!cat data.bin | head -n 2
4.0K	data.bin
---分界线---
��/      }�(�name��data struct��number�G@^�/��w�tuple��first��G@$�Q녇��
numpy_data��numpy.core.multiarray��_reconstruct����numpy��ndarray���K ��Cb���R�(KK	K	��h�dtype����u1�����R�(K�|�NNNJ����J����K t�b�CQ�t�bu.

If you want to deserialize and re-read the file, just use the pickle.load function. The serialization protocol is auto-detected and does not need to be specified. In addition, there are two parameters encoding and errors to tell pickle how to deserialize the serialized file lower than the current python version, the default value is fine.

import pickle

with open('data.bin', 'rb') as f:
    data = pickle.load(f)
    print(type(data))
    print(data['name'])
    print(data.keys())
<class 'dict'>
data struct
dict_keys(['name', 'number', 'tuple', 'numpy_data'])

Return the serialized representation of the object as a bytes object via the dumps function instead of writing it to a file. The bytes object is deserialized through the loads function. Note that bytes is a new type of Python3, and bytes are only responsible for storing data in binary form.

data = [1,2,3]

# 序列化,返回bytes对象
dumped = pickle.dumps(data)
print(dumped)
print(type(dumped))
print(len(dumped))

# 反序列化
loaded = pickle.loads(dumped)
print(loaded)
b'\x80\x03]q\x00(K\x01K\x02K\x03e.'
<class 'bytes'>
14
[1, 2, 3]

The process of serialization and deserialization can be influenced by the __getstate__ and __setstate__ functions. The __getstate__ function is called when serializing, and the __setstate__ function is called when deserializing.

An example is as follows, when serializing, specify to serialize some parameters, and restore the parameters when deserializing.

import pickle

class MyData:

    def __init__(self, x):

        self.x = x
        self.y = self.sqrt(x)

    def sqrt(self,x):
        return x**x

    def __getstate__(self):
        self.state = "ok"
        print("enter getstate")
        #  self.__dict__存储关于self.xxx的一些东西
        odict = self.__dict__.copy()
        del odict['y']
        print(odict)
        return odict

    def __setstate__(self, input):
        print("enter setstate")
        print(input)
        self.x = input['x']
        self.y = self.sqrt(self.x)

obj = MyData(3)
# 序列化
print("序列化")
dumped = pickle.dumps(obj)
# 反序列化
print("反序列化")
loaded = pickle.loads(dumped)
print("反序列化结果", loaded.y)
序列化
enter getstate
{'x': 3, 'state': 'ok'}
反序列化
enter setstate
{'x': 3, 'state': 'ok'}
反序列化结果 27

2 pickle acceleration

When the object to be serialized is particularly large, pickle loading and saving the serialized object can become a performance bottleneck for the code. There are generally three ways to speed up the pickle serialization process. There are:

  • Use a higher protocol version
  • Use cPickle instead of pickle
  • disable garbage collector

The following examples will show how to use it, but the acceleration effect is not obvious, because the amount of data is not large, write a code to mark it.

Use pickle directly

import time
import pickle
import numpy as np
import os 
def time_count(func):
    def inner(*args,**kwargs):
        start = time.time()
        func(*args,**kwargs)
        end = time.time()
        print('{}用时:{}秒'.format(func.__name__,end-start))
    return inner

@time_count
def pickle_dump(data,filepath):
    with open(filepath, 'wb') as f:
        pickle.dump(data, f)


@time_count
def pickle_load(filepath):
    with open(filepath, 'rb') as f:
        data = pickle.load(f)
    return data

data = np.ones((10000, 10000))
filepath = "file.dat"
pickle_dump(data,filepath)
pickle_load(filepath)
os.remove(filepath)
time.sleep(2)
pickle_dump用时:1.7647628784179688秒
pickle_load用时:1.7913622856140137秒

Use pickle highest protocol

You can specify the parameter protocol as -1, but the acceleration may not be obvious. Look at the data specifically.

import time
import pickle
import numpy as np
import os

def time_count(func):
    def inner(*args,**kwargs):
        start = time.time()
        func(*args,**kwargs)
        end = time.time()
        print('{}用时:{}秒'.format(func.__name__,end-start))
    return inner

@time_count
def pickle_dump(data,filepath):
    with open(filepath, 'wb') as f:
        # 使用最高版本协议
        pickle.dump(data, f, -1)


@time_count
def pickle_load(filepath):
    with open(filepath, 'rb') as f:
        data = pickle.load(f)
    return data

data = np.ones((10000, 10000))
filepath = "file.dat"
pickle_dump(data,filepath)
pickle_load(filepath)
os.remove(filepath)
time.sleep(2)
pickle_dump用时:1.731525182723999秒
pickle_load用时:1.7664134502410889秒

Use cPickle instead of pickle

The easiest way is to use cPickle instead of pickle. cPickle is exactly the same module as pickle, with the same functions and the same parameters. The only difference is that cPickle is written in C language, which makes cPickle much faster.

import time
# python3 导入cPickle方式
import _pickle as cPickle
import numpy as np
import os

def time_count(func):
    def inner(*args,**kwargs):
        start = time.time()
        func(*args,**kwargs)
        end = time.time()
        print('{}用时:{}秒'.format(func.__name__,end-start))
    return inner

@time_count
def pickle_dump(data,filepath):
    with open(filepath, 'wb') as f:
        # 使用最高版本协议
        cPickle.dump(data, f, -1)


@time_count
def pickle_load(filepath):
    with open(filepath, 'rb') as f:
        data = cPickle.load(f)
    return data

data = np.ones((10000, 10000))
filepath = "file.dat"
pickle_dump(data,filepath)
pickle_load(filepath)
os.remove(filepath)
time.sleep(2)
pickle_dump用时:1.7443737983703613秒
pickle_load用时:1.7894999980926514秒

Disable garbage collection

The garbage collector slows down processing, disabling it can improve performance.

import time
import pickle
import numpy as np
import os
import gc

# 禁用垃圾回收
gc.disable()

def time_count(func):
    def inner(*args,**kwargs):
        start = time.time()
        func(*args,**kwargs)
        end = time.time()
        print('{}用时:{}秒'.format(func.__name__,end-start))
    return inner

@time_count
def pickle_dump(data,filepath):
    with open(filepath, 'wb') as f:
        # 使用最高版本协议
        pickle.dump(data, f, -1)


@time_count
def pickle_load(filepath):
    with open(filepath, 'rb') as f:
        data = pickle.load(f)
    return data

data = np.ones((10000, 10000))
filepath = "file.dat"
pickle_dump(data,filepath)
pickle_load(filepath)
os.remove(filepath)
time.sleep(2)

# 开启垃圾回收
gc.enable()
pickle_dump用时:1.8271889686584473秒
pickle_load用时:1.7800366878509521秒

3 Reference

Guess you like

Origin blog.csdn.net/LuohenYJ/article/details/125669231
Recommended