Pickle is a Python built-in module for serializing and deserializing object structures in Python. Python serialization is a process of converting a Python object hierarchy into a byte stream that can be stored locally or transmitted over a network, and deserialization is the process of restoring a byte stream to a Python object hierarchy.
The function of data serialization is simply understood as storing data that cannot be stored directly on disk, thereby prolonging the life cycle of objects. There are two commonly used serialization libraries for Python, namely json and pickle. There are two main differences between the json library and the pickle library:
- Pickle can serialize all data types in Python, including classes and functions, which are generally stored as binary files. And json can only serialize the basic data types of Python, and the dump results are very easy to read.
- Pickle can only be used in Python, while json can exchange data between different languages.
Pickle is generally slower than json, especially when the amount of data is large. Both pickle and json have four basic methods:
method | effect |
---|---|
dump | Serialize write to file |
load | read file deserialization |
dumps | serialize return object |
loads | deserialize object |
Article directory
1 pickle use
The pickle.dump() function is used to serialize the python structure and save it as a binary file. The pickle.dump function accepts three arguments where the first argument contains the object to be stored in the file and the second argument gives the file object obtained when writing the desired file in binary mode. The third parameter indicates the serialization protocol.
For the protocol selection of pickle, there are currently 5 different protocols available (from Python object serialization ). The higher the protocol used, the newer the Python version required to read the generated pickle. These agreements include:
- Protocol version 0 is the original "human-readable" protocol, backward compatible with earlier versions of Python.
- Protocol version 1 is an old binary format that is also compatible with earlier versions of Python.
- Protocol version 2 was introduced in Python 2.3, providing a more efficient serialization method.
- Protocol version 3 was introduced in Python 3.0. It explicitly supports bytes objects, which is also the default protocol for Python and the recommended protocol when compatibility with other Python 3 versions is required.
- Protocol version 4 was introduced in Python 3.4. It adds support for very large objects, serializes more types of objects, and optimizes some data formats.
Different protocols can be set through 0 to 4. The protocol parameter defaults to None, and None means to use the default protocol used by the Python version. Choose -1 for highest agreement. In addition, the protocol can be set through constants, respectively:
- pickle.HIGHEST_PROTOCOL: Indicates the highest protocol.
- pickle.DEFAULT_PROTOCOL: Indicates the default protocol.
import pickle
print("当前python环境最高序列化协议版本为:{}".format(pickle.HIGHEST_PROTOCOL))
print("当前python环境默认序列化协议版本为:{}".format(pickle.DEFAULT_PROTOCOL))
当前python环境最高序列化协议版本为:4
当前python环境默认序列化协议版本为:3
# 序列化实例
import pickle
import numpy as np
data = {
"name": "data struct",
"number": 123.456,
"tuple": ("first", False, 10.01),
"numpy_data": np.ones((9,9),np.uint8)
}
# 保存到本地,这个文件名包含后缀可以随意命名,反正是二进制文件
with open('data.bin', 'wb') as f:
# 设置最底层协议
pickle.dump(data, f, 0)
# 查看文件大小
!du -h data.bin
print('---分界线---')
# 查看文件前十行,发现有可读文字
!cat data.bin | head -n 5
4.0K data.bin
---分界线---
(dp0
Vname
p1
Vdata struct
p2
# 保存到本地,这个文件名包含后缀可以随意命名,反正是二进制文件
with open('data.bin', 'wb') as f:
# 设置最底层协议
pickle.dump(data, f, 1)
# 查看文件大小
!du -h data.bin
print('---分界线---')
# 查看文件前2行
!cat data.bin | head -n 2
4.0K data.bin
---分界线---
}q (X nameqX data structqX numberqG@^�/��wX tupleq(X firstqI00
G@$�Q�tqX
# 保存到本地,这个文件名包含后缀可以随意命名,反正是二进制文件
with open('data.bin', 'wb') as f:
# 设置默认协议
pickle.dump(data, f, pickle.DEFAULT_PROTOCOL)
# 查看文件大小
!du -h data.bin
print('---分界线---')
# 查看文件前2行
!cat data.bin | head -n 2
4.0K data.bin
---分界线---
�}q (X nameqX data structqX numberqG@^�/��wX tupleqX firstq�G@$�Q녇qX
numpy_dataqcnumpy.core.multiarray
# 保存到本地,这个文件名包含后缀可以随意命名,反正是二进制文件
with open('data.bin', 'wb') as f:
# 设置默认协议
pickle.dump(data, f, 4)
# 查看文件大小
!du -h data.bin
print('---分界线---')
# 查看文件前2行
!cat data.bin | head -n 2
4.0K data.bin
---分界线---
��/ }�(�name��data struct��number�G@^�/��w�tuple��first��G@$�Q녇��
numpy_data��numpy.core.multiarray��_reconstruct����numpy��ndarray���K ��Cb���R�(KK K ��h�dtype����u1�����R�(K�|�NNNJ����J����K t�b�CQ�t�bu.
If you want to deserialize and re-read the file, just use the pickle.load function. The serialization protocol is auto-detected and does not need to be specified. In addition, there are two parameters encoding and errors to tell pickle how to deserialize the serialized file lower than the current python version, the default value is fine.
import pickle
with open('data.bin', 'rb') as f:
data = pickle.load(f)
print(type(data))
print(data['name'])
print(data.keys())
<class 'dict'>
data struct
dict_keys(['name', 'number', 'tuple', 'numpy_data'])
Return the serialized representation of the object as a bytes object via the dumps function instead of writing it to a file. The bytes object is deserialized through the loads function. Note that bytes is a new type of Python3, and bytes are only responsible for storing data in binary form.
data = [1,2,3]
# 序列化,返回bytes对象
dumped = pickle.dumps(data)
print(dumped)
print(type(dumped))
print(len(dumped))
# 反序列化
loaded = pickle.loads(dumped)
print(loaded)
b'\x80\x03]q\x00(K\x01K\x02K\x03e.'
<class 'bytes'>
14
[1, 2, 3]
The process of serialization and deserialization can be influenced by the __getstate__ and __setstate__ functions. The __getstate__ function is called when serializing, and the __setstate__ function is called when deserializing.
An example is as follows, when serializing, specify to serialize some parameters, and restore the parameters when deserializing.
import pickle
class MyData:
def __init__(self, x):
self.x = x
self.y = self.sqrt(x)
def sqrt(self,x):
return x**x
def __getstate__(self):
self.state = "ok"
print("enter getstate")
# self.__dict__存储关于self.xxx的一些东西
odict = self.__dict__.copy()
del odict['y']
print(odict)
return odict
def __setstate__(self, input):
print("enter setstate")
print(input)
self.x = input['x']
self.y = self.sqrt(self.x)
obj = MyData(3)
# 序列化
print("序列化")
dumped = pickle.dumps(obj)
# 反序列化
print("反序列化")
loaded = pickle.loads(dumped)
print("反序列化结果", loaded.y)
序列化
enter getstate
{'x': 3, 'state': 'ok'}
反序列化
enter setstate
{'x': 3, 'state': 'ok'}
反序列化结果 27
2 pickle acceleration
When the object to be serialized is particularly large, pickle loading and saving the serialized object can become a performance bottleneck for the code. There are generally three ways to speed up the pickle serialization process. There are:
- Use a higher protocol version
- Use cPickle instead of pickle
- disable garbage collector
The following examples will show how to use it, but the acceleration effect is not obvious, because the amount of data is not large, write a code to mark it.
Use pickle directly
import time
import pickle
import numpy as np
import os
def time_count(func):
def inner(*args,**kwargs):
start = time.time()
func(*args,**kwargs)
end = time.time()
print('{}用时:{}秒'.format(func.__name__,end-start))
return inner
@time_count
def pickle_dump(data,filepath):
with open(filepath, 'wb') as f:
pickle.dump(data, f)
@time_count
def pickle_load(filepath):
with open(filepath, 'rb') as f:
data = pickle.load(f)
return data
data = np.ones((10000, 10000))
filepath = "file.dat"
pickle_dump(data,filepath)
pickle_load(filepath)
os.remove(filepath)
time.sleep(2)
pickle_dump用时:1.7647628784179688秒
pickle_load用时:1.7913622856140137秒
Use pickle highest protocol
You can specify the parameter protocol as -1, but the acceleration may not be obvious. Look at the data specifically.
import time
import pickle
import numpy as np
import os
def time_count(func):
def inner(*args,**kwargs):
start = time.time()
func(*args,**kwargs)
end = time.time()
print('{}用时:{}秒'.format(func.__name__,end-start))
return inner
@time_count
def pickle_dump(data,filepath):
with open(filepath, 'wb') as f:
# 使用最高版本协议
pickle.dump(data, f, -1)
@time_count
def pickle_load(filepath):
with open(filepath, 'rb') as f:
data = pickle.load(f)
return data
data = np.ones((10000, 10000))
filepath = "file.dat"
pickle_dump(data,filepath)
pickle_load(filepath)
os.remove(filepath)
time.sleep(2)
pickle_dump用时:1.731525182723999秒
pickle_load用时:1.7664134502410889秒
Use cPickle instead of pickle
The easiest way is to use cPickle instead of pickle. cPickle is exactly the same module as pickle, with the same functions and the same parameters. The only difference is that cPickle is written in C language, which makes cPickle much faster.
import time
# python3 导入cPickle方式
import _pickle as cPickle
import numpy as np
import os
def time_count(func):
def inner(*args,**kwargs):
start = time.time()
func(*args,**kwargs)
end = time.time()
print('{}用时:{}秒'.format(func.__name__,end-start))
return inner
@time_count
def pickle_dump(data,filepath):
with open(filepath, 'wb') as f:
# 使用最高版本协议
cPickle.dump(data, f, -1)
@time_count
def pickle_load(filepath):
with open(filepath, 'rb') as f:
data = cPickle.load(f)
return data
data = np.ones((10000, 10000))
filepath = "file.dat"
pickle_dump(data,filepath)
pickle_load(filepath)
os.remove(filepath)
time.sleep(2)
pickle_dump用时:1.7443737983703613秒
pickle_load用时:1.7894999980926514秒
Disable garbage collection
The garbage collector slows down processing, disabling it can improve performance.
import time
import pickle
import numpy as np
import os
import gc
# 禁用垃圾回收
gc.disable()
def time_count(func):
def inner(*args,**kwargs):
start = time.time()
func(*args,**kwargs)
end = time.time()
print('{}用时:{}秒'.format(func.__name__,end-start))
return inner
@time_count
def pickle_dump(data,filepath):
with open(filepath, 'wb') as f:
# 使用最高版本协议
pickle.dump(data, f, -1)
@time_count
def pickle_load(filepath):
with open(filepath, 'rb') as f:
data = pickle.load(f)
return data
data = np.ones((10000, 10000))
filepath = "file.dat"
pickle_dump(data,filepath)
pickle_load(filepath)
os.remove(filepath)
time.sleep(2)
# 开启垃圾回收
gc.enable()
pickle_dump用时:1.8271889686584473秒
pickle_load用时:1.7800366878509521秒