Pandas supports multiple storage formats. In this article, we will test and compare the reading speed, writing speed and size of Pandas Dataframe in different storage formats.
Create a test Dataframe
First create a test Pandas Dataframe containing different types of data.
import pandas as pd
import random
import string
import numpy as np
# Config DF
df_length= 10**6
start_date= '2023-01-01'
all_string= list(string.ascii_letters + string.digits)
string_length= 10**1
min_number= 0
max_number= 10**3
# Create Columns
date_col= pd.date_range(start= start_date, periods= df_length, freq= 'H')
str_col= [''.join(np.random.choice(all_string, string_length)) for i in range(df_length)]
float_col= np.random.rand(df_length)
int_col= np.random.randint(min_number,max_number, size = df_length)
# Create DataFrame
df= pd.DataFrame({'date_col' : date_col,
'str_col' : str_col,
'float_col' : float_col,
'int_col' : int_col})
df.info()
df.head()
stored in a different format
Next create test functions to read and write in different formats.
import time
import os
def check_read_write_size(df, file_name, compression= None) :
format= file_name.split('.')[-1]
# Write
begin= time.time()
if file_name.endswith('.csv') : df.to_csv(file_name, index= False, compression= compression)
elif file_name.endswith('.parquet') : df.to_parquet(file_name, compression= compression)
elif file_name.endswith('.pickle') : df.to_pickle(file_name, compression= compression)
elif file_name.endswith('.orc') : df.to_orc(file_name)
elif file_name.endswith('.feather') : df.to_feather(file_name)
elif file_name.endswith('.h5') : df.to_hdf(file_name, key= 'df')
write_time= time.time() - begin
# Read
begin= time.time()
if file_name.endswith('.csv') : pd.read_csv(file_name, compression= compression)
elif file_name.endswith('.parquet') : pd.read_parquet(file_name)
elif file_name.endswith('.pickle') : pd.read_pickle(file_name, compression= compression)
elif file_name.endswith('.orc') : pd.read_orc(file_name)
elif file_name.endswith('.h5') : pd.read_hdf(file_name)
read_time= time.time() - begin
# File Size
file_size_mb = os.path.getsize(file_name) / (1024 * 1024)
return [format, compression, read_time, write_time, file_size_mb]
Then run the function and store the result in another Pandas Dataframe.
test_case= [
['df.csv','infer'],
['df.csv','gzip'],
['df.pickle','infer'],
['df.pickle','gzip'],
['df.parquet','snappy'],
['df.parquet','gzip'],
['df.orc','default'],
['df.feather','default'],
['df.h5','default'],
]
result= []
for i in test_case :
result.append(check_read_write_size(df, i[0], compression= i[1]))
result_df= pd.DataFrame(result, columns= ['format','compression','read_time','write_time','file_size'])
result_df
Test Results
The graphs and tables below are the results of the tests.
We do a simple analysis of the test results
CSV
- Uncompressed file size max
- The compressed size is small, but not the smallest
- CSV read and write speeds are the slowest
Pickle
- behaved averagely
- But compression write speed is the slowest
Feather
The fastest read and write speed, the file size is also medium, very average
ORC
- Smallest of all formats
- Very fast read and write speeds, almost the fastest
Parquet
Overall, fast and very small, but not the fastest nor the smallest
Summarize
Judging by the results, we should use ORC or Feather instead of CSV?
"It depends on your system."
If you're doing some solo projects, it definitely makes sense to use the fastest or smallest format possible.
But most of the time, we have to work with others. So, there are more factors than speed and size.
Uncompressed CSV can be slow and large, but when the data needs to be sent to another system, it's very easy.
As a traditional big data processing format (from Hive), ORC is the best for speed and size optimization. Parquet is larger and slower than ORC, but it achieves the best balance between speed and size. , and there are many ecosystems that support it, so you can choose Parquet first when you need to process large files.
https://avoid.overfit.cn/post/387acc48c7dd42a49f7bec90cc6d09ae
By Chanon Krittapholchai