Pandas DataFrame data storage format comparison

Pandas supports multiple storage formats. In this article, we will test and compare the reading speed, writing speed and size of Pandas Dataframe in different storage formats.

Create a test Dataframe

First create a test Pandas Dataframe containing different types of data.

 import pandas as pd
 import random
 import string
 import numpy as np
 
 # Config DF
 df_length= 10**6
 start_date= '2023-01-01'
 all_string= list(string.ascii_letters + string.digits)
 string_length= 10**1
 min_number= 0
 max_number= 10**3
 
 # Create Columns
 date_col= pd.date_range(start= start_date, periods= df_length, freq= 'H')
 str_col= [''.join(np.random.choice(all_string, string_length)) for i in range(df_length)]
 float_col= np.random.rand(df_length)
 int_col= np.random.randint(min_number,max_number, size = df_length)
 
 # Create DataFrame
 df= pd.DataFrame({'date_col' : date_col, 
                   'str_col' : str_col, 
                   'float_col' : float_col, 
                   'int_col' : int_col})
 df.info()
 df.head()

stored in a different format

Next create test functions to read and write in different formats.

 import time 
 import os
 
 def check_read_write_size(df, file_name, compression= None) :
     format= file_name.split('.')[-1]
     # Write
     begin= time.time()
     if file_name.endswith('.csv') : df.to_csv(file_name, index= False, compression= compression)
     elif file_name.endswith('.parquet') : df.to_parquet(file_name, compression= compression)
     elif file_name.endswith('.pickle') : df.to_pickle(file_name, compression= compression)
     elif file_name.endswith('.orc') : df.to_orc(file_name)
     elif file_name.endswith('.feather') : df.to_feather(file_name)
     elif file_name.endswith('.h5') : df.to_hdf(file_name, key= 'df')
     write_time= time.time() - begin
     # Read
     begin= time.time()
     if file_name.endswith('.csv') : pd.read_csv(file_name, compression= compression)
     elif file_name.endswith('.parquet') : pd.read_parquet(file_name)
     elif file_name.endswith('.pickle') : pd.read_pickle(file_name, compression= compression)
     elif file_name.endswith('.orc') : pd.read_orc(file_name)
     elif file_name.endswith('.h5') : pd.read_hdf(file_name)
     read_time= time.time() - begin
     # File Size
     file_size_mb = os.path.getsize(file_name) / (1024 * 1024)
     return [format, compression, read_time, write_time, file_size_mb]

Then run the function and store the result in another Pandas Dataframe.

 test_case= [
             ['df.csv','infer'],
             ['df.csv','gzip'],
             ['df.pickle','infer'],
             ['df.pickle','gzip'],
             ['df.parquet','snappy'],
             ['df.parquet','gzip'],
             ['df.orc','default'],
             ['df.feather','default'],
             ['df.h5','default'],
             ]
 
 result= []
 for i in test_case :
     result.append(check_read_write_size(df, i[0], compression= i[1]))
 
 result_df= pd.DataFrame(result, columns= ['format','compression','read_time','write_time','file_size'])
 result_df

Test Results

The graphs and tables below are the results of the tests.

We do a simple analysis of the test results

CSV

  • Uncompressed file size max
  • The compressed size is small, but not the smallest
  • CSV read and write speeds are the slowest

Pickle

  • behaved averagely
  • But compression write speed is the slowest

Feather

The fastest read and write speed, the file size is also medium, very average

ORC

  • Smallest of all formats
  • Very fast read and write speeds, almost the fastest

Parquet

Overall, fast and very small, but not the fastest nor the smallest

Summarize

Judging by the results, we should use ORC or Feather instead of CSV?

"It depends on your system."

If you're doing some solo projects, it definitely makes sense to use the fastest or smallest format possible.

But most of the time, we have to work with others. So, there are more factors than speed and size.

Uncompressed CSV can be slow and large, but when the data needs to be sent to another system, it's very easy.

As a traditional big data processing format (from Hive), ORC is the best for speed and size optimization. Parquet is larger and slower than ORC, but it achieves the best balance between speed and size. , and there are many ecosystems that support it, so you can choose Parquet first when you need to process large files.

https://avoid.overfit.cn/post/387acc48c7dd42a49f7bec90cc6d09ae

By Chanon Krittapholchai

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/132685319