Bold style I saw a problem on StackOverflow a few years ago, but I can’t find it now. I recorded a little bit in the draft box at that time
It is a good learning case, and it will also give hints for places that need attention
When the loaded data set becomes larger, our system may freeze and crash. In this case, 33G data is compressed to 3.7G
dependent library
import pandas as pd
import gc
import glob
import os
Pandas is one of the commonly used data analysis libraries for data processing, etc.
Garbage Collector (GC), which frees space from memory when processing large amounts of data, removes unnecessary things from memory
The glob library extracts specific files using patterns in the system
The os library interacts with the operating system and handles files and their paths
Chunk
chunk_size = 500000
num = 1
for chunk in pd.read_csv('test_data.csv', chunksize = chunk_size):
chunk.to_csv('chunk' + str(num) + '.csv', index = False)
gc.collect()
num += 1
Here you cannot directly load the data file as you can with small-scale data, it will cause a crash
You can use "chunksize" to split the file, here select 500,000 lines for each chunk, it can be determined according to each person's task & machine performance
Here the use of 'gc.collect()' is very clever and crucial to avoid memory errors
Get several chunks after running
Then see if you can read the chunk and check the information
As can be seen from the printout, 185 of the 190 columns are of float64 type, which is one of the common problems that pandas always loads float data as float64
By optimizing this part, a part of the memory in the dataset can be reduced
so convert it to 'float16' or 'float32' to minimize memory usage, here convert it to 'float16'
Knock on the blackboard, please find out whether the conversion of your data set will lose data precision! ! !
You can see that after converting to "float16", the memory usage is greatly reduced
Optimize and stitch chunk files
Previously read and optimized a single chunk file, will now splice all 23 chunk files and optimize memory
Use the 'glob' and 'os' methods to access the corresponding file, ie ("*.csv")
Then read all the files by iterating, converting them from 'float64' to 'float16' during the iteration and saving in a list
Then save all file concatenations in a new dataframe
Data processing can be done quickly in dataframe
Convert dataframe to file format
The optimized dataframe can be converted to any file format
Feather is recommended because it is lighter , as shown below
read optimized file
Reading the optimized feather format file again will not cause any memory errors
Data processing operations can be performed normally
More dry goods are available on the official account [Graduate student who knows everything]