Converting 33G data sets to 3G, this case is worth learning

Bold style I saw a problem on StackOverflow a few years ago, but I can’t find it now. I recorded a little bit in the draft box at that time

It is a good learning case, and it will also give hints for places that need attention

When the loaded data set becomes larger, our system may freeze and crash. In this case, 33G data is compressed to 3.7G

dependent library

import pandas as pd
import gc
import glob
import os

Pandas is one of the commonly used data analysis libraries for data processing, etc.

Garbage Collector (GC), which frees space from memory when processing large amounts of data, removes unnecessary things from memory

The glob library extracts specific files using patterns in the system

The os library interacts with the operating system and handles files and their paths

Chunk

chunk_size = 500000
num = 1
for chunk in pd.read_csv('test_data.csv', chunksize = chunk_size):
    chunk.to_csv('chunk' + str(num) + '.csv', index = False)
    gc.collect()
    num += 1

Here you cannot directly load the data file as you can with small-scale data, it will cause a crash

You can use "chunksize" to split the file, here select 500,000 lines for each chunk, it can be determined according to each person's task & machine performance

Here the use of 'gc.collect()' is very clever and crucial to avoid memory errors

Get several chunks after running
insert image description here

Then see if you can read the chunk and check the information

insert image description here
As can be seen from the printout, 185 of the 190 columns are of float64 type, which is one of the common problems that pandas always loads float data as float64

By optimizing this part, a part of the memory in the dataset can be reduced

so convert it to 'float16' or 'float32' to minimize memory usage, here convert it to 'float16'

Knock on the blackboard, please find out whether the conversion of your data set will lose data precision! ! !
insert image description here
You can see that after converting to "float16", the memory usage is greatly reduced
insert image description here

Optimize and stitch chunk files

Previously read and optimized a single chunk file, will now splice all 23 chunk files and optimize memory

insert image description here

Use the 'glob' and 'os' methods to access the corresponding file, ie ("*.csv")

Then read all the files by iterating, converting them from 'float64' to 'float16' during the iteration and saving in a list

Then save all file concatenations in a new dataframe

insert image description here

Data processing can be done quickly in dataframe

Convert dataframe to file format

The optimized dataframe can be converted to any file format

Feather is recommended because it is lighter , as shown below

insert image description here
insert image description here

read optimized file

Reading the optimized feather format file again will not cause any memory errors

Data processing operations can be performed normally

insert image description here
More dry goods are available on the official account [Graduate student who knows everything]

Guess you like

Origin blog.csdn.net/zzh516451964zzh/article/details/129334725