Pytorch tensor.save() produces huge files for small tensors from MNIST

santiagonasar :

I'm working with MNIST dataset from Kaggle challange and have troubles preprocessing with data. Furthermore, I don't know what are the best practices and was wondering if you could advise me on that.

Disclaimer: I can't just use torchvision.datasets.mnist because I need to use Kaggle's data for training and submission.

In this tutorial, it was advised to create a Dataset object loading .pt tensors from files, to fully utilize GPU. In order to achieve that, I needed to load the csv data provided by Kaggle and save it as .pt files:

import pandas as pd
import torch
import numpy as np

# import data
digits_train = pd.read_csv('data/train.csv')

train_tensor = torch.tensor(digits_train.drop(label, axis=1).to_numpy(), dtype=torch.int)
labels_tensor = torch.tensor(digits_train[label].to_numpy()) 

for i in range(train_tensor.shape[0]):
    torch.save(train_tensor[i], "data/train-" + str(i) + ".pt")

Each train_tensor[i].shape is torch.Size([1, 784])

However, each such .pt file has size of about 130MB. A tensor of the same size, with randomly generated integers, has size of 6.6kB. Why are these tensors so huge, and how can I reduce their size?

Dataset is 42 000 samples. Should I even bother with batching this data? Should I bother with saving tensors to separate files, rather than loading them all into RAM and then slicing into batches? What is the most optimal approach here?

The Floe :

As explained in this discussion, torch.save() saves the whole tensor, not just the slice. You need to explicitly copy the data using clone().

Don't worry, at runtime the data is only allocated once unless you explicitly create copies.

As a general advice: If the data easily fits into your memory, just load it at once. For MNIST with 130 MB that's certainly the case.

However, I would still batch the data because it converges faster. Look up the advantages of SGD for more details.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=12314&siteId=1