How to package your own data set into dataloader? 4 ways to pack you!

Preface

We often encounter such a situation: the data used is not the library's own. In this case, whether it is training or testing, we need to index our data by ourselves, and index according to a batch_size specification every time . What I first thought was (assuming that I have 100 samples), to build an index table by myself, for example idx = np.arange(1,100,1), first shuffle(idx)take a look at it, and then take batch_size subscripts for indexing each time, but I always feel that one thing is unnatural.

Method 1: Inherit Dataset & DataLoader

from torch.utils.data import Dataset,DataLoader
import torch
train_X = torch.randn(100,28,28)
train_Y = torch.randn(100)
class Dataset(Dataset):
    def __init__(self,x,y):
        # self.x = x.unsqueeze(dim=1)
        self.y = y
    
    def __len__(self):
        return len(self.x)
    
    def __getitem__(self,idx):
        return self.x[idx],self.y[idx]
train_DL = DataLoader(Dataset(train_X,train_Y),batch_size=10,shuffle=True)
for idx,(xb,yb) in enumerate(train_DL):
    print(idx,xb.shape)

The official website explains as follows:Insert picture description here

  • In the above example, I created a subclass of the same name of Dataset
  • And as required, rewrite getitem and len
  • Then, pass this class to DataLoader and set the parameters: batch_size and shuffle

Method 2: Inherit Dataset & Encapsulate DataLoader

class DataLoader:
    def __init__(self, dl, func):
        self.dl = dl # 我们的数据集
        self.func = func # 我们的操作对象
    def __len__(self):
        return len(self.dl)
    def __iter__(self):
        batches = iter(self.dl)
        for b in batches:
            yield (self.func(*b))

On the basis of Method 1, we encapsulate DataLoader by ourselves, where func can be our custom function, which is used to process the data when loading data, for example: def preprocess(x,y): return x.view(...),ysuch a function.

Method 3: TensorDataset & DataLoader

I basically use this method, just two lines of code:

from torch.utils.data import TensorDataset,DataLoader
Train_DS = TensorDataset(train_x,train_y)
Train_DL = DataLoader(Train_DS,shuffle=True,batch_size = 10)

Advanced: DataLoader's collate_fn parameter setting

The examples I have seen are all images, so let me talk about the problems I encountered in emotion classification. In fact, just as the label length of each picture is different, the number of words in each sentence is also different. In a batch, how to ensure that their dimensions are the same? Sentence dimension = number of words * word vector dimension, the latter need not be worried. What we have to do is to ensure that the number of words in each sentence is the same. The strategy is to find the sentence with the largest number of words in a batch as a benchmark, and then supplement other shorter sentences (just add all zero vectors). Well. The collate_fn in DataLoader can implement such a "padding" process.

Since the "stupid" method is used for the time being, let's express the general idea in code first, and add the complete code later:

size = 100 # 表示每一个单词表示的向量维度
def collate_fn(batch):
    batch.sort(key = lambda x:len(x[0]),reverse=True)
    x, y = zip(*batch)
    pad_x = []
    nums = []
    max_num = len(x[0])  
    for i in range(len(x)):
        temp = x[i]  # num_of_words * size
        for j in range(max_num - len(temp)):
            temp.append([0] * size)
        pad_x.append(temp_x)
        nums.append(len(x[i]))
    return pad_x,y,lens
# Then
Train_DS = TensorDataset(train_x,train_y)
Train_DL = DataLoader(Train_DS,shuffle=True,batch_size = 10,collate_fn = collate_fn)

do you understand? If you don’t understand, I’ll try my best to explain it (after all, my level is limited):

First of all, the batch passed in by collate_fn is a list! The length of the list is batch_size, which means there are batch_size elements. Each element is also a list! There are two elements in this list, a sample in train_x and a classification label in train_y. What we want to solve is this sample in train_x, its 0th dimension (representing the number of words), and the number of words in other samples will be different.
Then, we first sort each list in the batch according to the 0 dimension of the first element in the list (that is, a sample of train_x) (descending order).
Then it is easy to understand, max_num represents the maximum number of words, each sample in train_x, if less than this number of words. Just calculate max_num-len(temp), that is, how many words (n) are missing, and then add n 0 vectors of size.

The above reference book: PyTorch realizes free data reading

In addition, the use of the collate_fn parameter can also solve the problem of how to deal with the "divided by batch_size, but the remaining samples":
see the blog for details: Pytorch Tip 1: DataLoader's collate_fn parameter

Update

Well, it's useful to dataloader, collate_fn is not advanced, but basic (especially in NLP).

to sum up

In general, method 3 plus the collate_fn parameter should be enough. Although method two can customize func, how to read batch_size samples together and how to implement shuffle is another problem. In contrast, in method 1, func can also be customized. In the step of setting the dataset, some operations can be performed, such as adding 1 dimension with unsqueeze (dim=1).

If you feel that something is wrong or not clear, please criticize or point it out in the comment section, thank you! !

Guess you like

Origin blog.csdn.net/jokerxsy/article/details/106504932