Understand the Pytorch data reading mechanism in one article!

Friends who are familiar with deep learning must know that deep learning model training mainly consists of five modules: data, model, loss function, optimizer and iterative training. As shown in the figure below, the Pytorch data reading mechanism is the main branch in the data module.

Pytorch data reading is completed through Dataset+Dataloader. in,

  • DataSet: Define the data set. Map the original data samples and corresponding labels to the Dataset to facilitate subsequent reading of data through index. At the same time, preprocessing operations such as data format transformation and data enhancement can also be performed in Dataset.

  • DataLoader: iteratively reads the data set. The data samples are processed in batches, shuffled in order, etc. to facilitate iterative reading of data during training.

Dataset

Dataset is used to solve the problem of where and how to read data. The Dataset given by Pytorch is an abstract class. All customized data sets must inherit Dataset and override the __init__(), __getitem__() and __len__() class methods for direct call by the DataLoader class.

  • __init__: Data set initialization.

  • __getitem__: Defines how to obtain sample data for the specified index, and finally returns the sample pair {sample data x:label y} corresponding to the index.

  • __len__: The number of samples in the data set.

The following is a code sample to implement a Dataset custom data set, taking the cifar10 data set as an example.

from torch.utils.data import Dataset  
from PIL import Image  
import os  
  
class Mydata(Dataset):  
    """  
    步骤一:继承 torch.utils.data.Dataset 类  
    """  
    def __init__(self,data_dir,label_dir):  
        """  
        步骤二:实现 __init__ 函数,初始化数据集,将样本和标签映射到列表中  
        """  
        self.data_dir = data_dir  
        self.label_dir = label_dir  
        # 用join把路径拼接一起可以避免一些因“/”引发的错误  
        self.path = os.path.join(self.data_dir,self.label_dir)  
        # 将该路径下的所有文件变成一个列表  
        self.img_path = os.listdir(self.path)  
  
    def __getitem__(self,idx):  
        """  
        步骤三:实现 __getitem__ 函数,定义指定 index 时如何获取数据,并返回单条数据(样本数据、对应的标签)  
        """  
        # 根据index(idx),从列表中取出图片  
        # img_path列表里每个元素就是对应图片文件名  
        img_name = self.img_path[idx]  
        # 获得对应图片路径  
        img_item_path = os.path.join(self.data_dir,self.label_dir,img_name)  
        # 使用PIL库下Image工具,打开对应路径图片  
        img = Image.open(img_item_path)  
        label = self.label_dir  
        # 返回图片和对应标签  
        return img,label  
  
    def __len__(self):  
        """  
        步骤四:实现 __len__ 函数,返回数据集的样本总数  
        """  
        return len(self.img_path)  
  
# data_dir,label_dir可自定义数据集目录  
train_custom_dataset = MyData(data_dir,label_dir)  
test_custom_dataset = MyData(data_dir,label_dir)  
  



DataLoader

In actual projects, when the amount of data is large, considering issues such as limited memory and I/O speed, it is impossible to load all the data into the memory at one time or only use one to load the data during training. At this time, what is needed is multiple Process and iterative loading, Dataloader comes into being.

DataLoader is an iterable data loader that combines a dataset and a sampler and provides an iterable object over a given dataset. Integration of multiple objects in the data set can be completed.

The DataLoader module in Pytorch's data reading mechanism includes two sub-modules, Sampler and Dataset. The Sampler module generates the index index; the Dataset module reads data based on the index. The data reading process of DataLoader is shown in the figure below.

  • DataLoader: Enter the DataLoader module.

  • DataloaderIter: Enter the __iter__ function to determine whether to use multiple processes, and enter the corresponding reading mechanism.

  • Sampler: Through sampling, select the data to be read for each Batchsize and return the index of these data.

  • index: An index of batchsize data.

  • DatasetFetcher: Get the data corresponding to index.

  • Dataset: Call dataset[idx] to obtain the corresponding data and splice it into a list.

  • getitem: The core of Dataset, using index to obtain data.

  • img, label: read data.

  • collate_fn: Convert the read data from list to batch form.

  • Batch Data: batch form data, the first element is an image, and the second element is a label.

The DataLoader class in Pytorch is defined as follows:

class torch.utils.data.DataLoader(  
     """  
     构建可迭代的数据装载器,训练时,每一个for循环,每一次迭代,  
     从DataLoader中获取一个batch_size大小的数据  
     """  
     dataset,  
     batch_size=1,  
     shuffle=False,  
     sampler=None,  
     batch_sampler=None,  
     num_workers=0,  
     collate_fn=None,  
     pin_memory=False,  
     drop_last=False,  
)  



  • dataset: the data set to be loaded, Dataset object.

  • batch_size: The number of samples read in each batch. For example, batch_size=16 means reading 16 samples in each batch.

  • shuffle: Whether each epoch is shuffled. shuffle=True means shuffling the order of samples when taking data to reduce the possibility of overfitting.

  • sampler: index index.

  • batch_sampler: Pack the sampler that returns an index, and return a set of indexes according to the set batch_size.

  • num_workers: Read data synchronously/asynchronously. num_workers=0 means that data loading is synchronous and completed in the main process. When the value of num_workers is set to greater than 0, multi-process asynchronous loading of data is enabled, which can improve data reading speed.

  • pin_memory: Whether to copy data to the temporary buffer.

  • collate_fn: Combine multiple samples together into a mini-batch. If this function is not specified, the default function inside Pytorch will be called.

  • drop_last: Discard incomplete batch samples. drop_last=True means that when the number of data set samples cannot be divided by batch_size, the last incomplete batch sample will be discarded.

Additional information

Epoch : All training samples have been input into the model, which is called an epoch

Iteration : A batch of samples (batch_size) is input into the model, which is called an Iteration.

Batchsize : The size of a batch of samples, called Batchsize. Used to determine how many Iterations an epoch has.

Code implementation examples are as follows.

import torch  
import torch.utils.data as Data  
  
BATCH_SIZE = 5  
  
x = torch.linspace(1, 10, 10)  
y = torch.linspace(10, 1, 10)  
  
# 将数据集转换为torch可识别的类型  
torch_dataset = Data.TensorDataset(x, y)  
  
loader = Data.DataLoader(  
    dataset=torch_dataset,  
    batch_size=BATCH_SIZE,  
    shuffle=True,  
    num_workers=0  
)  
  
for epoch in range(3):  
    for step, (batch_x, batch_y) in enumerate(loader):  
        print('epoch', epoch,  
              '| step:', step,  
              '| batch_x', batch_x.numpy(),  
              '| batch_y:', batch_y.numpy())  



Through the above method, a data reader loader can be initialized to load the training data set torch_dataset.

Guess you like

Origin blog.csdn.net/Python_HUHU/article/details/131071510