[Python] PyTorch Dataset and DataLoader introduction, usage scenarios and use cases

What is a Dataset?

PyTorch's Dataset is an abstract class used to represent datasets. It provides some common methods, such as __len__()and __getitem__(),for getting the size of the data set and getting the data sample of the specified index respectively. Users can customize their own datasets by inheriting the Dataset class and implementing these methods.

Dataset class:

TensorDataset class:

 

In addition, in the torchvision library, for visual processing, the VisionDataset class inherited from Dataset is provided as the basic class of machine vision data sets. Currently, there are 74 subclasses of the VisionDataset class implemented (such as CIFAR*, MNIST).

 

 

Dataset usage scenarios

The usage scenarios of Dataset are as follows:

  1. When you need to process a large amount of data in a machine learning project, you can use PyTorch's Dataset to organize and manage the data.
  2. When you need to preprocess, enhance, or normalize data, you can use PyTorch's Dataset to easily implement these functions.
  3. When you need to load a dataset into memory, you can use PyTorch's Dataset to achieve efficient data reading and caching.

What is a data loader (DataLoader)?

PyTorch's DataLoader is a tool for loading data. It can load data sets into memory in batches and supports multi-threaded parallel processing. DataLoader can be used to easily implement operations such as small batch training, distributed training, and data enhancement.

Parameter Description:

  • dataset:The dataset object to be loaded must be torch.utils.data.Dataseta subclass of .
  • batch_size:The size of each batch, default is 1.
  • shuffle:Whether to shuffle the data order at the beginning of each epoch, the default is False.
  • sampler: Used to specify the strategy for extracting samples from the data set, which can be torch.utils.data.Sampleran object or its subclass.
  • batch_sampler: samplerSimilar to :, but used to specify the strategy for extracting batches from the dataset, can be torch.utils.data.BatchSampleran object of or its subclass.
  • num_workers:The number of sub-processes used for data loading. The default is 0, which means multi-process data is not loaded.
  • collate_fn:Function used to combine multiple samples into a batch, defaults to torch.utils.data.dataloader.default_collate.
  • pin_memory: Whether to store data in fixed memory, the default is False.
  • drop_last:If True, discard the last incomplete batch, default is False.
  • timeout: The timeout for obtaining data from the worker process. The default is 0, which means infinite waiting.
  • worker_init_fn:Function used to initialize the worker process, defaults to None.
  • multiprocessing_context: Used to specify the type of multi-process context, the default is None.

 

Data Loader usage scenarios

DataLoader usage scenarios:

  1. When a large amount of data needs to be processed in a machine learning project, DataLoader can be used to organize and manage the data.
  2. When you need to preprocess, enhance, or normalize data, you can use DataLoader to easily implement these functions.
  3. When you need to load a data set into memory, you can use DataLoader to achieve efficient data reading and caching.
  4. When small batch training is required, DataLoader can be used to load data in small batches.
  5. When distributed training is required, DataLoader can be used to implement distributed loading and processing of data.

Dataset and DataLoader use cases

import torch
import torch.nn as nn
import numpy as np
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

# 定义神经网络模型
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(1, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model = Net()
print('model:', model)

# 创建数据集
x = torch.randn(1000, 1)  # 生成1000个样本,每个样本有1个特征
print('x.size:', x.size())
noise = torch.from_numpy(np.random.normal(0, 0.1, (1000, 1))).float() # 生成1000 x 1个数值在0~0.1之间噪音
y = 3 * x + 2 + noise  # 生成1000个标签 + 其中包含了噪音
print('y.size:', y.size())
dataset = TensorDataset(x, y)  # 将数据和标签封装成TensorDataset对象
print('dataset:', dataset)
# 创建数据加载器
batch_size = 100  # 每个批次的大小
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)  # 创建DataLoader对象,用于批量加载数据
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)  # 随机梯度下降,学习率为0.01

# 训练模型
num_epochs = 1000
for epoch in range(0,num_epochs):
    # 遍历数据集
    for i, (inputs, labels) in enumerate(dataloader):
        # 前向传播
        outputs = model(inputs)
        
        # 计算损失
        loss = criterion(outputs, labels)
        
        # 反向传播和优化
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if (epoch+1) % 100 == 0:
            print('Epoch [{}/{}], Step [{}/{}], Loss: {:.6f}'.format(epoch+1, num_epochs, i+1, len(dataloader), loss.item()))

Guess you like

Origin blog.csdn.net/u011775793/article/details/135423162