Deep Learning-- Pytorch Learning Dataset API Dataset and DataLoader Overload Iris Dataset

foreword

In the steps of model training, the data part is very important, and its process is mainly divided into data collection, data division, data reading, and data preprocessing.

The data is collected withoriginal sampleandLabel(Img,label)

The division of the data set needs to be divided into training set, verification set and test set.
The training set is responsible for training the model, the validation set is responsible for verifying whether the model is overfitting, and the test set is used to test performance.

Data reading is mainly the content of DataLoader

  • DataLoader is divided into two sub-modules, namely Sampler and DataSet.
    The function of Sampler is to generate an index (Index)
    . DataSet reads data according to the index.

Data preprocessing needs to be implemented with transforms

Custom Dataset Class

PyTorch's custom datasets can be defined using the Dataset class and the IterableDataset class. The former is used to implement Map-style (mapping style) datasets, and the latter is used to implement iteration-style datasets.

DataLoader and Dataset

DataLoader and Dataset are the core of pytorch data reading

Dataset

torch.utils.data.Dataset
function: Dataset abstract class, all custom Datasets need to inherit it, and override __getitem__()

getitem:
Receive an index and return a sample

insert image description here

Use the Dataset class to overload the iris dataset

import numpy as np
import torch
from torch.utils.data import Dataset
import numpy as np

class IrisDataset(Dataset):
    '''鸢尾花数据集'''
    def __init__(self):
        super(IrisDataset).__init__()
        data = np.loadtxt()("鸢尾花数据集路径.csv",delimiter=',',dtype=np.float32)
        self.x = torch.from_numpy(data[:,0:-1])
        self.y = torch.from_numpy(data[:,[-1]])
        self.len = data.shape[0]
        
    def __getitem__(self, index):
        return self.x[index],self.y[index]
    
    def __len__(self):
        return self.len



DataLoader

torch.utils.data.DataLoader

After realizing the custom data set, the data set samples can be returned, but this method of returning samples directly through the index is relatively primitive, and the data set cannot provide a batch of data at a time, nor can the data be randomized. Scrambling and parallel acceleration. To this end, PyTorch specifically provides the DataLoader class to implement this function.

The DataLoader class is a data loader that combines a dataset and a sample sampler and provides an iterable over a given dataset.

Function: Build fallable data loaders

insert image description here

  • dataset: Dataset class, which determines where and how to read data
  • batchsize: batch size
  • num_works: Whether to read data by multiple processes
  • Shuffle: Whether each epoch is out of order
  • drop_last: When the number of samples cannot be divisible by batchsize, whether to discard the last batch of data

Epoch: All training samples are input into the model, called an Epoch
Iteration: A batch of samples are input into the model, called an Iteration
Batchsize: The batch size determines how many Iterations an Epoch has

Simple example of DataLoader class

Let's take reading the iris data set constructed above as an example

import torch
from torch.utils.data import DataLoader
from Dataset类重载鸢尾花数据集 import IrisDataset

# 实例化
iris = IrisDataset()
irir_loader = DataLoader(dataset=iris,batch_size=10,shuffle=True)

for epoch in range(2):
    for i,data in enumerate(irir_loader): # Return an enumerate object.
        # 从irir_loader中读取数据
        inputs,labels = data
        # 打印数据集
        print(inputs.data.size())
        print(labels.data.size())

Three questions about data reading

1. What data to read?

training data

2. Where to read data?

dataset

3. How to read data?

Read files on the hard disk through the os library

if __name__ == '__main__':
	random.seed(1)

	dataset_dir = ps.path.join('..','data')
	split_dir = ps.path.join('..','split')
	train_dir = os.path.join(split_dir,'train')
	valid_dir = os.path.join(split_dir,'valid')
	test_dir = os.path.join(split_dir,'test')
	
	train_pct = 0.8
	valid_pct = 0.1
	test_pct = 0.1

Build MyDataset instance

train_data = MyDataset(data_dir=train_dir,transform=train_transform)
valid_data = MyDataset(data_dir=train_dir,transform=valid_transform)

Build DataLoder

train_loader = DataLoader(dataset=train_data,batch_size=tensor(32,32),shuffle=True)
valid_loader = DataLoader(dataset=valid_data,batch_size=tensor(32,32))

insert image description here

Guess you like

Origin blog.csdn.net/fuhao6363/article/details/130416343