foreword
In the steps of model training, the data part is very important, and its process is mainly divided into data collection, data division, data reading, and data preprocessing.
The data is collected withoriginal sampleandLabel(Img,label)
The division of the data set needs to be divided into training set, verification set and test set.
The training set is responsible for training the model, the validation set is responsible for verifying whether the model is overfitting, and the test set is used to test performance.
Data reading is mainly the content of DataLoader
- DataLoader is divided into two sub-modules, namely Sampler and DataSet.
The function of Sampler is to generate an index (Index)
. DataSet reads data according to the index.
Data preprocessing needs to be implemented with transforms
Custom Dataset Class
PyTorch's custom datasets can be defined using the Dataset class and the IterableDataset class. The former is used to implement Map-style (mapping style) datasets, and the latter is used to implement iteration-style datasets.
DataLoader and Dataset
DataLoader and Dataset are the core of pytorch data reading
Dataset
torch.utils.data.Dataset
function: Dataset abstract class, all custom Datasets need to inherit it, and override __getitem__()
getitem:
Receive an index and return a sample
Use the Dataset class to overload the iris dataset
import numpy as np
import torch
from torch.utils.data import Dataset
import numpy as np
class IrisDataset(Dataset):
'''鸢尾花数据集'''
def __init__(self):
super(IrisDataset).__init__()
data = np.loadtxt()("鸢尾花数据集路径.csv",delimiter=',',dtype=np.float32)
self.x = torch.from_numpy(data[:,0:-1])
self.y = torch.from_numpy(data[:,[-1]])
self.len = data.shape[0]
def __getitem__(self, index):
return self.x[index],self.y[index]
def __len__(self):
return self.len
DataLoader
torch.utils.data.DataLoader
After realizing the custom data set, the data set samples can be returned, but this method of returning samples directly through the index is relatively primitive, and the data set cannot provide a batch of data at a time, nor can the data be randomized. Scrambling and parallel acceleration. To this end, PyTorch specifically provides the DataLoader class to implement this function.
The DataLoader class is a data loader that combines a dataset and a sample sampler and provides an iterable over a given dataset.
Function: Build fallable data loaders
- dataset: Dataset class, which determines where and how to read data
- batchsize: batch size
- num_works: Whether to read data by multiple processes
- Shuffle: Whether each epoch is out of order
- drop_last: When the number of samples cannot be divisible by batchsize, whether to discard the last batch of data
Epoch: All training samples are input into the model, called an Epoch
Iteration: A batch of samples are input into the model, called an Iteration
Batchsize: The batch size determines how many Iterations an Epoch has
Simple example of DataLoader class
Let's take reading the iris data set constructed above as an example
import torch
from torch.utils.data import DataLoader
from Dataset类重载鸢尾花数据集 import IrisDataset
# 实例化
iris = IrisDataset()
irir_loader = DataLoader(dataset=iris,batch_size=10,shuffle=True)
for epoch in range(2):
for i,data in enumerate(irir_loader): # Return an enumerate object.
# 从irir_loader中读取数据
inputs,labels = data
# 打印数据集
print(inputs.data.size())
print(labels.data.size())
Three questions about data reading
1. What data to read?
training data
2. Where to read data?
dataset
3. How to read data?
Read files on the hard disk through the os library
if __name__ == '__main__':
random.seed(1)
dataset_dir = ps.path.join('..','data')
split_dir = ps.path.join('..','split')
train_dir = os.path.join(split_dir,'train')
valid_dir = os.path.join(split_dir,'valid')
test_dir = os.path.join(split_dir,'test')
train_pct = 0.8
valid_pct = 0.1
test_pct = 0.1
Build MyDataset instance
train_data = MyDataset(data_dir=train_dir,transform=train_transform)
valid_data = MyDataset(data_dir=train_dir,transform=valid_transform)
Build DataLoder
train_loader = DataLoader(dataset=train_data,batch_size=tensor(32,32),shuffle=True)
valid_loader = DataLoader(dataset=valid_data,batch_size=tensor(32,32))