The dataset pipeline in PyTorch (How does Pytorch read data from the dataset?)

The data set is stored in our computer hard disk, and Pytorch needs to read these data from the hard disk and form a dataset form that Pytorch can understand. Then use the dataloader to read batch by batch from the dataset and pass it into the subsequent model. This article summarizes how to build datasets in pytorch and how to read datasets with dataloader.

Build dataset from tensor

The previous article introduced how to build tensor , so how to build dataset with tensor?

In supervised learning, there will be a Tensor to store the feature of the data, and another Tensor to store the label of the data. for example:

t_x = torch.rand([6, 5], dtype=torch.float32)
t_y = torch.arange(6)

6 data are stored in t_x, and each data has 5 features.

t_y stores the labels of these 6 data.

Next, build the dataset, so that pytorch can retrieve each data by retrieving the index method, and let t_x and t_y be paired one by one, even if the order is disturbed, t_x and t_y can also be in one-to-one correspondence.

There is a Dataset class in pytorch, just build a subclass of Dataset:

from torch.utils.data import Dataset 

class JointDataset(Dataset):
    def __init__(self, x, y):
        self.x = x 
        self.y = y 
        
    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]
    
    def __len__(self):
        return len(self.x)

Then pass t_x and t_y into the above custom class to build a dataset that pytorch can understand

joint_dataset = JointDataset(t_x, t_y)

for example in joint_dataset:
    print(f"x:{example[0]}, y:{example[1]}")

x:tensor([0.3030, 0.3913, 0.1098, 0.1247, 0.1747]), y:0
x:tensor([0.6247, 0.4709, 0.7010, 0.3407, 0.2678]), y:1
x:tensor([0.1844, 0.7371, 0.8012, 0.9095, 0.6837]), y:2
x:tensor([0.8457, 0.1382, 0.6116, 0.7448, 0.4173]), y:3
x:tensor([0.4306, 0.2952, 0.8508, 0.7258, 0.5765]), y:4
x:tensor([0.4122, 0.2141, 0.5772, 0.9119, 0.8334]), y:5

Use DataLoader to read the constructed Dataset:

from torch.utils.data import DataLoader 

data_loader = DataLoader(dataset=joint_dataset, batch_size=3, shuffle=True)

Dataloader will read batch by batch from the built dataset, here set batch_size=3. Before reading data, first shuffle the data: shuffle=True. 

When training the model, it usually needs to train N epochs, that is, the number of trainings on all existing data. In each epoch, apply data_loader:

for epoch in range(2):
    print('\n')
    print(f'epoch {epoch+1}')
    for i, batch in enumerate(data_loader, start=1):
        print(f'batch {i}:, x:, {batch[0]},
              \n          y: {batch[1]}')


epoch 1
batch 1:, x:, tensor([[0.4122, 0.2141, 0.5772, 0.9119, 0.8334],
        [0.1844, 0.7371, 0.8012, 0.9095, 0.6837],
        [0.8457, 0.1382, 0.6116, 0.7448, 0.4173]]), 
          y: tensor([5, 2, 3])
batch 2:, x:, tensor([[0.3030, 0.3913, 0.1098, 0.1247, 0.1747],
        [0.6247, 0.4709, 0.7010, 0.3407, 0.2678],
        [0.4306, 0.2952, 0.8508, 0.7258, 0.5765]]), 
          y: tensor([0, 1, 4])


epoch 2
batch 1:, x:, tensor([[0.3030, 0.3913, 0.1098, 0.1247, 0.1747],
        [0.4306, 0.2952, 0.8508, 0.7258, 0.5765],
        [0.1844, 0.7371, 0.8012, 0.9095, 0.6837]]), 
          y: tensor([0, 4, 2])
batch 2:, x:, tensor([[0.4122, 0.2141, 0.5772, 0.9119, 0.8334],
        [0.8457, 0.1382, 0.6116, 0.7448, 0.4173],
        [0.6247, 0.4709, 0.7010, 0.3407, 0.2678]]), 
          y: tensor([5, 3, 1])

Build Dataset from disk

For example, we have a picture dataset that needs to be classified, such as BMW-10 dataset . This dataset has 11 kinds of BMW cars stored in 11 folders. How to read these pictures and their corresponding labels from the hard disk and build a dataset that pytorch can understand? 

First, let's read the data with pathlib and visualize some pictures: 

import pathlib 

imgdir_path = pathlib.Path('bmw10_ims')

image_list = sorted([str(path) for path in imgdir_path.rglob('*.jpg')])


['bmw10_ims/1/150079887.jpg', 'bmw10_ims/1/150080038.jpg', 'bmw10_ims/1/150080476.jpg', 
...,'bmw10_ims/8/149389446.jpg', 'bmw10_ims/8/149389742.jpg', 'bmw10_ims/8/149389834.jpg']
import matplotlib.pyplot as plt 
from PIL import Image 
import numpy as np

fig = plt.figure(figsize=(10, 5)) 
for i, file in enumerate(image_list[:6]): 
    img = Image.open(file)
    print('Image shape:', np.array(img).shape)
    ax = fig.add_subplot(2, 3, i+1)
    ax.set_xticks([]); ax.set_yticks([])
    ax.imshow(img)
    ax.set_title(pathlib.Path(file).name, size=15) 
plt.tight_layout()
plt.show()


Image shape: (480, 640, 3)
Image shape: (360, 424, 3)
Image shape: (768, 1024, 3)
Image shape: (768, 1024, 3)
Image shape: (360, 480, 3)
Image shape: (183, 275, 3)

Build the label of the image: 

#Pathlib.Path("bmw10_ims/7/149461474.jpg").parts = ('bmw10_ims', '7', '149461474.jpg')
labels = list(pathlib.Path(path).parts[-2] for path in image_list)

print(labels)


['1', '1', '1', '1',... '8', '8', '8', '8']

 

Now to build the dataset: 

class ImageDataset(Dataset):
    def __init__(self, file_list, labels):
        self.file_list = file_list 
        self.labels = labels 
        
    def __getitem__(self, index):
        file = self.file_list[index]
        label = self.labels[index]
        return file, label  
    
    def __len__(self):
        return len(self.labels)
    
image_dataset = ImageDataset(image_list, labels)
for file, label in image_dataset:
    print(file, label)


bmw10_ims/1/150079887.jpg 1
bmw10_ims/1/150080038.jpg 1
...
bmw10_ims/5/149124761.thumb.jpg 5
bmw10_ims/5/149124940.jpg 5
...
bmw10_ims/8/149389742.jpg 8
bmw10_ims/8/149389834.jpg 8

Generally, it is necessary to perform pre-processing on the input image, such as nomoralization, resize, crop, etc.:

import torchvision.transforms as transforms 
img_height, image_width = 128, 128 
transform = transforms.Compose([transforms.ToTensor(), 
                                transforms.Resize((img_height, image_width)),
                               ])

Generally put the preprocessing into the dataset: 

class ImageDataset(Dataset):
    def __init__(self, file_list, labels, transform=None):
        self.file_list = file_list 
        self.labels = labels 
        self.transform = transform 
        
    def __getitem__(self, index):
        img = Image.open(self.file_list[index])
        if self.transform is not None:
            img = self.transform(img)
        label = self.labels[index]
        return img, label 
    
    def __len__(self):
        return len(self.labels)

image_dataset = ImageDataset(image_list, labels, transform)

Visualize this Dataset:

fig = plt.figure(figsize=(10, 6))
for i, example in enumerate(image_dataset):
    if i == 6:
        break
    ax = fig.add_subplot(2, 3, i+1)
    ax.set_xticks([]); ax.set_yticks([])
    print(example[0].numpy().shape)
    ax.imshow(example[0].numpy().transpose((1, 2, 0)))
    ax.set_title(f'{example[1]}', size=15)
    
plt.tight_layout()
plt.show()


(3, 128, 128)
(3, 128, 128)
(3, 128, 128)
(3, 128, 128)
(3, 128, 128)
(3, 128, 128)

Reference from:   Machine Learning with PyTorch and Scikit-Learn  Book 

Guess you like

Origin blog.csdn.net/bo17244504/article/details/124842911