Data Preparation for Deep Learning

The four steps required to complete a deep learning project:

Prepare the data
Build the model
Train the model
Analyze the model's results

prepare data

Preparing the data will follow the ETL process

extract (extract)
transform
load (load)

We extract data from the data source, convert the data into an ideal format, and then load the data into a suitable structure for query and analysis. The packages that need to be imported are as follows:

import torch
import torchvision
import torchvision.transforms as transforms

torch is the top pytorch package and tensor library
torchvision is a package that provides access to popular datasets, model architectures, and image transformations for computer vision
The torchvision.transforms interface gives us access to common transformations for image processing

The torchvision package gives us access to the following resources:

Datasets（like MNIST and Fashion-MNIST）
Models
Transforms
Utils

When preparing data, the ultimate goal is to follow the ETL process:

Extract - Get Fashion MNIST data from source
Transform - Transform image data into a pytorch tensor
Load - puts data into an object making it easily accessible

When completing the above, pytorch provides us with two classes:

Dataset
DataLoader (data loader)

The Dataset class is an abstract class representing a dataset, and the data loader encapsulates the dataset and provides access to the underlying data.

Since we need to use datasets, we need a new class (not required) that inherits the Dataset class to implement the abstract methods:

import torch
import torchvision
import torchvision.transforms as transforms
import torch.utils.data as data
import pandas as pd
import numpy as np

class OHLC(data.Dataset):
    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file)
        
    def __getitem__(self, index):
        r = self.data.iloc[index]
        label = torch.tensor(r.is_up_day, dtype=torch.long)
        sample = self.normailze(torch.tensor([r.open, r.high, r.low, r.close]))
        return sample, label
    
    def __len__(self):
        return len(self.data)

The Fashion-MNIST dataset class based on torchvision inherits the Dataset class and implements its abstract methods. Therefore, in practice, we usually use the Fashion-MNIST dataset class .

What is MNIST

The MNIST dataset is a modified National Institute of Standards and Technology database, a well-known dataset of handwritten digits, often used for training image processing systems for machine learning. NIST stands for National Institute of Standards and Technology.

The M in MNIST stands for modified, which is because there is an original NIST digit dataset modified to MNIST.

insert image description here

MNIST is known for how frequently the dataset is used for two main reasons:

Beginners use it because it's easy
Researchers use it to measure (compare) different models.

The dataset consists of 70,000 images of handwritten digits, and the image segmentation is as follows:

60000 training images
10000 test images

These images were originally created by U.S. Census Bureau employees and U.S. high school students.

What is Fashion-MNIST

Fashion MNIST, as the name suggests, is a dataset of fashion items. Specifically, the dataset contains the following ten categories of fashion items:

insert image description here

An example dataset looks like this:

insert image description here

The relationship between MNIST and Fashion-MNIST

The reason the Fashion-MNIST dataset has MNIST in its name is that the creators are trying to replace MNIST with Fashion-MNIST. MNIST has become so widely used, and image recognition technology has improved so much, that the dataset is considered overly simple. This is why the Fashion MNIST dataset was created. The existence of Fashion-MNIST is to allow frameworks like PyTorch to add the Fashion-NMIST dataset by changing the URL to obtain the data. It can be said that PyTorch's Fashion-MNIST just extends the MNIST dataset and covers its URL.

code demo

(1) Extract and transform data

import torch
import torchvision
import torchvision.transforms as transforms

train_set = torchvision.datasets.FashionMNIST(
    root='./data/FashionMNIST'                  # 提取
    ,train=True
    ,download=True
    ,transform=transforms.Compose([             # 转换
        transforms.ToTensor()
    ])
)

Among them, the first parameter root is the path, which is the location of the disk where the data is located, the second parameter train is set to true, indicating that the data is used as a training set, and the third parameter download is set to true, indicating that if the data does not appear in The specified file path will be downloaded, and the last parameter transform, here passes a transformation combination, these transformations should be performed on the data set elements, because we want to convert the image into a tensor, so we use the built-in ToTensor transformation.

(2) Encapsulate data in the data loader object

train_loader = torch.utils.data.DataLoader(
    train_set, batch_size=10                    # 加载
)

Among them, batch_size specifies the batch size.

(3) Access the data in the training set

import matplotlib.pyplot as plt
sample = next(iter(train_set))
image, label = sample
print(image.shape)
plt.imshow(image.squeeze(), cmap='gray')

Show results:

torch.Size([1, 28, 28])

insert image description here

If the plt.imshow function is used in jupyter notebook, the kernel hangs up (insufficient memory), and the picture cannot be displayed, you can add the following code:

import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

The above is the case of processing a single data, and the next step is to use the data loader to process batch data:

batch = next(iter(train_loader))
print(len(batch))
print(type(batch))
images, labels =batch
print(images.shape)
print(labels.shape)

Show results:

2
<class 'list'>
torch.Size([10, 1, 28, 28])
torch.Size([10])

Then, you can use the torchvision.utils.make_grid function to draw the entire batch of images at once:

grid = torchvision.utils.make_grid(images, nrow=10)   # nrow指定显示在每行的图片数（这个根据batch的大小来设置）

plt.figure(figsize=(15,15))
plt.imshow(np.transpose(grid, (1,2,0)))

print('labels:',labels)

Show results:

insert image description here

[Deep Learning] Data Preparation

Data Preparation for Deep Learning

prepare data

What is MNIST

What is Fashion-MNIST

The relationship between MNIST and Fashion-MNIST

code demo

Guess you like