[pytorch, learning]-5.6 Deep Convolutional Neural Network (AlexNet)

reference

5.6 Deep Convolutional Neural Network (AlexNet)

Nearly 20 years after LeNet was proposed, neural networks were surpassed by other machine learning methods, such as support vector machines. Although LeNet can achieve good results on early small data sets, its performance on larger real data sets is not satisfactory. On the one hand, neural networks are computationally complex. Although there were some acceleration hardware for neural networks in the 1990s, they were not as popular as GPUs afterwards. Therefore, training a multi-channel, multi-layer and a large number of parameters of the convolutional neural network was difficult to complete. On the other hand, researchers at that time did not have a lot of in-depth research on parameter initialization and non-convex optimization algorithms and many other fields, which made the training of complex neural networks often difficult.

More popular for a long time are the manual features designed and generated by researchers through hard work and wisdom. The main process of this type of image classification research is:

  1. Obtain an image data set;
  2. Use existing feature extraction functions to generate image features;
  3. Use machine learning models to classify image features.

The part of machine learning that I thought was limited to this last step. If you talk to machine learning researchers at that time, they will think that machine learning is both important and beautiful. The elegant theorem proves the properties of many classifiers. The field of machine learning is vibrant, rigorous, and extremely useful. However, if you talk to a computer vision researcher, it is a different picture. They will tell you the "unspeakable" reality in image recognition: what really matters in the computer vision process is data and features. In other words, the use of cleaner data sets and more effective features has a greater impact on the image classification results than the choice of machine learning models.

5.6.1 Learning feature representation

Researchers believe that multi-layer neural networks may be able to learn multi-level representations of data and represent increasingly abstract concepts level by level. Take image classification as an example: In a multi-layer neural network, the first-level representation of the image can be whether there are edges at a specific position and angle; and the second-level representation may be able to combine these edges into interesting patterns, such as Pattern; In the third level of representation, perhaps the patterns on the upper and lower levels can be further merged into patterns corresponding to specific parts of the object. In this way, the representation continues, and finally, the model can easily complete the classification task based on the representation of the last level. It needs to be emphasized that the stepwise representation of the input is determined by the parameters in the multi-layer model, and these parameters are all learned.

5.6.1.1 Missing element one: data

A deep model with many features requires a lot of labeled data to perform better than other classical methods. Limited to the limited storage of early computers and the limited research budget in the 1990s, most researches were based on small public data sets. For example, many research papers are based on several public data sets provided by the University of California, Irvine (UCI), many of which have only a few hundred to a few thousand images. This situation has been improved in the wave of big data that emerged around 2010. In particular, the ImageNet dataset, which was born in 2009, contains 1,000 categories of objects, each with thousands of different images. This scale is not comparable to other public data sets at the time. The ImageNet data set simultaneously promotes computer vision and machine learning research into a new stage, making the previous traditional methods no longer have advantages.

5.6.1.2 Missing element two: hardware

Deep learning requires high computing resources. Early hardware computing power was limited, which made it difficult to train more complex neural networks. However, the advent of general-purpose GPUs changed this pattern. For a long time, GPUs have been designed for image processing and computer games, especially for matrix and vector multiplications with high throughput to serve basic graphics transformations. Fortunately, the mathematical expression is similar to the expression of the convolutional layer in the deep network. The concept of general-purpose GPU began to emerge in 2001, and programming frameworks such as OpenCL and CUDA emerged. This makes GPUs also began to be used by the machine learning community around 2010.

5.6.2 AlexNet

Insert picture description here

The following implements a simplified AlexNet

import time
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision

import sys
sys.path.append("..")
import d2lzh_pytorch as d2l


device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')

class AlexNet(nn.Module):
    def __init__(self):
        super(AlexNet, self).__init__()
        self.conv = nn.Sequential(
            # N = (W - F + 2P)/S + 1,除不尽向下取整(记不清了,向上取整对不上,向下取整刚好对上...)
            nn.Conv2d(1, 96, 11, 4),   # in_channels, out_channels, kernel_size, stride, padding: (256, 1, 224, 224) -> (256, 96, 54,54)
            nn.ReLU(),
            nn.MaxPool2d(3, 2),   # kernel_size, stride: (256, 96, 54, 54) -> (256, 96, 26, 26)
            # 减少卷积窗口,使用填充为2来使得输入的高和宽一致,且增大输出通道数
            nn.Conv2d(96, 256, 5, 1, 2),  # (256, 96, 26, 26) -> (256, 256, 26, 26)
            nn.ReLU(),
            nn.MaxPool2d(3, 2),  # (256, 256, 26, 26) -> (256, 256, 12, 12)
            # 连续3个卷积层,且使用更小的卷积窗口。除了最后的卷积层外,进一步增大了输出通道数。
            # 前两个卷积层不使用池化层来减小输入的高和宽
            nn.Conv2d(256, 384, 3, 1, 1),   # (256, 256, 12, 12) -> (256, 384, 12, 12)
            nn.ReLU(),
            nn.Conv2d(384, 384, 3, 1, 1),  # (256, 384, 12, 12) -> (256, 384, 12, 12)
            nn.ReLU(),
            nn.Conv2d(384, 256, 3, 1, 1),  # (256, 384, 12, 12) -> (256, 256, 12, 12)
            nn.ReLU(),
            nn.MaxPool2d(3,2)  # (256, 256, 12, 12) -> (256, 256, 5, 5)
        )
        # 这里全连接层的输出个数比LeNet中的大数倍。使用丢弃层来缓解过拟合
        self.fc = nn.Sequential(
            nn.Linear(256*5*5, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            # 输出层。由于这里使用Fashion-MNIST,所以用类别数为10
            nn.Linear(4096, 10),
        )
    
    def forward(self, img):
        feature = self.conv(img)  # 256 * 1 * 224 * 224
        output = self.fc(feature.view(img.shape[0], -1))
        return output
net = AlexNet()
print(net)

Insert picture description here

5.6.3 Read data

def load_data_fashion_mnist(batch_size, resize= None, root="~/Datasets/FashionMNIST"):
    trans = []
    if resize:
        trans.append(torchvision.transforms.Resize(size = resize))
    trans.append(torchvision.transforms.ToTensor())
    
    transform = torchvision.transforms.Compose(trans)
    mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)
    mnist_test = torchvision.datasets.FashionMNIST(root=root, train=False, download=True, transform=transform)
    
    train_iter = torch.utils.data.DataLoader(mnist_train, batch_size = batch_size, shuffle=True, num_workers=4)
    test_iter = torch.utils.data.DataLoader(mnist_test, batch_size = batch_size, shuffle=False, num_workers=4)
    
    return train_iter, test_iter

batch_size = 128
train_iter, test_iter = load_data_fashion_mnist(batch_size, resize = 224)

5.6.4 Training

lr, num_epochs = 0.001, 5
optimizer = optim.Adam(net.parameters(), lr =lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)

Insert picture description here

Guess you like

Origin blog.csdn.net/piano9425/article/details/107199826