Deep Convolutional Neural Network (ALEXNET)

Part of the study notes of "Practice Deep Learning pytorch" is only for your own review.

Deep Convolutional Neural Network (ALEXNET)

Nearly 20 years after LeNet was proposed, neural networks have been surpassed by other machine learning methods, such as support vector machines. Although LeNet can achieve good results on early small data sets, its performance on larger real data sets is not as good as expected .

On the one hand, neural networks are computationally complex . Although there were some acceleration hardware for neural networks in the 1990s of the 20th century, they were not as popular as GPUs afterwards. Therefore, training a multi-channel, multi-layer and convolutional neural network with large parameters was difficult to complete in the year. On the other hand, when 年researchers are not yet big量 in-depth study of parameter initialization and non-convex optimization algorithms and other fields, the training of complex neural networks is usually difficult.

The neural network can directly classify based on the original pixels of the image. This method called end-to-end saves 了 many intermediate steps. However, for a long time, what has flowed is the manual features designed and produced by researchers through labor and wisdom. The main process of this type of image classification research is:

  • 1. Obtain the image data set;
  • 2. Use existing feature extraction functions to generate image features;
  • 3. Use machine learning models to classify image features.

What really matters in the computer vision process is data and features . In other words, the use of cleaner data sets and more effective features has a greater impact on the image classification results than the choice of machine learning model. 

Learning feature representation

In a considerable amount of time, features are extracted from the data based on various manual designed functions . In fact, many researchers continue to improve the image classification results by proposing new feature extraction functions. This has made an important contribution to the development of computer vision.

Some researchers disagree. They believe that the characteristics themselves should also be learned . They also believe that in order to represent sufficiently complex inputs, the features themselves should be represented hierarchically. Researchers holding this idea believe that multi-layer neural networks may be able to learn multi-level representations of data and represent increasingly abstract concepts or patterns level by level . Take image classification as an example例, an example of object edge detection in a two-dimensional convolutional layer. In a multi-layer neural network, the first-level representation of the image can be whether there are edges at a specific position and length; and the second-level representation may be able to combine these edges into interesting patterns, such as patterns; In the third level of representation, perhaps the patterns of the previous level can be further merged into patterns corresponding to specific parts of the object. In this way, the representation continues level by level , and finally, the model can easily complete the classification task based on the representation of the last level. It should be emphasized that the level-by-level representation of the input is determined by the parameters in the multi-layer model, and these parameters are all learned .

Although there is a group of persistent researchers who are constantly studying and trying to learn the hierarchical representation of visual data, but for a long time, these fields have not been realized. There are many factors in this that deserve our analysis.

Missing element one: data

A deep model that contains many features requires a lot of labeled data to perform better than other classic methods. Limited to the limited storage of early computers and the limited research budget in the 1990s, most of the research was only based on small public data sets. This situation was improved in the big data wave that emerged around 2010. In particular, the ImageNet data set born in 2009年 contains 1,000 types of objects, each with as many as thousands of images. The ImageNet dataset simultaneously promotes computer vision and machine learning research into a new stage, making the previous traditional methods no longer have advantages.

Missing element two: hardware

Deep learning requires very high computing resources. Early hardware computing power was limited, which made it difficult to train more complex neural networks. However, the arrival of general-purpose GPUs has changed this pattern. For a long time, GPUs have been designed for image processing and computer games , especially for matrix and vector multiplication for high throughput to serve basic graphics transformations. Fortunately, the mathematical expression is similar to the expression of the convolutional layer in the deep network. The concept of universal GPU began to emerge in 2001, and programming frameworks such as OpenCL and CUDA emerged . This makes GPUs also used by the machine learning community around 2010年.

ALEXNET

In 2012 , AlexNet was born. The name of this model comes from the name of the first author of the paper, Alex Krizhevsky [1]. AlexNet uses 了8-layer convolutional neural network and won the ImageNet 2012 Image Recognition Challenge with a great advantage. It proves for the first time that the learned features can surpass the features of manual design, thus breaking the previous state of computer vision research.

The design principles of AlexNet and LeNet are very similar, but there are also significant differences.
First, in contrast to the relatively small LeNet, AlexNet contains 8 layers of transformation, including 5 layers of convolution and 2 layers of fully connected hidden layers, and a fully connected output layer.

The shape of the convolution window in the first layer of AlexNet is 11×11. Because the height and width of most images in ImageNet are more than 10 times larger than those of MNIST images, the objects in ImageNet images occupy more pixels, so a larger convolution window is needed to capture objects . The shape of the convolution window in the second layer is reduced to 5×5, and then 3×3 is used. In addition, after the first, second, and fifth convolutional layers, the largest pooling layer with a window shape of 3×3 and a stride of 2 is used . Moreover, the number of convolution channels used by AlexNet is dozens of times larger than the number of convolution channels in LeNet.
Following the last convolutional layer are two fully connected layers with 4096 outputs. These two huge fully connected layers bring nearly 1 GB of model parameters. Due to the limitations of the early video memory , the earliest AlexNet used a dual data stream design so that a GPU only needs to process half of the model. Fortunately, there has been sufficient development in the past few years, so usually we no longer need such a special design.

Second, AlexNet changed the sigmoid activation function to a simpler ReLU activation function.

On the one hand, the calculation of the ReLU activation function is simpler , if it does not have the exponentiation operation in the sigmoid activation function.

On the other hand, the ReLU activation function makes the model more 易 training under different parameter initialization methods. This is because when the output of the sigmoid activation function is very close to 0 or 1, the gradient of these regions is almost 0, which prevents back propagation from continuing. New part of the model parameters; and the gradient of the ReLU activation function in the positive interval is always 1. . Therefore, if the model parameters are not initialized properly, the sigmoid function may get a gradient of almost 0 in the positive interval , which prevents the model from being effectively trained.

Third, AlexNet uses the discard method to control the model complexity of the fully connected layer. LeNet does not use the discard method.
Fourth, AlexNet introduces large image enhancements, such as flipping, cropping, and color changes to further expand the data set to alleviate overfitting.

Implement a slightly simplified AlexNet.

import time
import torch
from torch import nn, optim
import torchvision

import sys
sys.path.append("..") 
import d2lzh_pytorch as d2l
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(torch.__version__)
print(torchvision.__version__)
print(device)
class AlexNet(nn.Module):
    def __init__(self):
        super(AlexNet, self).__init__()
        self.conv = nn.Sequential(
            # in_channels, out_channels, kernel_size, stride, padding
            nn.Conv2d(1, 96, 11, 4), 
            nn.ReLU(),
            # kernel_size, stride
            nn.MaxPool2d(3, 2), 
            # 减小卷积窗口,使用填充为2来使得输入与输出的高和宽一致,且增大输出通道数
            nn.Conv2d(96, 256, 5, 1, 2),
            nn.ReLU(),
            nn.MaxPool2d(3, 2),
            # 连续3个卷积层,且使用更小的卷积窗口。除了最后的卷积层外,进一步增大了输出通道数。
            # 前两个卷积层后不使用池化层来减小输入的高和宽
            nn.Conv2d(256, 384, 3, 1, 1),
            nn.ReLU(),
            nn.Conv2d(384, 384, 3, 1, 1),
            nn.ReLU(),
            nn.Conv2d(384, 256, 3, 1, 1),
            nn.ReLU(),
            nn.MaxPool2d(3, 2)
        )
         # 这里全连接层的输出个数比LeNet中的大数倍。使用丢弃层来缓解过拟合
        self.fc = nn.Sequential(
            nn.Linear(256*5*5, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            # 输出层。由于这里使用Fashion-MNIST,所以用类别数为10,而非论文中的1000
            nn.Linear(4096, 10),
        )

    def forward(self, img):
        feature = self.conv(img)
        output = self.fc(feature.view(img.shape[0], -1))
        return output

Print to see the network structure.

AlexNet(
  (conv): Sequential(
    (0): Conv2d(1, 96, kernel_size=(11, 11), stride=(4, 4))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(96, 256, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU()
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(256, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU()
    (8): Conv2d(384, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU()
    (10): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU()
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (fc): Sequential(
    (0): Linear(in_features=6400, out_features=4096, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.5)
    (3): Linear(in_features=4096, out_features=4096, bias=True)
    (4): ReLU()
    (5): Dropout(p=0.5)
    (6): Linear(in_features=4096, out_features=10, bias=True)
  )
)

Read data

Although AlexNet uses the ImageNet dataset in the paper, because the ImageNet dataset takes a long time to train, we still use the previous Fashion-MNIST dataset to demonstrate AlexNet. When reading the data, we do an extra step to enlarge the image height and width to the image height and width 224 used by AlexNet. This can be achieved by torchvision.transforms.Resize instance. In other words, we use Resize before the ToTensor instance, and then use the Compose instance to concatenate these two transformations to call.

# 本函数已保存在d2lzh_pytorch包中方便以后使用
def load_data_fashion_mnist(batch_size, resize=None, root='~/Datasets/FashionMNIST'):
    """Download the fashion mnist dataset and then load into memory."""
    trans = []
    if resize:
        trans.append(torchvision.transforms.Resize(size=resize))
    trans.append(torchvision.transforms.ToTensor())
    
    transform = torchvision.transforms.Compose(trans)
    mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)
    mnist_test = torchvision.datasets.FashionMNIST(root=root, train=False, download=True, transform=transform)

    train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=4)
    test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=4)

    return train_iter, test_iter
batch_size = 128
# 如出现“out of memory”的报错信息,可减小batch_size或resize
train_iter, test_iter = load_data_fashion_mnist(batch_size, resize=224)

training

Start training AlexNet. Compared with LeNet, because the image size changes and the model changes , it requires a large video memory and a long training time.

lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)

Output:

training on  cuda
epoch 1, loss 0.0047, train acc 0.770, test acc 0.865, time 128.3 sec
epoch 2, loss 0.0025, train acc 0.879, test acc 0.889, time 128.8 sec
epoch 3, loss 0.0022, train acc 0.898, test acc 0.901, time 130.4 sec
epoch 4, loss 0.0019, train acc 0.908, test acc 0.900, time 131.4 sec
epoch 5, loss 0.0018, train acc 0.913, test acc 0.902, time 129.9 sec

summary

  • AlexNet is similar to LeNet in structure, but uses more convolutional layers and a larger parameter space to fit a large-scale data set ImageNet . It is the dividing line between shallow neural network and deep neural network.
  • Although it seems that the implementation of AlexNet is a few more codes than the implementation of LeNet, the change in concept and the production of truly excellent experimental results have cost the academic community many years.

Guess you like

Origin blog.csdn.net/dujuancao11/article/details/108572001