From AlexNet to MobileNet, take you to the deep neural network

Abstract:  In Yunqi Community on March 13, 2018, Shen Junnan from Harbin Institute of Technology shared a typical model - Introduction to Deep Neural Networks. This paper introduces the development process of deep neural network in detail, and introduces the structure and characteristics of the model at each stage in detail.

Shen Junnan from Harbin Institute of Technology shared a typical pattern - Introduction to Deep Neural Networks. This paper introduces the development process of deep neural network in detail, and introduces the structure and characteristics of the model at each stage in detail.
Live review please click

Here are the highlights of the video content:

problem elicited

Learning knowledge is a good way to start with problems, so this article will focus on the following three questions:
1. What is the difference between DNN and CNN? what is the relationship? How to define it?
2. Why is DNN so popular now, and how has it developed?
3. The structure of DNN is very complex, how can you actually try it?
The mind map of this article is as follows:


development path

DNN - Definition and Concepts

In a convolutional neural network, convolution operations and pooling operations are organically stacked together to form the backbone of the CNN.
Also inspired by the multi-layer network between the macaque retina and visual cortex, the deep neural network architecture came into being and achieved good performance. It can be said that DNN is actually an architecture, which refers to a neural network structure with a depth exceeding several similar layers, generally reaching dozens of layers, or composed of some complex modules.


ILSVRC (ImageNet Large-Scale Visual Recognition Challenge) is constantly being ranked by deep learning every year. As the model becomes deeper and deeper, the error rate of Top-5 is also getting lower and lower. It is currently reduced to around 3.5%, while humans The recognition error rate on the ImageNet data set is about 5.1%, which means that the current deep learning model recognition ability has surpassed that of humans.

From AlexNet to MobileNet

Alexnet

AlexNet is the first model to introduce convolutional neural networks into the field of computer vision and achieve breakthrough results.
AlexNet, proposed by Alex Krizhevsky, llya Sutskever, and Geoff Hinton, won the ILSVRC championship in 2012, and the error rate in the top-5 project was only 15.3%, which was an excellent and significant breakthrough compared to the runner-up using the traditional method of 26.2%.
Compared with the previous LeNet, AlexNet makes the model deeper and wider by stacking convolutional layers, and at the same time uses GPU to obtain results within an acceptable time range, which promotes the development of convolutional neural networks and even deep learning.
Below is the architecture of AlexNet:


The features of AlexNet are:
1. With the help of the ImageNet dataset with 15 million labels and 22000 categories to train the model, it is close to the complex scenes in the real world.
2. Use deeper and wider CNNs to increase the learning capacity.
3. The flexible use of ReLU as the activation function greatly improves the training speed compared to Sigmoid.
4. Use multiple GPUs to increase the capacity of the model.
5. Introduce competition between neurons through LRN to help generalization and improve model performance.
6. Randomly ignore some neurons through Dropout to avoid overfitting.
7. Avoid overfitting by scaling, flipping, cutting and other data enhancement methods.
The above are typical methods used by deep neural networks.
When AlexNet was developing, the GTX580 used only had 3GB of video memory, so the model was creatively disassembled into two graphics cards. The architecture is as follows:
1. The first layer is the convolution layer, which performs convolution operations on 224x224x3 input images. , the parameters are: convolution kernel 11x11x3, number 96, stride 4, LRN normalization and 2x2 max pooling.
2. The second layer is the convolution layer, which is only convolved with the output of the first layer in the same GPU. The parameters are: convolution kernel 5x5x48, sparse 256, LRN normalization and 2x2 maximum pooling.
3. The third layer is a convolution layer, which is convolved with all the outputs of the second layer. The parameters are: 3x3x256, and the number is 384.
4. The fourth layer is a convolution layer, which is only performed with the output of the third layer in the same GPU. Convolution, the parameters are: convolution kernel 3x3x192, number 384.
5. The fifth layer is the convolution layer, which is only convolved with the output of the third layer in the same GPU. The parameters are: convolution kernel 3x3x192, number 256, and 2x2 maximum pooling.
6. The sixth layer is a fully connected layer with 4096 neurons.
7. The seventh layer is a fully connected layer with 4096 neurons.
8. The eighth layer is a fully connected layer, representing the SoftMax of 1000 categories.
VGGNet
VGGNet is a CNN model proposed by Oxford's Visual Geometry Group. It won the ILSVRC 2014 positioning competition and won the championship with an error rate of 25.3%. The classification competition is second only to GoogLeNet, and the top-5 error rate is 7.32%.
VGGNet and GooLeNet independently use deeper network results, but each has its own advantages in design. VGGNet inherits the design of AlexNet, but has made more optimizations:
1. Deeper networks, commonly used 16 layers and 9 layers, achieve good performance.
2. Simpler, using only 3x3 convolution kernels and 2x2 max pooling to explore the relationship between depth and performance.
3. Influenced by Network in Network, some models of VGGNet also use 1x1 convolution kernels.
4. Use multiple GPUs for parallel training.
5. The use of Local Response Normailzation was abandoned due to the ineffectiveness.
The network structure is roughly as follows:


In deep learning, we often need to use some techniques, such as decentralizing, rotating, horizontal displacement, vertical displacement, horizontal flipping, etc., to reduce overfitting through data augmentation.
ResNet
ResNet (Residual Neural Network) was proposed by Kaiming He and others of Microsoft Research Asia. By using Residual Unit to successfully train a 152-layer deep neural network, it won the championship in the ILSVRC2015 competition, with a top-5 error rate of 3.57%. The amount is much lower than that of VGGNet.
ResNet was inspired by this problem: previous research has demonstrated that depth is critical to model performance, but accuracy decays as depth increases. Surprisingly, the decay did not come from overfitting, as the accuracy on the training set dropped. In extreme cases, assuming that the additional layers are all equivalent maps, at least it should not bring about an increase in the error on the training set.
The solution is to introduce residual error: the input of a certain layer of network is x, and the expected output is H(x). If we directly pass the input x to the output as an equivalent map, and the non-linear layer in the middle is F(x)=H (x)-x as residual. We guess that optimizing the residual mapping is simpler than optimizing the original mapping. In extreme cases, the residual F(x) can be compressed to 0. as the picture shows:


The above is the residual unit of ResNet. The advantage of the residual unit is that when the response propagates, the gradient can be directly passed to the upper layer, and the gradient disappears with low efficiency, which can support a deeper network. At the same time, ResNet also uses Batch Normalization, and the residual unit will be easier to train and generalize better than before.
GoogLeNet
GoogLeNet was proposed by Christian Szegedy et al. The main idea is to use a deeper network to achieve better performance, and at the same time to reduce the computational loss through optimization.
The model of GoogLeNet is Network in Network. The convolutional layer in AlexNet uses a linear convolution kernel to perform an inner product operation on the image. Each local output is followed by a nonlinear activation function, and the final result is called a feature function. This convolution kernel is a generalized linear model, which implicitly assumes that the features are linearly separable during feature extraction, but this is often not the case in practical problems. In order to solve this problem, Network in Network proposes to use a multilayer perceptron to implement nonlinear convolution, which is actually equivalent to inserting a 1x1 convolution while keeping the feature image size unchanged.


The advantages of using 1x1 convolution are: increasing local feature abstraction capability through nonlinear changes, avoiding fully connected layers to reduce overfitting, reducing dimensionality, and requiring fewer parameters. Network in Network confirms in a sense that deeper networks perform better.
GoogLenet stacks inception and builds a deeper network through a sparse network, which controls the amount of computation while ensuring the performance of the model, making it more suitable for prediction in resource-limited scenarios.
MobileNet
's traditional CNN models tend to focus on performance, but lack feasibility in mobile and embedded application scenarios. In response to this problem, Google proposed a new model architecture called MobileNet.
MobileNet is a small-scale but high-performance CNN model that helps users implement computer vision on mobile or embedded devices without resorting to cloud computing power. With the increasing computing power of mobile devices, MobileNet can help AI technology to be loaded into mobile devices.
MobileNet has the following characteristics: reduce the number of parameters and computational complexity with the help of the depthwise separable convolution; introduce two global hyperparameters, width and resolution, which can find a balance between delay and accuracy, suitable for mobile phones and embedding It has competitive performance and has been verified in tasks such as ImageNet classification; it is feasible in mobile phone applications such as object detection, fine-grained recognition, face attributes, and large-scale geographic status.

Understanding Implementation - VGGNET Style Transfer
Style transfer is one of the most interesting applications of deep learning. We can use this method to "transfer" the style of one image to another image to generate a new image.
The application of deep learning is particularly obvious in the field of computer vision. Image classification, recognition, localization, super-resolution, transformation, migration, description, etc. can all be implemented using deep learning technology. The technology behind it can be summed up in one word: Deep Convolutional Neural Networks have superior image feature extraction capabilities.
Among them, the success of the style transfer algorithm is mainly based on two points: 1. The two images have undergone a pre-trained classification network. more similar. 2. The two images have undergone a pre-trained classification network, and the extracted low-dimensionality is basically equal on the branches, the more similar the two images are in style. Based on these two points, an appropriate loss function optimization network can be designed.
For deep networks, deep convolutional classification networks have good feature extraction capabilities, and the features extracted by different layers have different meanings. Each trained network can be regarded as a good feature extractor. In addition, deep network It is composed of layers of nonlinear functions, which can be regarded as complex multivariate nonlinear functions. This function completes the mapping from input image to output. Therefore, one can use a trained deep network as a loss function calculator.


The model structure is shown in the figure. The network framework is classified into two parts. One part is the image transformation network T (Image transform net) and the pre-trained loss calculation network VGG-16. The image transformation network T takes the content image x as input and outputs the style. The transferred image y, then the content image yc, the style image ys, and y' input vgg-16 to compute features.
In this deep neural network, the parameter loss function is divided into two parts. For the final image y', one part is the content and the other is the style.
Loss content: , which represents the perceptual loss of the deep convolutional network VGG-16: , where G is the Gram matrix, and the calculation process is:
The total loss is determined by the calculation method:

Original link

This article is the original content of Yunqi Community and may not be reproduced without permission.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325824627&siteId=291194637