[ZJU-Machine Learning] Convolutional neural network-the most popular network structure recently

VGGNet

Insert image description here
Why stack two 3 * 3 convolution kernels together?

Because of the two 3 * 3 convolution kernels stacked together, the receptive field (Receptive Field) is 7 * 7, which can roughly replace the role of the 7 * 7 convolution kernel. But doing so allows for fewer parameters, with a parameter ratio of roughly 18:49

Use small convolution kernels instead of large convolution kernels (viewable) to reduce parameters (but increase training time)

GoogLeNet

Insert image description here

(1) 22 layers
(2) Inception structure, using some 1 * 1, 3 * 3 and 5 * 5 small convolution kernels combined together in a fixed way to replace the large convolution kernel. To achieve the purpose of increasing the receptive field and reducing parameters.
(3) 5 million parameters, 12 times smaller than ALEXNET.
(4) ILSVRC'14 test champion (6.7% TOP 5 ERROR)

Initial Inception structure
Insert image description here
Improved Inception structure
Insert image description here
The improved Inception structure serves as the basic unit and is superimposed to form the entire network structure.
Insert image description here

ResNet

Insert image description here

(1) 152 layers
(2) ILSVRC'15 champion, (3.57 TOP 5 ERROR)
(3) A forward input mechanism is added, and the feature map obtained by the previous layer is input to the later layer as a supervision item. This method enables deep network training to converge.
Insert image description here

(1) The author first discovered that training a shallow network performs better than a deep network on both the training set and the test set, and it continues to perform better at all stages of training.
(2) In this example, since the performance of the 20-layer network is better than that of the 56-layer network, the other 36 layers will not do anything at all, so the idea of ​​adding the output of the shallow network directly to the subsequent layers came up.
(Although the network has become shallower, resulting in a reduction in the complexity of the network structure (reduced space complexity and time complexity), the loss has not been significantly reduced) The output of the shallow layer is directly added to the later layers, so that the deep network can perform better
Insert image description here
. OK.
Insert image description here

训练技巧:
– Batch Normalization
– Xavier initialization
– SGD + Momentum (0.9)
– Learning Rate:0.1
– Batch size 256
– Weight decay 1e-5
– No dropout

Face verification based on ResNet

The key to face verification is to map face images into a feature space in which the same person is closer and different people are further apart. Here we use classification supervision to learn such a feature space.
Use Caffe to implement face verification. First train a face classification network, and then use the penultimate fully connected layer of the network as the feature layer (512 dimensions).
(1) Network structure
First define a 28-layer ResNet.

Insert image description here
Insert image description here
(2) Optimization objective
Design our objective function. Referring to the literature, we determine the objective function to minimize the following formula:

Insert image description here

The first term is softmax loss; the second term is center loss.

Insert image description here

A trained Solver file that specifies the trained network and necessary parameters

Insert image description here
Run the training script to start training. Training can be terminated at any time, manually adjust the learning rate and then resume training. In this way, we have basically completed the process of training a convolutional neural network with Caffe.

Modifying Caffe and adding new layers
sometimes require some special layers or objective functions. At this time we need to implement it ourselves. Caffe uses various libraries
to make its own code very concise and easy to modify.
(3) Training set and preprocessing
Webface is used as the training set, containing 10,000 different people. Image preprocessing includes face detection, face key point detection and alignment cropping. After detecting the face keypoints, we find a similarity transformation matrix to align the face with a predefined standard shape.
The final result is as follows:
Insert image description here

Finally, a part of the processed training set is divided as a validation set.

  1. Network structure: improved RESNET (train_val.prototxt)
  2. SGD training is used to reduce the learning rate every certain number of steps. During training, caffe will record the loss into the log file, and draw the loss drop chart by parsing the log file:

Insert image description here

(5) Experimental results
The test set is the LFW data set, and the evaluation rule is to determine whether the two pictures are of the same person given two pictures.
We determine whether two faces belong to the same person based on the cosine distance between features. The final 10-fold cross-validation accuracy on this database was: 99.18%.

Insert image description here

Deploy to hardware
Insert image description here

transfer learning

There are two application methods:
1. For different databases, use the database to be identified (maybe a very small amount) to fine-tune the trained model (the span can be large or small)
2. After the trained model, use The output parameters are used as input parameters and then connected to a neural network for training.

Guess you like

Origin blog.csdn.net/qq_45654306/article/details/113419797