VGGNet
Why stack two 3 * 3 convolution kernels together?
Because of the two 3 * 3 convolution kernels stacked together, the receptive field (Receptive Field) is 7 * 7, which can roughly replace the role of the 7 * 7 convolution kernel. But doing so allows for fewer parameters, with a parameter ratio of roughly 18:49
Use small convolution kernels instead of large convolution kernels (viewable) to reduce parameters (but increase training time)
GoogLeNet
(1) 22 layers
(2) Inception structure, using some 1 * 1, 3 * 3 and 5 * 5 small convolution kernels combined together in a fixed way to replace the large convolution kernel. To achieve the purpose of increasing the receptive field and reducing parameters.
(3) 5 million parameters, 12 times smaller than ALEXNET.
(4) ILSVRC'14 test champion (6.7% TOP 5 ERROR)
Initial Inception structure
Improved Inception structure
The improved Inception structure serves as the basic unit and is superimposed to form the entire network structure.
ResNet
(1) 152 layers
(2) ILSVRC'15 champion, (3.57 TOP 5 ERROR)
(3) A forward input mechanism is added, and the feature map obtained by the previous layer is input to the later layer as a supervision item. This method enables deep network training to converge.
(1) The author first discovered that training a shallow network performs better than a deep network on both the training set and the test set, and it continues to perform better at all stages of training.
(2) In this example, since the performance of the 20-layer network is better than that of the 56-layer network, the other 36 layers will not do anything at all, so the idea of adding the output of the shallow network directly to the subsequent layers came up.
(Although the network has become shallower, resulting in a reduction in the complexity of the network structure (reduced space complexity and time complexity), the loss has not been significantly reduced) The output of the shallow layer is directly added to the later layers, so that the deep network can perform better
. OK.
训练技巧:
– Batch Normalization
– Xavier initialization
– SGD + Momentum (0.9)
– Learning Rate:0.1
– Batch size 256
– Weight decay 1e-5
– No dropout
Face verification based on ResNet
The key to face verification is to map face images into a feature space in which the same person is closer and different people are further apart. Here we use classification supervision to learn such a feature space.
Use Caffe to implement face verification. First train a face classification network, and then use the penultimate fully connected layer of the network as the feature layer (512 dimensions).
(1) Network structure
First define a 28-layer ResNet.
(2) Optimization objective
Design our objective function. Referring to the literature, we determine the objective function to minimize the following formula:
The first term is softmax loss; the second term is center loss.
A trained Solver file that specifies the trained network and necessary parameters
Run the training script to start training. Training can be terminated at any time, manually adjust the learning rate and then resume training. In this way, we have basically completed the process of training a convolutional neural network with Caffe.
Modifying Caffe and adding new layers
sometimes require some special layers or objective functions. At this time we need to implement it ourselves. Caffe uses various libraries
to make its own code very concise and easy to modify.
(3) Training set and preprocessing
Webface is used as the training set, containing 10,000 different people. Image preprocessing includes face detection, face key point detection and alignment cropping. After detecting the face keypoints, we find a similarity transformation matrix to align the face with a predefined standard shape.
The final result is as follows:
Finally, a part of the processed training set is divided as a validation set.
- Network structure: improved RESNET (train_val.prototxt)
- SGD training is used to reduce the learning rate every certain number of steps. During training, caffe will record the loss into the log file, and draw the loss drop chart by parsing the log file:
(5) Experimental results
The test set is the LFW data set, and the evaluation rule is to determine whether the two pictures are of the same person given two pictures.
We determine whether two faces belong to the same person based on the cosine distance between features. The final 10-fold cross-validation accuracy on this database was: 99.18%.
Deploy to hardware
transfer learning
There are two application methods:
1. For different databases, use the database to be identified (maybe a very small amount) to fine-tune the trained model (the span can be large or small)
2. After the trained model, use The output parameters are used as input parameters and then connected to a neural network for training.