cs231n : Transfer Learning

The so-called Transfering learning

In fact, very few people will train a complete convolutional neural network from scratch (and the initialization is random), because the probability of having enough data sets is very small. Instead, it is common practice to pre-train on a large dataset such as ImageNet (1.2m images of 1000 categories), and then use this convolutional neural network as initialization or fixed feature extraction for the task we are interested in device. The three main scenarios of transfer learning are as follows:

1. Convolutional neural network as feature extractor
Take a convolutional neural network pre-trained on ImageNet and remove its final fully connected layer (the output of this layer is 1000 class score), and then treat the remaining convolutional neural network as a fixed feature extractor for the new dataset. In AlexNet, this feature extractor computes a 4096-dimensional vector for each image, which contains the output of the activation function of the hidden layer before the last classifier. We refer to these features as CNN codes. If these codes are threshholded when training on ImageNet data, then it is important for the performance of the neural network that these codes be ReLUd. Once the 4096-dimensional codes for all images have been extracted, we can train a linear classifier on the new dataset.

2. Finetune Convolutional Neural Network
The second strategy is not only to replace and retrain the classifier of the last convolutional layer of the neural network on the new dataset, but to fine-tune the weights of the pretrained network by continuing the bp process. We can fine-tune all layers of the convolutional neural network, or keep the parameters of the previous layers fixed (considering the problem of overfitting) and only tune the last layers of the neural network. The reason for this is mainly that we observe that the features of the previous layers of the neural network contain more attribute features (such as corner detectors or color blob detectors), which are useful for many tasks, but the later layers of the neural network. Layers are becoming more and more specific (i.e. only for the original dataset). In the case of ImageNet, this dataset includes many categories of dogs, so the network has the ability to recognize many different kinds of dogs.

3. Pre-trained model
Since multi-GPU training on ImageNet can take 2-3 weeks, a more general approach is to expose your final neural network checkpoints so others can use this for fine-tune. For example, the caffe library has a Model Zoo where people can share their own network parameters.


When and how to fine-tune?
How do you decide which type of transfer learning to perform on a new dataset? Two of the most important determinants are the size of the new dataset and its similarity to the original dataset. Here are some common rules of thumb:
1) The new dataset is smaller and similar to the original dataset, choose train a linear classifier on the CNN codes.
2) The new dataset is larger and similar to the original dataset, can have more confidence that we won't overfit if we were to try to fine-tune through the full network;
3) the new dataset is small and not very similar to the original dataset, to only train a linear classifier, from activations somewhere earlier in the network ;
4) The new dataset is relatively large and not very similar to the original dataset, we would have enough data and confidence to fine-tune through the entire network.


Practical advice
1. If a pre-trained model is used, for the new dataset, the optional structure will be limited. For example, you can't arbitrarily take some convolutional layers from a pretrained model.
2. The learning rate is generally set relatively small. This is because we generally don't want to distort the parameters of the convolutional neural network too fast and too much.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325982955&siteId=291194637