Transfer Learning using differential learning rates

In this post, I will be sharing how one can use popular deep learning models for their own specific task using transfer learning. We will cover some concepts like differential learning rates which are not even currently in implementation in some of the deep learning libraries. I have learned about these from the fast.ai deep learning course 2017 as an international fellow. This course content will be available to the general public early 2018 as a MOOC.

So what is transfer learning?

It is the process of using the knowledge learned in one process/activity and applying it to a different task. Let us take a small example, a player who is good at carroms can apply that knowledge in learning how to play a game of pool.

So coming to our machine learning/deep learning perspective…

Same applies to our machine learning world. Let’s have a look at the below picture.

Transfer Learning Example

Let’s say, model A’s task is to identify 1000 kinds of objects like hats, cats, mats and we have such trained model at our hand. Now let’s suppose we want to create a model B to detect a cat/dog classifier. Even if we have a small dataset we can use the knowledge of model A during the training of model B and produce state of the art results.

But why should one use transfer learning?

Whenever we want to solve a unique problem using machine learning the chances are high that we might not find enough data for our model. Training with fewer data will result in not so good results. And even if we have large data there is a possibility of not having enough resources like GPU to obtain high-quality results. So transfer learning addresses these problems by already using the knowledge in the form of a pre-trained model which someone has created with large datasets, resources.

Ok, So how do you do it?

Let’s understand it using a sample network diagram of CNN. Although in practice the networks are large, complex and will contain various other units.

Sample CNN diagram

The most general way of doing it is by modifying the dense layers such that the network output suits our task at hand and train only the newly added layers. This works decently when the tasks are related to each other and you have a small amount of data. Ex: If we are using a pre-trained model which already knows how to detect cats, this approach will work if we want to create a cat/dog classifier with a small amount of data.

Second approach is to even include the convolution blocks closer to dense blocks( blue ones in the diagram) in training. This is more ideal if we have a medium size dataset and the tasks are not so tightly related.

The third approach is to use the pre-trained model but to train all the layers with the dataset. This works well but requires a relatively large dataset and GPU resources. There are a couple of cool tricks if we are taking this option which we are going to cover below.

Mixing first and third approach.

If we decide to use the third approach, it is better to apply the first approach for some epochs of training to bring up the newly added layers weights to a better point. Then we can unfreeze all the layers and follow our third approach.

Differential learning rates:

The phrase “Differential learning rates” means to have different learning rates for different parts of the network during our training. The idea is to divide the layers into various layer groups and set different learning rate for each group so that we get ideal results. In simple terms, we control the rate at which weights change for each part of our network during training.

Why? How does it help?

If we consider the third approach above, there is a small but significant point to be noticed. To understand it let’s go back to our sample CNN figure.

Sample CNN with differential learning rate

In general, layers in red learn generic features like edges, shapes and the middle blue layers learn specific details with respect to the dataset on which it is trained.

Given the above statement, it’s not a good idea to change the learned weights on the initial layers too much because they are already good at what they are supposed to do (detecting the features like edges etc). Middle layers will have knowledge of the complex features that might help our task to some extent if we slightly modify them. So, we want to finetune them a little.

Differential learning rates help us in this regard. We can now imagine sample network into three layer groups (red, blue and green) and set different learning rates. The initial red layers will have small learning rate as we don’t want to disturb them much, the middle blue layers will have learning rate higher than initial layers and the final green layers will be having the highest learning rate that’s optimal.

How less/more the learning rates for initial and middle layers depends on the data correlation between the pre-trained model and our required model. For example, if the task is to create a dog/cat classifier and our pre-trained model is already good at recognizing cats, then we can use learning rates of less magnitude. But if our task is to create some model on satellite imagery/medical imagery then we will have learning rates of slightly higher magnitude.

Please note that most of the deep learning libraries currently do not support differential learning rates.

Ok so show me the code….

Will write up a detailed post on coding part for whatever we have discussed.

Conclusion:

Depending on the task at hand and resources one should choose an appropriate approach in transfer learning. In general, if we have a good amount of data and resources, transfer learning with differential learning rates will yield better results.

【Data Sciencs】不同迁移率下的迁移学习