transfer learning application

Transfer Learning:

1. Fine-tuning and feature extraction

In *fine-tuning*, we start with a pre-trained model and update all model parameters for our new task, essentially retraining the entire model.
In *feature extraction*, we start with a pretrained model and only update the final layer weights from which predictions are derived. It is called feature extraction because we use a pretrained CNN as a fixed feature extractor and only change the output layer.

Both transfer learning methods follow several steps:

  1. Initialize the pre-trained model
  2. Reorganize the last layer to have the same number of outputs as the number of new dataset categories

Guide package:

from __future__ import print_function
from __future__ import division
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import torchvision
from torchvision import datasets, models, transforms
import matplotlib.pyplot as plt
import time
import os
import copy
print("PyTorch Version: ",torch.__version__)
print("Torchvision Version: ",torchvision.__version__)

Dataset: https://download.pytorch.org/tutorial/hymenoptera_data.zip

Below are all parameters that need to be changed at runtime. We will use the dataset hymenoptera_data. This
dataset contains two classes: bees* and *ants, and is structured so that we can use the ImageFolder dataset without writing our
own custom dataset. Download the data and set data_dir to the root directory of the dataset. model_name is the model name you want to use
and must be chosen from this list:

[resnet, alexnet, vgg, squeezenet, densenet, inception]

Other inputs are as follows:

num_classes is the number of categories in the dataset,

batch_size is the training batch size, which can be adjusted according to the computing power of your machine,

num_epochsis is the number of training epochs we want to run and
feature_extractis is a boolean defining whether we choose fine-tuning or feature extraction. If feature_extract = False, the model will be fine-tuned and all model parameters will be updated. If feature_extract=True, only the parameters of the last layer are updated, other parameters remain unchanged.

# 顶级数据目录。 这里我们假设目录的格式符合ImageFolder结构
data_dir = "./data/hymenoptera_data"
# 从[resnet, alexnet, vgg, squeezenet, densenet, inception]中选择模型
model_name = "squeezenet
# 数据集中类别数量
num_classes = 2
# 训练的批量大小(根据您的内存量而变化)
batch_size = 8
# 你要训练的epoch数
num_epochs = 15
# 用于特征提取的标志。 当为False时,我们微调整个模型,
# 当True时我们只更新重新形成的图层参数
feature_extract = True

 2. Helper functions

Before writing the code to tune the model, let's define some helper functions.

Model Training and Validation

The code train_model function handles training and validation for a given model. As input, it takes a PyTorch model,
a dictionary of dataloaders, a loss function, an optimizer, the number of epochs for training and validation, and a boolean flag when the model is the initial model.
The is_inception flag is used to accommodate the Inception v3 model, since the architecture uses auxiliary outputs, and the overall model loss
involves both auxiliary and final outputs, as described here.
This function trains for the specified number of epochs, and runs the full validation step after each epoch . It also keeps track of the best performing model (in terms of validation accuracy) and returns the
best performing model at the end of training. After each epoch, print the training and validation accuracy.

def train_model(model, dataloaders, criterion, optimizer,num_epochs=25, is_inception=False): 
    since = time.time()
    val_acc_history = []
    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
	    print('Epoch {}/{}'.format(epoch, num_epochs - 1))
	    print('-' * 10)
	    # 每个epoch都有一个训练和验证阶段
	    for phase in ['train', 'val']:
		    if phase == 'train':
			    model.train() # Set model to training mode
    		else:
	    		model.eval() # Set model to evaluate mode
		    	running_loss = 0.0
			    running_corrects = 0
    		# 迭代数据
	    	for inputs, labels in dataloaders[phase]:
		    	inputs = inputs.to(device)
			    labels = labels.to(device)
    			# 零参数梯度
	    		optimizer.zero_grad()
		    	# 前向
			    # 如果只在训练时则跟踪轨迹
    			with torch.set_grad_enabled(phase == 'train'):
	    			# 获取模型输出并计算损失
		    		# 开始的特殊情况,因为在训练中它有一个辅助输出。
			    	# 在训练模式下,我们通过将最终输出和辅助输出相加来计算损耗
				    # 但在测试中我们只考虑最终输出。
    				if is_inception and phase == 'train':
	    				# From https://discuss.pytorch.org/t/how-to-optimize-
		    			inception-model-with-auxiliary-classifiers/7958
			    		outputs, aux_outputs = model(inputs)
				    	loss1 = criterion(outputs, labels)
					    loss2 = criterion(aux_outputs, labels)
    					loss = loss1 + 0.4*loss2
	    			else:
		    			outputs = model(inputs)
			    		loss = criterion(outputs, labels)
					
				_, preds = torch.max(outputs, 1)
				# backward + optimize only if in training phase
				if phase == 'train':
					loss.backward()
					optimizer.step()
				# 统计
			running_loss += loss.item() * inputs.size(0)
			running_corrects += torch.sum(preds == labels.data)
		epoch_loss = running_loss / len(dataloaders[phase].dataset)
		epoch_acc = running_corrects.double() / len(dataloaders[phase].dataset)
		
		print('{} Loss: {:.4f} Acc: {:.4f}'.format(phase, epoch_loss, epoch_acc))
		# deep copy the model
		if phase == 'val' and epoch_acc > best_acc:
			best_acc = epoch_acc
			best_model_wts = copy.deepcopy(model.state_dict())
		if phase == 'val':
			val_acc_history.append(epoch_acc)
time_elapsed = time.time() - since
print('Training complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60,
time_elapsed % 60))
print('Best val Acc: {:4f}'.format(best_acc))
# load best model weights
model.load_state_dict(best_model_wts)
return model, val_acc_history

Set the .requires_grad property of the model parameters
This helper function sets the .requires_grad property of the parameters in the model to False when we do feature extraction.

By default, when we load a pre-trained model, all parameters are .requires_grad=True, if we train or fine-tune from scratch, this setting is fine.

However, if we are going to run feature extraction and only want to compute gradients for newly initialized layers, then we want all other parameters to not require gradient changes.

def set_parameter_requires_grad(model, feature_extracting):
    if feature_extracting:
        for param in model.parameters():
            param.requires_grad = False

3. Initialize and reshape the network

Here we reshape each network

Note that this is not an automatic process and is unique to each model. Recall that the last layer of a CNN model (usually the FC layer)
has the same number of nodes as the number of output classes in the dataset. Since all models have been pre-trained on Imagenet, they all have an output layer of size 1000, with one node per class. The goal here is to reshape the last layer to have the same number of inputs as before, and have the same number of outputs as the number of categories in the dataset. In the following sections, we discuss how to change the architecture of each model.

But first, an important detail about the difference between fine-tuning and feature extraction.
When doing feature extraction, we only want to update the parameters of the last layer , in other words, we only want to update the parameters of the layer we are reshaping
. Therefore, we don't need to calculate gradients for parameters that don't need to change, so for efficiency, we set the .requires_grad property of other layers
to False . This is important because by default this property is set to True. Then, when
we initialize a new layer, the new parameter .requires_grad = True by default, so only the parameters of the new layer are updated. When we
fine-tune, we can set all .required_grad to default True

 

4. Network selection

4.1 The Resnet
paper Deep Residual Learning for Image Recognition introduces the Resnet model.

There are several variants of different sizes, including Resnet18, Resnet34, Resnet50, Resnet101, and Resnet152, all of which are available from the torchvision model. Since our dataset is small with only two classes, we use Resnet18. When we print this model, we see that the last layer is a fully connected layer, as shown below: Therefore, we must reinitialize model.fc as a linear layer with 512 input features and 2 output features:


4.2 Alexnet
Alexnet was introduced in the paper ImageNet Classification with Deep Convolutional Neural Networks and was the first very successful CNN on the ImageNet dataset. When we print the model architecture, we see the model output as
layer 6 of the classifier: to use this model in our dataset, we reinitialize this layer as:


4.3 VGG


VGG was introduced in the paper Very Deep Convolutional Networks for Large-Scale Image Recognition.
Torchvision provides 8 versions of VGG of different lengths, some of which have batch normalization layers. Here we use
VGG-11 for batch normalization. The output layer is similar to Alexnet, namely (fc):

Linear(in_features=512, out_features=1000, bias=True)
model.fc = nn.Linear(512, num_classes)
(classifier): Sequential(
...
Linear(in_features=4096, out_features=1000, bias=True))
model.classifier[6] = nn.Linear(4096,num_classes)
(classifier): Sequential(
...
Linear(in_features=4096, out_features=1000, bias=True))



Therefore, we use the same method to modify the output layer


4.4 Squeezenet


The paper SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size describes the Squeeznet architecture, using a different output structure than any other model shown here. Torchvision's
Squeezenet has two versions, we use version 1.0. The output comes from a 1x1 convolutional layer, which is the first layer of the classifier:
To modify the network, we reinitialize the Conv2d layer so that the output feature map depth is 2


4.5 Densenet
paper Densely Connected Convolutional Networks introduced the Densenet model. Torchvision has four
Densenet variants, but here we only use Densenet-121. The output layer is a linear layer with 1024 input features
: To reshape this network, we reinitialize the linear layer of the classifier as


4.6 Inception v3
Inception v3 was first described in the paper Rethinking the Inception Architecture for Computer Vision. The network is unique in that it trains with two output layers. The second output is called the auxiliary output and is included in the AuxLogits section of the network. The main output is the linear layer at the end of the network. Note that we only consider the main output when testing. The auxiliary output and main output of the loaded model are printed as:

model.classifier[6] = nn.Linear(4096,num_classes)
(classifier): Sequential(
(0): Dropout(p=0.5)
(1): Conv2d(512, 1000, kernel_size=(1, 1), stride=(1, 1))
(2): ReLU(inplace)
(3): AvgPool2d(kernel_size=13, stride=1, padding=0))
model.classifier[1] = nn.Conv2d(512, num_classes, kernel_size=(1,1), stride=(1,1))
(classifier): Linear(in_features=1024, out_features=1000, bias=True)
model.classifier = nn.Linear(1024, num_classes)


To fine-tune this model, we have to reshape these two layers. This can be done in the following way, note that many models have a similar output structure, but each handles it slightly differently. Also, look at the model architecture of the reshape network and make sure that the number of output features is the same as the number of categories in the dataset.

 

5. Reshape the code

def initialize_model(model_name, num_classes, feature_extract, use_pretrained=True):
     # 初始化将在此if语句中设置的这些变量。
     # 每个变量都是模型特定的。
     model_ft = None
     input_size = 0
     if model_name == "resnet":
         """ Resnet18"""
         model_ft = models.resnet18(pretrained=use_pretrained)
         set_parameter_requires_grad(model_ft, feature_extract)
         num_ftrs = model_ft.fc.in_features
         model_ft.fc = nn.Linear(num_ftrs, num_classes)
         input_size = 224
     elif model_name == "alexnet":
         """ Alexnet"""
         model_ft = models.alexnet(pretrained=use_pretrained)
         set_parameter_requires_grad(model_ft, feature_extract)
         num_ftrs = model_ft.classifier[6].in_features
         model_ft.classifier[6] = nn.Linear(num_ftrs,num_classes)
         input_size = 224
     elif model_name == "vgg":
         """ VGG11_b"""
		 model_ft = models.vgg11_bn(pretrained=use_pretrained)
		 set_parameter_requires_grad(model_ft, feature_extract)
		 num_ftrs = model_ft.classifier[6].in_features
		 model_ft.classifier[6] = nn.Linear(num_ftrs,num_classes)
		 input_size = 224
	elif model_name == "squeezenet":
	 """ Squeezenet """
		 model_ft = models.squeezenet1_0(pretrained=use_pretrained)
		 set_parameter_requires_grad(model_ft, feature_extract)
		 model_ft.classifier[1] = nn.Conv2d(512, num_classes, kernel_size(1,1),stride=(1,1))
		 model_ft.num_classes = num_classes
		 input_size = 224
	elif model_name == "densenet":
	 """ Densenet """
		model_ft = models.densenet121(pretrained=use_pretrained)
		set_parameter_requires_grad(model_ft, feature_extract)
		num_ftrs = model_ft.classifier.in_features
		model_ft.classifier = nn.Linear(num_ftrs, num_classes)
		input_size = 224
	elif model_name == "inception":
	 """ Inception v3 Be careful, expects (299,299) sized images and has auxiliary output """
		model_ft = models.inception_v3(pretrained=use_pretrained)
		set_parameter_requires_grad(model_ft, feature_extract)
		# 处理辅助网络
		num_ftrs = model_ft.AuxLogits.fc.in_features
		model_ft.AuxLogits.fc = nn.Linear(num_ftrs, num_classes)
		# 处理主要网络
		num_ftrs = model_ft.fc.in_features
		model_ft.fc = nn.Linear(num_ftrs,num_classes)
		input_size = 299
	else:
		print("Invalid model name, exiting...")
		exit()
	return model_ft, input_size
# 在这步中初始化模型
model_ft, input_size = initialize_model(model_name, num_classes,feature_extract,use_pretrained=True)

# 打印我们刚刚实例化的模型
print(model_ft)

6. Data loading

Now that we know what the input dimensions must be, we can initialize the data transformation, image dataset and data loader. Note that the model is pre-trained with hard-coded normalization values, as described here

# 数据扩充和训练规范化
# 只需验证标准化
data_transforms = {
	'train': transforms.Compose([
	transforms.RandomResizedCrop(input_size),
	transforms.RandomHorizontalFlip(),
	transforms.ToTensor(),
	transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
	]),
	'val': transforms.Compose([
	transforms.Resize(input_size),
	transforms.CenterCrop(input_size),
	transforms.ToTensor(),
	transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
	]),
}
print("Initializing Datasets and Dataloaders...")
# 创建训练和验证数据集

image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
data_transforms[x]) for x in ['train', 'val']}

# 创建训练和验证数据加载器
dataloaders_dict = {x: torch.utils.data.DataLoader(image_datasets[x],
batch_size=batch_size, shuffle=True, num_workers=4) for x in ['train', 'val']}

# 检测我们是否有可用的GPU


device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

7. Create an optimizer

Now that the model structure is correct, the final step of fine-tuning and feature extraction is to create an optimizer that only updates the required parameters.

Recall that after loading the pretrained model, but before reshaping, we manually set the .requires_grad attribute of all parameters to False if feature_extract=True. Then reinitialize the network layer parameters that default to .requires_grad = True. So now we know that all parameters with .requires_grad = True should be optimized.

Next, we list these parameters and input this list to the SGD algorithm constructor.

To verify this, you can look at the parameters to learn. When fine-tuning, this list should be long and contain all model parameters. However, when doing feature extraction, this list should be short and include only the weights and biases of the reshape layer.

 

# 将模型发送到GPU
model_ft = model_ft.to(device)
# 在此运行中收集要优化/更新的参数。
# 如果我们正在进行微调,我们将更新所有参数。
# 但如果我们正在进行特征提取方法,我们只会更新刚刚初始化的参数,即`requires_grad`的参数为
# True。
params_to_update = model_ft.parameters()
print("Params to learn:")
if feature_extract:
	params_to_update = []
	for name,param in model_ft.named_parameters():
	    if param.requires_grad == True:
	        params_to_update.append(param)
	        print("\t",name)
else:
	for name,param in model_ft.named_parameters():
		if param.requires_grad == True:
			print("\t",name)
# 观察所有参数都在优化
optimizer_ft = optim.SGD(params_to_update, lr=0.001, momentum=0.9)

#输出结果
#Params to learn:
#	 classifier.1.weight
#	 classifier.1.bias

 

8. Run training and validation

The final step is to set the loss for the model, then run the train and validation functions for the set number of epochs.

Note that this step may take a while on the CPU depending on the number of epochs. Also, the default learning rate is not optimal for all models, so it is necessary to tune each model individually for maximum accuracy.

# 设置损失函数
criterion = nn.CrossEntropyLoss()

# Train and evaluate
model_ft, hist = train_model(model_ft, dataloaders_dict, criterion,optimizer_ft,
num_epochs=num_epochs, is_inception=(model_name=="inception"))

9. Contrasted with the ab-scratch model

How will the model learn if we don't use transfer learning. The performance of fine-tuning and feature extraction is highly dependent on the dataset, but in general, both transfer learning methods yield good results in terms of training time and overall accuracy relative to training a model from scratch

# 初始化用于此运行的模型的非预训练版本
scratch_model,_ = initialize_model(model_name, num_classes,
feature_extract=False, use_pretrained=False)
scratch_model = scratch_model.to(device)
scratch_optimizer = optim.SGD(scratch_model.parameters(), lr=0.001, momentum=0.9)
scratch_criterion = nn.CrossEntropyLoss()
_,scratch_hist = train_model(scratch_model, dataloaders_dict, scratch_criterion,
scratch_optimizer, num_epochs=num_epochs, is_inception=(model_name=="inception"))
# 绘制验证精度的训练曲线与转移学习方法
# 和从头开始训练的模型的训练epochs的数量
ohist = []
shist = []
ohist = [h.cpu().numpy() for h in hist]
shist = [h.cpu().numpy() for h in scratch_hist]
plt.title("Validation Accuracy vs. Number of Training Epochs")
plt.xlabel("Training Epochs")
plt.ylabel("Validation Accuracy")
plt.plot(range(1,num_epochs+1),ohist,label="Pretrained")
plt.plot(range(1,num_epochs+1),shist,label="Scratch")
plt.ylim((0,1.))
plt.xticks(np.arange(1, num_epochs+1, 1.0))
plt.legend()
plt.show()

result:

Epoch 0/14 ----------

train Loss: 0.7131  Acc: 0.4959

val Loss: 0.6931  Acc: 0.4575

Epoch 1/14 ----------

train Loss: 0.6930  Acc: 0.5041

val Loss: 0.6931  Acc: 0.4575

Epoch 2/14 ----------

train Loss: 0.6932  Acc: 0.5041

val Loss: 0.6931  Acc: 0.4575

Epoch 3/14 ----------

train Loss: 0.6932  Acc: 0.5041

val Loss: 0.6931  Acc: 0.4575

。。。。。。

10. Summary and outlook

Try running other models and see how good you can get. Also, notice that feature extraction takes less time because we don't need to compute most of the gradients in the backpropagation. There are many more places to try.

For example: Run this code on a harder dataset to see more benefits of transfer learning. In new domains (such as NLP, audio, etc.), use transfer learning to update different models using the method described here. Once you are happy with a model, you can export it as an ONNX model, or track it with a hybrid front end for even more speed and opportunities for optimization

Guess you like

Origin blog.csdn.net/Turbo_Come/article/details/105748369