Machine learning 14 - transfer learning

1 Overview

The goal of transfer learning is to use some irrelevant data to improve the target task. Irrelevant mainly includes

  1. The task is not related. For example, one is a cat and dog classifier, and the other is a tiger and lion classifier
  2. data is not relevant. For example, both are cat and dog classifiers, but one is from real photos of cats and dogs, and the other is cartoon cats and dogs

There are two parts of data in transfer learning

  1. source data. Not directly related to the target task, labeled or unlabeled data is generally easier to obtain, and the amount of data is large. You can use some public data sets, such as ImageNet. For example, in machine translation tasks, the amount of Chinese-English data is very large, which can be used as source data
  2. target data. Data directly related to the target task, labeled or unlabeled data are generally relatively small. For example, in machine translation tasks, Chinese translation to Portuguese is relatively less.

According to the source data and target data, whether it contains labeled data, we can be divided into four categories

image.png

Here are the four types of cases and their processing methods.

 

2 Both source and target have labels

At this time, it is usually the case that both have labels, but the amount of source data is relatively large, and the amount of target data is relatively small. If the target data volume itself is relatively large, then we can directly use the target data to train the model, without using source data. Two methods are commonly used at this time

  1. Pretrain on source data, then fine-tune on target data
  2. Combine the two tasks of source and target to do multi-task learning (MTL)

 

2.1 fine-tune

The idea of ​​fine-tune model fine-tuning is to train the model on source data and then fine-tune on target data. In this way, you can learn a lot of knowledge from source data and adapt to specific tasks of target data. First use the source data to train the model, then use the model parameters to initialize, and then continue training on the target data. When the target data is particularly small, it is necessary to prevent fine-tune from overfitting.

image.png

layer transfer

When the target data is particularly small, fine-tune may also overfit. You can use layer transfer at this time.

  1. First use source data to train a model
  2. Then copy some layers of the model directly to the target model.
  3. Then use target data to train the remaining layers of the target model, and the previously copied layer can freeze and live

At this time, only a few layers of the training model are needed, and overfitting is not so prone.

image.png

So the question now is, which layers need to be copied directly, and which ones need to be fine-tuned. This needs to be based on different tasks

  1. In speech recognition, the last few layers are generally copied directly, and the first few layers of fine-tune are copied. This is because the pronunciation of different people, due to the different oral structure, the low-level features are quite different, while the high-level features such as semantics and language models are similar.
  2. In the image task, the first few layers are generally copied directly, and the next few layers of fine-tune are copied. This is because the low-level features such as lighting and shadows in the image generally have little difference, while the high-level features (such as the elephant’s trunk) are very different from each other.

image.png

 

2.2 multi-task learning Multi-task learning

Fine-tune only needs to consider the effect of the model on the target data, while multi-task learning requires the model to perform better on the source and target.

  1. If the source and target input features are relatively similar, the first few layers can be shared, and the latter layers can be processed separately on different tasks.
  2. If the source and target input characteristics are different, the first few layers and the last few layers can be separate and share the middle layers.

image.png

The following is an example of multi-task learning on machine translation

image.png

The following figure proves that the use of multi-task learning can greatly reduce the error rate under the same amount of data. At the same time, in the case of less than half of the data volume, the effect of a single task can still be achieved. It greatly reduces the model's dependence on data, and at the same time improves model performance.

image.png

 

3 The target has no label, but the source has a label

At this time, you can use domains to fight against migration and zero-sample learning

3.1 Domain-adversarial Training

The first few layers in the neural network are generally used for feature extraction. The latter layers implement corresponding tasks, such as classification. Our goal is that the feature extractor is not sensitive to different domain data, removes domain-specific information, and keeps common information as much as possible.

Such as handwriting recognition on black and white background, and handwriting recognition on color background. The two domains are quite different, and the source model is directly used to predict the target data, and the effect is very poor. Mainly affected by different background colors. We need feature extractors that are not sensitive to the background and can truly capture the common information of numbers.

image.png

How to do domain confrontation training, you can learn from the idea of ​​GAN. As shown in the figure below, the entire network consists of three parts

  1. Feature extractor feature extractor. It is used to extract features from different domain data
  2. The predictor label predictor. It is used to predict the label of source data
  3. Domain classifier. It is used to distinguish whether the data comes from source or target.

image.png

We have two goals

  1. Maximize the ACC of label predict, so as to ensure that the effect of the model on the actual task will not be bad
  2. Minimize the ACC of the domain classifier so that the model cannot distinguish which domain the data comes from. This ensures that the feature extractor is not sensitive to different domains. Do not extract the private features of the domain, but try to extract the common features of different domains.

 

3.2 zero-shot learning

For example, source is to classify cats and dogs, while monkeys appear in target data. It is obviously useless to use the source model directly, because even the monkey label is missing. At this point we can use zero-shot learning, not directly learning the category, but the attributes of the category. For example, we can create a table with attributes such as the number of legs, whether there are tails, whether there are horns, whether there are hairs, etc. According to these attributes, the categories can be determined as cats, dogs, and monkeys. We learn to predict these attributes through the source, and then use the attribute lookup table to guess which category it is.

image.png

 

4 source has no label, target has label

At this time, you can refer to semi-supervised learning, but there is still a big difference from semi-supervised. In semi-supervised data, there is little difference in domain. Our source and target here have a certain difference in domain. The large amount of source data can be used to construct self-supervised learning tasks to learn feature expressions. Typical examples are various pre-training models in NLP. Use self-supervised learning to build Auto-Encoder and train pretrain model on source. Then perform fine-tune on the target task. See

Machine Learning 10 - Semi-supervised Learning

Machine learning 13 - self-supervised unsupervised learning

 

5 Neither source nor target has a label

At this time, it is mainly the category of clustering, which is generally less encountered, so I won't talk about it.

 

 

Guess you like

Origin blog.csdn.net/u013510838/article/details/108566050