1. Definition
1. Source domain and target domain
The source domain (Source) and the target domain (Target) are different but related . The role of transfer learning is to learn knowledge from the source domain and make it achieve better results in the target domain.
Transfer learning can be divided into positive transfer and negative transfer , and the division is based on the effect of transfer learning.
2. Advantages of transfer learning
① Lack of large (labeled) data or computing resources
②Need to quickly train personalized models
③Cold start service (such as a product recommendation for a new user, which can be done relying on user association)
2. Related symbols
Domain :
Source Domain: , Target Domain:
Tasks :
Conditions : Transfer learning needs to meet one of the following two conditions:
①Different domains:
②Different tasks:
3. Transfer Learning
1. The domain is different
After Bayesian expansion:
If the same, it has a different marginal distribution :
If the same, it has a different conditional distribution :
2. Loss function
Empirical Risk Minimization (ERM): ; where L is the loss function
The above formula is an iterative formula used in general machine learning. In transfer learning, generally by adding a transfer regularization expression (Transfer regularization), it can be expressed as follows:
; where R is the parameter that needs to be learned. Generally, R is learned in the following situations:
① (subset), available , in which case R is not required
②R can write or
③ When the two tasks are similar ( ), the optimization of R can be skipped
The above three learning methods correspond to:
①Instance-based TL : Based on the instance, it is necessary to select a part of the sample to make it close to the target domain. This method is less used now, and it can be divided into the following methods:
1. Instance selection : Design an instance selector to filter out data close to the target domain from the source domain and change its weight (increase the weight of well-scored samples and reduce the weight of poorly-scored samples). It consists of an instance selector (Instance Selector) and a performance evaluator (Performance Evaluator) , which are executed cyclically as shown in the figure below. The general idea is close to reinforcement learning .
2.Instance reweighting : The premise of using this method is , and , the same. At this point, the cost function will be rewritten as:
, after simplification, we get
②Feature-based TL : Based on features, the migration regularization term R is explicitly expressed and minimized, generally the distance between two domains. It can be divided into two categories according to the type of source domain and target domain: homogeneous feature space (for example, both source domain and target domain are pictures), heterogeneous feature space (for example, one of source domain and target domain is text, and the other is picture)
The premise of this method is that there are some common features between the source domain and the target domain . What we need to do is to transform the source domain and the target domain into the same feature space and reduce their distance. Can be divided into two approaches:
1. Explicit distance (Explicit distance) :; Spatial distance, that is, using some mathematical tools to measure the distance between two domains. The common ones are as follows:
① Based on Kernel : MMD, KL divergence, Cosine similarity
② Based on geometry : flow kernel (GFK), covariance, drift alignment, Riemannian manifold
The most used one is MMD (Maximum Matrix Difference), see Chapter 4 for details .
2. Implicit distance (Implicit distance) :; Separability, when the spatial distance cannot be selected, is generally achieved by using the confrontation network GAN .
3. Combination of both (explict+implicit dist): such as MMD-AAE network, DAAN network.
③Parameter-based TL : Based on parameters, the model trained on the source domain is reused. The representative method is pre-training.
4. MMD
1. Definition
MMD stands for Maximum Matrix Difference. Is a value used to measure the difference between domains, which can be defined as mapping x and y to the two data distributions of P and Q ( ) , as a function that can map x to the Hilbert space , MMD The calculation is the expected maximum difference between the two domains after mapping, and its mathematical formula can be written as:
In actual calculation, limited random sampling is often performed to obtain some data, and then the mean difference of these data is calculated. The largest of these mean differences is MMD, which is generally written as:
Based on statistics, when the value of MMD is very close to 0, it can be considered that the distribution between the two domains is approximately equal (that is, the goal of domain alignment)
2. Classification
①Marginal dist
This method uses MMD to measure the difference in distribution between two domains, the original formula
After a certain calculation, it can be written as: , where the kernel form can be written as: , where
This method is usually called: TCA (Transfer Component Analysis)-transfer component analysis
As can be seen from the figure above, the distribution of the two domains is not equal after PCA (Principal Component Analysis), but the distribution tends to be consistent after TCA processing.
②conditional dist
The formula for this method can be written as:
After simplification, it can be obtained: , it can be seen that the structure is similar to the above TCA formula, the difference lies in the representative category in the formula , which is equivalent to adding TCA to the category.
Through transformation, you can get a method called JDA (Joint Distribution Adaptation), written as:
Compared with TCA, JDA has better performance and shorter distribution distance. At the same time, because JDA can iterate, it can better learn the differences between distributions.
③dynamic dist
This method can be abbreviated as DDA, which is equivalent to writing TCA and JDA using a general formula, which can be written as:
At that time , the formula can be written as: , which is TCA
At that time , the formula can be written as: , which is JDA
The difficulty of this method lies in how to evaluate the parameters , generally using the A-distance estimation method. The specific method can be written as:
; where is the linear classifier and is the error of
It can then be estimated using the above formula ; where is the marginal distribution and is the conditional distribution.
3. The application of MMD in deep learning
The above-mentioned TCA, JDA, and DDA can be added to the neural network by means of Deep domain confusion (DDC) or Dynamic distribution adaptation network (DDAN). The improved network structure is as follows:
The loss function of the network is: ; where Distance can be TCA, JDA, DDA. It can be learned by stochastic gradient descent. It is an end-to-end learning method.
5. Popular directions of transfer learning
1.Low-resource learning : Training with only a small amount of labeled data. That is, self-training .
2. Safe transfer : prevent targeted attacks due to vulnerabilities in inherited public models
3.Domain adaptation : domain adaptation
4.Domain generalization : Domain generalization