[Transfer Learning] Domain Adaptation

1. Definition

        1. Source domain and target domain

        The source domain (Source) and the target domain (Target) are different but related . The role of transfer learning is to learn knowledge from the source domain and make it achieve better results in the target domain.

        Transfer learning can be divided into positive transfer and negative transfer , and the division is based on the effect of transfer learning.

        2. Advantages of transfer learning

        ① Lack of large (labeled) data or computing resources

        ②Need to quickly train personalized models

        ③Cold start service (such as a product recommendation for a new user, which can be done relying on user association)

2. Related symbols

        Domain :D:{(x_i,y_i)}^N_{i=1}\sim P(x,y)

                Source Domain: D_s, Target Domain:D_t

        Tasks :y=f(x)

        Conditions : Transfer learning needs to meet one of the following two conditions:

                ①Different domains:P(x,y)\neq Q(x,y)

                ②Different tasks:T_S \neq T_t

3. Transfer Learning

        1. The domain is different

        P(x,y) \neq Q(x,y)After Bayesian expansion:P(x,y)=P(y|x)P(x)

        If P(y|x)the same, it has a different marginal distribution :

                x_s \sim P_s(X),x_t\sim P_t(X)\rightarrow P_s(X) \neq P_t(X)

        If P(x)the same, it has a different conditional distribution :

                P_s(y|x) \q P_t(y|x)

         2. Loss function

        Empirical Risk Minimization (ERM): f^*=argmin_f \frac{1}{m}\sum _{i=1}^m L(f(x_i),y_i); where L is the loss function

        The above formula is an iterative formula used in general machine learning. In transfer learning, generally by adding a transfer regularization expression (Transfer regularization), it can be expressed as follows:

                f^*=argmin_f \frac{1}{m}\sum _{i=1}^m L(f(x_i),y_i) + \lambda R(x_i,y_i); where R is the parameter that needs to be learned. Generally, R is learned in the following situations:

                ① D'_s\subseteq D_s(subset), available P(x,y) \approx Q(x,y), in which case R is not required

                ②R can write Distance(D_s,D_t)orSeparability(D_s,D_t)

                ③ When the two tasks are similar ( f_s \approx f_t), the optimization of R can be skipped

        The above three learning methods correspond to:

                ①Instance-based TL : Based on the instance, it is necessary to select a part of the sample to make it close to the target domain. This method is less used now, and it can be divided into the following methods:

                        1. Instance selection : Design an instance selector to filter out data close to the target domain from the source domain and change its weight (increase the weight of well-scored samples and reduce the weight of poorly-scored samples). It consists of an instance selector (Instance Selector) fand a performance evaluator (Performance Evaluator) g, which are executed cyclically as shown in the figure below. The general idea is close to reinforcement learning .

                         2.Instance reweighting : The premise of using this method is D'_s \subseteq D_s, and P_s(x) \q P_t(x), P(y|x)the same. At this point, the cost function will be rewritten as:

                                \theta^*_t=argmax_\theta \int _x P_t(x) \sum_{y \in Y}P_t(y|x)logP(y|x;\theta)dx, after simplification, we get

                                \theta^*_t \approx argmax_\theta \frac{1}{N_s}\sum^{N_s}_{i=1}\frac{P_t(x_i^S)}{P_s(x_i^S)}logP(y_i^S|x_i^S;\theta)

                ②Feature-based TL : Based on features, the migration regularization term R is explicitly expressed and minimized, generally the distance between two domains. It can be divided into two categories according to the type of source domain and target domain: homogeneous feature space (for example, both source domain and target domain are pictures), heterogeneous feature space (for example, one of source domain and target domain is text, and the other is picture)

                         The premise of this method is that there are some common features between the source domain and the target domain . What we need to do is to transform the source domain and the target domain into the same feature space and reduce their distance. Can be divided into two approaches:

                        1. Explicit distance (Explicit distance) :R=Distance(D_s,D_t); Spatial distance, that is, using some mathematical tools to measure the distance between two domains. The common ones are as follows:

                                ① Based on Kernel : MMD, KL divergence, Cosine similarity

                                ② Based on geometry : flow kernel (GFK), covariance, drift alignment, Riemannian manifold

                                The most used one is MMD (Maximum Matrix Difference), see Chapter 4 for details .

                        2. Implicit distance (Implicit distance) :R=Separability(D_s,D_t); Separability, when the spatial distance cannot be selected, is generally achieved by using the confrontation network GAN .

                        3. Combination of both (explict+implicit dist): such as MMD-AAE network, DAAN network.

                ③Parameter-based TL : Based on parameters, the model trained on the source domain is reused. The representative method is pre-training.

4. MMD

        1. Definition

        MMD stands for Maximum Matrix Difference. Is a value used to measure the difference between domains, which can be defined as mapping x and y to the two data distributions of P and Q ( ) x \sim P,y \sim Q, as a function fthat can map x to the Hilbert space , MMD HThe calculation is the expected maximum difference between the two domains after mapping, and its mathematical formula can be written as:

        MMD(P,Q,F) = sup E_P [f(x)]-E_Q[f(y)]

        In actual calculation, limited random sampling is often performed to obtain some data, and then the mean difference of these data is calculated. The largest of these mean differences is MMD, which is generally written as:

        MMD(P,Q,F)=sup E_P[\frac{1}{m}\sum_{i=1}^mf(x_i)-\frac{1}{n}\sum^n_{j=1}f(y_j)]

        Based on statistics, when the value of MMD is very close to 0, it can be considered that the distribution between the two domains is approximately equal (that is, the goal of domain alignment)

        2. Classification

                ①Marginal dist

                This method uses MMD to measure the difference in distribution between two domains, the original formula

                Distance(D_s,D_t)\approx MMD(P,Q,F) = sup E_P [f(x)]-E_Q[f(y)]After a certain calculation, it can be written as: tr(A^TXMX^TA), where X=[X_s,X_t]\in R^{d\times(m+n)},A \in R^{(m+n)\times(n+m)}the kernel form can be written as: tr(KM), where

                 This method is usually called: TCA (Transfer Component Analysis)-transfer component analysis

                         min\, tr(KM)-\lambda\, tr(K)

                 As can be seen from the figure above, the distribution of the two domains is not equal after PCA (Principal Component Analysis), but the distribution tends to be consistent after TCA processing.

                ②conditional dist

                The formula for this method can be written as:Distance(D_s,D_t)\approx MMD(P_s(y|x),P_t(y|x),f)

                After simplification, it can be obtained: , it can be seen that the structure is similar to the above TCA formula, the difference lies in the representative category Distance(D_s,D_t)=\sum_{c=1}^C tr(A^TXM_cX^TA)in the formula , which is equivalent to adding TCA to the category.M_c

                 Through transformation, you can get a method called JDA (Joint Distribution Adaptation), written as:

                        min \sum^C_{c=0} tr(A^TXM_cX^TA)+\lambda ||A||^2

                Compared with TCA, JDA has better performance and shorter distribution distance. At the same time, because JDA can iterate, it can better learn the differences between distributions.

                ③dynamic dist

                This method can be abbreviated as DDA, which is equivalent to writing TCA and JDA using a general formula, which can be written as:

                 At that time\in=1 , the formula can be written as: tr(A^TXMX^TA), which is TCA

                At that time\mu=0.5 , the formula can be written as: tr(A^TXM_cX^TA), which is JDA

                 The difficulty of this method lies in how to evaluate the parameters \mu, generally using the A-distance estimation method. The specific method can be written as:

                        d_A(D_s,D_t)=2(1-2\epsilon (h)); where his the linear classifier and \epsilon (h)is hthe error of

                It can then be estimated using the above formula \hat{\mu}\approx 1-\frac{d_M}{d_M+\sum_{c=1}^Cdc}; where d_M=d_A(D_s,D_t)is the marginal distribution and d_c=d_A(D_s^{(c)},D_t^{(c)})is the conditional distribution.

        3. The application of MMD in deep learning

        The above-mentioned TCA, JDA, and DDA can be added to the neural network by means of Deep domain confusion (DDC) or Dynamic distribution adaptation network (DDAN). The improved network structure is as follows:

                

The loss function         of the network is: L=L_c(x_i,y_i)+\lambda \cdot Distance(D_s,D_t); where Distance can be TCA, JDA, DDA. It can be learned by stochastic gradient descent. It is an end-to-end learning method.

5. Popular directions of transfer learning

         1.Low-resource learning : Training with only a small amount of labeled data. That is, self-training .

         2. Safe transfer : prevent targeted attacks due to vulnerabilities in inherited public models

         3.Domain adaptation : domain adaptation

         4.Domain generalization : Domain generalization

Guess you like

Origin blog.csdn.net/weixin_37878740/article/details/131153569