Popular explanation of meta-learning (Meta-Learning)

In layman's terms, meta-learning is to learn how to learn (Learning to learn), master the method of learning , sometimes mastering the method of learning is more important than studying hard!

Let's explain in detail

1. From traditional machine learning to meta-learning

In traditional machine learning, we choose an algorithm ��F, put the data in it, learn a set of parameters ��θ, use it on the test data, and get the result. As shown in the picture:

 Based on this idea, can we learn from F ?

of course! Now our goal is to learn �F

As shown in the figure above, A is the meta-learning algorithm, and ω is the learnable parameters in the algorithm , called meta-knowledge . After obtaining the most applicable ��F, put the data into it to get ��f, and finally output the result.

now there is a problem

There is only one task in the above picture, which is to distinguish between cats and dogs. What should I do when encountering multitasking ? It is to let the algorithm not only classify cats and dogs, but also apples and oranges, bicycles and cars. As shown below:

If �F can do it, at this time, �ω is a better algorithm for all classification tasks . After getting it, you can let him do a new task, divide the mobile phone and computer, and get the model ��fθ​, at this time The model has the ability to distinguish between mobile phones and computers.

The goal of single-task meta-learning is to find an algorithm that is most suitable for the task , while multi-task meta-learning is to find the most suitable algorithm for all tasks, and this algorithm can handle new tasks .

1.1 How to learn algorithm parameter ω?

How does traditional machine learning learn parameters θ?

As shown in the figure below, first build a model, input "cat" and give continuous feedback ;

The second step defines the loss function, as shown in the figure below, defined by cross entropy, that is to say, by comparing the predicted probability with the real label , the lower the predicted probability, the more wrong the classification, the higher the penalty, and the higher the penalty, the next The lower the probability of a wrong classification, learn in this way.

Finally, add up the losses to get the total loss and find the gradient. The purpose is to prevent him from making mistakes next time. According to continuous optimization iterations, we can get �∗θ∗, which is the learned model parameters.

The above steps are to find the parameters of the model, so how to learn the parameters of the algorithm ?

In fact, it is exactly the same as the parameters of the model, and there is no difference in essence.

Because the parameters of the model are to make it more general and general in terms of data, in meta-learning, it is just to replace the above-mentioned data with tasks , so that it can do well in tasks that have not been seen before. do it? As shown below

Given task1 and task2, the initialization algorithm ��Fω​ is obtained, and the loss is obtained according to the test of ��Fω​, and then added together to obtain the final loss ��(��)L(ϕ) .

Compared with traditional machine learning, traditional machine learning loss is done on training samples, while meta-learning is done on test samples . But it is not possible to use test data for training in machine learning, what should I do?

Divide the training data into support set (Support set) (optimize model parameters ��θ) and query set (optimize algorithm parameters ��ω, if ��ω is not good, then ��θ must be bad). For test data, it is also divided into support set and query set, but the query set of test data does not participate in learning

 So what is its process like ? as follows:

First, there is a meta-learning algorithm. After the meta-learning algorithm gives you a support set, you will get a general model. This model is evaluated on the query set to see if it is good. Finally, you get a ω, and then use this algorithm to test Update the support set on the set to get a model, and then use this model to make the final prediction on the query set.

The training and testing sets are called meta-training and meta-testing in meta-learning. The formula is expressed as follows:

 Meta training :

As shown in the figure above, it is divided into inner layer and outer layer optimization, the outer layer is used to learn algorithm parameters, the inner layer is used to learn model parameters , the inner layer is to give ω to learn ∗θ∗, and get ∗θ∗ Query set, how is the verification learning? If it is not good, it means that ω is not good. Use loss to update ∗ω∗ and iterate continuously.

Meta test:

After learning the best algorithm, learn a model on the support set of the test set, and finally, �∗θ∗ is the test model.

What could this meta-knowledge be?  You can make a lot of things, such as: hyperparameters, initialized model parameters, embeddings, model architecture, loss function, etc.

2. Taxonomy of meta-learning

From a methodological perspective, meta-learning can be divided into three categories: optimization-based, model-based, and metric-based.

expand

2.1 Optimization-based meta-learning

�ω plays a role in the optimization process. It guides you to optimize and tells you which optimizer to use currently. A related paper is as follows:

[1606.04474] Learning to learn by gradient descent by gradient descent (arxiv.org)

The title of the thesis is: Learning with Gradient Descent How to learn with Gradient Descent

In meta-learning, an appropriate optimization function is learned through meta-knowledge ω, the formula is as follows:

A typical optimization-based method is MAML (Model-Agnostic Meta-Learning):

We define the initialization parameter as �θ, and define the model parameters after training on the nth test task as �^�θ^n, so the total loss function is �(�)=∑�=1���(�^ �)L(ϕ)=∑n=1N​ln(θ^n), the loss function of pre-training is �(�)=∑�=1���(�)L(ϕ)=∑n=1N ​ln(ϕ), the intuitive understanding is that the loss evaluated by MAML is the test loss after task training, while pre-training is to directly calculate the loss on the original basis without training.

 Teacher Li Hongyi gave a very vivid example, assuming that the ϕ and θ vectors of the model parameters are both one-dimensional, the original intention of MAML is to find an unbiased ϕ , so that no matter in the loss curve of task 1 1l1 is also the loss curve of task 2. On 2l2, both can drop to the respective global optimum.

The original intention of model pre-training is to find a state that minimizes the sum of losses of all tasks from the beginning. ϕ , it does not guarantee that all tasks can be trained to the best Indicates that �2l2 converges to a local optimum. Next, Mr. Li also made a very realistic analogy. He compared MAML to choosing to study for a Ph. At work, immediately exchange the skills you have learned into cash, and what you care about is how you perform at the moment.

To sum up, the framework of the MAML algorithm is actually very simple. It is worth noting that the two learning rates ϵ and η are used differently:

  • For all sampled tasks ��θi, calculate the gradient on the support set and update the parameters ��=��−��▽��(��)θi=θi−ϵ▽ϕ​l(ϕ)
  • Calculate the sum of the losses of all tasks on the query set �(�)=∑�=1���(��)L(ϕ)=∑n=1N​ln(θn)
  • Update initialization parameters �⟵�−�▽��(�)ϕ⟵ϕ−η▽ϕ​L(ϕ)

This is the flow of the training process. All update parameter steps are limited to one step, that is, one-step, but when using this algorithm, it can be updated more times when testing the performance of new tasks.

 Any model can apply MAML

The pseudo code is as follows:

Application : few-shot learning

The way to calculate the loss function is different. MAML does not require the initialized model to do well, but requires the model after one iteration to do well . The pre-training model requires that θ itself be good

2.2 Model-based meta-learning

Learn a model and directly generate a model through meta-knowledge

There are the following models:

  • Memory-enhanced neural network (MANN, Memory-Augmented Neural Network)
  • Meta Network (MetaNet)
  • Task-Agnostic Networks (TAML)
  • Simple Neural Attentive Meta-Learner (SNAIL, Simple Neural Attentive Meta-Learner)

Advantages :

The flexibility of the internal dynamics of the system has wider applicability than that based on metrics

shortcoming:

  • Large amount of data, poor effect
  • Supervised tasks, inferior to metric-based meta-learning
  • The distance between tasks is large, which is not as strong as optimization-based meta-learning, which
    has a strong dependence on the network structure, and the design of the network structure depends on the characteristics of the task to be solved. In the face of tasks with large differences, the network structure needs to be redesigned.

2.3 Metric-based meta-learning

Learn an effective metric space to represent the similarity of two sets of samples, and then quickly update and adapt to new tasks based on the metric space.

There are the following models:

  • Twin Network (SiamenseNet)
  • Matching Network (MatchingNet)
  • Attentive Recurrent Comparator (ARC, Attentive Recurrent Comparator)

advantage:

When the amount of tasks is small, the network does not need to be adjusted for specific tasks, and the prediction speed is fast; the prediction based on similarity has a simple idea

shortcoming:

  • When the training and testing tasks are far away, the method cannot absorb new task information into the network weights, and the encoding process needs to be retrained
  • When the task is large, the pairwise comparison is computationally expensive and the label is highly dependent, which is only suitable for supervised environments
  • Encoded samples fail to interpret meaning
  • Simple use of distance to express similarity is unreasonably possible

Guess you like

Origin blog.csdn.net/bruce__ray/article/details/131144371