One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning

版权声明:Copyright reserved to Hazekiah Wang ([email protected]) https://blog.csdn.net/u010909964/article/details/84946878

Introduction

The goal is to : enable a robot to learn from one raw video of human demonstrations on a new task, with the help of the prior knowledge of some old tasks, where both human demonstrations and robot demonstrations are provided.

Learning from video demonstrations presents two major challenge:

  1. the correspondence problem, sometimes named domain shift, referring to the difference in appearance and morphology of human and robot.
  2. hardships in obtaining substantial amount of data.

As to the domain shift problem, this work uses data-driven approach, claimed to be better than manual specifications used in prior works.

As to the data problem, this work resorts to meta-learning, which is capable of learning from very few data with the supplementation of a rich prior over old tasks experienced in the meta-training process. I think, this is like trading off data for multiple similar tasks with data for the same tasks.

Related Work

The correspondence problem or the domain shift is previously solved by: 1) manually specify a correspondence; 2) resort to domain adaptation, where invariant representation or a mapping from one domain to the other is found.

Q5

The author claims that the physical correspondence between human and robot does not call for an invariant repesentation, nor a direct mapping between the domains, but does not provide sufficient evidence to make the it compelling.

In the light of reinforcement learning, prior efforts to resolve the problem of learning from human demonstrations include: 1) explicitly determine the goal or reward underlying the human behavior, e.g. Inverse RL; 2) assume a known model and perform trajectory optimization to reach the inferred goal. These approaches view tasks in isolation and hence a large amount of data, here in the form of human demonstratioins, is required.

Preliminaries

This work revolves around meta-learning which naturally rule out the problem of costly data, and what is left to deal with is the domain shift. They solve it by extending MAML.

A typical meta-learning process can be formalized as learning a prior over functions (discovering structures across tasks), and the following fine-tuning process as inference under the learned prior (use the learned structure to fast adapt to new task). MAML optimizes for a set of model parameters such that one or a few gradient steps on meta-train-set produces good performance on meta-val-set, and the fine-tuning starts from this optimal parameters to adapt to a given new task on meta-test-set. This typical approach is reported to be able to learn via very few data because a prior is learned during meta-training and the data involved is not counted.

For convenience, name the inner loss the adaptation objective as it simulates the fast adaptation process from a set of model parameters, what we get after meta-training, and name the outer loss the meta-objective as it evaluates the fast-adapted parameters’ performance on some held-out set, which is what we are maximizing.

Prior work (paper, note) has applied MAML to robot imitation learning, but demonstrations are provided robot-aware. This work feeds only human demonstrations while meta-testing, further propose to handle domain shift between human demonstrations and robot ones. But note that, this does not mean that robot demonstrations are no more needed – they are still required during meta-training.

Learning from humans

Formally, human demonstrations are sequences of images, while robot demonstrations are comprised of image observations, robot states and robot actions. The approach is two-phased: 1) acquire a prior over policies via both human and robot demonstrations of several tasks; 2) use the prior to quickly learn to imitate a new task with only human demonstrations.

During meta-training, besides maximizing the fast-adaptation performance, the approach should also learn the correspondence between human demonstrations and robot demonstrations, since during meta-test only human demonstrations are provided but actions are taken by a robot.

Domain-Adaptive Meta-Learning

The logics presented in the paper is a bit confusing, because they do not show why they formulate in this way. They want the policy to extract behavior and relevant objects from either human demonstrations or robot demonstrations, meaning that both form of data should be somehow involved in meta-training. They propose to use human demonstrations in the adaptation objective and robot demonstrations in the meta objective. It is smooth to use the supervised behavioral cloning objective in the meta objective that maximizes the likelihood of robot demonstrations, since observations, states and actions are all provided, but not the case with human demonstrations in the adaptation objective. States and actions are indeed inherent in human demonstrations in the form of video, and hence the behovioral cloning loss won’t work.

Q1

What if I use robot demonstrations in the inner loss and use human ones in the outer loss?

To solve this, they argue that they meta-learn a loss function for the adaptation objective that can ingest human demonstrations and generate gradients that help supervise the policy’s actions. This is reasonable regarding that they make human observations and robot observations counterpart as the input of the policy network. If improvement on the learned inner loss also promises good performance on the outer loss, I think the learned loss is a good fit.

Q2

I have some obscure feelings that I cannot explain clearly. If we view different θ \theta to represent a set of functions, we are maximizing the meta objective to find an optimal function. But in MAML, the search space seems to be limited to θ α θ L \theta-\alpha \nabla_{\theta}L , so the solution might well be sub-optimal.

Q3

I still don’t understand meta-learning a loss.

If we consider the learned loss as responsible for providing gradients to supervise the policy’s actions, this gradient information should be evidenced from multiple time steps according to the nature of human motion. So this work uses temporal convolution rather than an LSTM.

The form of the learned loss is more powerful than that in this work. The author claims that the two head structure presents a linear loss.

Q4

I don’t understand the probabilistic interpretations.

猜你喜欢

转载自blog.csdn.net/u010909964/article/details/84946878
今日推荐