A general-purpose few-shot learner for dense prediction tasks

In this paper, we design and implement a few-shot learner Visual Token Matching VTM, which can be used for arbitrary dense prediction tasks. This is the first few-shot learner suitable for all dense prediction tasks.

Dense prediction tasks are an important class of tasks in the field of computer vision, such as semantic segmentation, depth estimation, edge detection, and keypoint detection. For such tasks, manually annotating pixel-level labels faces an unaffordable huge cost. Therefore, how to learn and make accurate predictions from a small amount of labeled data, that is, small sample learning, is a topic of great concern in this field. In recent years, the research on small-sample learning has continuously made breakthroughs, especially some methods based on meta-learning and adversarial learning, which have attracted the attention and popularity of the academic community.

However, existing few-shot learning methods for computer vision generally target a specific class of tasks, such as classification tasks or semantic segmentation tasks. They typically leverage prior knowledge and assumptions specific to these tasks in designing model architectures and training, and thus are not suitable for generalization to arbitrary dense prediction tasks. Researchers at Microsoft Research Asia want to explore a core question: whether there is a general small-shot learner that can learn dense prediction tasks that have not been seen in any segment from a small number of labeled images.

The goal of a dense prediction task is to learn a mapping from input images to labels annotated in pixels, which can be defined as:

Where H and W are the height and width of the image respectively, the input image generally contains three channels of RGB, and C_T indicates the number of output channels. Different dense prediction tasks may involve different numbers of output channels and channel attributes. For example, the output of the semantic segmentation task is multi-channel binary, while the output of the depth estimation task is single-channel continuous value. A general few-shot learner F, for any such task Τ, given a small number of labeled sample support sets S_Τ (including N sets of samples X^i and labels Y^i), can be unseen Querying the image X^q yields predictions, namely:

 

If there exists a general few-shot learner suitable for arbitrary dense prediction tasks, then the following expectations must be met:

  • First, it must have a unified architecture. The architecture is able to handle arbitrarily dense prediction tasks, and shares the parameters required by most tasks in order to obtain generalizable knowledge, so that any unseen task can be learned with a small number of samples.

  • Second, the learner should flexibly tune its prediction mechanism to address unseen tasks with various semantics, while being efficient enough to prevent overfitting.

Therefore, researchers at Microsoft Research Asia designed and implemented a small sample learner Visual Token Matching VTM (Visual Token Matching), which can be used for arbitrary dense prediction tasks. This is the first few-shot learner that adapts to all dense prediction tasks . VTM  opens up new ideas for the processing of dense prediction tasks and small-shot learning methods in computer vision . This work won  the ICLR 2023 Outstanding Paper Award .

论文:Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching

Link: https://arxiv.org/abs/2303.14969

The design of VTM is inspired by the analogy of human thinking process: given a small number of examples of a new task, humans can quickly assign similar outputs to similar inputs according to the similarity between examples, while also being flexible according to the given context. Work around at what level the input and output are similar. The researchers implemented an analogous process for dense prediction using non-parametric matching at the patch level. Through training, the model is inspired to capture similarities in image patches.

Given a small number of labeled examples for a new task, the VTM first adjusts its understanding of similarity based on the given examples and the labels of the examples, and locks the image patches similar to the image patches to be predicted from the example image patches, by combining their labels to predict labels for unseen image patches.

VTM employs a layered encoder-decoder architecture that implements block-based non-parametric matching at multiple levels. It mainly consists of four modules, namely image encoder f_T, label encoder g, matching module and label decoder h. Given a query image and a support set, an image encoder first extracts patch-level representations for each query and support image independently. The label encoder similarly extracts tokens for each supported label. Given the labels of each level, the matching module performs non-parametric matching, and finally the label of the query image is inferred by the label decoder.

The essence of VTM is a meta-learning method. Its training consists of multiple episodes, each of which simulates a few-shot learning problem. The VTM is trained on the meta-training dataset D_train, which contains labeled examples for various dense prediction tasks. Each training episode simulates the few-shot learning scenario of a specific task T_train in the dataset, and the goal is to generate the correct label for the query image given the support set. Through the experience of multiple small-sample learning, the model can learn general knowledge in order to quickly and flexibly adapt to new tasks. At test time, the model needs few-shot learning on any task T_test not included in the training dataset D_train.

When dealing with arbitrary tasks, since the output dimensions C_T of each task in meta-training and testing are different, it is a great challenge to design a unified general model parameter for all tasks. To provide a simple and general solution, the researchers transform the task into C_T single-channel subtasks, learn each channel separately, and model each subtask independently using a shared model F.

To test the VTM, the researchers also specially constructed a variant of the Taskonomy dataset to simulate few-shot learning for unseen dense prediction tasks. Taskonomy contains a variety of annotated indoor images, from which the researchers selected ten dense prediction tasks with different semantics and output dimensions, which were divided into five parts for cross-validation. In each split, two tasks are used for few-shot evaluation (T_test), and the remaining eight tasks are used for training (T_train). The researchers carefully constructed partitions such that training and testing tasks are sufficiently different from each other, such as grouping edge tasks (TE, OE) into testing tasks, so that tasks with new semantics can be evaluated.

Table 1: Quantitative comparisons on the Taskonomy dataset ( Few-shot baselines are trained on tasks from other partitions, followed by 10-shot learning on the test-partition task, where fully supervised baselines are trained on each fold ( DPT) or all folds (InvPT) trained and evaluated on the task)

Table 1 and Fig. 2 quantitatively and qualitatively demonstrate the few-shot learning performance of VTM and two types of baseline models on ten dense prediction tasks. Among them, DPT and InvPT are two state-of-the-art supervised learning methods, DPT can be trained independently for each single task, while InvPT can jointly train all tasks. Since there were no dedicated few-shot methods developed for general-purpose dense prediction tasks before VTM, the researchers compared VTM with three state-of-the-art few-shot segmentation methods, namely DGPNet, HSNet, and VAT, and extended them to handle General label spaces for dense prediction tasks. VTM did not have access to the test task T_test during training, and only used a small number (10) of labeled images at test time, but it performed best among all few-shot baseline models, and performed well on many tasks. Competitiveness compared to fully supervised baseline models. v

Figure 2: Qualitative comparison of few-shot learning methods given only ten labeled images on new tasks in Taskonomy's ten dense prediction tasks. Where other methods fail, VTM successfully learns all new tasks with different semantics and different label representations.

In Fig. 2, above the dotted lines are the ground truth labels and two supervised learning methods DPT and InvPT, respectively. Below the dotted line is the few-shot learning method. Notably, other few-shot baselines underfit catastrophically on new tasks, while VTM successfully learns all tasks. Experiments show that VTMs can perform similarly to fully supervised baselines on a very small number of labeled examples (<0.004% of fully supervised), and are able to scale down with relatively little additional data (0.1% of fully supervised). The gap with the supervision method, and even achieve overtake. whaosoft  aiot  http://143ai.com

In summary, although the underlying idea of ​​VTM is very simple, it has a unified architecture and can be used for arbitrary dense prediction tasks, because the matching algorithm essentially encompasses all tasks and label structures (eg, continuous or discrete). In addition, VTM is resistant to overfitting and flexible by only introducing a small number of task-specific parameters. In the future, researchers hope to further explore the impact of the task type, data volume, and data distribution in the pre-training process on the generalization performance of the model, so as to help us build a truly universal small-sample learner.

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/130482632