Zero-Shot Learning

Zero-Shot Learning

Zero-shot learning is one of the most challenging machine recognition methods. In 2009, Lampert et al. proposed the Animals with Attributes dataset and the classic attribute-based learning algorithm, which began to attract widespread attention to this algorithm.

The main data involved in zero-shot learning,

  1. Known class : the class label used in model training .
  2. Unknown class : The class label is not known during model testing and training.
  3. Auxiliary information : descriptions/semantic attributes/word embeddings of known and unknown classes of data. This information acts as a bridge between known and unknown classes.

Zero-Shot Learning (ZSL) is a strategy in machine learning whose goal is to enable a model to understand and recognize categories it has not encountered during the training phase.

  • This concept was first introduced to solve the problem of too many categories and too few samples per category in the real world.
  • Based on this idea, it would be extremely useful if there was a way for a machine learning model to understand new categories it hadn't seen before.

Principle:
In ZSL, the model usually receives some additional information to help understand new categories .

  • This information is usually provided in a human-understandable form, such as a textual description, a list of attributes, or a semantic vector.
  • The model understands new categories by learning the mapping between known categories and this additional information .


example

For example, an image classification model might only have seen images of the three categories "cat", "dog", and "bird" when it was trained .

  • In the ZSL setting, we might want the model to work on images of "horses" .
  • To achieve this goal, we might give the model a description of "horse"
    • For example, "A horse is an animal with four legs and a long tail."
    • Then let the model use this description to understand the category of "horse".


current research methods

As mentioned above, to realize the ZSL function, it seems that two parts need to be solved:

  1. The first problem is getting a proper class description
  2. The second problem is to build an appropriate classification model

Most of the current work is focused on the second problem, while the research progress on the first problem is relatively slow.


Development and Status

The research work of ZSL has expanded from the initial cross-modal learning of text and images to many other fields, such as video analysis, speech recognition, etc.

  • At the same time, the method of ZSL is also gradually improving, such as extending from the original attribute label learning to using a deep learning model to learn higher-level semantic information .

difficulty

The main difficulty of ZSL is how to establish an effective mapping from known categories to unknown categories

  • And how to deal with the problem of different data distributions that the model faces during the training and testing phases (this is known as the domain shift or domain gap problem).

Innovation

The main innovation of ZSL lies in proposing a new learning paradigm that allows the model to handle categories it has not seen during the training phase.

  • In addition, ZSL also tries to use human-understandable semantic information to help the model understand new categories, which is not available in traditional supervised learning methods.


general process

The basic principle of Zero-Shot Learning (ZSL) is to understand and recognize new, unseen categories by learning the mapping between known categories and semantic descriptions .

  • In many ZSL models, we need a way to represent the semantic information of categories, which is usually achieved through so-called attribute vectors or category embedding vectors .

The following is a simplified version of the general process of ZSL:

  1. Category Embedding : First, we need a vector that can represent the semantic information of the category . This can be achieved with a set of predefined attributes (e.g. "has feathers", "can fly", etc.), or it can be achieved in other ways, such as feeding the category name into a pre-trained word embedding model (such as Word2Vec or GloVe) The resulting vector.

  2. Feature extraction : Then, we need to extract features from the input data (for example, images or text) . This can often be achieved with a pretrained deep neural network.

  3. Mapping Learning : Next, we need to learn a mapping that maps the feature space of the input data to the category embedding space . This can be achieved in many ways, for example, a neural network can be trained such that the features of samples of the same class are as close as possible in the class embedding space, while the features of samples of different classes are as far apart as possible in the class embedding space . This process can be achieved by minimizing some distance or loss function , for example, cosine distance or cross-entropy loss can be used.

The above is the basic process of ZSL.
Of course, the actual ZSL model may be more complicated, for example, it may consider the relationship between categories, or use more complex mapping functions, etc.



Implementation

For the specific implementation of Zero-Shot Learning (ZSL), one of the common methods is to use visual-semantic mapping.
Let's walk through the process in detail:

1. Visual feature extraction : We first extract visual features from input samples . Assuming our input samples are images, we can use a Convolutional Neural Network (CNN) to extract features. We φ(·)use it to represent the feature extraction function, then for an input image x, we can get its visual features φ(x).

2. Semantic Embeddings : We need a way to represent the semantic information of categories . This can be achieved with word embedding models such as Word2Vec or GloVe. We ψ(·)use it to represent the word embedding function, then for a category c, we can get its semantic embedding ψ(c).

3. Visual-Semantic Mapping : Next, we need to learn a mapping function f(·)that maps the visual feature space to the semantic embedding space . Suppose we have a data set D = {(x1, c1), (x2, c2), ..., (xn, cn)}, which xiis the image, ciis the corresponding category. Our goal is to learn the mapping function f(·)such that, for all samples, f(φ(xi))it is as close as possible ψ(ci). This process can be achieved by minimizing the following loss function:

L = Σ ||f(φ(xi)) - ψ(ci)||^2

Here ||·||^2represents the square L2 norm, which is the Euclidean distance. Σmeans to sum over all samples.

During training, we learn the mapping function by minimizing the loss function through optimization methods such as backpropagation and gradient descentf(·) .

4. Prediction : In the test phase, given a new image x, we first use f(φ(x))its representation in the semantic embedding space, and then find the closest semantic embedding ψ(c), and the corresponding category cis our prediction result.

This is the application of the visual-semantic mapping method in ZSL. Note that this is just one approach, and there are actually many ways to implement ZSL, each with its advantages and limitations.


Simplified implementation process

In Zero-Shot Learning (ZSL), a common framework is to directly learn the mapping from visual features to semantic embedding space . Assuming we have a function f(·)to extract the features of the input samples and a function g(·)to represent the semantic information of the category , our goal is to learn a mapping functionh(·) that we can h(f(x))use to predict xthe category of the sample.

To give a more concrete mathematical form, suppose we have a data set D = {(x1, y1), (x2, y2), ..., (xn, yn)}where xiis the sample and yiis the corresponding class label. Each category ycorresponds to a semantic embedding g(y). Our goal is to learn a mapping function h(·)that, for all samples, h(f(xi))is as close as possible g(yi). This can be achieved by minimizing the following loss function :

L = Σ ||h(f(xi)) - g(yi)||^2

Here ||·||^2represents the square L2 norm, which is the Euclidean distance. Σmeans to sum over all samples.

In the test phase, given a new sample x, we first use h(f(x))its representation in the semantic embedding space, and then find the closest semantic embedding g(y), and the corresponding category yis our prediction result.

Please note that this is just a very simplified version of ZSL, and the actual ZSL model may consider more complex situations, such as the relationship between categories, or use more complex mapping functions, etc.
In addition, the above formula does not consider the regularization term , and some regularization terms are usually added in actual use to prevent overfitting.



Research Paper References

Zero-Shot Learning (ZSL) is a very active research area with many important and influential papers. Here are some classic and important papers:

  1. Attribute-based Classification for Zero-Shot Visual Object Categorization

    • Paper link: [https://ieeexplore.ieee.org/abstract/document/6247951]
    • Overview: This paper is an early work in the field of ZSL, and the authors propose an attribute-based classification method that can handle categories that do not appear in the training set.
  2. Zero-Shot Learning through Cross-Modal Transfer

    • Paper link: [https://papers.nips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html]
    • Overview: This paper presents a method for ZSL via cross-modal transfer. An important contribution of this paper is to address the issue of using different features in the training and testing phases.
  3. Zero-Shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly

    • Paper link: [https://arxiv.org/abs/1707.00600]
    • Overview: This paper presents a comprehensive evaluation of ZSL. The authors point out some issues with existing methods and propose a new evaluation protocol.

For multimodal ZSL, there are also some important works, such as:

  1. Zero-Shot Learning of Class Semantics via Temporal Attention

    • Paper link: [https://arxiv.org/abs/1809.00116]
    • Overview: This paper studies how to exploit dynamic information in videos for ZSL. The author proposes a model based on temporal attention mechanism, which can learn the semantic information of categories.
  2. Learning Semantic Models for Cross-Modal Zero-Shot Sketch Data Retrieval

    • Paper link: [https://www.sciencedirect.com/science/article/abs/pii/S0031320318303701]
    • Overview: This paper investigates how to perform zero-shot learning across modalities, especially in the task of sketch data retrieval.

The above papers are just the tip of the iceberg in the field of zero-time learning, and which paper to choose depends on your specific needs and interests. I suggest that while reading these papers, you also look up their citations for more related work.

Guess you like

Origin blog.csdn.net/weixin_43338969/article/details/130852524