(Paper Reading 46-50) Image Description 2

46. ​​Literature reading notes

Introduction

topic

Learning a Recurrent Visual Representation for Image Caption Generation

author

Xinlei Chen, C. Lawrence Zitnick, arXiv:1411.5654.

Original link

http://www.cs.cmu.edu/~xinleic/papers/cvpr15_rnn.pdf

Key words

2014 rnn image features and text features describe each other

research problem

Bidirectional mapping between images and sentence-based descriptions.

Sentence generation, sentence retrieval and image retrieval.

Target:

Able to generate sentences based on a set of visual observations or features, calculate the probability that word wt is generated at time t based on the previously generated word set Wt-1 = w1, ... , wt-1 and the observed visual features V.

Second, it is desirable to be able to compute the likelihood of a visual feature V given a set of spoken or read words Wt, thereby generating a visual representation of a scene or performing an image search.

Research methods

proposed to use a recurrent neural network to learn this mapping. Unlike previous methods that map sentences and images to a common embedding, we allow the generation of new sentences given an image. Using the same model, we can also reconstruct the visual features associated with an image given its visual description.

Using a novel recurrent visual memory, we automatically learn to remember long-term visual concepts to aid sentence generation and visual feature reconstruction.

Rnn: Generate image features from sentences, generate sentences from image features

Analysis conclusion

Learning long-term interactive, recurring visual memories to learn to reconstruct visual features

Insufficient innovation

None

additional knowledge

None

47. Literature reading notes

Introduction

topic

From Captions to Visual Concepts and Back

author

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Dollar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, Geoffrey Zweig, CVPR, 2015.

Original link

http://arxiv.org/pdf/1411.4952

Key words

Automatically generate image descriptions

research problem

Learn image descriptions to generate new image descriptions

Research methods

Learn visual detectors, language models, and multimodal similarity models directly from image caption datasets.

The system is trained on images and corresponding captions and learns to extract nouns, verbs, and adjectives from regions in the images. These detected words then guide a language model to generate text that reads well and contains the detected words. Finally, we rerank the candidate subtitles using the global deep multi-modal similarity model introduced in this paper.

CNN AlexNet or VGG CNN

DMSM learns two neural networks to map images and text segments to a common vector representation. We measure the similarity between images and text by measuring the cosine similarity between their corresponding vectors.

Analysis conclusion

Faster than human writing

Insufficient innovation

Hard to comment

additional knowledge

image captions:Image descriptions

48. Literature reading notes

Introduction

topic

Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention

author

Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio, arXiv:1502.03044/ICML 2015

Original link

http://www.cs.toronto.edu/~zemel/documents/captionAttn.pdf

Key words

Image description

research problem

Describes how to train this model deterministically using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also demonstrate through visualization how the model can automatically learn to fixate gaze on salient objects while generating corresponding words in the output sequence.

Research methods

Cnn+lstm+attention mechanism

An attention-based model is introduced that automatically learns to describe the content of images.

attention:

Combining "soft" and "hard" attention.

Analysis conclusion

The proposed attention framework does not explicitly use object detectors but instead learns latent permutations from scratch. Models can transcend "objectness" and learn to focus on abstract concepts.

The learned attention is exploited to give the model generation process more interpretability, and the learned alignment is demonstrated to be very consistent with human intuition.

Insufficient innovation

additional knowledge

Caption: description text

Attention:Attention does not compress the entire image into a static representation, but allows prominent features to dynamically highlight when needed. This is especially important when there is a lot of clutter in the image. Using representations, such as those from the top layer of a convolutional network, to distill the information in an image into the most salient objects is an effective solution. A potential disadvantage of this approach is that some information is lost that might be useful for richer, more descriptive subtitles. Using a lower-level representation helps preserve this information. However, using these features requires a powerful mechanism to guide the model to obtain information that is important for the current task.

Attention Mechanism:Attention mechanism

In the attention mechanism, the output of each neuron not only depends on the output of all neurons in the previous layer, but can also be weighted according to different parts of the input data, that is, different weights are given to different parts. This allows the model to pay more attention to the key information in the input sequence, thereby improving the accuracy and efficiency of the model.

[Deep Learning] Attention Mechanism_Efficient Attention Mechanism-CSDN Blog

[Deep Learning] (1) Attention mechanism (SE, ECA, CBAM) in CNN, with complete Pytorch code_se attention mechanism_Li Sir's blog-CSDN blog

49. Literature reading notes (based on phrases rather than words)

Introduction

topic

Phrase-based Image Captioning

author

Remi Lebret , Peter O. Pinheiro , Ronan Collobert , arXiv:1502.03671/ICML2015

Original link

http://arxiv.org/pdf/1502.03671

Key words

Generating novel textual descriptions of images

research problem

Generate descriptive sentences given a sample image, with a strong focus on the grammar of description

Research methods

A simple model is proposed that is able to infer different phrases from image samples. From the predicted phrases, the model is able to automatically generate sentences using statistical language models.

CNN obtains image features.

Phrase Initialization: Word Vector Representations: By leveraging the ability of these word vector representations to be composed by simple summation, the representation of phrases can be easily computed by element-wise addition.

Phrase formation into sentences: After identifying the most likely L constituent phrases in the image , sentences are generated from these constituents. The likelihood of a sentence given using a statistical language framework.

Decoding sentences: pruning, phrase only appears once, syntactic restrictions.

The generated sentences are sorted to select the one that best matches the image.

Analysis conclusion

The sentence generation problem can be implemented efficiently without using complex recurrent networks. Our algorithm, although simpler than state-of-the-art models, achieves similar results on this task. Furthermore, our model generates new sentences that typically do not exist in the training set.

Insufficient innovation

Future research directions will be towards leveraging unsupervised data and more complex language models

additional knowledge

None

50. Literature reading notes (generalization)

Introduction

topic

Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

author

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan L. Yuille, arXiv:1504.06692

Original link

http://arxiv.org/pdf/1504.06692

Key words

The problem of learning new categories of objects from a small number of examples (sometimes there is not enough data to identify new concepts, so knowledge needs to be transferred from previously learned categories) It is not desirable to retrain every time you add some images with new concepts The entire model, especially when the amount of data or model parameters are very large.

research problem

The task of learning novel visual concepts from a few images with sentence descriptions and their interactions with other concepts.

Recognizing, learning, and using new concepts is one of the most important cognitive functions of humans. From a young age, we learn new concepts by observing the visual world and hearing sentence descriptions from our parents. This process is slow at first, but becomes faster as we accumulate enough concepts we have learned.

Figure 1: Schematic diagram of the novel visual concept learning of sentences (NVCS) task. We start from a model (i.e. model library) trained on images that do not contain the concept of "Quidditch" (1). Using some "Quidditch" images with sentence descriptions, our method can learn that "Quidditch" is played by humans with balls.

Research methods

A method is proposed that allows a model to augment its word dictionary using a small number of examples to describe new concepts without requiring extensive retraining. In particular, there is no need to retrain the model from scratch on all data (all previously learned concepts and new concepts).

 Basic model: m-RNN

First, a transposed weight sharing strategy is proposed, which greatly reduces the number of parameters in the model. Second, we replace the recurrent layer in with a long short-term memory (LSTM) layer. LSTM is a recurrent neural network specifically designed to solve the exploding and vanishing gradient problem.

The model consists of three parts: language part, visual part and multimodal part.

The language component consists of two word embedding layers and an LSTM layer. It maps word indices in the dictionary into a semantically dense word embedding space and stores word context information in the LSTM layer.

The vision component consists of a 16-layer deep convolutional neural network (CNN) pre-trained on the ImageNet classification task. We removed the last layer of SoftMax from the deep convolutional neural network and connected the top fully connected layer (4096-dimensional layer) to our model. The activations of these 4096-dimensional layers can be viewed as image features, which contain rich visual attributes of objects and scenes.

The multimodal component consists of a single-layer representation in which information from the linguistic and visual parts are fused. We build a SoftMax layer after the multimodal layer to predict the index of the next word.

Submodels of words in a sentence share weights. Like the m-RNN model, we add the start symbol wstart and the end symbol wend to each training sentence.

In the test phase of image description, we input the starting symbol wstart into the model and select the K best words with the highest probability based on the SoftMax layer. This process is repeated until the model generates the end symbol wend.

Analysis conclusion

The Novel Visual Concept Learning from Sentences (NVCS) task is proposed. In this task, methods need to learn novel concepts from sentence descriptions of a small number of images. We describe an approach that allows us to train our model on a small number of images containing new concepts. This is comparable to the performance of a model retrained from scratch on all data if the number of novel concept images is large, and performs better when only a few training images of novel concepts are available.

Insufficient innovation

additional knowledge

Zero-shot and one-shot learning:

Zero-shot learning:[Selected] Zero Shot | Learn about zero-shot learning in one article - CSDN Blog

one-shot learningOne-Shot learning/One-shot learning (One-shot learning)-CSDN Blog

Introduction to the concepts of Zero-Shot, One-Shot, and Few-Shot Learning-CSDN Blog

Guess you like

Origin blog.csdn.net/qq_46012097/article/details/134482089