46. Literature reading notes |
||
Introduction |
topic |
Learning a Recurrent Visual Representation for Image Caption Generation |
author |
Xinlei Chen, C. Lawrence Zitnick, arXiv:1411.5654. |
|
Original link |
||
Key words |
2014 rnn image features and text features describe each other |
|
research problem |
Bidirectional mapping between images and sentence-based descriptions. Sentence generation, sentence retrieval and image retrieval. Target: Able to generate sentences based on a set of visual observations or features, calculate the probability that word wt is generated at time t based on the previously generated word set Wt-1 = w1, ... , wt-1 and the observed visual features V. Second, it is desirable to be able to compute the likelihood of a visual feature V given a set of spoken or read words Wt, thereby generating a visual representation of a scene or performing an image search. |
|
Research methods |
proposed to use a recurrent neural network to learn this mapping. Unlike previous methods that map sentences and images to a common embedding, we allow the generation of new sentences given an image. Using the same model, we can also reconstruct the visual features associated with an image given its visual description. Using a novel recurrent visual memory, we automatically learn to remember long-term visual concepts to aid sentence generation and visual feature reconstruction. Rnn: Generate image features from sentences, generate sentences from image features |
|
Analysis conclusion |
Learning long-term interactive, recurring visual memories to learn to reconstruct visual features |
|
Insufficient innovation |
None |
|
additional knowledge |
None |
47. Literature reading notes |
||
Introduction |
topic |
From Captions to Visual Concepts and Back |
author |
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Dollar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, Geoffrey Zweig, CVPR, 2015. |
|
Original link |
http://arxiv.org/pdf/1411.4952 |
|
Key words |
Automatically generate image descriptions |
|
research problem |
Learn image descriptions to generate new image descriptions |
|
Research methods |
Learn visual detectors, language models, and multimodal similarity models directly from image caption datasets. The system is trained on images and corresponding captions and learns to extract nouns, verbs, and adjectives from regions in the images. These detected words then guide a language model to generate text that reads well and contains the detected words. Finally, we rerank the candidate subtitles using the global deep multi-modal similarity model introduced in this paper. CNN AlexNet or VGG CNN DMSM learns two neural networks to map images and text segments to a common vector representation. We measure the similarity between images and text by measuring the cosine similarity between their corresponding vectors. |
|
Analysis conclusion |
Faster than human writing |
|
Insufficient innovation |
Hard to comment |
|
additional knowledge |
image captions:Image descriptions |
48. Literature reading notes |
||
Introduction |
topic |
Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention |
author |
Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio, arXiv:1502.03044/ICML 2015 |
|
Original link |
||
Key words |
Image description |
|
research problem |
Describes how to train this model deterministically using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also demonstrate through visualization how the model can automatically learn to fixate gaze on salient objects while generating corresponding words in the output sequence. |
|
Research methods |
Cnn+lstm+attention mechanism An attention-based model is introduced that automatically learns to describe the content of images. attention: Combining "soft" and "hard" attention. |
|
Analysis conclusion |
The proposed attention framework does not explicitly use object detectors but instead learns latent permutations from scratch. Models can transcend "objectness" and learn to focus on abstract concepts. The learned attention is exploited to give the model generation process more interpretability, and the learned alignment is demonstrated to be very consistent with human intuition. |
|
Insufficient innovation |
||
additional knowledge |
Caption: description text Attention:Attention does not compress the entire image into a static representation, but allows prominent features to dynamically highlight when needed. This is especially important when there is a lot of clutter in the image. Using representations, such as those from the top layer of a convolutional network, to distill the information in an image into the most salient objects is an effective solution. A potential disadvantage of this approach is that some information is lost that might be useful for richer, more descriptive subtitles. Using a lower-level representation helps preserve this information. However, using these features requires a powerful mechanism to guide the model to obtain information that is important for the current task. Attention Mechanism:Attention mechanism In the attention mechanism, the output of each neuron not only depends on the output of all neurons in the previous layer, but can also be weighted according to different parts of the input data, that is, different weights are given to different parts. This allows the model to pay more attention to the key information in the input sequence, thereby improving the accuracy and efficiency of the model. [Deep Learning] Attention Mechanism_Efficient Attention Mechanism-CSDN Blog |
49. Literature reading notes (based on phrases rather than words) |
||
Introduction |
topic |
Phrase-based Image Captioning |
author |
Remi Lebret , Peter O. Pinheiro , Ronan Collobert , arXiv:1502.03671/ICML2015 |
|
Original link |
http://arxiv.org/pdf/1502.03671 |
|
Key words |
Generating novel textual descriptions of images |
|
research problem |
Generate descriptive sentences given a sample image, with a strong focus on the grammar of description |
|
Research methods |
A simple model is proposed that is able to infer different phrases from image samples. From the predicted phrases, the model is able to automatically generate sentences using statistical language models. CNN obtains image features. Phrase Initialization: Word Vector Representations: By leveraging the ability of these word vector representations to be composed by simple summation, the representation of phrases can be easily computed by element-wise addition. Phrase formation into sentences: After identifying the most likely L constituent phrases in the image , sentences are generated from these constituents. The likelihood of a sentence given using a statistical language framework. Decoding sentences: pruning, phrase only appears once, syntactic restrictions. The generated sentences are sorted to select the one that best matches the image. |
|
Analysis conclusion |
The sentence generation problem can be implemented efficiently without using complex recurrent networks. Our algorithm, although simpler than state-of-the-art models, achieves similar results on this task. Furthermore, our model generates new sentences that typically do not exist in the training set. |
|
Insufficient innovation |
Future research directions will be towards leveraging unsupervised data and more complex language models |
|
additional knowledge |
None |
50. Literature reading notes (generalization) |
||
Introduction |
topic |
Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images |
author |
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan L. Yuille, arXiv:1504.06692 |
|
Original link |
http://arxiv.org/pdf/1504.06692 |
|
Key words |
The problem of learning new categories of objects from a small number of examples (sometimes there is not enough data to identify new concepts, so knowledge needs to be transferred from previously learned categories) It is not desirable to retrain every time you add some images with new concepts The entire model, especially when the amount of data or model parameters are very large. |
|
research problem |
The task of learning novel visual concepts from a few images with sentence descriptions and their interactions with other concepts. Recognizing, learning, and using new concepts is one of the most important cognitive functions of humans. From a young age, we learn new concepts by observing the visual world and hearing sentence descriptions from our parents. This process is slow at first, but becomes faster as we accumulate enough concepts we have learned. Figure 1: Schematic diagram of the novel visual concept learning of sentences (NVCS) task. We start from a model (i.e. model library) trained on images that do not contain the concept of "Quidditch" (1). Using some "Quidditch" images with sentence descriptions, our method can learn that "Quidditch" is played by humans with balls. |
|
Research methods |
A method is proposed that allows a model to augment its word dictionary using a small number of examples to describe new concepts without requiring extensive retraining. In particular, there is no need to retrain the model from scratch on all data (all previously learned concepts and new concepts). Basic model: m-RNN First, a transposed weight sharing strategy is proposed, which greatly reduces the number of parameters in the model. Second, we replace the recurrent layer in with a long short-term memory (LSTM) layer. LSTM is a recurrent neural network specifically designed to solve the exploding and vanishing gradient problem. The model consists of three parts: language part, visual part and multimodal part. The language component consists of two word embedding layers and an LSTM layer. It maps word indices in the dictionary into a semantically dense word embedding space and stores word context information in the LSTM layer. The vision component consists of a 16-layer deep convolutional neural network (CNN) pre-trained on the ImageNet classification task. We removed the last layer of SoftMax from the deep convolutional neural network and connected the top fully connected layer (4096-dimensional layer) to our model. The activations of these 4096-dimensional layers can be viewed as image features, which contain rich visual attributes of objects and scenes. The multimodal component consists of a single-layer representation in which information from the linguistic and visual parts are fused. We build a SoftMax layer after the multimodal layer to predict the index of the next word. Submodels of words in a sentence share weights. Like the m-RNN model, we add the start symbol wstart and the end symbol wend to each training sentence. In the test phase of image description, we input the starting symbol wstart into the model and select the K best words with the highest probability based on the SoftMax layer. This process is repeated until the model generates the end symbol wend. |
|
Analysis conclusion |
The Novel Visual Concept Learning from Sentences (NVCS) task is proposed. In this task, methods need to learn novel concepts from sentence descriptions of a small number of images. We describe an approach that allows us to train our model on a small number of images containing new concepts. This is comparable to the performance of a model retrained from scratch on all data if the number of novel concept images is large, and performs better when only a few training images of novel concepts are available. |
|
Insufficient innovation |
||
additional knowledge |
Zero-shot and one-shot learning: Zero-shot learning:[Selected] Zero Shot | Learn about zero-shot learning in one article - CSDN Blog one-shot learning:One-Shot learning/One-shot learning (One-shot learning)-CSDN Blog Introduction to the concepts of Zero-Shot, One-Shot, and Few-Shot Learning-CSDN Blog |