Paper [Paper Reading] published in the arXiv Image Captioning direction - continually updated

Web links: https://arxiv.org/search/?searchtype=all&query=image+captioning&abstracts=show&size=50&order=announced_date_first

  • This blog cited papers published related Image Captioning direction on the arXiv.
  • Blog summarizes the main focus of the paper summary of each paper section, in order to focus on the paper "★" or "☆" mark (the degree of importance: ★> ☆).

Papers list

  • arXiv: 1409.2329  <Regularization the Recurrent Neural the Network> ★ : proposed RNN / LSTM the regularization method .

We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units. Dropout, the most successful technique for regularizing neural networks, does not work well with RNNs and LSTMs. In this paper, we show how to correctly apply dropout to LSTMs, and show that it substantially reduces overfitting on a variety of tasks. These tasks include language modeling, speech recognition, image captiongeneration, and machine translation.

  •  arXiv: 1411.4555  <the Show and Tell: A Neural the Caption Image Generator> ★ : Depth neural network solution using Image Captioning task; incorporated "into Image Captioning art coder - decoder (Encoder-Deocder) " frame.

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU-4 of 27.7, which is the current state-of-the-art.

  • arXiv: 1411.4952  <Captions to the Visual Concepts and the From the Back> ★ : Use multiple instance learning (Multi Instance Learning) extraction from the picture word (Word Detectors); traditional modeling language; level features using the sentence and a depth multimodal similarity model description generator reorders.

This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives. The word detector outputs serve as conditional inputs to a maximum-entropy language model. The language model learns from a set of over 400,000 image descriptions to capture the statistics of word usage. We capture global semantics by re-ranking caption candidates using sentence-level features and a deep multimodal similarity model. Our system is state-of-the-art on the official Microsoft COCO benchmark, producing a BLEU-4 score of 29.1%. When human judges compare the system captions to ones written by other people on our held-out test set, the system captions have equal or better quality 34% of the time.

  • arXiv: 1411.5654  <Recurrent Learning A Visual Representation for Image Caption Generation> : explore between the image and the description of two-way mapping .

In this paper we explore the bi-directional mapping between images and their sentence-based descriptions. We propose learning this mapping using a recurrent neural network. Unlike previous approaches that map both sentences and images to a common embedding, we enable the generation of novel sentences given an image. Using the same model, we can also reconstruct the visual features associated with an image given its visual description. We use a novel recurrent visual memory that automatically learns to remember long-term visual concepts to aid in both sentence generation and visual feature reconstruction. We evaluate our approach on several tasks. These include sentence generation, sentence retrieval and image retrieval. State-of-the-art results are shown for the task of generating novel image descriptions. When compared to human generated captions, our automatically generated captions are preferred by humans over 19.8% of the time. Results are better than or comparable to state-of-the-art results on the image and sentence retrieval tasks for methods using similar visual features.

  • arXiv: 1412.6632  <Deep Captioning is the Recurrent Neural Networks, with the Multimodal (m-RNN)> : A proposed multi-modal cycle neural network (the Recurrent Neural Multimodal the Network, RNN-m) , connected to a multi-modal layer portion and visual language model section.

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, we apply the m-RNN model to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval. The project page of this work is: www.stat.ucla.edu/~junhua.mao/m-RNN.html .

  • arXiv: 1412.8419  <Via the Simple Image Generator A Linear the Description Phrase-Based Approach> : image feature learning with the phrase represents a public space (common space) between (phrase representations).

Generating a novel textual description of an image is an interesting problem that connects computer vision and natural language processing. In this paper, we present a simple model that is able to generate descriptive sentences given a sample image. This model has a strong focus on the syntax of the descriptions. We train a purely bilinear model that learns a metric between an image representation (generated from a previously trained Convolutional Neural Network) and phrases that are used to described them. The system is then able to infer phrases from a given image sample. Based on captionsyntax statistics, we propose a simple language model that can produce relevant descriptions for a given test image using the phrases inferred. Our approach, which is considerably simpler than state-of-the-art models, achieves comparable results on the recently release Microsoft COCO dataset.

  • arXiv: 1502.03044  <Show, Attend and Tell: Neural Visual Image Caption Generation with the Attention> : the introduction of attentional mechanisms (Attention Mechanism) to Image Captioning areas, and proposed Hard Attention Soft Attention.

Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.

 

Guess you like

Origin www.cnblogs.com/zlian2016/p/11038179.html