Deep Captioning with Multimodal Recurrent Neural Networks ( m-RNN )

  • Authors propose a multimodal Recurrent Neural Networks (AlexNet / VGGNet + + multimode layer RNN), extracting image feature with CNN, word after word input to the RNN in two layers, and finally word feature, image feature, and the RNN together hidden multimodal input layer, via a probability distribution generated Softmax word. RNN is mainly characterized in order to preserve the previous sentence.
    • Adding two Embedding, represents more than a single layer of dense effective learning words
    • It does not use the stored visual information layer
    • Image feature described in the sentence is input to each word model with m-RNN

Key Point

  • Most sentences - Image Multimode pre-calculated using the word as an initialization vector embedded in its model, by contrast, random initialization of the embedded layer and their words they are learning from the training data. Experimental better
  • Flickr8K by cross-validation of the data set, ultra parameters (such as size selection layer and a nonlinear activation function) to tune, and then fixed in all experiments.
  • Previous work: treated as described image retrieval task, first sentence and the image feature extraction, which is embedded in a common semantic space, calculates the distance between the image and the sentences. When generating an image, sentences retrieved from sentence database minimum distance as described. This method can not automatically generate the rich description
  • Benchmark datasets for Image Captioning: IAPR TC-12 ( Grubinger et al.(2006) ), Flickr8K ( Rashtchian et al.(2010) ), Flickr30K ( Young et al.(2014) ) and MS COCO ( Lin et al.(2014) ).

Model

  1. input word by word embedding two layers, generating a vector representation dense \ (W (T) \) , \ (W (T) \) is simultaneously transmitted to and Multimodal RNN
  2. RNN done converted into \ (R & lt (T) = F_2 (diode with U_r \ {R & lt CDOT (. 1-T) + W (T)}) \) [official] , where [official]is the \ (T \) output timing loop layer, \ (F_2 \) is a function of ReLU
  3. Right in the green box, the input image to generate a feature vector after CNN \ (the I \) , \ (the I \) and input together multimodal. conversion is done multimodal \ (m (T) = G_2 (V_m \ CDOT {W (T) + V_r \ {R & lt CDOT (T)} + the I}) \) , \ (G_2 (X) = 1.7159 \ {CDOT tanh (\ frac {2} { 3} x)} \)
  4. \ (m (t) \) input layer generation softmax probability distribution. At this point, the input word to generate a next word

Guess you like

Origin www.cnblogs.com/doragd/p/11373469.html