Link of the Paper: https://arxiv.org/abs/1412.2306
Main Points:
- An Alignment Model: Convolutional Neural Networks over image regions, Bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding ( CNN - Structured Objective - BiRNN ).
- A Multimodal Recurrent Neural Network architecture.
Other Key Points:
- The