Research Progress and Trends of Automatic Text Generation: Image-to-Text Generation

Image to Text Generation

1 Current status of international research

        Image-to-text generation technology refers to the generation of natural language text describing the content of the image based on a given image, such as the title attached to the news image, the description attached to the medical image, the common picture-seeing speech in children's education, and the user's micro-blog The description text provided when uploading pictures in Internet applications. Depending on the level of detail and length of the generated natural language text, this task can be divided into automatic generation of image captions and automatic generation of image captions. The former needs to highlight the core content of the image according to the application scenario. For example, the title generated for a news photo needs to highlight the news events closely related to the image content, and seek new ways of expression to attract readers’ attention; while the latter usually needs to describe the image in detail The main content of , for example, provides concise and detailed picture descriptions for people with visual impairments, and strives to present the content of the pictures in a comprehensive and orderly manner, but there are no specific requirements on the specific expression methods.

       For the task of automatic image-to-text generation, humans can easily understand the image content and express it in the form of natural language sentences according to specific needs; however, for computers, it is necessary to comprehensively use image processing, computer vision and Research results in several fields such as natural language processing. As a landmark cross-field research task, automatic image-to-text generation has attracted the attention of researchers from different fields. Since 2010, relevant papers have been published in well-known international conferences and journals ACL, TACL and EMNLP in the field of natural language processing; since 2013, the top international journal IEEE TPAMI in the field of pattern recognition and artificial The international journal IJCV has also begun to publish the research progress of related work. By 2015, nearly 10 papers on related work have been published in CVPR, a well-known international conference in the field of computer vision. At the same time, there are also 2 papers in ICML, a well-known international conference in the field of machine learning. Related papers were published. The task of automatic image-to-text generation has been recognized as a fundamental challenge in the field of artificial intelligence.

       Similar to the general text generation problem, solving the automatic image-to-text generation problem also needs to follow the three-stage pipeline model [76], and at the same time, some adjustments need to be made according to the characteristics of image content understanding:

  •       In terms of content extraction, concepts such as objects, orientations, actions, and scenes need to be extracted from images. Objects can be specifically located in a specific area in the image, while other concepts require semantic indexing. This part mainly relies on pattern recognition and computer vision techniques.
  •       In terms of sentence content selection, it is necessary to select the most important (such as the most prominent in the image screen, or the most relevant to the application scenario) concept with coherent meaning expression according to the application scenario. This part requires the comprehensive use of computer vision and natural language processing technology.

       Finally, in the sentence realization part, select an appropriate expression method according to the characteristics of the actual application to sort out the selected concepts into natural language sentences that conform to grammatical habits. This part mainly relies on natural language processing technology.

       The early work was mainly realized according to the above-mentioned three-stage pipeline mode. For example, in the work of Yao et al. [88], the image is carefully segmented and marked as objects and their components, as well as the scene represented by the image, and on this basis, a description template related to the scene is selected to recognize the object The result is filled into the template to get the description text of the image. Feng and Lapata[89][90] used probabilistic graphical models to model text information and image information at the same time, and selected appropriate keywords from the text reports where news pictures were located as keywords to reflect image content, and then used The language model links the selected content keywords and necessary functional vocabulary into image titles that basically conform to grammatical rules. There are also some works [91] [92] [93] [94] [95] rely on the existing object recognition technology in the field of computer vision to extract objects from images (including common objects such as people, animals, flowers, cars, tables, etc. type), and locate them to obtain the hyponym relationship between objects, and then rely on the probabilistic graphical model and language model to select an appropriate description order to concatenate these object concepts and prepositional phrase blocks into complete sentences. Hodosh et al. [96] used Kernel Canonical Correlation Analysis (KCCA) based on kernel functions to find the correlation between text and images, and sorted the candidate sentences according to the image information, so as to obtain the best description sentences. It is worth noting that neither the work of Hodosh et al. [96] nor the work of Feng and Lapata [90][91] relies on existing object recognition techniques.

       With the wide application of deep learning methods in the fields of pattern recognition, computer vision, and natural language processing, large-scale image classification and semantic annotation technologies based on massive data have developed rapidly; at the same time, techniques related to natural language generation such as statistical machine translation There has also been a significant improvement. This has also led to a series of work on the joint modeling of image semantic annotation and natural language sentence generation. On the one hand, a multi-layer deep convolutional neural network (Deep Convolution Neural Network, DCNN) is used on the image side to carry out the concept of objects in the image. On the other hand, Recurrent Neural Network (RNN) or Recursive Neural Network (Recursive Neural Network) is used on the text side to model the generation process of natural language sentences [97]. Traditional image semantic annotation work mainly focuses on the recognition of a specific object and the relative positional relationship between objects, while less attention is paid to abstract concepts such as actions. Socher et al. [98] proposed to use recurrent neural network to model sentences, and use syntactic parsing tree to highlight the modeling of actions (verbs), and then jointly optimize the image side and text side, which better describes objects and actions. The relationship between. In order to unify the data of two different modalities under one framework, Chen and Zitnick [99] fused text information and image information in the same recurrent neural network, and used image information as a memory module to guide the generation of text sentences. At the same time, with the help of a reconstructed image information layer, the bidirectional representation from image to text and from text to image is realized. Mao et al. [100] fused the image information and text information obtained by DCNN into the same recurrent neural network (m-RNN), integrated the image information into the sequence process of natural language sentence generation, and achieved good results. . Similar ideas were also applied by Donahue et al. [101] in the process of action recognition and video description generation. However, in the process of m-RNN sentence generation, there are no significant constraints on the image side. For example, in the figure below, when the word "man" is generated, there is no direct or indirect association with the task annotation in the image information.

       Researchers from Google and the University of Montreal and the University of Toronto respectively drew on the latest research progress in the field of statistical machine translation to advance the joint modeling of image-to-text automatic generation [102] [103]. The former uses the deep convolutional neural network DCNN to model the image, and after the image information is "encoded", it is directly "decoded" by another LSTM neural network (Long-Short Term Memory Network, LSTM) connected to it ( decoding) into natural language sentences, without the sub-steps of traditional models such as image-word alignment and ordering. The latter, under the framework of neural network-based machine translation, proposes to use the "Attention" mechanism in the field of computer vision to promote the alignment between words and image blocks, thereby simulating human vision in the process of sentence generation.

        In addition, Microsoft researchers [104] used convolutional neural network CNN and multiple instance learning (Multiple Instance Learning, MIL) to model images, and used discriminative language models to generate candidate sentences, and used the classic statistical machine translation research Minimum Error Rate Training (MERT) is used to explore text and image-level features to rank candidate sentences.

       Although the image-to-text generation technology is still in the exploratory stage, and there is still a certain distance from the actual industrial application, the industry has begun to pay attention to the theoretical research value and potential application prospects of this technology, and actively cooperates with the academic community to expand research directions . In the 2015 LSUN Challenge (Large-scale Scene Understanding) held at the well-known computer vision international conference CVPR 2015, the evaluation task of automatically generating image titles was also carried out. Finally, Google [102] and Microsoft Research [104] The Montreal-Toronto United team [103] and another Microsoft Research team [105] tied for the third place in the overall score, and the University of California, Berkeley [101] won the fifth place.

2 Domestic Research Status

       The research on image-to-text generation technology in the domestic academic circles is relatively late, and most research institutes focus on tasks such as semantic annotation and retrieval of cross-media data. For example, in the ImageCLEF evaluation organized by the European Union in 2015, Renmin University and Tencent won the first place in the Image Sentence Generation task.

       In terms of industry, scientific research institutions such as Baidu and Tencent are also relying on their own research advantages in cross-media semantic annotation, classification and retrieval, and gradually carry out research work in related directions. It also achieved good results in the image caption automatic generation task evaluated by LSUN.

3 Development Trend and Prospect

       The generation technology from image to text requires the integration of pattern recognition and machine learning, computer vision, natural language processing, and even research results in the field of cognitive science, which has extremely high theoretical research value and practical prospects. To a certain extent, this technology, together with tasks such as image semantic annotation, has become a way for major top scientific research institutions to compete in the comprehensive research strength in the field of artificial intelligence, which will surely promote its rapid development.

      For this task itself, the bigger challenge still lies in how to correctly extract the content of the image, and at the same time choose an appropriate expression method according to human language habits to convert the image content into natural language sentences. It should be pointed out that the current research is still focused on whether the concept of the object in the image is fully extracted, whether the correct words are selected, whether the generated sentences conform to the grammatical habits, etc.; it can be foreseen that in the near future, practical application scenarios and contexts Constraints such as context will further promote the progress of related technologies, and will be widely used in news communication, online education, smart home and other fields.

Guess you like

Origin blog.csdn.net/jinhao_2008/article/details/115949180
Recommended