Shanghai AI lab proposes VideoChat: you can talk to video

e9926cee939a973a3500c2792fb0baf1.jpeg

 Original Author of Xi Xiaoyao's Science and Technology
 | Xiaoxi, ZenMoore

Enter the NLP group —> join the NLP exchange group

Compared with language and images, video is a more complex and advanced type of modality to represent the world, and video understanding is also a more complex type of work than the common work of natural language processing and computer vision. In the current torrent of large models, the natural idea is that large-scale language models (LLMs) can complete the work of video understanding based on the powerful understanding and reasoning capabilities of language training ? Now the answer is here. Shanghai AI Lab has proposed VideoChat, an end-to-end video understanding system centered on Chat, which integrates video basic models and LLMs, and performs well in many aspects such as space and time reasoning, event location, and causal inference. very good .

d9931e10b5717692c5a4e6e93bc84f64.png

Different from the existing multi-modal large model processing method for video input, that is, first textualize the video content and then access the large model to take advantage of the natural language understanding of the large model, this paper integrates video in a learnable way from the perspective of the model . And the basic model of language, by constructing the interface between video basic model and LLMs, and completing the alignment of video and language by training and learning the interface . Such a method can effectively avoid the loss of visual information and spatio-temporal complexity information. For the first time, it has created an efficient and learnable video understanding system, which can realize effective communication of video content with VideoChat.

Thesis title:
VideoChat : Chat-Centric Video Understanding

Paper link:
https://arxiv.org/pdf/2305.06355.pdf

Code address:
https://github.com/OpenGVLab/Ask-Anything

If you want to ask what kind of capabilities the big model has, then we may eloquently list many from understanding reasoning to calculation judgment, but if you want to ask how to understand the different functions of the big model in different scenarios, it may be quite a mystery. The "art" problem . In VideoChat, the author of the paper understands the large model as a video task decoder, which understands video-related descriptions or further embeddings into human-understandable text . This process can be formally understood as:

Here is a model representing a picture or video. By inputting I (image) and V (video) into the model, the embedded representation E of the video or image is obtained, and a decoding process is:

where and represent LLM's answer in round t and all questions and answers raised by users before round t, that is, an LLM model. Traditionally, the solution to multimodal large models is generally a method of textualizing video information. By serializing video into text to form a Video Description, and then inputting it into the large model, this text flow can be easily It is good for understanding tasks, but it does not perform well for tasks such as time and space perception , because it is almost inevitable that such basic information will be easily lost after textualizing video information. Therefore, this paper attempts to complete an end-to-end integrated method to directly extract the embedded information of the video, as shown in the comparison in the following figure:

23f4aebddda7f8d14ce56bafc3fed9df.png

By integrating these two video architectures, that is, the Video Context obtained by integrating VideoChat-Text and VideoChat-Embed is input into the large model to obtain a more comprehensive understanding of video information. For example, in the task above, the user asks " he Is he singing, dancing and rapping?" VideoChat replied "No, he's playing basketball (and dancing)"

For the VideoChat-Text part, the author of the paper deconstructed the contents of a video in detail, such as actions, voices, objects and objects with location annotations, etc. Based on these analyzes, the VideoChat-Text module comprehensively utilizes various video and image models Obtain the representation of these contents, and then use the T5 integrated model output to get the textual video, and use the template shown in the following figure to complete the input to LLMs:

b18604549177afb1efd0aa330e180706.png

For VideoChat-Embed, the following architecture is used to combine video and large models with a learnable Video-Language Token Interface (VLTF), and to build VideoChat-Embed based on BLIP-2 and StableVicuna. Specifically, first input video through GMHRA , while introducing image data for joint training and accessing a pre-trained Q-Former to complete video Embedding.

b5ff836f5a4e097f7e849b269ed4a45f.png

The whole training process can be divided into two stages, namely alignment and fine-tuning. In the alignment phase, the author introduced 25M visual-text pairs to fine-tune the interface, and the overall input prompts are as follows:

015e2ce58ac8aa7504f2189520fbb265.png

In the fine-tuning stage, the paper built and open sourced an instruction dataset containing 7k detailed video descriptions and image descriptions, 4k video dialogues, 3k image descriptions, 2k image dialogues, and 2k image inferences to fine-tune VideoChat .

b1c0b496e4661b97f74b4dc4b96892e9.png

Comparing LLaVa, miniGPT-4 and mPLUG-owl, the paper conducts qualitative research on the various capabilities of VideoChat. Among them, in spatial perception and analysis, VideoChat can recognize Japanese clothing to infer the corresponding music, and determine the number of people in the video . This is the proof of VideoChat's ability to recognize and capture visual elements and give them to analyze.

23ec26f8396a1333b1ad8f314e85f67d.png

In time perception and analysis, VideoChat can recognize yoga movements in the video, and even give a judgment on the possibility of it falling and remind safety issues .

ab8c446eb119a38180e0d5609b036cc2.png

In informal inference, VideoChat can also explain the question "why this video is funny", and the explanation is also in line with some of our abstract judgments about video funny, such as incongruity, suddenness, etc.

a33bcdcb88d4c112df7e1c3df7dfd5e9.png

Compared with the recent image-based multimodal dialogue system, VideoChat can correctly identify the scene, while other systems mistakenly regard the dialogue environment as indoors , which fully reflects the very powerful comparative advantage of Video-Chat in terms of spatial perception. .

6e639df49d0734ed066719e1cf393c0f.png

Such an open-source video understanding framework can pave the way for video understanding, a problem for which there are no very mature solutions. Obviously, the excellent ability of large-scale speech models to align video information with text information can allow them to understand video information. However, if the large model is regarded as a black box with reasoning and comprehension capabilities, the problem of video understanding becomes how to decode the video and align it with the text. This can be said to be brought by the large model to this field. Changes in the way of asking questions .

However, for the mature video comprehension we expect, this work still has limitations. For example, VideoChat is still difficult to process long videos longer than 1 minute . Of course, this is mainly due to the limitation of the context length of the large model, but in the limited context length How to better compress video information has also become a complex issue. When the video duration becomes longer, the response time of the system will also have a negative impact on user experience . In addition, in general, the data set used in this paper is still not large, so the reasoning ability of VideoChat is still at the level of simple reasoning, and it cannot complete complex reasoning work . In short, although VideoChat is not yet a perfect solution, but it can already add an important touch to the current video understanding system, let us look forward to more mature work based on it!

Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/130676109