Human's first sand sculpture video dataset! FunQA: Let the machine be the king of comedy

Click on the card below to follow the " CVer " public account

AI/CV heavy-duty information, delivered as soon as possible

Click to enter -> [Target Detection and Transformer] communication group

Reply in the background of CVer WeChat public account: FunQA, you can download the pdf and data set of this paper

4,000 sand sculpture videos + 310,000 appreciation texts allow the AI ​​model to learn more accurate video understanding, counter-intuitive reasoning, humor, and accurate free text generation.

d2b95d779cb01fcbca08150a7cafc461.png

People can easily get pleasure from counter-intuitive videos (videos that are humorous, creative, and full of visual effects). This attraction not only comes from the visual sensory stimulation of humans by videos, but also from humans’ innate The ability to understand and find joy in unexpected and counter-intuitive moments. However, despite significant advances in today's computer vision models, the question remains: Can video models "understand" humor or creativity in videos?

Current video question answering (VideoQA) datasets still focus on common, less surprising videos and simple tasks (such as Multi-choice, Open-end). Simply answering simple questions about the people and things in the video (What, Who, How many, etc.) is obviously not enough to help understand the video. Commonly used video question and answer data sets include YouCook2 (which contains 2K cooking videos) and Howto100m (which only contains instructional videos). Some data sets (such as UR-FUNNY, etc.) introduce humorous clips from TV shows and set tasks such as predicting laughter trajectories, but these tasks often rely heavily on audio and narrative clues, and visual clues cannot play a big role.

To address this gap and evaluate the ability of computer vision models to understand counterintuitive videos, scholars from Beijing University of Posts and Telecommunications, Nanyang Technological University in Singapore, and the Allen Institute for Artificial Intelligence proposed FunQA, a comprehensive high-quality video question answering data collection, consisting of 4.3K interesting videos and 312K human-annotated free-text question-answer pairs. The FunQA dataset includes three subsets: HumorQA, CreativeQA, and MagicQA. Each subset covers different sources and video content, but the commonality lies in the surprising properties of the videos, such as unexpected contrasts in humorous videos, intriguing disguises in creative videos, and seemingly impossible performances in magic videos.

30831d9fe6cd7a2fae75861e27c37e12.png

Paper: https://arxiv.org/abs/2306.14899

Dataset: github.com/Jingkang50/FunQA

Reply in the background of CVer WeChat public account: FunQA, you can download the pdf and data set of this paper

In FunQA, the researchers also formulated three rigorous tasks to measure the model’s understanding of counterintuitive videos. These tasks take video reasoning beyond superficial descriptions and require models with deeper understanding and insight. Specific tasks include:

1) Counter-intuitive timestamp positioning: This task requires the model to determine the specific time period in the video when the unexpected event occurred;

2) Detailed video description: The model must generate a coherent and objective video content description to demonstrate its basic video understanding capabilities;

3) Counter-intuitive reasoning: The model must provide a concrete explanation for why the video is surprising. This requires deep reasoning about counterintuitive events in the video.

These tasks progressively assess a model's ability to perceive, express, and reason about counterintuitive elements that appear in videos. In addition, the researchers also proposed more challenging auxiliary tasks, including giving the video an appropriate and vivid title.

The figure below shows the demo of three subsets of FunQA, showing the question and answer pairs designed by FunQA for different video types.

5c95a988e007ee953b8491022db7c77c.png

FUNQA dataset

When constructing the data set, the researchers adhered to three principles to solve the challenge of video understanding ability, namely focusing on vision, emphasizing counter-intuitive reasoning ability, and emphasizing spatiotemporal reasoning ability. Based on these principles, FunQA includes 4,365 videos and 311,950 question and answer pairs from 3 different art genres. The total length of these videos is 23.9 hours, and the average length of the video segments is 19 seconds. The FunQA data set includes three subsets: HumorQA, CreativeQA and MagicQA. The specific statistics of the data set are shown in Figure 2.

85204c5ca8395031949931f4c98c7158.png

From the statistical data Figure 2(h), you can see the timestamp heatmap of three different types of videos, which shows the high-frequency time span of the answers. It can be found from Figure 2 (h) that for description and reasoning tasks, the average length of free text answers reaches 34.24, which greatly exceeds the existing VideoQA data set (such as 8.7 in Activity-QA and 8.7 in NExT-QA). 11.6). The FunQA annotation consistency evaluation results are shown in Figure 2 (i). For each video category, more than 90% of the annotations show high consistency, and only 1% of the content shows low consistency. Approximately 8% of the data showed changes in consensus, demonstrating the objectivity of the FunQA dataset.

Comparison of FunQA with other existing benchmarks

Compared with other benchmarks, FunQA focuses on interesting and counter-intuitive video domains. The tasks in FunQA are designed to challenge the visual capabilities of the model and require in-depth description, explanation, and spatiotemporal reasoning capabilities. The table below details how FunQA compares to other benchmarks.

892552549e01087a17c6b61da1c424b2.png

Often, performance trends on one benchmark may be similar to performance trends on another benchmark, such as the noteworthy correlation between VQA and MSCOCO. However, compared to other datasets, the FunQA dataset not only provides evaluation in new areas, but also challenges models in ways that other datasets may not. Its features include:

1) Deep spatiotemporal reasoning: FunQA focuses on counter-intuitive content, requiring the model to first understand typical scenarios (common sense) and then identify humor deviations. This type of deep reasoning remains a challenging but unexplored area.

2) Rich annotations: Unlike many datasets that rely on multiple choice questions or open-ended short answers, FunQA has free-text annotations with an average length of 34 words (previously the most richly annotated dataset in the video question answering space is NExT-QA, with an average word length of 11.6). This detailed annotation approach allows for richer model responses and tests their ability to generate more nuanced answers.

3) Explore humor: A detailed understanding of humor principles may be crucial for models to truly grasp the content of some videos. (No video question and answer data set has focused on this aspect before. Only relevant new data sets such as The New Yorker Caption Contest have appeared in the field of VisualQA). Determining how to equip models with this humorous information, and what other types of knowledge might be “valuable,” are exciting research directions.

Experimental Results and Conclusion

The researchers tested on 7 video question and answer models (divided into caption-based models and instruction-based models). The following table shows the main experimental results. In the FunQA benchmark, H1, C1, and M1 respectively represent counter-intuitive timestamp positioning tasks on three subsets, where the measurement indicator is IOU. H2, C2, and M2 represent detailed video description tasks, and H3, C3, and M3 represent counter-intuitive reasoning tasks. For higher-level tasks, H4 and C4 represent giving the video an appropriate and vivid title. The answers to all these tasks are in free text format and are measured using the following metrics: BLEU-4, ROUGE-L, CIDEr, BLEURT and GPT-4. C5 stands for scoring the creativity of a creative video, which is evaluated by the difference between the predicted score and the official score.

3f68ea64c90d7d671f23d4438d462165.png

In addition, the researchers showed examples of responses of different models to FunQA. Figure 3 shows the responses given by VideoChat, Video-ChatGPT and Otter on the humorous video in the figure. VideoChat performed best on tasks H2 and H3. On task H4, Video-ChatGPT and Otter answered better, which is consistent with the experimental results in Table 2. However, the answers of all models are still far from the correct answers, especially in the detailed description and counter-intuitive explanations.

a82ef3cad8b9991f35c39a52b671e558.png

Overall, the model performance on the FunQA dataset is generally unsatisfactory. Several key findings include:

1) The timestamp positioning task is the most challenging.

Caption-based models usually ignore temporal information, while instruction-based models, such as Otter, only obtain visual information from specific frames without introducing temporal content. Therefore, currently no VLM can solve the tasks of H1, C1 and M1.

2) There is no clear winner for all tasks.

Caption-based models perform well in providing detailed descriptions, but perform poorly in tasks that require inference, resulting in a significant performance gap between description tasks (such as H2) and inference tasks (such as H3). On the other hand, instruction-based models exhibit stronger reasoning ability but perform poorly in description tasks. One possible explanation is that instruction-based models may contain too much redundant information in their answers, leading to degraded description task performance.

3) The performance of different video types varies greatly.

Most models can get relatively accurate answers for humor and magic videos, but have difficulty answering questions for creative videos. This may be because humor and magic videos usually describe daily life that the model has encountered before, while creative videos contain content that the model has never seen, so the model has difficulty generating new ideas, leading to irrelevant and wrong answers.

4) Insufficient evaluation metrics for free text tasks.

Traditional metrics score almost zero on free-text questions because they only focus on basic textual similarity. The researchers found that GPT-4 showed some ability in assessing deep understanding of free text. However, there is still the issue of instability, where the same content can get different scores.

5) The fine-tuned Otter performs well on traditional metrics but lags behind on GPT-4 scores.

The researchers fine-tuned Otter on Dense Caption and FunQA. Otter (FunQA) showed obvious performance advantages over Otter (DC.). Although Otter performs better on traditional metrics such as ROUGE-L compared to other instruction-based models, Otter performs poorly on GPT-4 scores. One possible reason is that the input to Otter is only 128 frames sampled from the video, which is not enough for comprehensive inference. The difference between Otter’s scores on traditional metrics and GPT-4 matches the previous finding of a lack of evaluation metrics.

discuss

As mentioned earlier, compared with existing video question and answer data sets, FunQA has the characteristics of deep spatiotemporal reasoning and exploration of humor, which also poses new challenges to the model:

1) Accurately understand information and long videos: Through analysis of failure cases, researchers found that many models have difficulty accurately describing videos. While they may be good at detecting objects in videos, they tend to be hesitant in understanding the context between consecutive events. This demonstrates the need for further exploration in this area and that FunQA can serve as a valuable dataset for in-depth exploration of video descriptions.

2) Logical reasoning: The main nature of the videos in the FunQA dataset is to contain content that is counterintuitive and contradicts common sense. In order for models to understand this, they must grasp the concept of "common sense," infer what would normally happen under normal circumstances, and then use this perspective to interpret the video humorously. This requires the model to have strong reasoning capabilities. How to inject common sense into models remains an important research point.

4) Additional knowledge - sense of humor: In order to interpret the humor in the video, it is crucial to understand the basic principles of humor. This type of knowledge, along with other common sense and additional information, may enhance model performance. Therefore, deciding how to integrate valuable knowledge and discerning what is “valuable” are topics worthy of further exploration.

In response to the challenges faced by the model, researchers have proposed some possible solutions:

1) Model size: Increasing the number of parameters is a natural way to improve model performance. However, this approach has its own engineering challenges and requires improvements in model optimization and deployment. The relationship between the number of model parameters and its performance on the FunQA benchmark deserves further exploration, and the FunQA data set can serve as an excellent testing platform.

2) Data quality: The researcher believes that the focus of this task should be on data collection. Current trends in large dynamic models indicate that having large amounts of low-quality data is far less effective than having small amounts of high-quality data. Therefore, the researchers hope that the community will discover the type of data that will truly help understand counterintuitive videos. This is a crucial research direction.

3) Training strategies: It is also important to study training strategies. For example, determining what type of data to start learning from, and understanding the significance of course learning, etc.

4) Model collaboration: Researchers believe that perhaps multiple models collaborating to process examples in an elegant way may be a way to improve performance. However, this approach may require more attention to the overall efficiency of the model implementation.

Limitations of current work:

1) The current FunQA dataset mainly includes video-level data and annotations, but deeper annotations can be introduced to explore the possibilities of video reasoning, such as detailed spatial and temporal annotations, i.e., subtitles and object-level ones corresponding to a specific timeline. Note.

2) Original annotations are done in Chinese. In the process of translating into English, the researcher first used GPT to polish and supplement the Chinese annotations to make the text as complete as possible. However, there may still be disagreements between the annotations due to cultural differences between the two languages.

future career:

The researchers hope to extend the FunQA dataset with deeper and more diverse annotations. Additionally, new metrics will be explored to better evaluate model performance, especially in open-ended problems where deep metrics are lacking. Finally, the researchers hope to provide a direction for the model to develop towards deeper video reasoning.

Reply in the background of CVer WeChat public account: FunQA, you can download the pdf and data set of this paper

Click to enter -> [Target Detection and Transformer] communication group

ICCV/CVPR 2023 paper and code download

 
  

Backstage reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

后台回复:ICCV2023,即可下载ICCV 2023论文和代码开源的论文合集
目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-目标检测或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer、NeRF等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号
整理不易,请点赞和在看

Guess you like

Origin blog.csdn.net/amusi1994/article/details/132798155