VidSitu dataset

In order to facilitate subsequent scientific research needs, relevant research is now conducted on the VidSitu data set.

This data set originates from the paper "Visual Semantic Role Labeling for Video Understanding":

2104.00990.pdf (arxiv.org) icon-default.png?t=N3I4https://arxiv.org/pdf/2104.00990.pdf This dataset is available below:

VidSitu Dataset: Situation Recognition in Videos icon-default.png?t=N3I4https://vidsitu.org/ The following is just a brief introduction to this data set. If you want to get more information, please see this paper or go to the official website of the data set.

Summary

This paper proposes a new framework for understanding and representing relevant salient events in videos using visual semantic role annotation. The authors represent videos as a set of related events, where each event consists of a verb and multiple entities that fulfill various roles related to the event. To study the challenging task of semantic role annotation in videos or VidSRL, this paper proposes the VidSitu benchmark, a large-scale video understanding data source with 29K 10-second movie clips richly annotated with verbs and 2 seconds for semantic role annotation. Entities are co-referenced in events within movie clips, and events are connected to each other through event relationships. VidSitu clips were extracted from a large collection of movies (3K) and chosen to be both complex (4.2 unique verbs in the video) and diverse (200 verbs with over 100 annotations each). This dataset is comprehensively analyzed compared to other publicly available video understanding benchmarks, several illustrative baselines, and a range of standard video recognition models are evaluated.

introduce

VidSitu, is a large video understanding dataset containing over 2900 videos extracted from a diverse set of 3K movies. Videos in VidStum are exactly 10 seconds long and are annotated with 5 verbs , corresponding to the most significant events that occur within the 5 2-second intervals in the video . Each verb annotation is accompanied by a set of roles whose values ​​are annotated using free-form text. Unlike verb annotations derived from a fixed vocabulary, free-form character annotations allow the use of referential expressions (e.g., boy in robe) to disambiguate entities in a video. Entities appearing in any of the five clips in the video are consistently referenced using the same expression. Finally, the dataset also contains event relationship annotations, capturing causality (event Y is caused by/a reaction to event X) and contingency (event X is a prerequisite of event Y) . Key highlights of VidStum include:

        Diverse situation: VidStuon has a large vocabulary of verbs (consisting of 1500 unique verbs, 200 verbs annotated with at least 100 events) and entities (5600 unique nouns, 350 nouns appearing in at least 100 videos);

        Complexity: Each video is annotated with 5 interrelated events, with an average of 4.2 unique verbs, 6.5 unique entities, and;

        Rich annotations: VidSitu provides a structured event representation (3.8 roles per event) with entity co-references and event relationship tags.

Reason for presentation: To facilitate further research into VidSRL and provide a comprehensive benchmark that supports partial assessment (for assessment questions) of the various capabilities required to address VidSRL.

Main contributions

  • VidSRL task form for understanding complex situations in videos
  • Curate the richly annotated VidStum dataset consisting of diverse and complex cases for studying VidSRL;
  • Establish assessment methods to evaluate the key capabilities required by VidSRL and establish baselines for each component using state-of-the-art components. The dataset and code are publicly available at vidsitu.org.

Dataset annotation example

The time scale of significant events. In video, what constitutes a salient event is often ambiguous and subjective. For example, given the 10-second clip in Figure 1, one could define fine-grained events around atomic actions such as "Turn (event 2 third frame)" or take a more holistic view of the sequence as involving "fighting". Due to the lack of understanding of Constraints on event time scales, this ambiguity makes annotation and evaluation challenging. We resolve this ambiguity by limiting the selection of significant events to one event per fixed time interval. Previously identified atomic actions [21] The work relies on 1 second intervals. An appropriate choice of time intervals for annotating events is one that enables a rich description of complex videos while avoiding accidental atomic actions. We qualitatively observe that a time interval of 2 seconds is better at obtaining descriptive There is a good balance between events and the objectivity required for system evaluation. Therefore, for each 10-second segment, 5 events are annotated. The fourth part of the paper deals with the management, analysis and statistics of the \left \{ E_i \right \}_{i=1}^{5}dataset , without much introduction.

Evaluation index

1. Verb prediction

2. Semantic role prediction and collaborative citation

3. Event relationship prediction accuracy.

Experimental results

 

 

Guess you like

Origin blog.csdn.net/Mr___WQ/article/details/130493542