Detailed explanation of major paper data sets

MSR-MTB

Paper name: MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

This is a new large-scale benchmark for video understanding, especially for the emerging task of video-to-text translation.

This was achieved by collecting 257 popular queries from a commercial video search engine, each containing 118 videos. In its current version, MSR-VTT provides 10K web video clips, totaling 41.2 hours, and 200K clip-sentence pairs, covering the most comprehensive categories and diverse visual content, representing the largest sentence and vocabulary dataset . Each video was annotated with approximately 20 natural sentences by 1,327 AMT workers.

1. The official partition uses 6513 clips for training, 497 clips for validation, and the remaining 2990 clips for testing.
2. For the partitioning, there are 6656 clips for training and 1000 clips for testing.
3. Partition uses 7010 and 1000 clips for training and testing respectively. Since the last two data partitions do not provide a validation set, we construct a validation set by randomly sampling 1000 segments from MSR-VTT.

Video introduction

MSR-VTT dataset: This dataset is the Microsoft Research - Video to Text (MSR-VTT) Challenge of ACM Multimedia 2016. The address is: MSR-VTT dataset . . The data set contains 10,000 video clips (video clips), which are divided into three parts: training, validation and test sets. Each video clip is annotated with about 20 English sentences. In addition, MSR-VTT also provides category information for each video (20 categories in total). This category information is considered a priori and is also known in the test set. At the same time, videos all contain audio information. This database uses a total of four machine translation evaluation indicators, namely: METEOR, BLEU@1-4, ROUGE-L, and CIDEr.

Insert image description here
First, our dataset has the largest number of clip-sentence pairs, where each video clip has multiple sentence annotations. This allows for better training of the RNN, resulting in more natural and diverse sentences. Second, our dataset contains the most comprehensive yet representative video content, collecting 257 popular video queries in 20 representative categories (including cooking and movies) from a real video search engine. This would be useful for verifying the generalization ability of any method from video to language. Third, the video content in our dataset is more complex than any existing dataset because the videos are collected from the Web. This is a fundamental challenge for this particular field of research. Finally, in addition to video content, we retain audio channels for each clip, which opens the door to related fields. Figure 1 shows some examples of videos and annotation sentences.

Classification:
Insert image description here
Split

To split the dataset into training, validation, and test sets, we separate video clips according to their corresponding search queries. Clips from the same video or the same query will not appear individually in the training or test sets to avoid overfitting. We split the data 65%:30%:5%, corresponding to 6,513, 2,990, and 497 clips in the training, testing, and validation sets, respectively.

MSR-VTT is derived from a wide variety of video categories (7180 videos from 20 general domains/categories), and
since MSR-VTT has the largest vocabulary, each clip is annotated with 20 different
sentences​​​​Insert image description here

TRECVID ACP 2016-2018

Task - Ad Hoc Video Search (AVS)
IACC.3 Dataset

The IACC.3 dataset is approximately 4600 Internet Archive videos (144 GB, 600 hours) with a Creative Commons license in MPEG-4/H.264 format, ranging in duration from 6.5 to 9.5 minutes, with an average duration of nearly 7.8 minute. Most videos have some metadata available provided by the donor, such as title, keywords and description.

No training data, so use the joint set training of MSR-VTT and TGIF

VATEX

Paper title: VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video
-and-Language Research

It contains more than 41,250 videos and 825,000 Chinese and English subtitles. In the subtitles, there are more than 206,000 parallel translations between English and Chinese.

We use 25,991 video clips for training, 1,500 video clips for validation, and 1,500 video clips for testing, where the validation and test sets are obtained by randomly dividing the official validation set of 3,000 clips into two equal ones. Partially obtained.

Multilingual video subtitles
Insert image description here

VATEX, a large-scale multilingual video and language research dataset, contains more than 41,250 unique videos and 825,000 high-quality subtitles. It covers 600 human activities and various video contents. Each video is accompanied by 10 English and 10 Chinese subtitles from 20 individual annotators.

  • It contains large-scale descriptions in both English and Chinese, and can support many multilingual studies limited by single-language datasets.
  • Secondly, VATEX has the largest number of clip sentence pairs, each video clip has multiple unique sentence annotations, and each title is unique in the entire corpus.
  • Third, the video content of VATEX is more comprehensive and representative, covering a total of 600 human activities.

Insert image description here

An example of the VATEX dataset. The video has 10 English and 10 Chinese descriptions. All of these describe the same video, so they are far parallel to each other, while the last 5 are paired translations.

Probably like this
Insert image description here

MPII Movie Description Dataset (MPII-MD)

MPII-MD contains approximately 68,000 video clips extracted from 94 Hollywood movies. Each clip is accompanied by a sentence description derived from the movie script and audio description (AD) data. Advertising or Descriptive Video Service (DVS) is an additional audio track added to a movie to describe explicit visual elements of the movie for the visually impaired. Although film clips are manually aligned with descriptions, the data is very challenging due to the high diversity of visual and textual content, and the fact that most clips have only a single quote sentence. We use the train/validation/test split provided by the authors, extracting every five frames (videos are shorter than MSVD, 94 frames on average).

We use official data partitions, i.e. 56828, 4929 and 6580 movie clips, for training, evaluation and testing respectively. Each movie clip is associated with one or two text descriptions.

Insert image description here

MS-COCO

MS-COCO contains 123,287 pictures, each picture is described with five sentences. We adopt its standard split into [2], [9] and use 113287 images for training, 5000 images for validation, and the remaining 5000 images for testing. Final results are reported by averaging over 5 times on 1000 test images or testing on 5000 test images.

Insert image description here
Insert image description here
: Classification statistics of COCO2017 data set .
: Correspondence between 80 category names and ID numbers of coco2017 data set .

: COCO2017 data set structure description .

Flickr30k

Flickr30k collects 31,000 images, each image has 5 text annotations. We adopt its standard segmentation as 29,000 images for training, 1,000 images for validation, and 1,000 images for testing.

What is the flickr30k data set?
The core of this data set is two points, one is the image, and the other is the description language corresponding to the image.
First picture:Insert image description here

在token文件中的标注信息:
667626.jpg#0 A girl wearing a red and multicolored bikini is laying on her back in shallow water .
667626.jpg#1 Girl wearing a bikini lying on her back in a shallow pool of clear blue water .
667626.jpg#2 A young girl is lying in the sand , while ocean water is surrounding her .
667626.jpg#3 A little girl in a red swimsuit is laying on her back in shallow water .
667626.jpg#4 A girl is stretched out in shallow water

It can be seen that each image is paired with 5 descriptions, and the meanings of the five descriptions are basically the same.

Our goal is to train a model, and the desired effect is: put an image into it, and a corresponding and fairly correct image description will come out. As the saying goes, the picture speaks for itself.

One image corresponds to 5 description languages, and there are 158915 language descriptions in total.

MSVD

Youtube2Text (MSVD)

MSVD contains 1970 videos, and each video clip contains 40 sentences. We use a standard split with 1200 videos for training, 100 videos for validation, and 670 videos for testing.

  • This data set contains 1970 short videos, 10-25s, with an average duration of 9s. The videos contain different people, animals, actions, scenes, etc.
  • Each video has multiple sentences annotated by different people, about 41 annotated sentences per clip, with a total of 80,839 sentences, and an average of 8 words per sentence. All these sentences contain a total of nearly 16,000 unique words.
  • The caption includes descriptions in multiple languages. In some papers, only captions with language = english are used for training and testing.

Insert image description here
Insert image description here

TGIF

The Tumblr GIF (TGIF) dataset contains 100K animated GIFs and 120K sentences describing the visual content of GIFs.

TGIF contains videos in gif format, 79451 video clips for training, 10651 video clips for validation, and the remaining 11310 video clips for testing.

example:

A man glared, and a man wearing sunglasses appeared.

A cat tries to catch a mouse on a tablet

A man in red clothes is dancing

One animal near another in the jungle

A man in a hat adjusted his tie and made a strange face.

Someone puts a cat in wrapping paper and wraps it up with a bow

A brunette woman is looking at the man
and a cyclist is jumping over a fence

A group of men stood staring in the same direction.

a boy is happy parking and see another boy

Original source

ActivityNet Captions

The ActivityNet Captions database consists of 20000 videos. Each video is densely annotated with multiple sentence descriptions.

The ActivityNet Captions database associates videos with a series of time-series annotated statements. Each sentence covers a specific segment of the video and describes the event that occurred. These events may last longer or shorter, have no limit on the events themselves, and may occur simultaneously. ActivityNet Captions contains 20,000 videos, each video contains an average of 3.65 temporally positioned description sentences, and a total of 100,000 descriptions. We found that the number of sentences per video follows a relatively normal distribution. In addition to this, as the duration of the video increases, the number of descriptive sentences also increases. The average length of sentences is 13.48 words, which also conforms to the normal distribution. On average, each sentence describes an event for 36 seconds, which is approximately 31% of the corresponding video. However, the complete sentence of each video describes about 94.6% of the video content, which shows that each annotation can basically cover the main activities in the video. We also found that there is a 10% overlap in description content, indicating that events that occur at the same time will cover each other.

It looks like this:
Insert image description here

Video paper introduction website

LSMDC

It consists of 118081 short films. These videos are excerpted from 202 feature-length films.

Insert image description here
-----------------------------Source--------------------- -------

Guess you like

Origin blog.csdn.net/missgrass/article/details/121158046