Holistic Video Understanding (HVU) data set detailed explanation

Article Directory

0. Preface

  • One sentence summary: HVU uses more comprehensive tags (scene/environment, objects, actions, events, attributes, concepts) to describe video information.

  • The name of this data set is literally translated as "comprehensive video understanding data set", which means to explore what applications can be used in the field of video understanding, not a single task data set.

  • Official information: official website , thesis (ECCV 2020) , supplementary materials , Github

  • Reference blog

  • Obtain (the data is 500+G):

    • youtube-dl self-download: the official repo is provided , just follow the README to download it. The label file can be downloaded from another repo .
    • Apply to the author for the data that has been downloaded.
  • Problems with the previous data set:

    • The main sentences of video-related data sets are currently on human behavior or sports events. These are actually only very specific problems in video-related tasks.
    • In fact, video understanding includes many aspects of recognition, such as a scene or an environment, objects, actions, events, attributes, and concepts. We generally only focus on behavior now.
    • attribute is similar to adjectives and adverbs, describing other scenes/actions/objects/events
    • I don’t know how to translate concept.

    The concept category refers to any noun and label which present a grouping definition or related higher level in the taxonomy tree for labels of other categories.

1. Overview

  • The description of the data set in the paper: I can't translate this sentence well, so let's read the original text. The keywords are (hierarchically, multi-label, multi-task)

HVU is organized hierarchically in a semantic taxonomy that focuses on multi-label and multi-task video understanding as a comprehensive problem that encompasses the recognition of multiple semantic aspects in the dynamic scene

  • Mainly focus on three tasks
    • Video classification
    • Video Captioning
    • Video Clustering
  • The amount of data:
    • Number of videos: training set/validation set/test set have 476k/31k/65k samples respectively, 572k videos in total
    • Labels: The training set, validation set, and test set have 7.5M/600k/1.3M labels respectively.
    • Number of categories: A total of 3142 categories, with an average of 2112 labeled data. Subdivision categories: 248 categories for scenes, 1678 for objects, 739 for actions, 69 for events, 117 for attributes and 291 for concepts.
      • These categories are too many, it is not suitable to put in the text, you can see it directly on github, please refer to here
      • The relationship between different types of samples is shown in the figure and table below
      • image-20210110171020916
      • image-20210110171206127
  • Comparison with other common video understanding data sets
    • image-20210110171053870

2. Detailed

  • Label file introduction

    • HVU_Tags_Categories_V1.0.csv: Category file, divided into two columns (Tag and Category), the former represents the specific category name, and the latter selects one of six actions/attributes/concept/event/object/scene.
    • File where the label is located: HVU_Train_V1.0.csvandHVU_Val_V1.0.csv
      • There are four columns, which areTags, youtube_id, time_start, time_end
      • Tags are four (a plurality of labels, each label through between |division), the sample number (i.e. youtube id when downloaded from YouTube), the start time (starting time should be in the original sample, i.e., the front cutting Video), and cut-off time (should be the time point in the original sample, that is, the video before cutting).
  • Data collection and labeling process

    • The general video data set construction is divided into two steps, data collection and data labeling.
    • HVU data collection: mainly use existing behavior recognition data sets as data sources, such as YouTube-8M, Kinetics-600, HACS
      • There are many advantages to using existing data sets. The first is that copyright and privacy issues are not considered, and the second is that the test set and training set will not be repeated.
    • HVU data annotation
      • There are two main problems with the labeling of behavior recognition data sets. One is that manual labeling is prone to errors. After all, there are many tags and it is difficult for annotators to pay attention to all the details. The other is that labeling is time-consuming and laborious.
      • In order to alleviate the above problems, HVU first uses Google Vision API and Video Tagging API to mark, each video has 30 tags, and then it is manually verified.
    • There are supplementary materials in the paper, introducing the details of human annotation (Human Annotation Details)
  • Taxonomy: Literal translation is "taxonomy". In short, it is how the categories come from.

    • Using the APIs of Google and Sensifa, there are about 8000 tags obtained.
    • Remove labels with unbalanced samples (I don’t know what it means, maybe those labels with few samples?)

Guess you like

Origin blog.csdn.net/irving512/article/details/112440674