Charades & CharadesEgo & Action Genome data set and paper summary

0. Preface

  • This article introduces the Charades series of data sets, including:
    • Charades: ECCV 2016, the first daily behavior recognition data set in a home indoor scene, was completed through crowdsourcing.
      • The data set collection method is quite interesting. The user first writes the script (making sentences based on keywords), then shoots the video by himself, and finally other people mark it.
    • CharadesEgo: CVPR 2018, the first paired behavior recognition data set.
      • The so-called pair means that for the same series of actions, there are both a first-view video and a third-view video at the same time.
      • The thesis hopes to model the first and third perspective data sets.
    • Action Genome: CVPR 2020, is a secondary annotation of Charades, including the relationship between people and objects.
      • The newly added annotations include characters, object bboxes, and the relationship between people and objects.

1. Charades

  • basic information:

  • Data acquisition: download directly from the official website, and use Thunder.

  • Data set overview:

    • This data set is crowdsourced (Amazon Mechanical Turk platform).
    • The shooting scene is a family indoor .
    • There are 150+ behavior categories, all of which are daily behaviors. In the words of the paper, they are boring household activities.
    • In addition to behavior tags and behavior start and end time, it also includes many text description tags.
  • Data volume (the training set/test set have 7,985/1,863 videos, 49,809/16,691 behavioral clips respectively)

    image-20210131031356134

1.1. Introduction to the data set

  • There is a detailed introduction of the data set in the README of the official website .
    • Including: what files are there in total, and a detailed introduction of each file (including video, frame, tag).
    • The content I introduce below is just a translation of some README content. For more details, please refer to the link above.
  • The files mainly include (you can find all the files below on the official website and download them, and you can use Thunder)
    • README.txt: Detailed data set
    • liscence.txt: Free for academic institutions, not for commercial use
    • Charades.zip: Label file, introduced separately later
    • Charades_v1.zip: Original video, all videos are in one folder.
    • Charades_v1_480.zip: Compressed video, all videos are in one folder.
    • Charades_caption.zip: Caption information, and the corresponding evaluation code.
    • Charades_v1_rgb.tar: RGB video frame
    • Charades_v1_flow.tar:light flow
    • Charades_v1_features_rgb.tar.gz: Video frame feature, I don’t know how to get it
    • Charades_v1_features_flow.tar.gz: Optical flow frame feature, I don’t know how to get it
  • Detailed labeling:
    • Category file:
      • Charades_v1_classes.txt: Behavior number and corresponding behavior, such asc013 Washing a table
      • Charades_v1_objectclasses.txt: Noun number and corresponding nouns, such aso007 chair
      • Charades_v1_verbclasses.txt: Verb number and corresponding verb, such asv001 close
      • Charades_v1_mapping: Act number and corresponding verb number and noun number, such asc012 o033 v026
    • Sample label:
      • Mainly include Charades_v1_train.csvas well Charades_v1_test.csv.
      • Each row contains 11 elements, which are
        • id: Corresponding video id, that is, video name (excluding extension)
        • subject: I really don’t know what this means, I didn’t understand
        • scene: Scenes, including a total of 15 indoor scenes, such as living room, kitchen, bathroom, etc.
        • quality: The quality of the video considered by the annotator, 7 points system, 7 points is the highest
        • relevance: The video that the annotator thinks is related to the latter script, 7 points system, 7 points are the most relevant
        • verified: Does another annotator agree that the video is scriptrelated to
        • script: Script (refer to the next section for details 数据采集流程)
        • objects: Related objects (name, not number)
        • descriptions: The description of the video by the non-photographer (annotator).
        • actions: Behavior, including behavior number and start and end time (from the second to the second of the video, accurate to 2 digits after the decimal point), for examplec092 11.90 21.20
        • length: Video length, in seconds, accurate to 2 decimal places

1.2. Introduction

  • Why create this data set?

    • The previous data sets are all raw data obtained and annotated on Youtube, and there are very few indoor home scenes.
    • Some behaviors are generally difficult to find in videos, movies, or TV on YouTube.
  • The paper introduced a Hollywood in Homesmethod for data collection and released Charades v1.0.

  • Comparison with other data sets

image-20210131013847324

  • Data collection process:

    • The first step: Generate an indoor script. The key is to generate a variety of scripts, and to ensure that there are enough samples for each.
      • The focus is on indoor scenes, so a total of 15 types of scenes related to residences, such as bedrooms, living rooms, etc., are selected.
      • In order to write a script, you need to choose a series of everyday behaviors and everyday objects. In order to choose behaviors and types of objects, the scripts of 549 movies were analyzed, and various methods were used to analyze the behaviors and objects that appeared in the indoor scenes of the movies. Finally, 40 types of objects and 30 types of behaviors were selected.
      • The specific implementation process is: randomly give the user 5 nouns and 5 verbs, let the user randomly choose 2 respectively, and then make a sentence, requiring only 1-2 people and some interaction.
    • Step 2: Take a video based on the script.
      • The user who created the script is required to take a 30s video.
      • It also discussed how some AMT can find people to make videos more economically.
    • The third step: Let others describe the video taken, and determine whether the video is related to the script.
      • For each recorded video, we ask other users to describe it in one sentence.
      • For each video, the related items are detected through algorithms, and then the user is asked to see if the detection result is accurate.
      • For each type of item, there are 4-5 related behaviors, and the user judges whether these behaviors appear in the video.
      • A crowdsourcing algorithm is also used to optimize the results.
      • Another group of annotators, based on the video and the corresponding behavior, marked the start and end time of the behavior.

    image-20210131015143121

1.3. Introduction to performance indicators

  • The official website provides the evaluation code, but unfortunately it is from matlab.
  • There are two tasks, one is behavior recognition and the other is sequential behavior detection.
  • For behavior recognition tasks:
    • The output form is video_id, vector
    • Each video output of a length num_classesof a vector represents the probability of each type of behavior occurs.
    • This is a multi-label behavior recognition task.
    • The final performance index is mAP, and AP is calculated by category.
  • For spatiotemporal behavior detection:
    • In Charades, the essence is to detect every frame.
    • The output should bevideo_id, framenumber, vector
    • In order to prevent too many output results, the video is divided into 25 points in the test code, and each piece is calculated by taking one frame.
for j=1:frames_per_video
    timepoint(j) = (j-1)*time/frames_per_video;

# That is: 0, time/25, 2*time/25, ..., 24*time/25.

2. CharadesEgo

  • Basic information: papers , official website
  • Overview of the data set:
    • The overall acquisition process is roughly the same as Charades
    • Performance indicators are the same as Charades
  • The main research of this data set is:
    • First-perspective behavior recognition (there is no such data set before, or the amount of data in the previous data set is not enough)
    • Integration of the third perspective and the first perspective

2.1. Differences from Charades

  • The data source has changed: originally there was only a third view, and now they are all paired first and third view videos.

    • Note that the two videos are shot separately, but do the same action.
    • I haven't watched the video itself for the time being, and I don't know how the paired video works and whether it is time aligned.
    • There are two recording methods for the first-view video. One is to hold the mobile phone in one hand and perform actions with one hand. We prefer the second type, so there is an extra bonus for the second type of video submitted.
  • The data format is slightly different: in the sample label, there are thirteen columns (two more columns than Charades)

    • egocentric: Is it the first point of view
    • charades_video: Which training sample corresponds to in Charades.

2.2. Paper details

  • The title of the paper isActor and Observer: Joint Modeling of First and Third-Person Videos
    • Obviously, I want to model a pair of first-view and third-view videos at the same time.
    • The data set is only proposed to solve this problem, and a lot of space is on related models.
  • Propose a structure similar to triplet loss

image-20210131143624054

  • Through the above structure, the first-view and third-view video matching, time alignment, and Zero-shot first-person action recognition are realized.

3. Action Genome

  • basic information:
  • Data set download: The video can be downloaded directly from Charades official website (available at Thunder), and the newly added annotation file of Action Genome can be downloaded on Google Drive .
  • Basic information of the paper
    • Domain: Video understanding (I haven't read it carefully yet, I don't know if it can be used as a spatiotemporal behavior detection data set)
    • Author unit: Stanford University
    • Posting time: CVPR 2020
  • Data set overview: The essence is to re-annotate Charades and add the relationship between people and things

3.1. Data set introduction

  • Overview: Select some video frames and mark the bbox between people and things and the relationship between people and things.
    • There are 5 files in the data set.
  • People's information
    • Corresponding person_bbox.pklfile, after reading it is a dict object
    • The key of the dict is 文件名/图片编号, for example001YG.mp4/000089.png
    • The value of the dict is character information, including candidate boxes and mission key point information.
    • Candidate box information includes bbox/bbox_score/bbox_size/bbox_modefour
    • The key point information includes keypoints/keypoints_logitstwo
  • Object information
    • Correspondingly object_bbox_and_relationship.pkl, it is a dict object after reading
    • The key of the dict is 文件名/图片编号, for example001YG.mp4/000089.png
    • The value of the dict is the object information and the relationship between the object and the person. It is a list, and each element in the list corresponds to an object.
    • The information of each object includes class/bbox/attention_relationship/spatial_relationship/contacting_relationship/metadata/visiblethis information.
{...
    'VIDEO_ID/FRAME_ID':
        [...
            {
                'class': 'book',
                'bbox': (x, y, w, h),
                'attention_relationship': ['looking_at'],
                'spatial_relationship': ['in_front_of'],
                'contacting_relationship': ['holding', 'touching'],
                'visible': True,
                'metadata': 
                    {
                        'tag': 'VIDEO_ID/FRAME_ID',
                        'set': 'train'
                    }
            }
        ...]
...}
  • Others include a list of object types (see the appendix), a list of labeled frames, and a list of all relationships (see the figure below).

    image_1ei5ll5i21n1a10e12feuvn1k6tm.png-86.8kB

3.2. Paper details

  • What problem to solve
    • In computer vision, we treat actions or activities as a complete whole.
    • But in fact, in cognitive science (Cognitive Science) and neurological research, it is found that human behavior is encoded as a hierarchical part structures.
  • What method was used
    • A data set is proposed (in fact, it is relabeled on the basis of Charades):
      • Think of behavior as Action Genome
      • That is, the behavior is decomposed into spatio-temporal scene graphs.
      • Capture the relationship between people and objects when the behavior occurs .
    • Comparison of common data sets
      • image_1ei5kpli117rnn9uc431gvh9769.png-73.4kB
    • The so-called relationship , as shown in the figure below
      • The so-called relationship in the original Charades refers to the clip level
      • The relation in Action Genome is image-level
      • image_1ei5ll5i21n1a10e12feuvn1k6tm.png-86.8kB
    • There are many objects around people, refer to the picture below
      • image_1ei5lmumb1saq1j8a65i10751ph13.png-869.8kB
    • It is said that a structure SGFB was proposed to deal with, but I didn’t take a closer look.
      • image_1ei5mnjgusbv17rl18buijg1i4e2n.png-222.7kB
  • How's the effect
    • Through Action Genome, ordinary Charades classification, few-shot tasks and Spatio-temporal scene graph prediction can be realized
      image_1ei5mha60c3g17sh12ge162o1aua1g.png-73.3kB
      image_1ei5mhi0lml81pqic4qfefv5u1t.png-48.4kB
      image_1ei5mi38qmra1hbr15g01aikkas2a.png-99.9kB
  • What are the problems & what can I learn from
    • This data set is good, but it feels more troublesome to use it in practical applications... It needs follow-up research to detail.

appendix

Charades behavior tags

c000 Holding some clothes
c001 Putting clothes somewhere
c002 Taking some clothes from somewhere
c003 Throwing clothes somewhere
c004 Tidying some clothes
c005 Washing some clothes
c006 Closing a door
c007 Fixing a door
c008 Opening a door
c009 Putting something on a table
c010 Sitting on a table
c011 Sitting at a table
c012 Tidying up a table
c013 Washing a table
c014 Working at a table
c015 Holding a phone/camera
c016 Playing with a phone/camera
c017 Putting a phone/camera somewhere
c018 Taking a phone/camera from somewhere
c019 Talking on a phone/camera
c020 Holding a bag
c021 Opening a bag
c022 Putting a bag somewhere
c023 Taking a bag from somewhere
c024 Throwing a bag somewhere
c025 Closing a book
c026 Holding a book
c027 Opening a book
c028 Putting a book somewhere
c029 Smiling at a book
c030 Taking a book from somewhere
c031 Throwing a book somewhere
c032 Watching/Reading/Looking at a book
c033 Holding a towel/s
c034 Putting a towel/s somewhere
c035 Taking a towel/s from somewhere
c036 Throwing a towel/s somewhere
c037 Tidying up a towel/s
c038 Washing something with a towel
c039 Closing a box
c040 Holding a box
c041 Opening a box
c042 Putting a box somewhere
c043 Taking a box from somewhere
c044 Taking something from a box
c045 Throwing a box somewhere
c046 Closing a laptop
c047 Holding a laptop
c048 Opening a laptop
c049 Putting a laptop somewhere
c050 Taking a laptop from somewhere
c051 Watching a laptop or something on a laptop
c052 Working/Playing on a laptop
c053 Holding a shoe/shoes
c054 Putting shoes somewhere
c055 Putting on shoe/shoes
c056 Taking shoes from somewhere
c057 Taking off some shoes
c058 Throwing shoes somewhere
c059 Sitting in a chair
c060 Standing on a chair
c061 Holding some food
c062 Putting some food somewhere
c063 Taking food from somewhere
c064 Throwing food somewhere
c065 Eating a sandwich
c066 Making a sandwich
c067 Holding a sandwich
c068 Putting a sandwich somewhere
c069 Taking a sandwich from somewhere
c070 Holding a blanket
c071 Putting a blanket somewhere
c072 Snuggling with a blanket
c073 Taking a blanket from somewhere
c074 Throwing a blanket somewhere
c075 Tidying up a blanket/s
c076 Holding a pillow
c077 Putting a pillow somewhere
c078 Snuggling with a pillow
c079 Taking a pillow from somewhere
c080 Throwing a pillow somewhere
c081 Putting something on a shelf
c082 Tidying a shelf or something on a shelf
c083 Reaching for and grabbing a picture
c084 Holding a picture
c085 Laughing at a picture
c086 Putting a picture somewhere
c087 Taking a picture of something
c088 Watching/looking at a picture
c089 Closing a window
c090 Opening a window
c091 Washing a window
c092 Watching/Looking outside of a window
c093 Holding a mirror
c094 Smiling in a mirror
c095 Washing a mirror
c096 Watching something/someone/themselves in a mirror
c097 Walking through a doorway
c098 Holding a broom
c099 Putting a broom somewhere
c100 Taking a broom from somewhere
c101 Throwing a broom somewhere
c102 Tidying up with a broom
c103 Fixing a light
c104 Turning on a light
c105 Turning off a light
c106 Drinking from a cup/glass/bottle
c107 Holding a cup/glass/bottle of something
c108 Pouring something into a cup/glass/bottle
c109 Putting a cup/glass/bottle somewhere
c110 Taking a cup/glass/bottle from somewhere
c111 Washing a cup/glass/bottle
c112 Closing a closet/cabinet
c113 Opening a closet/cabinet
c114 Tidying up a closet/cabinet
c115 Someone is holding a paper/notebook
c116 Putting their paper/notebook somewhere
c117 Taking paper/notebook from somewhere
c118 Holding a dish
c119 Putting a dish/es somewhere
c120 Taking a dish/es from somewhere
c121 Wash a dish/dishes
c122 Lying on a sofa/couch
c123 Sitting on sofa/couch
c124 Lying on the floor
c125 Sitting on the floor
c126 Throwing something on the floor
c127 Tidying something on the floor
c128 Holding some medicine
c129 Taking/consuming some medicine
c130 Putting groceries somewhere
c131 Laughing at television
c132 Watching television
c133 Someone is awakening in bed
c134 Lying on a bed
c135 Sitting in a bed
c136 Fixing a vacuum
c137 Holding a vacuum
c138 Taking a vacuum from somewhere
c139 Washing their hands
c140 Fixing a doorknob
c141 Grasping onto a doorknob
c142 Closing a refrigerator
c143 Opening a refrigerator
c144 Fixing their hair
c145 Working on paper/notebook
c146 Someone is awakening somewhere
c147 Someone is cooking something
c148 Someone is dressing
c149 Someone is laughing
c150 Someone is running somewhere
c151 Someone is going from standing to sitting
c152 Someone is smiling
c153 Someone is sneezing
c154 Someone is standing up from somewhere
c155 Someone is undressing
c156 Someone is eating something

List of object types in Action Genome

person
bag
bed
blanket
book
box
broom
chair
closetcabinet
clothes
cupglassbottle
dish
door
doorknob
doorway
floor
food
groceries
laptop
light
medicine
mirror
papernotebook
phonecamera
picture
pillow
refrigerator
sandwich
shelf
shoe
sofacouch
table
television
towel
vacuum
window

Guess you like

Origin blog.csdn.net/irving512/article/details/113473577