Charades & CharadesEgo & Action Genome 数据集以及论文总结

0. 前言

本文介绍Charades系列数据集，包括：
- Charades：ECCV 2016，第一个家庭室内场景下的日常行为识别数据集，是通过众包完成的。
  - 数据集采集方式挺有意思，用户先写剧本（根据关键字造句），再自行拍摄视频，最后其他人标注。
- CharadesEgo：CVPR 2018，第一个成对的行为识别数据集。
  - 所谓成对指的是，对于同一系列动作，同时有第一视角视频与第三视角视频。
  - 论文希望对第一视角、第三视角数据集进行建模。
- Action Genome：CVPR 2020，是对Charades的二次标注，包括人与物体的关系。
  - 新增的标注包括人物、物体bbox，以及人与物之间的关系。

1. Charades

基本资料：
- 论文，官网
数据获取：官网直接下载，可用迅雷。
数据集概况：
- 该数据集是众包（Amazon Mechanical Turk平台）完成的。
- 拍摄场景是家庭室内。
- 行为类别有150+，都是一些日常行为，用论文里的话说就是 boring household activities。
- 除了行为标签、行为起止时间外，还包括众多文字描述标签。
数据量（训练集/测试集分别有7,985/1,863视频，49,809/16,691行为片段）

1.1. 数据集介绍

在官网的README中有数据集的详细介绍。
- 包括：一共有哪些文件、每个文件（包括视频、帧、标签）的详细介绍。
- 我下面介绍的内容只是摘了一些README的内容翻译过来而已。更多详细信息请参考上面的连接。
文件主要包括（可以在官网找到下面所有文件，并下载，可使用迅雷）
- README.txt：数据集详解
- liscence.txt：学术机构免费，不能商用
- Charades.zip：标签文件，后面单独介绍
- Charades_v1.zip：原始视频，全部视频都在一个文件夹中。
- Charades_v1_480.zip：压缩过的视频，全部视频都在一个文件夹中。
- Charades_caption.zip：caption信息，以及对应的评估代码。
- Charades_v1_rgb.tar：RGB视频帧
- Charades_v1_flow.tar：光流
- Charades_v1_features_rgb.tar.gz：视频帧特征，不知道是怎么获得的
- Charades_v1_features_flow.tar.gz：光流帧特征，不知道是怎么获得的
标签详解：
- 类别文件：
  - Charades_v1_classes.txt：行为编号以及对应行为，如c013 Washing a table
  - Charades_v1_objectclasses.txt：名词编号以及对应名词，如o007 chair
  - Charades_v1_verbclasses.txt：动词编号以及对应的动词，如v001 close
  - Charades_v1_mapping：行为编号以及对应的动词编号和名词编号，如c012 o033 v026
- 样本标签：
  - 主要包括Charades_v1_train.csv以及Charades_v1_test.csv。
  - 每行包括11个元素，分别是
    - id：对应视频id，即视频名称（不包括扩展名）
    - subject：还真不知道这个是啥意思，没看懂
    - scene：场景，一共包括15个室内场景，如起居室、厨房、卫生间等
    - quality：标注者认为视频的质量，7分制，7分最高
    - relevance：标注者认为的视频与后面script的相关性，7分制，7分最相关
    - verified：另外一个标注者是否同意该视频与script相关
    - script：剧本（详情参考下一节的数据采集流程）
    - objects：相关物体（名称，而不是编号）
    - descriptions：非拍摄人员（标注者）对于视频的描述。
    - actions：行为，包括行为编号以及起止时间（视频的第几秒到第几秒，精确到小数点后2位），例如c092 11.90 21.20
    - length：视频长度，单位秒，精确到小数点后2位

1.2. 论文介绍

为什么要创建这个数据集？
- 之前的数据集都是Youtube上获取原始数据并标注的，很少有家庭室内场景。
- 一些行为一般很难在youtube上的视频或电影、电视中找到。
论文介绍了一种Hollywood in Homes方法来进行数据收集，并发布了Charades v1.0 版本。
与其他数据集的对比

数据采集流程：
- 第一步：生成室内剧本。关键在于生成多种多样的剧本，且保证每一种都有足够的样本。
  - 重点在于室内场景，所以选择了住宅相关的共15类场景，如卧室、起居室等。
  - 为了写剧本，需要选择一系列日常行为以及日常物品。为了选择行为以及物品种类，分析了549部电影的剧本，使用各种方法分析电影中家庭室内场景中出现的行为与物品，最终选择40类物品与30类行为。
  - 具体实施过程是：随机给用户5个名词和5个动词，让用户随机分别选择2个，然后造句，要求只有1-2人，有一些交互。
- 第二步：根据剧本拍摄视频。
  - 要求创作剧本的用户拍摄一段30s的视频。
  - 还讨论了一些AMT如何更经济地找人拍视频。
- 第三步：让其他人描述拍摄的视频，判断视频与剧本是否关联。
  - 对于每个拍摄好的视频，我们要求其他用户用一句话描述。
  - 对于每个视频通过算法检测相关联的物品，然后让用户看看检测结果是否准确。
  - 对于每类物品都有4-5个相关的行为，用户判断这些行为有没有出现在视频中。
  - 还用众包算法来优化结果。
  - 另外一拨标注者，根据视频以及对应的行为，标注行为起止时间。

1.3. 性能指标介绍

官网提供给了评估代码，可惜是matlab的。
一共分两个任务，一个是行为识别，一个是时序行为检测。
对于行为识别任务：
- 输出形式是 video_id, vector
- 每个视频输出一个长度为 num_classes 的向量，表示每一类行为出现的概率。
- 这是一个多标签的行为识别任务。
- 最后的性能指标是mAP，AP是按类算的。
对于时空行为检测：
- Charades中，本质就是对每一帧进行检测。
- 输出的结果应该是video_id, framenumber, vector
- 为了防止输出结果过多，测试代码中将视频分为25分，每一份取一帧计算得到。

for j=1:frames_per_video
    timepoint(j) = (j-1)*time/frames_per_video;

# That is: 0, time/25, 2*time/25, ..., 24*time/25.

2. CharadesEgo

基本资料：论文，官网
数据集情况概述：
- 总体获取流程与Charades大致相同
- 性能指标与Charades相同
这个数据集主要研究的是：
- 第一视角行为识别（之前没有这样的数据集，或者说之前的数据集数据量不够）
- 第三视角与第一视角融合

2.1. 与Charades的不同之处

数据源变了：原来只有第三视角，现在都是成对的第一、第三视角视频。
- 注意，两个视频分开拍，但做相同的动作。
- 暂时没看过视频本身，不知道成对视频效果如何，时间上是否对齐。
- 第一视角视频有两种录制方法，一是一只手拿手机、单手做动作，另一种是用偷窥固定摄像头、双手做动作。我们更希望第二种，所以对提交的第二种视频，有额外奖金。
数据形式有少许不同：在样本标签中，有十三列（比Charades多了两列）
- egocentric：是否是第一视角
- charades_video：在Charades中对应哪个训练样本。

2.2. 论文细节

论文的标题是Actor and Observer: Joint Modeling of First and Third-Person Videos
- 很明显，是想同时对成对的第一视角、第三视角视频进行建模。
- 数据集只是为了解决这个问题而提出的，大量篇幅是在相关模型上。
提出一个类似于triplet loss的结构

通过上面这个结构实现第一视角与第三视角视频的匹配、时间对齐、Zero-shot first-person action recognition。

3. Action Genome

基本资料：
- 论文，官网，github
- 论文解读
数据集下载：视频可从Charades官网直接下（可迅雷），Action Genome的新增标注文件到Google Drive上下载。
论文基本信息
- 领域：视频理解（还没仔细看过，不知道能不能作为时空行为检测数据集）
- 作者单位：斯坦福大学
- 发表时间：CVPR 2020
数据集概述：本质就是对Charades进行二次标注，加入人与物的关系

3.1. 数据集介绍

概述：选择部分视频帧，标注了人与物的bbox以及人与物之间的关系。
- 数据集一共5个文件。
人的信息
- 对应person_bbox.pkl文件，读取后是一个dict对象
- dict的key是文件名/图片编号，例如001YG.mp4/000089.png
- dict的value是人物信息，包括包括候选框以及任务关键点信息。
- 候选框信息包括bbox/bbox_score/bbox_size/bbox_mode四个
- 关键点信息包括keypoints/keypoints_logits两个
物体的信息
- 对应object_bbox_and_relationship.pkl，读取后是一个dict对象
- dict的key是文件名/图片编号，例如001YG.mp4/000089.png
- dict的value是物体信息以及物体与人之间的关系，是一个list，list中每个元素对应一个物体。
- 每个物体的信息包括class/bbox/attention_relationship/spatial_relationship/contacting_relationship/metadata/visible这些信息。

{...
    'VIDEO_ID/FRAME_ID':
        [...
            {
                'class': 'book',
                'bbox': (x, y, w, h),
                'attention_relationship': ['looking_at'],
                'spatial_relationship': ['in_front_of'],
                'contacting_relationship': ['holding', 'touching'],
                'visible': True,
                'metadata': 
                    {
                        'tag': 'VIDEO_ID/FRAME_ID',
                        'set': 'train'
                    }
            }
        ...]
...}

其他包括了物体种类列表（查看附录）、标注的帧列表、所有关系列表（查看下图）。

3.2. 论文细节

要解决什么问题
- 在计算机视觉中，我们将行为（actions or activities）作为一个完整的整体。
- 但其实在认知科学（Cognitive Science）和神经学的研究中发现，人的行为被编码为一个 hierarchical part structures。
用了什么方法
- 提出了一个数据集（其实是在Charades的基础上进行再标注）：
  - 将行为看作是 Action Genome（行为基因组）
  - 即，将行为分解为时空场景图（spatio-temporal scene graphs）。
  - 捕捉行为发生时，人与物体之间的关系。
- 常见数据集对比
- 所谓关系，如下图所示
  - 原始Charades中所谓的relation，指的是clip level的
  - Action Genome中的relation是 image-level 的
- 人周边物体有很多，参考下图
- 说是提出了一种结构 SGFB 来处理，没细看
效果如何
- 通过Action Genome可实现普通Charades分类，few-shot任务以及Spatio-temporal scene graph prediction
还存在什么问题&有什么可以借鉴
- 这数据集好是好，但感觉要用到实际应用中比较麻烦……需要后续研究下细节。

附录

Charades 行为标签

c000 Holding some clothes
c001 Putting clothes somewhere
c002 Taking some clothes from somewhere
c003 Throwing clothes somewhere
c004 Tidying some clothes
c005 Washing some clothes
c006 Closing a door
c007 Fixing a door
c008 Opening a door
c009 Putting something on a table
c010 Sitting on a table
c011 Sitting at a table
c012 Tidying up a table
c013 Washing a table
c014 Working at a table
c015 Holding a phone/camera
c016 Playing with a phone/camera
c017 Putting a phone/camera somewhere
c018 Taking a phone/camera from somewhere
c019 Talking on a phone/camera
c020 Holding a bag
c021 Opening a bag
c022 Putting a bag somewhere
c023 Taking a bag from somewhere
c024 Throwing a bag somewhere
c025 Closing a book
c026 Holding a book
c027 Opening a book
c028 Putting a book somewhere
c029 Smiling at a book
c030 Taking a book from somewhere
c031 Throwing a book somewhere
c032 Watching/Reading/Looking at a book
c033 Holding a towel/s
c034 Putting a towel/s somewhere
c035 Taking a towel/s from somewhere
c036 Throwing a towel/s somewhere
c037 Tidying up a towel/s
c038 Washing something with a towel
c039 Closing a box
c040 Holding a box
c041 Opening a box
c042 Putting a box somewhere
c043 Taking a box from somewhere
c044 Taking something from a box
c045 Throwing a box somewhere
c046 Closing a laptop
c047 Holding a laptop
c048 Opening a laptop
c049 Putting a laptop somewhere
c050 Taking a laptop from somewhere
c051 Watching a laptop or something on a laptop
c052 Working/Playing on a laptop
c053 Holding a shoe/shoes
c054 Putting shoes somewhere
c055 Putting on shoe/shoes
c056 Taking shoes from somewhere
c057 Taking off some shoes
c058 Throwing shoes somewhere
c059 Sitting in a chair
c060 Standing on a chair
c061 Holding some food
c062 Putting some food somewhere
c063 Taking food from somewhere
c064 Throwing food somewhere
c065 Eating a sandwich
c066 Making a sandwich
c067 Holding a sandwich
c068 Putting a sandwich somewhere
c069 Taking a sandwich from somewhere
c070 Holding a blanket
c071 Putting a blanket somewhere
c072 Snuggling with a blanket
c073 Taking a blanket from somewhere
c074 Throwing a blanket somewhere
c075 Tidying up a blanket/s
c076 Holding a pillow
c077 Putting a pillow somewhere
c078 Snuggling with a pillow
c079 Taking a pillow from somewhere
c080 Throwing a pillow somewhere
c081 Putting something on a shelf
c082 Tidying a shelf or something on a shelf
c083 Reaching for and grabbing a picture
c084 Holding a picture
c085 Laughing at a picture
c086 Putting a picture somewhere
c087 Taking a picture of something
c088 Watching/looking at a picture
c089 Closing a window
c090 Opening a window
c091 Washing a window
c092 Watching/Looking outside of a window
c093 Holding a mirror
c094 Smiling in a mirror
c095 Washing a mirror
c096 Watching something/someone/themselves in a mirror
c097 Walking through a doorway
c098 Holding a broom
c099 Putting a broom somewhere
c100 Taking a broom from somewhere
c101 Throwing a broom somewhere
c102 Tidying up with a broom
c103 Fixing a light
c104 Turning on a light
c105 Turning off a light
c106 Drinking from a cup/glass/bottle
c107 Holding a cup/glass/bottle of something
c108 Pouring something into a cup/glass/bottle
c109 Putting a cup/glass/bottle somewhere
c110 Taking a cup/glass/bottle from somewhere
c111 Washing a cup/glass/bottle
c112 Closing a closet/cabinet
c113 Opening a closet/cabinet
c114 Tidying up a closet/cabinet
c115 Someone is holding a paper/notebook
c116 Putting their paper/notebook somewhere
c117 Taking paper/notebook from somewhere
c118 Holding a dish
c119 Putting a dish/es somewhere
c120 Taking a dish/es from somewhere
c121 Wash a dish/dishes
c122 Lying on a sofa/couch
c123 Sitting on sofa/couch
c124 Lying on the floor
c125 Sitting on the floor
c126 Throwing something on the floor
c127 Tidying something on the floor
c128 Holding some medicine
c129 Taking/consuming some medicine
c130 Putting groceries somewhere
c131 Laughing at television
c132 Watching television
c133 Someone is awakening in bed
c134 Lying on a bed
c135 Sitting in a bed
c136 Fixing a vacuum
c137 Holding a vacuum
c138 Taking a vacuum from somewhere
c139 Washing their hands
c140 Fixing a doorknob
c141 Grasping onto a doorknob
c142 Closing a refrigerator
c143 Opening a refrigerator
c144 Fixing their hair
c145 Working on paper/notebook
c146 Someone is awakening somewhere
c147 Someone is cooking something
c148 Someone is dressing
c149 Someone is laughing
c150 Someone is running somewhere
c151 Someone is going from standing to sitting
c152 Someone is smiling
c153 Someone is sneezing
c154 Someone is standing up from somewhere
c155 Someone is undressing
c156 Someone is eating something

Action Genome 中物体种类列表

person
bag
bed
blanket
book
box
broom
chair
closetcabinet
clothes
cupglassbottle
dish
door
doorknob
doorway
floor
food
groceries
laptop
light
medicine
mirror
papernotebook
phonecamera
picture
pillow
refrigerator
sandwich
shelf
shoe
sofacouch
table
television
towel
vacuum
window

Charades & CharadesEgo & Action Genome 数据集以及论文总结

0. 前言

1. Charades

1.1. 数据集介绍

1.2. 论文介绍

1.3. 性能指标介绍

2. CharadesEgo

2.1. 与Charades的不同之处

2.2. 论文细节

3. Action Genome

3.1. 数据集介绍

3.2. 论文细节

附录

Charades 行为标签

Action Genome 中物体种类列表

猜你喜欢