Freshly baked! ECCV2022 107 open source data collections, all global AI research hotspots

The biennial ECCV2022 has finally been held under the attention of everyone. I believe that many friends are very interested in the new directions, new algorithms and new data sets released by ECCV this year. Today, the editor started from the perspective of data sets and selected 8 data sets released by ECCV2022 for everyone, including huge labeled data and novel and interesting tasks. Welcome everyone to come and watch!

 

This article also has a huge benefit: ECCV 2022 dataset collection. The editors painstakingly screened more than 1,600 received articles of ECCV2022, and finally obtained 107 data sets, and organized them according to the task type. Friends, come and see if there is any data set you like!

OpenDataLab will also organize and put these open source data set resources on the shelves one after another, so stay tuned~

This article consists of two parts:

1.  Introduction to ECCV2022 Selected Dataset

Contains detailed introductions for 8 datasets. From the perspective of task type and citation rate, 8 data sets were selected for a brief introduction (including labeling types, application scenarios, key result graphs in papers, etc.).

2. ECCV2022 dataset collection

Contains a list of 107 datasets. The 107 data sets are divided into coarse-grained and fine-grained task types, and the corresponding article links are provided, supplemented by simple explanations.

(For complete information, please visit the Github dataset homepage of OpenDataLab , or contact the assistant to obtain it)

1. Introduction to Selected Datasets

This section includes detailed introductions for 8 datasets across multiple task domains:

● HuMMan (4D Human Perception and Modeling)

● OpenLane/Waymo (autonomous driving)

● PartImageNet (part segmentation)

● Sherlock (Abduction)

● MovieCuts (video clip type identification)

● DexMV (Imitation Learning for Robots)

● ViCo (Video Generation - Listener Feedback)

No.1  HuMMan

4D human perception and modeling are fundamental tasks in the field of vision and graphics with diverse applications. With the advancement of new sensors and algorithms, there is an increasing need for more general datasets. HuMMan [1] is a large-scale multimodal 4D human body dataset containing 1000 human objects, 400k sequences and 60M frames.

The features of HuMMan are:

1) Multimodal data and annotations, including color images, point clouds, keypoints, SMPL parameters, and textured meshes;

2) The sensor suite includes mainstream mobile devices;

3) It includes 500 movements, covering the basic movements of human activities;

4) Support and evaluate multiple tasks such as action recognition, pose estimation, parametric human body restoration, and textured mesh reconstruction.

Figure 1. HuMMan has multiple data formats and annotation forms: a) color images, b) point clouds, c) keypoints, d) SMPL parameters, e) meshes, f) textures. Each sequence is also annotated with action labels from 500 actions. Each subject had two additional high-resolution scans of natural and minimally clothed bodies[1]

Table 1. Comparison of HuMMan with published datasets [1]

HuMMan is competitive in number of subjects (#Subj), actions (#Act), sequences (#Seq) and frames (#Frame). Data sources are divided into: video, i.e. continuous data, not limited to RGB sequences; mobile, i.e. mobile devices in the sensor suite.

In addition, HuMMan has a variety of mode annotations to support multiple tasks, including: RGB, that is, three-band images; D/PC, that is, depth images or point clouds, only considering real point clouds collected from depth sensors; Act, that is, action Label; K2D, two-dimensional key points; K3D, 3D key points; Param, statistical model (such as SMPL) parameters; Mesh, grid, Txtr, texture.

A large number of experiments on HuMMan show that there are still many challenges in many research areas such as fine-grained action recognition, dynamic human mesh reconstruction, point cloud-based parameterized human body restoration, and cross-device domain gaps, which have great research potential. space.

HuMMan dataset link: https://caizhonggang.github.io/projects/HuMMan/

For the convenience of AI researchers, OpenDataLab has included this data set resource, open the link ( https://opendatalab.org.cn/OpenXD-HuMMan ), you can download it for free and at high speed.

No.2&3  OpenLane/Waymo

OpenLane [2] is the first real-world 3D lane dataset and the largest so far, with 200K frames and 880K carefully annotated lanes.

This dataset collects valuable content from the public perception dataset Waymo Open dataset [3] and provides lane and closest-in-path object (CIPO) annotations for 1000 segments.

Lane annotations include:

● Lane shape. Each 2D/3D lane is displayed as a set of 2D/3D points.

● Lane category. Each lane has a category, such as double yellow lines or curbs.

● Lane attribute. Some lanes have characteristics such as right lane, left lane, etc.

● Lane tracking ID. With the exception of curbs, each lane has a unique ID.

● Parking lines and curbs.

CIPO/scenario notes include:

● 2D bounding box. Its category indicates the object's level of importance.

● Scene markers. It describes in which scene this frame was collected.

● Weather tab. It describes in what weather the frame was collected.

● Hour label. It annotates when this frame was collected.

(See: https://github.com/OpenPerceptionX/OpenLane/blob/main/anno_criterion/CIPO/README.md )

Figure 2. Annotation example of OpenLane [2]

Waymo Open dataset [3] is also the dataset disclosed in the ECCV 2022 acceptance article. It provides independent annotations on the two sensors of lidar and camera. The types of annotations include: 3D Lidar detection and segmentation of vehicles and pedestrians, 2D camera detection and Segmentation (including pictures and videos), 2D-to-3D correspondence, and human body key point labeling, etc.

Figure 3. Annotation example from Waymo (https://waymo.com/open/data/perception/)

OpenLane dataset link: https://github.com/OpenPerceptionX/OpenLane

Waymo dataset link: https://waymo.com/open/

For the convenience of AI researchers, OpenDataLab has included this data set resource, open the link ( https://opendatalab.org.cn/OpenLane ), you can download it for free and at high speed.

No.4  PartImageNet

PartImageNet [4] is a large-scale high-quality dataset with segmentation annotations of parts.

It consists of 158 classes from ImageNet [5] with about 24000 images. These categories are divided into 11 super-categories, and the part division is designed according to the super-categories, as follows (the number in parentheses after the category name indicates the total number of subcategories contained under the super-categories):

Table 2. Labeling categories of PartImageNet [4]

PartImageNet has broad potential in numerous research areas, including part discovery, few-shot learning, and applications in semantic segmentation, etc. Here are some test images from this dataset with different models:

Figure 4. PartImageNet annotation example and model test result example (source is the official website of the dataset)

PartImageNet dataset link: https://github.com/TACJu/PartImageNet

No.5  Sherlock

Sherlock [6] is a data set that combines image instance-level box annotation and text annotation. The goal of the task is to give a certain part of the content in the image and a guiding clue, so that the machine can infer some of the content contained in this part. Information or what is happening, abductive reasoning.

The corpus includes 363K inferences in 103K images. Each image contains multiple bounding boxes with textual cues and inference text annotations in each bounding box, for a total of 363K (cue, inference) pairs collected, forming the first extensive visual inference dataset of its kind.

An example prediction from one of the best performing models is given below, along with human annotations:

Figure 5. Sherlock dataset labeling example and model prediction example (source is the official website of the dataset)

Sherlock dataset link: http://visualabduction.com/

No.6  MovieCuts

MovieCuts [7] is a large-scale dataset of video and clip annotation, containing 173,967 video clips labeled into 10 different clip types. Each sample in the dataset consists of two video clips (from which a video clip was split) and their accompanying audio.

This dataset focuses on a new task: video clip type recognition. Video editing refers to splicing two videos together, and the spliced ​​video looks continuous without being too abrupt. Video clip type identification is to identify how two videos are spliced ​​together. 

Figure 6. Clip types in the MovieCuts dataset [7]: Clip types fall into two broad categories: visually driven (first row) and audiovisual driven (second row)

The paper benchmarks a range of audiovisual methods, including some that deal with the multimodal and multilabel nature of the problem. The best model only achieves 45.7% mAP, which shows that achieving high-accuracy clip type recognition is an open and challenging problem.

MovieCuts dataset link: https://www.alejandropardo.net/publication/moviecuts/

No.7  DexMV

Although significant progress has been made in understanding hand-object interactions in computer vision, it remains a very challenging task for robots to perform complex dexterous manipulations. DexMV [8] is a data set created for the imitation learning task of dexterous manipulation in human body videos, which is used to improve robots to perform complex dexterous manipulations.

DexMV consists of a platform and pipeline consisting of: (a) a simulation system for complex dexterous manipulation tasks with a multifingered robotic hand, and (b) a computer vision system for recording large-scale demonstrations of human hands performing the same tasks. And perform 3D hand and object pose estimation from the video, and finally convert the human motion into a robot demonstration. The article applies and compares various simulated learning algorithms and demonstrations. The results show that demonstrations can indeed substantially improve robot learning and solve complex tasks that reinforcement learning alone cannot.

Figure 7. DexMV platform and pipeline[8]: The platform consists of a computer vision system (yellow), a simulation simulation system (blue) and a demonstration translation module (green). A computer vision system collects human manipulation videos. The same task was designed for a robotic hand in a simulated simulation system. Apply 3D hand object pose estimation from videos, then generate robot demos for simulation learning via the demo translation module

The paper benchmarks a range of audiovisual methods, including some that deal with the multimodal and multilabel nature of the problem. The best model only achieves 45.7% mAP, which shows that achieving high-accuracy clip type recognition is an open and challenging problem.

DexMV dataset link: https://yzqin.github.io/dexmv/

No.8  ViCo

The ViCo [9] dataset is mainly used for visual facial expression generation for contextual understanding, and the application scenario is to generate audience response feedback (such as nodding, smiling) in face-to-face conversations.

ViCo involves a total of 92 identities (67 speakers and 76 listeners) and 483 video and audio clips, using a paired "speak-listen" model, and the listener (listener) generates different attitudes in real time according to the speaker's voice and video Reaction Feedback (Positive, Neutral, Negative). Unlike traditional speech-to-gesture or speaking head generation, listener head generation utilizes audio and video signals from the speaker as input and provides non-verbal feedback (e.g. head movement, facial expression) in real time.

This dataset supports a wide range of applications, such as human-human interaction, video-to-video conversion, cross-modal understanding and generation.

Figure 8. Audience feedback generated by real annotation and model (source is the official website of the dataset)

The dataset consists of three parts:

●  videos/*.mp4 : all videos (excluding audio)

●  audios/*.wav : all audios

●  *.csv : returned metadata (see Table 3 for specific fields)

Table 3. Fields included in the metadata (the source is the official website of the dataset)

ViCo dataset link: https://project.mhzhou.com/vico

 2. ECCV2022 Data Collection Collection

According to the task types involved in the 107 data sets of ECCV2022 this time, the data sets are roughly classified from several general directions:, (click the title to view the detailed list)

 ●  classification, detection, tracking, segmentation, key points and recognition (40)

●  Image processing and generation (21)

●  Multimodality (20)

●  Others (26)

For complete information such as data set source papers, data download links, etc., you can visit the OpenDataLab Github data set homepage , or add a small assistant to get it. )

references

[1] Cai Z, Ren D, Zeng A, et al. HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling[J]. arXiv preprint arXiv:2204.13686, 2022. [2] Chen L, Sima C, Li Y, et al. PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark[J]. arXiv preprint arXiv:2203.11089, 2022. [3] Mei J, Zhu A Z, Yan X, et al. Waymo open dataset: Panoramic video panoptic segmentation[J]. arXiv preprint arXiv:2206.07704, 2022. [4] He J, Yang S, Yang S, et al. Partimagenet: A large, high-quality dataset of parts[J]. arXiv preprint arXiv:2112.00933, 2021. [5] Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical image database[C]//2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009: 248-255. [6] Hessel J, Hwang J D, Park J S, et al. The abduction of sherlock holmes: A dataset for visual abductive reasoning[J]. arXiv preprint arXiv:2202.04800, 2022. [7] Pardo A, Heilbron F C, Alcázar J L, et al. Moviecuts: A new dataset and benchmark for cut type recognition[J]. arXiv preprint arXiv:2109.05569, 2021. [8] Qin Y, Wu Y H, Liu S, et al. Dexmv: Imitation learning for dexterous manipulation from human videos[J]. arXiv preprint arXiv:2108.05877, 2021. [9] Zhou M, Bai Y, Zhang W, et al. Responsive Listening Head Generation: A Benchmark Dataset and Baseline[J]. arXiv preprint arXiv:2112.13548, 2021.

- End -

The above is this sharing. To obtain massive dataset resources, please visit OpenDataLab official website ; to obtain more open source tools and projects, please visit OpenDataLab Github space . In addition, if there is anything else you want to see, come and tell the little assistant. More data sets are on the shelves, more comprehensive data set content interpretation, the most powerful online Q&A, the most active circle of peers... Welcome to add WeChat opendatalab_yunying to join the OpenDataLab official communication group.

Guess you like

Origin blog.csdn.net/OpenDataLab/article/details/127792825