Academic Newsletter| CN-Celeb-AV: Release of multi-scene audio-visual multi-modal data set

Recently, the Speech and Language Technology Team of Tsinghua University and Beijing University of Posts and Telecommunications released the Chinese Celebrity Multi-Scene Audio and Video Multimodal Dataset (CN-Celeb-AV) for researchers in the fields of audio and video multimodal identity recognition (AVPR) use. This dataset contains more than 419,000 video clips from 1,136 Chinese celebrities, covering 11 different scenarios, and provides two sets of standard evaluation sets, complete and incomplete. Researchers can search for CN-Celeb-AV on the shared resource website http://cnceleb.org and apply for free download.

background introduction

Biometric technology is a technology that automatically measures and analyzes human biological characteristics to authenticate personal identity. Voiceprint and face are two of the most popular types of biometrics, mainly because they can be collected remotely and without contact. In the past few years, with the emergence of deep learning and the accumulation of big data, the performance of these two biometric technologies, speaker recognition and face recognition, has been significantly improved, and a wide range of applications have emerged.

Despite impressive progress, both voiceprint recognition and face recognition face their own practical difficulties. For audio-based voiceprint recognition, the challenges lie in content changes, channel differences, background noise, speaker's speaking style, and even physiological state changes. For video-based face recognition, challenges come from illumination changes, position changes, unknown occlusions, etc.

To overcome the performance ceiling of a single modality, an intuitive idea is to integrate the complementary information of audio and visual modalities to build an audio-visual multimodal identity recognition (AVPR) system. Especially in complex practical application scenarios, the system should be more robust. To answer this idea, NIST initiated the Audio-Visual Multimodal Identity Challenge track [1] at SRE 2019 and continued it in SRE 2021 [2]. Existing AVPR research mostly adopts two methods: representation fusion and joint modeling. Although these studies have achieved good results, the training and evaluation data scenarios are single and relatively limited, and it is difficult to reflect the complexity of real applications. For example, in real applications, some modal information is often destroyed or lost. Case.

To facilitate AVPR research in complex application scenarios, we release a new AVPR dataset named CN-Celeb-AV. The collection process of this data set follows the principles of CN-Celeb [3,4], including audio and visual modal data. The entire dataset consists of two parts: the "complete modal" part and the "incomplete modal" part. The entire dataset covers 11 real-world scenarios and contains more than 419,000 video clips from 1,136 individuals (Chinese celebrities, vloggers, and amateurs). We hope that CN-Celeb-AV will be a suitable benchmark for AVPR with real-world complexity.

data characteristics

CN-Celeb-AV possesses several desirable properties that make it suitable for AVPR research to address real-world challenges.

1. Real-world uncertainty : Almost all video clips contain real-world uncertainty. Audio content, noise, channel, multiple people, changes in speaking style, etc.; face pose, lighting, expression, resolution, occlusion, etc.

2. Multi-scenario single speaker : It contains a large amount of data of a single speaker and multiple scenarios, which can be used for cross-scenario and cross-session testing, and is closer to real-world applications.

3. Modality incompleteness : In some video clips, only part of the modality information is complete and observable, and there are modality missing situations, making it suitable for evaluating the performance of AVPR systems under real complex conditions, which is also Situations where multimodal technology is expected to provide the greatest value.

Table 1 Overview of CN-Celeb-AV data

picture

Table 2 CN-Celeb-AV scene segmentation

picture

CN-Celeb-AV has two benchmark evaluation sets:

1. "Complete mode" evaluation set CNC-AV-Eval-F: Most audio and video clips contain complete audio information and video information.

2. "Incomplete Mode" evaluation set CNC-AV-Eval-P: contains a large number of audio and video clips where audio or video information is damaged or completely lost. For example, the target person's face and/or voice may disappear briefly, be corrupted by noise, or be completely unusable.

preliminary verification

We used the open source voiceprint recognition model ECAPA-TDNN, face detection model RetinaFace and face recognition model InsightFace to carry out a series of comparative experiments on MOBIO [5], VoxCeleb [6] and CN-Celeb-AV evaluation sets. The experimental results are shown in Table 3 below.

Table 3 Experimental results

picture

First, both unimodal and multimodal systems achieve good performance on the MOBIO and VoxCeleb1 evaluation sets. This is expected since the modality information is almost complete in both datasets. In contrast, on the two CNC-AV-Eval evaluation sets, the performance of audio and visual modalities is much worse, mainly due to the more complex data in CNC-AV-Eval. This shows that the current mainstream identification technology, whether audio or visual, is still unable to cope with the complexity of the real world.

Second, the multimodal system consistently outperforms the unimodal system on all evaluation sets, highlighting the benefits of multimodal information. Even so, however, the performance of the multimodal system on the two CNC-AV-Eval evaluation sets is still poor, suggesting that further research is needed on multimodal identification in complex scenarios.

Download

  • Paper address

    • https://arxiv.org/abs/2305.16049

  • data application

    • http://cnceleb.org/

  • collection tool

    • https://github.com/smile-struggler/CN-Celeb3_collector

  • baseline system

    • https://gitlab.com/csltstu/sunine/-/tree/cncav/

references

[1] S. O. Sadjadi, C. S. Greenberg, E. Singer, D. A. Reynolds et al., “The 2019 NIST audio-visual speaker recognition evaluation,” in Odyssey, 2020, pp. 259–265.

[2] S. O. Sadjadi, C. Greenberg, E. Singer, L. Mason, and D. Reynolds, “The 2021 NIST speaker recognition evaluation,” arXiv preprint arXiv:2204.10242, 2022. 

[3] L. Li, R. Liu, J. Kang, Y. Fan, H. Cui, Y. Cai, R. Vipperla, T. F. Zheng, and D. Wang, “CN-Celeb: multi-genre speaker recognition,” Speech Communication, vol. 137, pp. 77–91, 2022.

[4] Fan, J. Kang, L. Li, D. Wang et al., “CN-Celeb: a challenging Chinese speaker recognition dataset,” in ICASSP. IEEE, 2020, pp. 7604–7608.

[5] C. McCool, S. Marcel, A. Hadid, M. Pietikainen ¨ et al., “Bi-modal person recognition on a mobile phone: using mobile phone data,” in ICMEW. IEEE, 2012, pp. 635–640.

[6] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A largescale speaker identification dataset,” in INTERSPEECH, 2017, pp. 2616–2620.

Guess you like

Origin blog.csdn.net/weixin_48827824/article/details/132086741