ICASSP 2023 Speaker Recognition Paper Collection

Among the papers selected for ICASSP 2023 this year, there are about 64 papers in the direction of speaker recognition (voiceprint recognition), which are initially divided into Speaker Verification (31 papers), Speaker Recognition (9 papers), Speaker Diarization (17 papers), Anti-Spoofing ( 4), others (3) five types.

picture

This article is the last issue of ICASSP 2023 Speaker Recognition Papers Collection Series. It has compiled a total of 16 brief papers on Speaker Recognition (9 papers), Anti-Spoofing (4 papers), and Others (3 papers).

Speaker Recognition

1.A Study on Bias and Fairness in Deep Speaker Recognition

Title: Research on Bias and Fairness in Deep Speaker Recognition

作宇:Amirhossein Hajavi, Ali Etemad

单位:Dept. ECE and Ingenuity Labs Research Institute, Queen’s University, Canada

Link: https://ieeexplore.ieee.org/document/10095572

Abstract: With the popularity of smart devices using speaker recognition systems as a means of personal authentication and personalized services, the fairness of speaker recognition systems has become an important focus. In this paper, we investigate the notion of fairness in recent SR systems based on 3 popular related definitions, namely, statistical equality, equal odds, and equal chance. We study 5 popular neural architectures and 5 commonly used loss functions in training SR systems, while evaluating their fairness to gender and nationality groups. Our detailed experiments shed light on this concept and demonstrate that more complex encoder architectures better meet the definition of fairness. Furthermore, we found that the choice of loss function significantly affects the bias of SR models.

With the ubiquity of smart devices that use speaker recognition (SR) systems as a means of authenticating individuals and personalizing their services, fairness of SR systems has becomes an important point of focus. In this paper we study the notion of fairness in recent SR systems based on 3 popular and relevant definitions, namely Statistical Parity, Equalized Odds, and Equal Opportunity. We examine 5 popular neural architectures and 5 commonly used loss functions in training SR systems, while evaluating their fairness against gender and nationality groups. Our detailed experiments shed light on this concept and demonstrate that more sophisticated encoder architectures better align with the definitions of fairness. Additionally, we find that the choice of loss functions can significantly impact the bias of SR models.

2.An Improved Optimal Transport Kernel Embedding Method with Gating Mechanism for Singing Voice Separation and Speaker Identification

Title: Backdoor Attacks Against Automatic Speaker Verification Models in Federated Learning

Authors: 1 Weitao Yuan, 1 Yuren Bian, 1 Shengbei Wang, 2 Masashi Unoki, 3 Wenwu Wang

单位:1 Tianjin Key Laboratory of Autonomous Intelligence Technology and Systems, Tiangong University, China

2 Japan Advanced Institute of Science and Technology, Japan

3 Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK

Link: https://ieeexplore.ieee.org/document/10096651

Summary:

Singing voice separation and speaker recognition are two classic problems in speech signal processing. Deep Neural Networks (DNNs) address both issues by extracting efficient representations of target signals from input mixtures. Since the essential features of the signal can be well reflected in the underlying geometric structure of the feature distribution, extracting the geometry-aware and distribution-related features of the target signal is a natural way to solve SVS/SI. To this end, this work introduces the concept of Optimal Transport (OT) from SVS/SI and proposes an improved Optimal Transport Kernel Embedding (iOTKE) to extract object distribution-related features. iOTKE learns the OT from the input signal to the target signal based on a reference set learned from all training data. This not only preserves the diversity of features, but also preserves the underlying geometric structure of the target signal distribution. To further improve the feature selection capability, we extend the proposed iOTKE to a gated version, gated iOTKE (G-iOTKE), by incorporating a lightweight gating mechanism. The gating mechanism controls the effective information flow and enables the proposed method to select important features of a particular input signal. We evaluate the proposed G-iOTKE on SVS/SI. Experimental results show that this method has better results than other models.

Singing voice separation (SVS) and speaker identification (SI) are two classic problems in speech signal processing. Deep neural networks (DNNs) solve these two problems by extracting effective representations of the target signal from the input mixture. Since essential features of a signal can be well reflected on its latent geometric structure of the feature distribution, a natural way to address SVS/SI is to extract the geometry-aware and distribution-related features of the target signal. To do this, this work introduces the concept of optimal transport (OT) to SVS/SI and proposes an improved optimal transport kernel embedding (iOTKE) to extract the target-distribution-related

features. The iOTKE learns an OT from the input signal to the target signal on the basis of a reference set learned from all training data. Thus it can maintain the feature diversity and preserve the latent geometric structure of the distribution for the target signal. To further improve the feature selection ability, we extend the proposed iOTKE

to a gated version, i.e., gated iOTKE (G-iOTKE), by incorporating a lightweight gating mechanism. The gating mechanism controls effective information flow and enables the proposed method to select important features for a specific input signal. We evaluated the proposed G-iOTKE on SVS/SI. Experimental results showed that the proposed method provided better results than other models.

3.HiSSNet:Sound Event Detection and Speaker Identification via Hierarchical Prototypical Networks for Low-Resource Headphones

Title: HiSSNet: A Hierarchical Prototype Network for Sound Event Detection and Speaker Recognition in Low-Resource Headphones

作话:N Shashaank 1,2, Berker Banar 1,3, Mohammad Rasool Izadi 1, Jeremy Kemmerer 1, Shuo Zhang 1, Chuan-Che (Jeff) Huang 1

Unit: 1 Bose Corporation, MA, USA

2 Department of Computer Science, Columbia University, NY, USA

3 School of Electronic Engineering and Computer Science, Queen Mary University of London, UK

Link: https://ieeexplore.ieee.org/document/10094788

Abstract: Modern noise-canceling headphones greatly improve the user's listening experience by eliminating unwanted background noise, but they also block out sounds that are important to the user. Machine learning (ML) models for sound event detection (SED) and speaker identification (SID) can enable headphones to selectively pass important sounds; however, implementing these models for a user-centric experience brings some unique challenge. First off, most people spend a limited amount of time customizing their headphones, so sound detection should work reasonably well out of the box. Second, over time, the model should be able to learn specific sounds that are important to the user based on their implicit and explicit interactions. Finally, such a model should have a small memory footprint to run on low-power headphones with limited on-chip memory. In this paper, we propose HiSSNet (Hierarchical SED and SID Network) to address these challenges. HiSSNet is a SEID (SED and SID) model that uses a hierarchical prototypical network to detect general and specific sounds of interest and characterize alarm-like sounds and speech. Studies have shown that HiSSNet outperforms SEID models trained with non-hierarchical prototype networks by 6.9 - 8.6%. Compared with state-of-the-art (SOTA) models trained specifically for SED or SID, HiSSNet achieves similar or better performance while reducing the memory footprint required to support multiple functions on-device.

Modern noise-cancelling headphones have significantly improved users’ auditory experiences by removing unwanted background noise, but they can also block out sounds that matter to users. Machine learning (ML) models for sound event detection (SED) and speaker identification (SID) can enable headphones to selectively pass through important sounds; however, implementing these models for a user-centric experience presents several unique challenges. First, most people spend limited time customizing their headphones, so the sound detection should work reasonably well out of the box. Second, the models should be able to learn over time the specific sounds that are important to users based on their implicit and explicit interactions. Finally, such models should have a small memory footprint to run on low-power headphones with limited on-chip memory. In this paper, we propose addressing these challenges using HiSSNet (Hierarchical SED and SID Network). HiSSNet is an SEID (SED and SID) model that uses a hierarchical prototypical network to detect both general and specific sounds of interest and characterize both alarm-like and speech sounds. We show that HiSSNet outperforms an SEID model trained using non-hierarchical prototypical networks by 6.9 – 8.6%. When compared to state-of-the-art (SOTA) models trained specifically for SED or SID alone, HiSSNet achieves similar or better performance while reducing the memory footprint required to support multiple capabilities on-device.

4.Jeffreys Divergence-Based Regularization of Neural Network Output Distribution Applied to Speaker Recognition

Title: Jeffreys Divergence-Based Regularization of Neural Network Output Distribution in Speaker Recognition

Author: Pierre-Michel Bousquet, Mickael Rouvier

Unit: LIA - Avignon University

Link: https://ieeexplore.ieee.org/document/10094702

Abstract: A new Jeffreys Divergence-based loss function for speaker recognition in deep neural networks is proposed. Adding this divergence to the cross-entropy loss function maximizes the target value of the output distribution while smoothing out non-target values. This objective function provides highly discriminative features. Beyond this effect, we present a theoretical proof of its effectiveness and try to understand how this loss function affects the model, especially on the type of dataset (i.e. in-domain or out-of-domain w.r.t training corpus). Our experiments show that Jeffreys loss consistently outperforms state-of-the-art speaker recognition, especially on out-of-domain data, and helps limit false positives.

A new loss function for speaker recognition with deep neural network is proposed, based on Jeffreys Divergence. Adding this divergence to the cross-entropy loss function allows to maximize the target value of the output distribution while smoothing the non-target values. This objective function provides highly discriminative features. Beyond this effect, we propose a theoretical justification of its effectiveness and try to understand how this loss function affects the model, in particular the impact on dataset types (i.e. in-domain or out-of-domain w.r.t the training corpus). Our experiments show that Jeffreys loss consistently outperforms the state-of-the-art for speaker recognition, especially on out-of-domain data, and helps limit false alarms.

5.Privacy-Preserving Occupancy Estimation

Title: Privacy-preserving occupancy estimation

作宇:Jennifer Williams, Vahid Yazdanpanah, Sebastian Stein

单位:School of Electronics and Computer Science University of Southampton, UK

Link: https://ieeexplore.ieee.org/document/10095340

Abstract: In this paper, we introduce an audio-based occupancy estimation framework, including a new public dataset, and evaluate occupancy in a "cocktail party" scenario that mixes audio to produce 1-10 people) voice to simulate a party. To estimate the number of speakers in an audio clip, we explore five different types of speech signal features and train several versions of our model using a convolutional neural network (CNN). Furthermore, we make the framework privacy-preserving by applying random perturbations to audio frames to hide speech content and speaker identity. We demonstrate that some of our privacy-preserving features perform better than raw waveforms in terms of occupancy estimation. We further analyze privacy using two adversarial tasks: speaker recognition and speech recognition. Our privacy-preserving model can estimate the number of speakers in a simulated cocktail party segment of 1-2 people within a mean square error (MSE) range of 0.9-1.6, and we achieve classification accuracy up to 34.9%. However, it is still possible for an attacker to identify individual speakers, motivating further work in this area

In this paper, we introduce an audio-based framework for occupancy estimation, including a new public dataset, and evaluate occupancy in a ‘cocktail party’ scenario where the party is simulated by mixing audio to produce speech with overlapping talkers (1-10 people). To estimate the number of speakers in an audio clip, we explored five different types of speech signal features and trained several versions of our model using convolutional neural networks (CNNs). Further, we adapted the framework to be privacy-preserving by making random perturbations of audio frames in order to conceal speech content and speaker identity. We show that some of our privacy-preserving features perform better at occupancy estimation than original waveforms. We analyse privacy further using two adversarial tasks: speaker recognition and speech recognition. Our privacy-preserving models can estimate the number of speakers in the simulated cocktail party clips within 1-2 persons based on a mean-square error (MSE) of 0.9-1.6 and we achieve up to 34.9% classification accuracy while preserving speech content privacy. However, it is still possible for an attacker to identify individual speakers, which motivates further work in this area.

6.Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation

Title: Quantitative Evidence on Enrolled Speaker Embedding Ignoring for Target Speaker Separation

作者:Xiaoyu Liu,Xu Li,Joan Serra`

Unit: Dolby Laboratories

Link: https://ieeexplore.ieee.org/document/10096478

Abstract: Single-channel Target Speaker Separation (TSS) aims to extract a speaker's voice from a mixture of multiple speakers and give the speaker's registered utterance. A typical deep learning TSS framework consists of an upstream model and a downstream model. The upstream model obtains the registered speaker embeddings, and the downstream model separates them according to the embedding conditions. In this paper, we investigate several important but overlooked aspects of registration embeddings, including the applicability of widely used embeddings for speaker recognition, the introduction of logarithmic filter banks and self-supervised embeddings, and the generalization of embeddings across datasets. ability. Our results suggest that speaker recognition embeddings may lose relevant information due to suboptimal metrics, training objectives, or common preprocessing. In contrast, both filterbanks and self-supervised embeddings preserve the integrity of speaker information, but the former consistently outperforms the latter in cross-dataset evaluations. The previously neglected competitive separation and generalization performance of filter bank embeddings is consistent in our study, requiring future research on better upstream features.

Single channel target speaker separation (TSS) aims at extracting a speaker’s voice from a mixture of multiple talkers given an enrollment utterance of that speaker. A typical deep learning TSS framework consists of an upstream model that obtains enrollment speaker embeddings and a downstream model that performs the separation conditioned on the embeddings. In this paper, we look into several important but overlooked aspects of the enrollment embeddings, including the suitability of the widely used speaker identification embeddings, the introduction of the log-mel filterbank and self-supervised embeddings, and the embeddings’ cross-dataset generalization capability. Our results show that the speaker identification embeddings could lose relevant information due to a sub-optimal metric, training objective, or common pre-processing. In contrast, both the filterbank and the self-supervised embeddings preserve the integrity of the speaker information, but the former consistently outperforms the latter in a cross-dataset evaluation. The competitive separation and generalization performance of the previously overlooked filterbank embedding is consistent across our study, which calls for future research on better upstream features.

7.Representation of Vocal Tract Length Transformation Based on Group Theory

Title: Group theory-based representation of vocal tract length transformations

Author: Atsushi Miyashita, Tomoki Toda

Unit: Nagoya University, Japan

Link: https://ieeexplore.ieee.org/document/10095239

Abstract: The acoustic properties of a phoneme depend on the vocal tract length (VTL) of an individual speaker. Separating this speaker information from linguistic information is important for tasks such as automatic speech recognition and speaker identification. This paper focuses on the characteristics of the vocal tract length transform (VTLT) that constitutes a group, and derives a new speech representation VTL spectrum based on group theory analysis, where VTLT only changes the phase of the VTL spectrum, that is, a simple linear shift. Furthermore, we propose a method to resolve the effect of VTL on VTL spectra. We conducted experiments using the TIMIT dataset to elucidate the nature of this feature, and show that: 1) for ASR, normalization of the VTL spectrum reduces the phoneme recognition error rate by 1.9 under stochastic VTLT; 2) for speaker recognition , removing the magnitude component of the VTL spectrum improves speaker classification performance.

The acoustic characteristics of phonemes vary depending on the vocal tract length (VTL) of the individual speakers. It is important to disentangle such speaker information from the linguistic information for various tasks, such as automatic speech recognition (ASR) and speaker recognition. In this paper, we focus on the property of vocal tract length transformation (VTLT) that forms a group, and derive the novel speech representation VTL spectrum based on group theory analysis, where only the phase of the VTL spectrum is changed by VTLT, which is a simple linear shift. Moreover, we propose a method to analytically disentangle the VTL effects on the VTL spectrum. We conducted experiments with the TIMIT dataset to clarify the property of this feature, demonstrationg that 1) for ASR, normalization of the VTL spectrum reduced the phoneme recognition error rate by 1.9 under random VTLT, and 2) for speaker recognition, removal of the amplitude component of the VTL spectrum improved speaker classification performance.

8.Speaker Recognition with Two-Step Multi-Modal Deep Cleansing

Title: Speaker Recognition Based on Two-Step Multimodal Deep Cleaning

Authors: Ruijie Tao 1, Kong Aik Lee 2, Zhan Shi 3 and Haizhou Li 1,3,4,5

Unit: 1 National University of Singapore, Singapore

2 Institute for Infocomm Research, A⋆STAR, Singapore

3 The Chinese University of Hong Kong, Shenzhen, China

4 University of Bremen, Germany

5 Shenzhen Research Institute of Big Data, Shenzhen, China

Link: https://ieeexplore.ieee.org/document/10096814

Abstract: In recent years, neural network-based speaker recognition technology has made remarkable progress. Robust speaker representations learn meaningful knowledge from hard-easy samples in the training set for good performance. However, noisy samples (i.e., samples with wrong labels) in the training set can cause confusion and cause the network to learn wrong representations. In this paper, we propose a two-step audiovisual deep cleaning framework to remove the influence of noisy labels in speaker representation learning. The framework consists of a coarse-grained cleaning step to search for complex samples, followed by a fine-grained cleaning step to filter out noisy labels. Our study starts with an efficient audiovisual speaker recognition system that achieves near-perfect equal error rates (EER) of 0.01%, 0.07%, and 0.13% on the Vox-O, E, and H test sets, respectively. With the proposed multimodal cleaning mechanism, four different speaker recognition networks are improved by an average of 5.9%.

Neural network-based speaker recognition has achieved significant improvement in recent years. A robust speaker representation learns meaningful knowledge from both hard and easy samples in the training set to achieve good performance. However, noisy samples (i.e., with wrong labels) in the training set induce confusion and cause the network to learn the incorrect representation. In this paper, we propose a two-step audio-visual deep cleansing framework to eliminate the effect of noisy labels in speaker representation learning. This framework contains a coarse-grained cleansing step to search for the complex samples, followed by a fine-grained cleansing step to filter out the noisy labels. Our study starts from an efficient audio-visual speaker recognition system, which achieves a close to perfect equal-error-rate (EER) of 0.01%, 0.07% and 0.13% on the Vox-O, E and H test sets. With the proposed multi-modal cleansing mechanism, four different speaker recognition networks achieve an average improvement of 5.9%. Code has been made available at: https://github.com/TaoRuijie/AVCleanse.

9.Universal Speaker Recognition Encoders for Different Speech Segments Duration

Title: Different Speech Segment Durations for Universal Speaker Recognition Encoders

作者:Sergey Novoselov 1,2, Vladimir Volokhov 1, Galina Lavrentyeva 1

Unit: 1 ITMO University, St.Petersburg, Russia

2 STC Ltd., St.Petersburg, Russia

Link: https://ieeexplore.ieee.org/document/10096081

Abstract: Creating a general speaker encoder that is robust to different acoustic and speech duration conditions is a major challenge today. Based on our observations, systems trained on short speech clips are optimal for short phrase speaker verification, while systems trained on long clips are superior for long segment verification. Systems trained simultaneously on combined short and long speech segments do not give optimal validation results and often degrade for both short and long segments. This paper addresses the problem of creating a generic speaker encoder for different speech segment durations. We describe our simple recipe for training a generic speaker encoder for any type of chosen neural network architecture. According to the evaluation results of our wav2vec TDNN based system obtained for NIST SRE and VoxCeleb1 benchmarks, the proposed general encoder provides speaker verification improvement under different enrollment and test speech segment durations. A key feature of the proposed encoder is that it has the same inference time as the chosen neural network architecture.

Creating universal speaker encoders which are robust for different acoustic and speech duration conditions is a big challenge today. According to our observations systems trained on short speech segments are optimal for short phrase speaker verification and systems trained on long segments are superior for long segments verification. A system trained simultaneously on pooled short and long speech segments does not give optimal verification results and usually degrades both for short and long segments. This paper addresses the problem of creating universal speaker encoders for different speech segments duration. We describe our simple recipe for training universal speaker encoder for any type of selected neural network architecture. According to our evaluation results of wav2vec-TDNN based systems obtained for NIST SRE and VoxCeleb1 benchmarks the proposed universal encoder provides speaker verification improvements in case of different enrollment and test speech segment duration. The key feature of the proposed encoder is that it has the same inference time as the selected neural network architecture.

Anti-Spoofing

1.Identifying Source Speakers for Voice Conversion Based Spoofing Attacks on Speaker Verification Systems

Title: Identifying source speakers for spoofing attacks on speech conversion-based speaker authentication systems

Authors: Danwei Cai 1, Zexin Cai 1, Ming Li 1,2

单位:1 Department of Electrical and Computer Engineering, Duke University, Durham, USA

2 Data Science Research Center, Duke Kunshan University, Kunshan, China

Link: https://ieeexplore.ieee.org/document/10096733

Abstract: An automatic speaker verification system aimed at verifying the speaker identity of speech signals. However, voice conversion systems can manipulate a person's speech signal to sound like another speaker's voice and fool speaker verification systems. Most countermeasures against speech conversion-based spoofing attacks are designed to distinguish real speech from spoofed speech for speaker verification systems. In this paper, we study the problem of source speaker recognition - inferring the identity of a source speaker given a speech transformed speech. For source speaker identification, we simply add voice-transformed speech data labeled with source speaker identities to the real speech dataset when the speaker embedding network is trained. Experimental results show that source speaker recognition is feasible when using the transformed speech of the same speech transformation model for training and testing. Furthermore, our results show that training with more utterances converted from various speech conversion models helps improve source speaker recognition performance on utterances converted by unseen speech conversion models.

An automatic speaker verification system aims to verify the speaker identity of a speech signal. However, a voice conversion system could manipulate a person’s speech signal to make it sound like another speaker’s voice and deceive the speaker verification system. Most countermeasures for voice conversion-based spoofing attacks are designed to discriminate bona fide speech from spoofed speech for speaker verification systems. In this paper, we investigate the problem of source speaker identification – inferring the identity of the source speaker given the voice converted speech. To perform source speaker identification, we simply add voice-converted speech data with the label of source speaker identity to the genuine speech dataset during speaker embedding network training. Experimental results show the feasibility of source speaker identification when training and testing with converted speeches from the same voice conversion model(s). In addition, our results demonstrate that having more converted utterances from various voice conversion model for training helps improve the source speaker identification performance on converted utterances from unseen voice conversion models.

2.Leveraging Positional-Related Local-Global Dependency for Synthetic Speech Detection

Title: Synthetic Speech Detection Using Position-Dependent Local-Global Dependencies

Authors: Xiaohui Liu 1, Meng Liu 1, Longbiao Wang 1, Kong Aik Lee 2, Hanyi Zhang 1, Jianwu Dang 1

单位:1 Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin, China

2 Institute for Infocomm Research, A⋆STAR, Singapore

Link: https://ieeexplore.ieee.org/document/10096278

Abstract: Automatic Speaker Verification (ASV) systems are vulnerable to spoofing attacks. Synthetic speech exhibits both local and global artifacts compared to natural speech, so incorporating local-global dependencies will lead to better anti-spoofing performance. To this end, we propose Rawformer that leverages position-dependent local-global dependencies for synthetic speech detection. Two-dimensional convolutions and Transformers are used in our method to capture local and global dependencies, respectively. Specifically, we design a novel location aggregator that reduces information loss by adding location information and a flattening strategy to integrate local global dependencies. Furthermore, we propose Squeeze-Excited Rawformer (SE-Rawformer), which introduces squeeze-excited operations to better capture local dependencies. The results show that our proposed SE-Rawformer achieves 37% relative improvement over the state-of-the-art single system on asvspoof2019 LA, and it generalizes well on asvspoof2021 LA. In particular, using position aggregators in SE-Rawformer improves by an average of 43%.

Automatic speaker verification (ASV) systems are vulnerable to spoofing attacks. As synthetic speech exhibits local and global artifacts compared to natural speech, incorporating local-global dependency would lead to better anti-spoofing performance. To this end, we propose the Rawformer that leverages positional-related local-global dependency for synthetic speech detection. The twodimensional convolution and Transformer are used in our method to capture local and global dependency, respectively. Specifically, we design a novel positional aggregator that integrates localglobal dependency by adding positional information and flattening strategy with less information loss. Furthermore, we propose the squeeze-and-excitation Rawformer (SE-Rawformer), which introduces squeeze-and-excitation operation to acquire local dependency better. The results demonstrate that our proposed SE-Rawformer leads to 37% relative improvement compared to the single stateof-the-art system on ASVspoof 2019 LA and generalizes well on ASVspoof 2021 LA. Especially, using the positional aggregator in the SE-Rawformer brings a 43% improvement on average.

3.SAMO:Speaker Attractor Multi-Center One-Class Learning For Voice Anti-Spoofing

Title: SAMO: Multicenter Single-Class Learning of Speaker Attractors for Speech Anti-Spoofing

Authors: Siwen Ding 1, You Zhang 2, Zhiyao Duan 2

Page:1 Columbia University, New York, NY, USA

2 University of Rochester, Rochester, NY, USA

Link: https://ieeexplore.ieee.org/document/10094704

Abstract: Speech anti-spoofing system is an important auxiliary equipment of Automatic Speaker Verification (ASV) system. A major challenge is the unseen attacks posed by advanced speech synthesis techniques. Our previous work on one-class learning improves generalization to unseen attacks by compressing real speech in the embedding space. However, this compactness does not take speaker diversity into account. In this work, we propose Speaker Attractor Multicentric Monoclass Learning (SAMO), which clusters real speech around some speaker attractors and pushes away all attractors in a high-dimensional embedding space for spoofing attacks . In terms of training, we propose a co-optimization algorithm for sincerity speech clustering and sincerity/deceit classification. On the inference side, we propose an anti-spoofing strategy for unregistered speakers. Our proposed system outperforms the existing state-of-the-art single system with a 38% relative improvement in equal error rate (EER) on the asvspof2019 LA evaluation set.

Voice anti-spoofing systems are crucial auxiliaries for automatic speaker verification (ASV) systems. A major challenge is caused by unseen attacks empowered by advanced speech synthesis technologies. Our previous research on one-class learning has improved the generalization ability to unseen attacks by compacting the bona fide speech in the embedding space. However, such compactness lacks consideration of the diversity of speakers. In this work, we propose speaker attractor multi-center one-class learning (SAMO), which clusters bona fide speech around a number of speaker attractors and pushes away spoofing attacks from all the attractors in a high-dimensional embedding space. For training, we propose an algorithm for the co-optimization of bona fide speech clustering and bona fide/spoof classification. For inference, we propose strategies to enable anti-spoofing for speakers without enrollment. Our proposed system outperforms existing state-of-the-art single systems with a relative improvement of 38% on equal error rate (EER) on the ASVspoof2019 LA evaluation set.

4.Shift to Your Device Data Augmentation for Device-Independent Speaker Verification Anti-Spoofing

Title: Moving to your device: Data-enhanced device-independent speaker verification anti-spoofing

Authors: Junhao Wang, Li Lu, Zhongjie Ba, Feng Lin, Kui Ren

单位:School of Cyber Science and Technology, Zhejiang University, China

ZJU-Hangzhou Global Scientific and Technological Innovation Center, China

Link: https://ieeexplore.ieee.org/document/10097168

Abstract: This paper presents a novel DeAug deconvolution-augmented data augmentation method for an ultrasound-based speaker verification anti-spoofing system to detect the activity of sound sources during physical visits, aiming to improve the detection of uncollected data. See Device Liveness Detection Capabilities. Specifically, DeAug first preprocesses the available collected data with Wiener deconvolution to generate enhanced clean signal samples. The resulting samples are then convolved with different device impulse responses, giving the signal the channel properties of the unseen device. Experiments on cross-domain datasets show that the proposed augmentation method can improve the relative performance of ultrasound-based anti-spoofing systems by 97.8%, and the performance can be further improved by 43.4% after domain adversarial training on multi-device augmented data. This paper proposes a novel Deconvolution-enhanced data Augmentation method, DeAug, for ultrasonic-based speaker verification anti-spoofing systems to detect the liveness of voice sources in physical access, which aims to improve the performance of liveness detection on unseen devices where no data is collected yet. Specifically, DeAug first employs a wiener deconvolution pre-processing on available collected data to generate enhanced clean signal samples. Then, the generated samples are convolved with different device impulse responses,

Others

1.Conditional Conformer: Improving Speaker Modulation For Single And Multi-User Speech Enhancement

Title: Conditional Conformer: Improving Speaker Modulation for Single-User and Multi-User Speech Enhancement

Authors: Tom O'Malley, Shaojin Ding, Arun Narayanan, Quan Wang, Rajeev Rikhye, Qiao Liang, Yanzhang He, Ian McGraw∗

Unit: Google LLC, USA

Link: https://ieeexplore.ieee.org/document/10095369

Abstract: Recently, Feature Linear Modulation (FiLM) has outperformed other methods in embedding speakers into speech separation and voicefilter models. We propose an improved method for incorporating such embeddings in front of speech filters for automatic speech recognition (ASR) and text-independent speaker verification (TI-SV). We extend the widely used Conformer architecture to construct a FiLM Block with additional feature processing before and after the FiLM layer. In addition to being applied to single-user VoiceFilter, we show that our system can be easily extended to multi-user VoiceFilter models by element-wise max-pooling of speaker embeddings in the projection space. The last architecture, which we call conditional consistency, tightly integrates speaker embeddings into a consistency backbone. Compared with the previous multi-user voiceeffilter model, we improve the TI-SV equal error rate by 56%, and our element max-pooling reduces the relative WER by 10% compared to the attention mechanism.

Recently, Feature-wise Linear Modulation (FiLM) has been shown to outperform other approaches to incorporate speaker embedding into speech separation and VoiceFilter models. We propose an improved method of incorporating such embeddings into a VoiceFilter frontend for automatic speech recognition (ASR) and textindependent speaker verification (TI-SV). We extend the widelyused Conformer architecture to construct a FiLM Block with additional feature processing before and after the FiLM layers. Apart from its application to single-user VoiceFilter, we show that our system can be easily extended to multi-user VoiceFilter models via element-wise max pooling of the speaker embeddings in a projected space. The final architecture, which we call Conditional Conformer, tightly integrates the speaker embeddings into a Conformer backbone. We improve TI-SV equal error rates by as much as 56% over prior multi-user VoiceFilter models, and our element-wise max pooling reduces relative WER compared to an attention mechanism by as much as 10%.

2.Exploiting Speaker Embeddings for Improved Microphone Clustering and Speech Separation in ad-hoc Microphone Arrays

Title: Utilizing speaker embeddings to improve microphone clustering and speech separation in ad hoc microphone arrays

作者:Stijn Kindt, Jenthe Thienpondt, Nilesh Madhu

单位:IDLab, Department of Electronics and Information Systems, Ghent University - imec, Ghent, Belgium

Link:

https://ieeexplore.ieee.org/document/10094862

Abstract: To separate sources captured by temporally distributed microphones, a critical first step is to assign microphones to appropriate source-dominant clusters. The features used for this (blind) clustering are based on fixed-length embeddings of audio signals in a high-dimensional latent space. In previous work, embeddings were manually designed from Mel-frequency cepstral coefficients and their modulation spectra. This paper argues that an embedding framework explicitly designed to reliably distinguish speakers will yield more appropriate features. We propose features generated by the state-of-the-art ECAPATDNN speaker verification model for clustering. We benchmark these features in terms of subsequent signal enhancement as well as the quality of clustering, and further, we introduce 3 intuitive metrics for the latter. The results show that eCAPA-TDNN-based features lead to more logistic clustering and better performance in the subsequent augmentation stage compared to hand-crafted features - thus validating our hypothesis.

For separating sources captured by ad hoc distributed microphones a key first step is assigning the microphones to the appropriate source-dominated clusters. The features used for such (blind) clustering are based on a fixed length embedding of the audio signals in a high-dimensional latent space. In previous work, the embedding was hand-engineered from the Mel frequency cepstral coefficients and their modulation-spectra. This paper argues that embedding frameworks designed explicitly for the purpose of reliably discriminating between speakers would produce more appropriate features. We propose features generated by the state-of-the-art ECAPATDNN speaker verification model for the clustering. We benchmark these features in terms of the subsequent signal enhancement as well as on the quality of the clustering where, further, we introduce 3 intuitive metrics for the latter. Results indicate that in contrast to the hand-engineered features, the ECAPA-TDNN-based features lead to more logical clusters and better performance in the subsequent enhancement stages - thus validating our hypothesis.

3.Perceptual Analysis of Speaker Embeddings for Voice Discrimination between Machine And Human Listening

Title: Speaker embedding perceptual analysis for speech recognition between machine and human hearing

作地:Jordanis Thoides; Clement Gaultier; Tobias Goehring

单位:School of Electrical and Computer Engineering, Aristotle University of Thessaloniki, Thessaloniki, Greece;

Cambridge Hearing Group, MRC Cognition and Brain Sciences Unit, University of Cambridge, UK

Link: https://ieeexplore.ieee.org/document/10094782

Abstract: This study investigates information captured by speaker embeddings relevant to human speech perception. Convolutional neural networks are trained to perform one-shot speaker verification in both clean and noisy conditions, encoding high-level abstractions of speaker-specific features into latent embedding vectors. We demonstrate that robust and discriminative speaker embeddings can be obtained by using a training loss function that optimizes embeddings during inference for similarity scoring. Computational analysis shows that such speaker embeddings predict a variety of hand-crafted acoustic features, without any single feature explaining substantial differences in embeddings. Furthermore, relative distances in speaker embedding space are moderately consistent with speech similarity, as inferred by human listeners. These findings confirm the overlap between machine and human hearing when discriminating sounds and inspire further research into remaining differences to improve model performance.

This study investigates the information captured by speaker embeddings with relevance to human speech perception. A Convolutional Neural Network was trained to perform one-shot speaker verification under clean and noisy conditions, such that high-level abstractions of speaker-specific features were encoded in a latent embedding vector. We demonstrate that robust and discriminative speaker embeddings can be obtained by using a training loss function that optimizes the embeddings for similarity scoring during inference. Computational analysis showed that such speaker embeddings predicted various hand-crafted acoustic features, while no single feature explained substantial variance of the embeddings. Moreover, the relative distances in the speaker embedding space moderately coincided with voice similarity, as inferred by human listeners. These findings confirm the overlap between machine and human listening when discriminating voices and motivate further research on the remaining disparities for improving model performance.

Guess you like

Origin blog.csdn.net/weixin_48827824/article/details/132404383