12 papers from the Audiovisual Cognitive Computing Team of Tianjin Key Laboratory of Cognitive Computing and Applications, Tianjin University, were accepted by Interspeech 2023, the top conference on speech processing

Twelve papers from the Audiovisual Cognitive Computing Team of the Tianjin Key Laboratory of Cognitive Computing and Applications, Tianjin University, were accepted by Interspeech 2023, the top speech technology conference, covering research directions such as intent recognition, spoken language understanding, acoustic features, speech recognition, speech separation, and emotion recognition. The brief introduction of the papers is as follows.

01. Rethinking the visual cues in audio-visual speaker extraction

Authors of the paper : Li Junjie, Ge Meng, Pan Zexu, Cao Rui, Wang Longbiao, Dang Jianwu, Zhang Shiliang

Thesis unit : Tianjin University, Alibaba Dharma Institute

Audio-video speech separation utilizes visual information to extract the speech signal of the target person from the mixed speech. However, the current method only uses a single visual encoder to extract visual information. This paper proposes to use two visual encoders to extract speaker information and synchronization information in the visual signal respectively. The model structure is shown in Figure d. The experimental results in this paper show that the method that explicitly utilizes identity and synchronization information can significantly improve the performance of the speech separation model compared to the method that is implicitly utilized by a single decoder.

02. Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation

Authors of the paper : Fu Yanjie, Ge Meng, Wang Honglong, Li Nan, Yin Haoran, Wang Longbiao, Zhang Gaoyan, Dang Jianwu, Deng Chengyun, Wang Fei

Thesis unit : Tianjin University, National University of Singapore, Beijing Xiaoju Technology Co., Ltd.

Paper resources : https://arxiv.org/abs/2305.10821

In recent years, neural beamforming technology has made amazing progress in multi-channel speech separation. However, they mostly ignore speaker 2D position cues contained in the mixed signal. In this paper, we propose an end-to-end beamforming network for guiding speech separation by 2D positional information given only a mixed signal. The network first estimates discriminable direction and 2D position cues that imply the direction of arrival of sound sources relative to multiple reference microphones and their 2D position coordinates. These cues are then integrated into a position-aware neural beamforming module, which allows accurate reconstruction of speech signals from both sources. Experiments show that compared to the baseline system, our proposed model not only achieves overall improvement in speech separation metrics, but also avoids underperformance in the case of spatial overlap.

03. SDNet: Stream-attention and Dual-feature Learning Network for Ad-hoc Array Speech Separation

Authors of the paper : Wang Honglong, Deng Chengyun, Fu Yanjie, Ge Meng, Wang Longbiao, Zhang Gaoyan, Dang Jianwu, Wang Fei

Thesis unit : Tianjin University, Beijing Xiaoju Technology, National University of Singapore

Good progress has been made in the field of multi-channel speech separation using fixed arrays. This paper proposes a robust system for distributed microphone arrays to cope with uncertainties in the location and number of microphones. Previous studies usually use averaging methods to deal with distributed microphone signals, ignoring the diversity of microphones in different locations. Some studies have shown that a microphone with a high signal-to-noise ratio is more helpful in improving speech quality. Inspired by this, we propose a channel-flow attention and dual-feature learning network named SDNet. The main contributions are as follows: 1) We propose a dual-feature learning block with fewer parameters to better learn long-term dependencies. 2) Based on this high-quality speech representation, we further propose channel-flow attention, which effectively handles microphones with varying positions and numbers, and assigns more attention to microphones with high SNR. Experiments demonstrate that our proposed model outperforms other baseline models.

04. Discrimination of the Different Intents Carried by the Same Text Through Integrating Multimodal Information

Authors of the paper : Li Zhongjie, Zhang Gaoyan, Wang Longbiao, Dang Jianwu

Thesis unit : Tianjin University

With the development of artificial intelligence and the popularization of smart devices, human-machine intelligent dialogue technology has received extensive attention. Spoken intent understanding is the core module of the entire dialogue system. Therefore, how to accurately obtain the comprehensive intent information transmitted by the speaker, including the linguistic intent carried by the text information and the paralinguistic intent carried by the acoustic information, is a key issue.

Currently, many intention understanding studies ignore the influence of paralinguistic information, which leads to misunderstandings during speech interaction, especially when the same text conveys different intentions through different paralinguistic information. To address this issue, this study first creates a Chinese multimodal spoken intent understanding dataset that contains the same text but different intents. Then, our proposed attention-based BiLSTM model integrates text and acoustic features, and introduces an acoustic information gate mechanism to complement or revise linguistic intent. Experimental results show that our multimodal fusion model improves intent recognition accuracy by 11.0% compared to models that only use linguistic information. This experimental result demonstrates the effectiveness of our proposed model in intent recognition, especially in the case of the same text but different intents.

05. Frequency Patterns of Individual Speaker Characteristics at Higher and Lower Spectral Ranges

Authors of the paper : Zhang Zhao, Zhang Ju, Zhu Ziyu, Chi Yujie, Honda Kiyoshi, Wei Jianguo

Thesis unit : Tianjin University, Huiyan Technology (Tianjin) Co., Ltd.

The acoustic characteristics of speech vary among individuals, but still retain the shared basic information of speech to be perceived and distinguished by the listener through the auditory system. This paper discusses general time-frequency patterns of speaker personalization features. The main goal of this paper is to discuss speaker-individualized frequency-domain features in the high-frequency and low-frequency ranges, respectively. To explore this underexplored phenomenon, we performed two experiments. First, we utilized acoustic simulation calculations based on a transmission line model to explore the variation of high-frequency resonances under different hypopharyngeal cavity shapes. Second, we recorded speech signals emanating from the mouth and nostrils separately to observe potential personalization factors for low spectral irregularities. Based on our findings combined with previous accumulated studies, we propose a time-frequency model to express speaker personalization features, which provides an approximate distribution of speaker personalization information in the speech spectrogram.

06. Improving Zero-shot Cross-domain Slot Filling via Transformer-based Slot Semantics Fusion

Authors of the paper : Li Yuhang, Wei Xiao, Si Yuke, Wang Longbiao, Wang Xiaobao, Dang Jianwu

Thesis unit : Tianjin University

Slot filling is an important component in spoken comprehension of task-based dialogues. In real-world scenarios, due to the scarcity of labeled data, zero-shot slot filling has been widely studied to transfer knowledge from the source domain to the target domain. Previous methods employ textual descriptions of slots or questions as the semantic information of slots, they utilize textual descriptions of slots to compute similarity scores, or recast the task as a machine reading comprehension task. However, these methods do not fully exploit the slot semantic information and the word-level dependencies between sentences. In this study, we propose a Transformer-based slot semantic fusion method (TSSF). A weight-shared encoder is first employed to obtain representations for sentence and slot semantics. Then, we design a Transformer-based slot semantic fusion module to effectively fuse slot semantics into sentence representations. Experimental results on the public dataset SNIPS show that our model significantly outperforms the state-of-the-art model by 6.09% on the slot-F1 metric.

07. Auditory Attention Detection in Real-Life Scenarios Using Common Spatial Patterns from EEG

Authors of the paper : Yang Kai, Xie Zhuang, Zhou Di, Wang Longbiao, Zhang Gaoyan

Thesis unit : Tianjin University, School of Software, Henan University, Hokuriku University of Advanced Science and Technology

An electroencephalogram (EEG)-based method for detecting auditory attention could be used in neurally directed hearing devices to help people with hearing loss improve their hearing. However, most previous studies obtained EEG data in a laboratory setting, which limits the practical application of neural-guided hearing devices. In this study, we employ the Common Spatial Patterns (CSP) algorithm commonly used in the BCI field to perform auditory attention detection using EEG signals collected in real scenes while differentiating subjects' different behavioral states (sitting and walking). The results show that when using different decision windows (1 second-30 seconds), the CSP method can achieve a detection accuracy of 81.3% to 87.5%, surpassing previous methods based on linear mapping and traditional CNN methods. This proves that the CSP algorithm can effectively decode people's attention in real-life scenarios. The experimental results of EEG sub-bands showed that the delta and beta bands were more active in attention tasks, supporting the findings of previous studies.

08. Transvelar Nasal Coupling Contributing to Speaker Characteristics in Non-nasal Vowels

Authors of the paper : Zhu Ziyu, Chi Yujie, Zhang Zhao, Honda Kiyoshi, Wei Jianguo

Thesis unit : Tianjin University

The structure of the nasal cavity remains stable during pronunciation and has individual differences among speakers, so nasal resonance plays an important role in forming the speaker's individual characteristics. Studies on the acoustic role of the nasal cavity mainly discuss nasalized vowels. The nasal cavity is connected to the main vocal tract through the velopharyngeal port (VPO), which changes the acoustic characteristics of nasalized vowels. However, researchers have found that nasal resonance occurs in the pronunciation of non-nasal vowels through transvelar nasal coupling, and has a non-negligible impact on the acoustic characteristics of non-nasal vowels. In this paper, a group of experimental devices are designed for acoustic experiments, and the lip radiating sound and nostril radiating sound are recorded respectively. The pronunciation corpus consists of non-nasalized vowels. Using spectral analysis technology to explore the nasal resonance characteristics and acoustic influences of non-nasal vowel pronunciation. The results show that the differences in nasal resonance characteristics between speakers are distributed below 2kHz: the lower frequencies appear as two peaks and a null in between, and the higher frequencies appear as subtle nulls with uneven distribution. In addition, the mixing of nostril radiating sounds will reduce the first resonant peak of vowels output by lip radiating sounds, and the degree of reduction varies among different speakers.

09. Effects of Tonal Coarticulation and Prosodic Positions on Tonal Contours of Low Rising Tones: In the Case of Xiamen Dialect

Authors of the paper : Hu Yiying, Feng Hui, Zhao Qinghua, Li Aijun

Thesis unit : Tianjin University, Chinese Academy of Social Sciences

Paper resources : http://arxiv.org/abs/2306.02251

This paper studies the influence of tonal coarticulation and prosodic position on low-rising tones in Xiamen Hokkien, and proposes a quantitative method for measuring the degree of tonal inflection in the tonal triangle (TCATT: Tonal Contour Analysis in Tonal Triangle). The experimental results show that the low-rising tone T2 in Xiamen Hokkien shows a tendency to become a falling-rising tone, and the tonal coarticulation and prosodic position have a significant impact on its tortuosity. The influence of tone coarticulation is reflected in that when the syllable before T2 is a high-flat tone, T2 will appear as a falling-rising tone, and the degree of inflection is the greatest at this time; when the preceding syllable is a low-flat or low-falling tone, T2 will appear as a low-rising tone. The influence of prosodic position is reflected in that when T2 is at the beginning of a sentence, the tortuosity of its tone curve is significantly greater than that of T2 at the middle and end of a sentence, and the syllable duration is positively correlated with the tortuosity of the tone curve.

10. Improving Chinese Mandarin Speech Recognition using Semantic Graph Embedding Regularization

Authors of the paper : Lin Yangshi, Lu Wenhuan, Jia Yongzhe, Ma Guoning, Wei Jianguo

Thesis unit : Tianjin University

In this paper, we investigate the role of semantic graph embeddings in end-to-end speech recognition systems (ASR). First, we introduce the method of constructing the semantic graph of Chinese characters. The semantic graph of Chinese characters is based on Chinese characters as nodes, and the weight of edges is determined by the frequency of character combinations in sentences and the weight of the vocabulary composed of character combinations. When the construction of the Chinese character semantic graph is completed, we use the graph embedding method to convert the graph into a graph embedding vector. This vector is used to regularize the decoder weights for end-to-end ASR. We believe that the semantic information contained in this vector can help end-to-end ASR understand the semantics contained in the semantic graph and the rules of artificially constructed word graphs. We conducted experiments on the Aishell1 dataset, and the character error rate was 4.36%. After adding the language model, the character error rate was reduced to 4.25%. Experimental results prove that this method can significantly reduce the character error rate of end-to-end ASR.

11. SOT: Self-supervised Learning-Assisted Optimal Transport for Unsupervised Adaptive Speech Emotion Recognition

Authors of the paper : Zhang Ruiteng, Wei Jianguo, Lu Xugang, Li Yongwei, Xu Junhai, Jin Di, Tao Jianhua

Thesis unit : Tianjin University, Qinghai University for Nationalities, Institute of Communication and Science and Technology, Institute of Automation, Chinese Academy of Sciences, Tsinghua University

In cross-domain speech emotion recognition (SER), reducing the global probability distribution distance (GPDD) between different domains plays a crucial role in unsupervised domain adaptation (UDA), which can be naturally measured by optimal transfer (OT). However, due to the large intra-class variance of emotion categories, samples from overlapping distributions may induce negative transfers. Furthermore, OT only considers GPDD and thus cannot effectively transfer indistinguishable samples without exploiting the local structure of the intra-class distribution. We propose a self-supervised learning (SSL) assisted optimal transfer (SOT) algorithm for cross-domain SER. First, we normalize the transport coupling of OT to alleviate negative transfer; then, we design an SSL module to emphasize local intra-class structures to help OT capture those non-transferable knowledge. Experimental results for cross-domain speech emotion recognition show that SOT significantly outperforms state-of-the-art unsupervised domain adaptation algorithms.

12. Multi-Level Knowledge Distillation for Speech Emotion Recognition in Noisy Conditions

Authors of the paper : Liu Yang, Sun Haoqin, Chen Geng, Wang Qingyue, Zhao Zhen, Lu Xugang, Wang Longbiao

Thesis unit : Qingdao University of Science and Technology, Japan National Institute of Information and Communication, Tianjin University

In recent years, the performance of speech emotion recognition (SER) has achieved remarkable improvement. However, most algorithms are trained and tested under pure speech conditions, and how to achieve good speech emotion recognition performance under noisy conditions is still a challenging task. To this end, we propose a multi-level knowledge distillation (MLKD) method, which aims to transfer knowledge from a teacher model trained on clean speech to a simpler student model trained on noisy speech. Specifically, we use clean speech features extracted by wav2vec-2.0 as learning targets, and train distil wav2vec-2.0 under noisy conditions to approach the feature extraction ability of the original wav2vec-2.0. Furthermore, we leverage the multi-level knowledge of the original wav2vec-2.0 to supervise the output of each intermediate layer of distil wav2vec-2.0. This paper conducts experiments on the IEMOCAP corpus and the Noisex-92 noise library. Experimental results show that compared with the baseline system, the proposed method achieves an average improvement of 18.23% in UA under all types of noise, showing competitive results.

Guess you like

Origin blog.csdn.net/weixin_48827824/article/details/131442498