INTERSPEECH 2023 Paper | Accent Recognition with Persistent Accent Memory Based on Self-Supervised Learning Representations

Essay topic:

Self-supervised Learning Representation based Accent Recognition with Persistent Accent Memory

Author list:

Li Rui, Xie Zhiwei, Xu Haihua, Peng Yizhou, Liu Hexin, Huang Hao, Chng Eng Siong

Research Background

Accent recognition (AR) is an important and challenging task. Because accent not only contains the speech features of the speaker, but also regional information, which may be crucial for speaker recognition [1] and speech recognition [2]. However, large-scale accent labeling data are hard to obtain, thus it is a low-resource task. Therefore, to obtain an ideal AR system, it is necessary to make full use of both data and model modeling efficiency.

The scheme of this article

This paper aims to improve AR performance from two perspectives. First, to alleviate the problem of insufficient data, we use the self-supervised learning representation (SSLR) extracted from the pre-trained model WavLM [3] to build the AR model. With the help of SSLRs, it achieves a significant performance boost compared to traditional acoustic signatures. Second, we propose a Persistent Accent Memory (PAM) as contextual knowledge to bias the AR model. Accent embeddings extracted from all training data by the encoder of the AR model are clustered to form an accent codebook, PAM. Furthermore, we propose multiple attention mechanisms to study the optimal use of PAM. We observe that the best performance is achieved by selecting the most relevant accent embeddings.

1. To alleviate the problem of insufficient data, we use self-supervised learning representations (SSLRs) extracted from pre-trained models to build AR models.

Figure 1 Multi-task backbone model

Table 1 The accuracy of SSLRs extracted using WavLM on the test set

First, we use the SSLRs extracted by WavLM instead of the traditional acoustic feature Fbank to train the model. From systems 1-5 in Table 1, it can be seen that using WavLM to extract SSLRs can significantly improve the performance of AR compared to systems trained from scratch using Fbank of traditional acoustic features. Second, the model trained with SSLRs extracted from upper-middle encoders performs better than SSLRs from lower-layer encoders, with the best results achieved at layer 20. Finally, according to the accuracies in Table 1 on different accents, we will find that the SSLRs extracted by different layer encoders provide different effective information for different accents. Then we will have a question, how to combine the effective information provided by different layers of SSLRs for different accents, so as to improve the accuracy of all accents?

2. We propose a Persistent Accent Memory (PAM) as contextual knowledge to bias AR models.

Specifically, PAM is a codebook containing 256 embeddings, which are clustered from the output of the training set data from the encoder of the AR model trained with WavLM SSLRs. The training set contains 8 accents. We aggregate the audio embeddings corresponding to each accent into 32 embeddings using k-means, and finally get 256 embeddings, which are called PAM. Among them, "persistence" means that these 256 embeddings will not be updated during training.

3. In order to exploit the accent context information, we experimented with various attention mechanisms.

Figure 2 Different attention mechanisms

(1) Frame-level cross-attention fusion: The encoder output is used as query, PAM is used as key and value, and the attention mechanism acts on the frame level.

(2) Discourse-level cross-attention fusion: The output of the encoder is at the discourse level by pooling, since PAM is also at the discourse level. This realizes that all attention components such as query, key and value are at the same utterance level, making attention have clear semantics.

(3) Splicing PAM self-attention fusion: splicing the output of the encoder and PAM in the time dimension, and performing self-attention operations on the entire sequence. The motivation is to improve AR performance by having the encoder bias the encoder output by the accent context.

4. In order to make better use of accent context information, we propose the N-best persistent accent memory selection method.

When we use different attention mechanisms, its limitation is that all embeddings in PAM are considered, which will lead to excessive redundancy, because we think that the model only needs to consider embedding information that is the same or similar to the current accent during training. So we propose the N-best persistent accent memory selection method. N denotes the number of embeddings selected from the PAM based on the similarity score between the embeddings in the PAM and the encoder output. The method architecture is shown in Figure 3.

Figure 3 N-best persistent accent memory selection method

Experimental results

Table 2 shows the experimental results of all attention-based methods. In order to verify the effectiveness and generality of our proposed method, "Oracle" represents that PAM is constructed from the embedding extracted from the best-performing accent recognition model corresponding to each accent, and the other two are based on the last layer output and the overall weighted sum output, denoted as "layer-24" and "layers:1-24" respectively. We find that all methods improve over the baseline, and the N-best selection method achieves the best performance.

Table 2 Accuracy on the test set using PAM

Table 3 The role of N in the N-best PAM selection method

Furthermore, we investigate the effect of different N on the N-best selection method, as shown in Table 3. When N is equal to 64, the model shows the highest accuracy. However, larger N does not necessarily lead to higher performance, but also leads to higher computational complexity.

in conclusion

In this work, we incorporate Self-Supervised Learning Representations (SSLRs) into our proposed Persistent Accent Memory (PAM) method to improve AR. We use SSLRs extracted from a pretrained WavLM model to address the data insufficiency problem in the accent recognition task. The use of SSLRs shows significant performance gains compared to traditional acoustic features, which demonstrates the effectiveness of SSLRs in accent recognition. Furthermore, we propose a PAM approach with a different attention mechanism to improve accent recognition. We demonstrate the effectiveness of our proposed method on public accent benchmark datasets, and the best-performing system selecting the N best relevant embeddings from persistent accent memory achieves further improvements in accent recognition.

references

[1] S. Shon, H. Tang, and J. Glass, “Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-toend model,” in Proc. SLT 2018. IEEE, 2018, pp. 1007–1013.

[2] X. Gong, Y. Lu, Z. Zhou, and Y. Qian, “Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition,” in Proc. INTERSPEECH 2021, 2021, pp. 1274–1278.

[3] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale selfsupervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.

Guess you like

Origin blog.csdn.net/weixin_48827824/article/details/131393667