ICASSP2023 paper code open source|TOLD speaker log framework capable of modeling aliased speech

The goal of the speaker log (Speaker Diarization, SD) task is to detect the time period of speech activity of different speakers, that is, to answer the question "who spoke at what time".

Traditional speaker log systems are often implemented based on clustering algorithms, and generally include the following steps: (1) Use speech endpoint detection to segment the original audio into speech segments; (2) Use the speaker embedding code to extract the model; (3) Use Clustering methods such as K-means group audio clips belonging to the same speaker together. However, these clustering methods are often unsupervised and cannot directly minimize speaker log error, leading to suboptimal results. Although some supervised clustering methods have been proposed later, both unsupervised and supervised clustering methods assume that each speech segment only corresponds to a single speaker, so they cannot solve the problem of aliasing speech.

In order to deal with aliased speech, the neural network-based end-to-end speaker log method (end-to-end neural diarization, EEND) redefines the speaker log as a multi-label classification problem, so that the speaker log can be directly optimized error, and has the ability to handle aliased speech. Further, an encoder-decoder based attractor (EDA) model is introduced into EEND to adapt to the situation where the number of speakers is not fixed. On this basis, some two-stage methods are also proposed to further improve the speaker log performance.

Recently, the paper "TOLD: A NOVEL TWO-STAGE OVERLAP-AWARE FRAMEWORK FOR SPEAKER DIARIZATION" by the Speech Lab of Alibaba DAMO Academy was accepted by ICASSP 2023. This paper is the latest achievement of the Speech Lab of DAMO Academy in the direction of speaker logs, and it is a research exploration on the problem of " how to explicitly model aliased speech ".

The code related to the thesis has been open sourced in FunASR, the code warehouse of the Speech Laboratory of DAMO Academy.

picture

  Thesis title: TOLD: A NOVEL TWO-STAGE OVERLAP-AWARE FRAMEWORK FOR SPEAKER DIARIZATION

  Authors: Wang Jiaming, Du Zhihao, Zhang Shiliang

  Paper address: https://arxiv.org/abs/2303.05397

  Code repository: https://github.com/alibaba-damo-academy/FunASR/tree/main/egs/callhome/TOLD

In this paper, we first propose an end-to-end model overlap-aware EEND (EEND-OLA) for explicit modeling of overlapping speech, using power set encoding (power set encoding, PSE) to convert speaker logs from multi-label The classification problem is transformed into a single-label classification problem, while explicitly modeling dependencies and overlaps between speakers. On this basis, we introduced the speaker overlap-aware post-processing model SOAP (speaker overlap-aware post-processing, SOAP) to further improve the results of EEND-OLA, thus proposing a two-stage, hybrid The speaker log framework TOLD (Two-stage OverLap-aware Diarization framework) for overlapping speech modeling.

Specifically, in the first stage, EDA is used to extract the speaker embedding code, and at the same time, the speaker order is determined by minimizing the permutation-invariant loss (PIT), and based on this, a single label is used to replace the multidimensional 0/ 1 label for better modeling of aliased speech. In the second stage, according to the speaker log results obtained in the first stage, speaker embedding codes are extracted from the non-aliased speech segments, and sent to the post-processing model SOAP together with the acoustic features to obtain more accurate speaker log results.

‍▎ Two-Stage Overlap Sensitive Speaker Log Model TOLD

picture

(Illustration: overall frame diagram of TOLD, EEND-OLA in the first stage on the left, SOAP in the second stage on the right) 

>>>Stage 1: Speaker Overlap Sensitive End-to-End Model EEND-OLA

In the first stage of TOLD (as shown in the left half of Figure 2), the input features are first obtained by the Transformer encoder to obtain the acoustic features, and then sent to the EDA module composed of two layers of LSTM to predict the number of speakers corresponding to the input audio and Speaker embedding codes (called attractors in the EDA paper) corresponding to each speaker, then, the similarity between each speaker embedding and acoustic features is calculated, and the speaker order is determined by minimizing the PIT loss function ( See the paper for details). After determining the order of the speakers, we can convert the original multi-dimensional 0/1 label into a single category label through the power set encoding PSE:

picture

To predict PSE labels, we use an additional layer of LSTM to better utilize contextual information. Since LSTM requires that the feature dimension of the input sequence is fixed, we will first use zero vectors to fill the number of speaker embeddings obtained by EDA to a fixed value, and then stitch them together frame by frame by speaker, that is, The result is the input of LSTM, and then through a linear layer to get the final PSE classification result. We optimize the EEND-OLA model by minimizing the cross-entropy loss between the true and predicted values ​​of the PSE labels.

>>>Second stage: speaker overlap-sensitive post-processing model SOAP

As shown in the right half of the diagram, first, according to the results of the first stage, non-overlapping voices are selected to extract the speaker's voiceprint embedding code. We use a ResNet34 speaker recognition model pre-trained on Switchboard and Callhome1 as the voiceprint extractor, and also use it to initialize the SOAP Encoder part of the network parameters. Although the SOAP Encoder has the same network structure as the voiceprint extractor, its input is a small segment of speech with a window length of 1.6s and a window shift of 0.8 seconds instead of the entire speech, so as to obtain more fine-grained speaker acoustic features at each moment.

Given the acoustic representation and the speaker's voiceprint features at each moment, context-dependent (CD) and context-independent (CI) scorers Scorers are used to model local and global speakers, respectively. characteristics, and use it to predict speaker activities at different moments. We use a three-layer fully connected network DNN as the context independent scorer CI Scorer. For the context-sensitive scorer CD Scorer, we use the multi-head self-attention mechanism (Multi-Head Self Attention, MHSA) to model the context-sensitive speaker's utterance probability at each moment. Based on the speaker utterance probabilities predicted by the two scorers, we employ an LSTM network to model dependencies and overlaps between speakers, and use this to predict the posterior probabilities of Power Set Encoding (PSE) labels.

We adopt a multi-task learning training strategy to better train the SOAP model. The main learning objective is to minimize the cross-entropy loss between the posterior probabilities predicted by the network and the true PSE labels; the secondary learning objectives are CI and CD scorers. The scores of the scorers should predict the independent vocalization activities of different speakers as accurately as possible. , the binary cross-entropy between the CI and CD scores and the multidimensional 0/1 labels. The balance factor between the two learning objectives is set to 0.1.

‍▎Experimental results

>>>Comparison of EEND-OLA and other end-to-end speaker log models

We verify the effectiveness of TOLD on the commonly used public dataset CALLHOME. First, in order to verify the role of PSE, we compared the EEND-OLA model corresponding to the first stage with other end-to-end models. The specific results are shown in Table 1.

In the case of 2 speakers, the results of EEND-OLA and CB-EEND, a model that can only handle a fixed number of speakers, are similar; while considering all the number of speakers, compared to EEND-EDA, EEND-OLA Able to achieve a relative improvement of 13.49% on the DER metric, proving the effectiveness of aliased speech modeling.

Table 1 Performance comparison of different one-stage models on the CALLHOME test set

picture

>>> Comparison of TOLD with other two-stage speaker-log models 

Table 2 presents the performance comparison of TOLD and other two-stage models on the CALLHOME dataset. Among them, except for TOLD, the remaining two-stage models are the combination of EEND model and clustering method. The difference is that the models of both stages of TOLD are based on neural networks, which provides the potential for subsequent unification into an end-to-end system. Overall, TOLD achieves 10.14% DER on the CALLHOME test set. This is a new optimal result on this test dataset.

Table 2 Performance comparison of different two-stage models on the CALLHOME test set

picture

>>>Performance comparison of different models in two stages 

In order to further explore the impact of using different models in different stages on performance, we designed ablation experiments, and the specific results are shown in Table 3. For the first stage, we compared VBx, EEND-EDA and EEND-OLA. For the second phase, we compared TSVAD and SOAP. According to the results in Table 1 and Table 3, we find that the lower the DER of the first-stage model is, the better the overall system performance is when using the same second-stage model. Meanwhile, we found that no matter which second-stage model is used, the proposed EEND-OLA always provides better performance than VBx and EEND-EDA. Furthermore, we also found that no matter which model was used as the first stage, SOAP provided lower DER than TSVAD, suggesting that SOAP might be more suitable for post-processing.

Table 3 Performance comparison on the CALLHOME test set using different methods at different stages

picture

‍▎Future Work

In this paper, we first proposed EEND-OLA, which uses power set coding to redefine the speaker log problem from the original multi-label prediction to a single-label classification problem, and realizes the speaker dependency and aliasing speech. Explicit modeling. Inspired by the recently proposed two-stage hybrid system, we further propose the TOLD framework, which iteratively improves the initial speaker log results obtained by EEND-OLA through overlap-aware post-processing SOAP, and achieves a new best result. Meanwhile, by comparing the performance differences of different methods used in different stages, we demonstrate the superiority of EEND-OLA and SOAP compared to other models. In the future, we will try to further improve TOLD and increase the modeling method of adaptive granularity so that it can handle speech of different durations.

References:

[1] Zhihao Du, Shiliang Zhang, Siqi Zheng, and Zhijie Yan, “Speaker overlap-aware neural diarization for multi-party meeting analysis,” in EMNLP, 2022.

[2]Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, and Shinji Watanabe, “End-to-end neural speaker diarization with permutation-free objectives,” INTERSPEECH, pp. 4300–4304, 2019.

[3]ShotaHoriguchi, YusukeFujita, ShinjiWatanabe, YawenXue, and Kenji Nagamatsu, “End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based at- tractors,” in INTERSPEECH, 2020, pp. 269–273.

[4]Chunlei Zhang, Jiatong Shi, Chao Weng, Meng Yu, and Dong Yu, “Towards end-to-end speaker diarization with generalized neural speaker clustering,” in ICASSP, 2022, pp. 8372–8376.

[5]Shota Horiguchi, Shinji Watanabe, Paola Garc ́ıa, Yawen Xue, Yuki Takashima, and Yohei Kawaguchi, “Towards neural diarization for unlimited numbers of speakers using global and local attractors,” in ASRU, 2021, pp. 98–105.

Guess you like

Origin blog.csdn.net/weixin_48827824/article/details/132316284