A Review of Speaker Diarization- Recent Advances with Deep Learning

A Review of Speaker Diarization- Recent Advances with Deep Learning

论文:A Review of Speaker Diarization- Recent Advances with Deep Learning

Speaker Diarization:SD

  1. Definition: The goal (task) of SD is to label audio or video recordings with categories corresponding to speaker identities, in short, a task of identifying " who spoke when "
  2. This article:
    1. Review the historical development of SD technology and the latest progress of SD based on neural network
    2. Discuss SD and downstream tasks such as ASR for integration

Introduction

In the process of Diarization, audio data will be segmented and clustered into groups of speech segments with the same speaker identity/label, usually this process does not require any prior knowledge about the speaker.

A traditional SD consists of several independent submodules:Please add a picture description

Front-end processing includes speech enhancement, reverberation, speech separation, target speaker extraction, etc. VAD is used to separate silent segments, and then converts the original audio into acoustic features or embedding vectors. In the clustering stage, the converted speech segments are used for Grouping and labeling, followed by fine-tuning, each sub-module above is independently trained and optimized.

Historical development of SD

  1. Early use of generalized likelihood ratio GRL, Bayesian Information Criterion II (BIC) for speaker change point detection and clustering distances between speech segments
  2. Improved methods include beamforming, information bottleneck clustering (IBC), variational Bayesian (VB), joint factor analysis (JFA), etc.
  3. After i-vector was proposed, it successfully replaced MFCC and other features, and combined with PCA, VB-GMM, PLDA, etc.
  4. In the future, the neural network will be used to extract speaker embedding, such as d-vector, x-vector, etc.
  5. End-to-end neural diarization replaces each module of the traditional SD with a neural network, which is very promising

motivation

The paper An overview of automatic speaker diarization systems (2016), gives an overview of different SD systems and subtasks in broadcast news and CTS data, covering the time period of the 1990s and early 2000s.

The paper Speaker diarization: A review of recent research (2012) focuses more on the SD of conference presentations and their respective challenges.

This article focuses more on techniques for mitigating the problem in a conferencing environment. In conference settings, there are often more participants than broadcast news or CTS data, and multimodal data is often available.

SD Overview and Classification

Four classifications are based on two criteria:Please add a picture description

Criterion 1: Whether the SD model is trained based on the SD objective function
Criterion 2: Whether multiple modules are jointly optimized for an objective function

Diarization evaluation metrics

  1. Diarization Error Rate: D E R =  FA  +  Missed  +  Speaker-Confusion   Total Duration of Time  \mathbf{D E R}=\frac{\text { FA }+\text { Missed }+\text { Speaker-Confusion }}{\text { Total Duration of Time }} THE R _= Total Duration of Time  FA + Missed + Speaker-Confusion In order to realize the one-to-one mapping, the Hungarian algorithm is adopted.
  2. Jaccard Error Rate: J E R = 1 N ∑ i N ref  F A i + M I S S i T O T A L i \mathbf{J E R}=\frac{1}{N} \sum_i^{N_{\text {ref }}} \frac{\mathrm{FA}_{\mathrm{i}}+\mathrm{MISS}_{\mathrm{i}}}{\mathrm{TOTAL}_{\mathrm{i}}} J E R=N1iNref TOTALiFAi+MISSiThe error rate for each speaker is first calculated and then averaged to calculate JER.

DER may exceed 100%. JER Impossible.

  1. Word-level Diarization Error Rate: Measure errors caused by vocabulary

Individual modules of the SD system

This section gives an overview of SD algorithms that are trained under non-SD targets. That is the principle of each independent module~

front-end processing

First set si , f , t ∈ C s_{i, f, t} \in \mathbb{C}si,f,tC is speakeriii at frequency binfff and timettThe STFT feature under t , the noise can be expressed as the source signal, impulse response hi , f , t ∈ C h_{i, f, t} \in \mathbb{C}hi,f,tC and additive noisent , f ∈ C n_{t, f} \in \mathbb{C}nt,fC 's combination:xt , f = ∑ i = 1 K ∑ τ hi , f , τ if , f , t − τ + nt , f x_{t, f}=\sum_{i=1}^K \sum_\ tau h_{i, f, \tau} s_{i, f, t-\tau}+n_{t, f}xt,f=i=1Kthi , f , tsi,f,tτ+nt,fAmong them, KKK denotes the number of speakers in the audio.

The goal of front-end processing is, given the observed signal X = ( { xt , f } f ) t \mathbf{X}=\left(\left\{x_{t, f}\right\}_f\right)_tX=({ xt,f}f)tTo estimate the original signal x ^ i , t \hat{\mathbf{x}}_{i, t}x^i,t x ^ i , t = FrontEnd ⁡ ( X ) , i = 1 , … , K \hat{\mathbf{x}}_{i, t}=\operatorname{FrontEnd}(\mathbf{X}), \quad i=1, \ldots, K x^i,t=FrontEnd(X),i=1,,K其中, x ^ i , t ∈ C D \hat{\mathbf{x}}_{i, t} \in \mathbb{C}^D x^i,tCD means theiiEstimated STFT features for i speakers.

Voice Enhancement and Noise Reduction

Speech enhancement is used to suppress the noise component of noisy speech. The speech enhancement method based on LSTM is: x ^ t = LSTM ⁡ ( X ) \hat{\mathbf{x}}_t=\operatorname{LSTM}(\mathbf{X })x^t=L S T M ( X ) This is a regression based method by minimizingLMSE = ∥ st − x ^ t ∥ 2 \mathcal{L}_{\mathrm{MSE}}=\left\|\mathbf{ s}_t-\hat{\mathbf{x}}_t\right\|^2LMSE=stx^t2 to achieve,st \mathbf{s}_tstUsually the log-likelihood spectrum.

de-reverb

De-reverberation is implemented based on statistical signal processing methods. The most widely used method is weighted prediction error (WPE) based dereverberation by decomposing the original signal into two parts: early response and late reverberation: xt , f = ∑ τ hf , τ sf , t − τ = xt , f early + xt , f late x_{t, f}=\sum_\tau h_{f, \tau} s_{f, t-\tau}=x_{t, f}^{\text {early }} +x_{t, f}^{\text {late }}xt,f=thf , tsf , t τ=xt,fearly +xt,flate WPE estimated filter coefficient h ^ f , twpe ∈ C \hat{h}_{f, t}^{\mathrm{wpe}} \in \mathbb{C}h^f,twpeC , based on MLE to preserve the early response while compressing the late reverberation:x ^ t , fearfully = xt , f − ∑ τ = Δ L h ^ f , τ wpexf , t − τ \hat{x}_{t, f}^ {\mathrm{early}}=x_{t, f}-\sum_{\tau=\Delta}^L \hat{h}_{f, \tau}^{\mathrm{wpe}} x_{f, t-\tau}x^t,fearly=xt,fτ = DLh^f , twpexf , t τAmong them, Δ \DeltaΔ denotes the number of frames to be separated,LLL represents the size of the filter.

The advantage of WPE is that it is based on linear filtering, does not introduce signal distortion, and can be safely used in downstream tasks.

voice separation

The effectiveness of beamforming-based multi-channel speech separation has been extensively demonstrated. For example, in the CHiME-6 challenge, the guided source separation (GSS) based multi-channel speech extraction technique achieves the best results.

While single-channel speech separation usually does not show any significant effect in realistic multi-speaker scenarios. Monophonic speech separation systems typically produce redundant non-speech or repetitive speech signals for non-overlapping areas, so audio "leakage" leads to many false alarms.

Some papers proposed a leakage filtering method to solve this problem, and after adopting this method, the performance of SD was observed to improve significantly.

WHAT

VAD: Distinguishes speech from non-speech such as background noise.

The SAD system mainly consists of two main parts. One is front-end feature extraction, commonly used acoustic features, such as zero-crossing rate, pitch, signal energy, high-order statistics in the residual domain of linear predictive coding or MFCC, etc. The second is a classifier, which predicts whether an input frame contains speech or not. Classifiers include traditional models based on GMM and HMM, as well as models based on DNN, CNN and LSTM.

The performance of VAD greatly affects the performance of SD, because it may generate a large number of false positive silence segments (that is, it is clearly silent segments but not identified) or lose some speech segments.

It is common practice in SD tasks to report DER with a setting of "oracle VAD", indicating that the system output is the output using the same VAD as GT.

Split

In the context of SD, speech segmentation is the splitting of input audio into segments to obtain speaker-unified segments. Therefore, the output unit of the SD system is determined by a segmentation process, and speech segmentation for SD falls into two categories:

  • Segmentation by speaker-change point detection Segmentation based on speaker change point detection
  • uniform segmentation Uniform segmentation

Segmentation based on speaker change point detection was standard in early SD systems. Speaker change points are detected by two hypotheses: H0 assumes that the left and right speech windows are from the same speaker, while H1 assumes that they are from different speakers. A measure-based method is used for hypothesis testing. First, the speech features are considered to conform to the Gaussian distribution N ( μ , Σ ) \mathcal{N}(\mu, \Sigma)N ( μ ,Σ) ,则两个假设可以表示为: H 0 : x 1 ⋯ x N ∼ N ( μ , Σ ) H 1 : x 1 ⋯ x i ∼ N ( μ 1 , Σ 1 ) x i + 1 ⋯ x N ∼ N ( μ 2 , Σ 2 ) \begin{aligned} H_0: & \mathbf{x}_1 \cdots \mathbf{x}_N \sim \mathcal{N}(\mu, \Sigma) \\ H_1: & \mathbf{x}_1 \cdots \mathbf{x}_i \sim \mathcal{N}\left(\mu_1, \Sigma_1\right) \\ & \mathbf{x}_{i+1} \cdots \mathbf{x}_N \sim \mathcal{N}\left(\mu_2, \Sigma_2\right) \end{aligned} H0:H1:x1xNN(μ,Σ)x1xiN(μ1,Σ1)xi+1xNN(μ2,Σ2)其中, ( x i ∣ i = 1 , ⋯   , N ) \left(\mathbf{x}_i \mid i=1, \cdots, N\right) (xii=1,,N ) is the speech feature sequence. Measures available then include Kullback-Leibler (KL) distance, generalized likelihood ratio (GLR), and BIC.

Where BIC is the most commonly used, the BIC value between two models from two hypotheses is expressed as follows: BIC ( i ) = N log ⁡ ∣ Σ ∣ − N 1 log ⁡ ∣ Σ 1 ∣ − N 2 log ⁡ ∣ Σ 2 ∣ − λ PBIC(i)=N \log |\Sigma|-N_1 \log \left|\Sigma_1\right|-N_2 \log \left|\Sigma_2\right|-\lambda PBIC(i)=NlogΣN1logΣ1N2logΣ2λ P where,PPP is the penalty term,P = 1 2 ( d + 1 2 d ( d + 1 ) ) log ⁡ NP=\frac{1}{2}\left(d+\frac{1}{2} d(d+1 )\right) \log NP=21(d+21d(d+1))logN d d d represents the dimension of the feature,N = N 1 + N 2 N=N1+N2N=No. 1+N 2 represent frame length respectively,λ = 1 \lambda=1l=1 , when the present formula is established, the positioniii is considered a change point: { max ⁡ i BIC ( i ) } > 0 \left\{\max _i BIC(i)\right\}>0{ imaxBIC(i)}>0
In general, the segment lengths of segments based on speaker change point detection are inconsistent.

With the introduction of x-vector and DNN-based embedding, the uniform segmentation method is gradually adopted. This difference may reduce the "fidelity" of the speaker representations because of the different segmentation lengths.

In uniform splitting, a given audio input is split with a fixed window length and overlap length. However, the selection of the length of different segments is a problem that requires trade-off consideration:

  • Needs to be short enough so as not to contain multiple speakers
  • But not too short to capture enough acoustic features to extract reliable speaker representations

Speaker Representation and Similarity Measurement

Speaker representation plays a crucial role for SD systems to measure similarity between speech segments

Measure-based similarity measure

The methods used for speech segmentation are also used to measure the similarity of speech segments, such as KL distance, GLR, BIC, etc.

As in the case of segmentation, BIC-based methods, where the similarity between two segments is also computed by the above equation, is one of the most widely used measures due to its effectiveness and ease of implementation.

Measure-based similarity measures are often used together with segmentation methods based on speaker change point detection.

Joint Factor Analysis (JFA) i-vector and PLDA

GMM-UBM has been successfully applied to speaker verification tasks before. The UBM consists of a large GMM trained to represent the speaker-dependent distribution of acoustic features. But the speaker verification system based on GMM-UBM has the problem of inter-session variability.

JFA compensates for the variability problem by modeling inter-speaker variability and channel variability separately. JFA uses the GMM super vector, which is the concatenated mean of the adaptive GMM, for example, the speaker-independent GMM mean vector mc ∈ RF × 1 m_c \in \mathbb{R}^{F\times 1}mcRF × 1 , then the supervector (supervector)MMM 为: M = [ m 1 t , m 2 t , … , m C t ] t \mathbf{M}=\left[m_1^t, m_2^t, \ldots, m_C^t\right]^t M=[m1t,m2t,,mCt]tIn
JFA, the GMM supervector is decomposed into four parts: speaker-independent, speaker-dependent, channel-dependent and residual components:MJ = m J + V y + U x + D z \mathbf{M}_J=\ mathbf{m}_J+\mathbf{V} \mathbf{y}+\mathbf{U} \mathbf{x}+\mathbf{D} \mathbf{z}MJ=mJ+V y+Ux+D z wherem J \mathbf{m}_JmJis a speaker-independent supervector, VVV is the speaker correlation matrix,UUU is the channel correlation matrix,DDD is the speaker-related residual matrix, vectory \mathbf{y}y represents the speaker factor,x \mathbf{x}x represents the channel factor,z \mathbf{z}z represents the residual factor of a specific speaker, and all vectors satisfyN ( 0 , 1 ) N(0,1)N(0,1 ) prior distribution.

Dehak et al. proposed a method to combine the channel and speaker spaces into a combined change space through the total change matrix, and denote it as T \mathbf{T}T , the corresponding weight vectorw \mathbf{w}w is called i-vector and is considered to be the speaker representation vector, at this timeM \mathbf{M}M 可以表示为: M I = m I + T w \mathbf{M}_I=\mathbf{m}_I+\mathbf{T} \mathbf{w} MI=mI+Tw其中, m I \mathbf{m}_I mISupervectors for speaker independence and channel independence.

The process of extracting the i-vector is a MAP estimation problem, using UBM, average supervector and the total change matrix T \mathbf{T} trained from the EM algorithmT extracted Baum–Welch statistics as parameters.

i-vector has been used not only in speaker recognition research but also in many SD studies and has shown superior performance compared to measure-based methods such as BIC, GLR and KL.

The intersession variability in the i-vector method is further compensated using back-end processing such as linear discriminant analysis (LDA) and within-class covariance normalization (WCCN), and then the cosine similarity score is calculated, which is later replaced by probabilistic LDA (PLDA ) model to replace.

G-PLDA applies i-vector Gaussianization and generates Gaussian hypotheses in PLDA, which was originally used for speaker verification. Usually, PLDA for the jjthiiin the j speech (conversation)Representation of i speakersϕ ij \phi_{ij}ϕijDefinition: ϕ ij = μ + F hi + G wij + ϵ ij \phi_{ij}=\mu+\mathbf{F h}_i+\mathbf{G w}_{ij}+\epsilon_{ij}ϕij=m+Fhi+Gwij+ϵijwhere μ \mathbf{\mu}μ is the mean vector,F \mathbf{F}F is the speaker change matrix,G \mathbf{G}G is the channel change matrix,ϵ ij \epsilon_{ij}ϵijrepresents the residual component. hi , wij \mathbf{h}_i, \mathbf{w}_{ij}hi,wijAs a hidden variable conforming to Gaussian distribution, during training, μ , Σ , F , G \mu, \boldsymbol{\Sigma}, \mathbf{F}, \mathbf{G}m ,S ,F,G is estimated using the EM algorithm, testing two hypotheses simultaneously:

  • H 0 H_0 H0Both samples come from the same speaker
  • H 1 H_1 H1The two samples come from different speakers.
    In the hypothesis H 0 H_0H0Next, the speaker representations of the two scripts are ϕ 1 , ϕ 2 \phi_1,\phi_2ϕ1,ϕ2Through the public hidden variable h 12 \mathbf{h}_{12}h12 建模为: [ ϕ 1 ϕ 2 ] = [ μ μ ] + [ F G 0 F 0 G ] [ h 12 w 1 w 2 ] + [ ϵ 1 ϵ 2 ] \left[\begin{array}{l} \phi_1 \\ \phi_2 \end{array}\right]=\left[\begin{array}{l} \boldsymbol{\mu} \\ \boldsymbol{\mu} \end{array}\right]+\left[\begin{array}{lll} \mathbf{F} & \mathbf{G} & 0 \\ \mathbf{F} & 0 & \mathbf{G} \end{array}\right]\left[\begin{array}{l} \mathbf{h}_{12} \\ \mathbf{w}_1 \\ \mathbf{w}_2 \end{array}\right]+\left[\begin{array}{l} \epsilon_1 \\ \epsilon_2 \end{array}\right] [ϕ1ϕ2]=[μμ]+[FFG00G]h12w1w2+[ϵ1ϵ2]
    在假设 H 1 H_1 H1 下,则通过不同的隐变量 h 1 , h 2 \mathbf{h}_1,\mathbf{h}_2 h1,h2 建模: [ ϕ 1 ϕ 2 ] = [ μ μ ] + [ F G 0 0 0 0 F G ] [ h 1 w 1 h 2 w 2 ] + [ ϵ 1 ϵ 2 ] \left[\begin{array}{l} \phi_1 \\ \phi_2 \end{array}\right]=\left[\begin{array}{l} \boldsymbol{\mu} \\ \boldsymbol{\mu} \end{array}\right]+\left[\begin{array}{cccc} \mathbf{F} & \mathbf{G} & \mathbf{0} & 0 \\ 0 & 0 & \mathbf{F} & \mathbf{G} \end{array}\right]\left[\begin{array}{c} \mathbf{h}_1 \\ \mathbf{w}_1 \\ \mathbf{h}_2 \\ \mathbf{w}_2 \end{array}\right]+\left[\begin{array}{c} \epsilon_1 \\ \epsilon_2 \end{array}\right] [ϕ1ϕ2]=[μμ]+[F0G00F0G]h1w1h2w2+[ϵ1ϵ2]
    In G-PLDA, supposeϕ \phiϕ comes from a Gaussian distribution, thus: p ( ϕ ∣ h , w ) = N ( ϕ ∣ μ + F h + G w , Σ ) p(\phi \mid \mathbf{h}, \mathbf{w}) =\mathbf{N}(\phi \mid \boldsymbol{\mu}+\mathbf{F h}+\mathbf{G} \mathbf{w}, \mathbf{\Sigma})p ( ϕh,w)=N ( ϕm+Fh+Gw,Σ )
    Based on the above formulas, the log likelihood ratio is calculated as:s ( ϕ 1 , ϕ 2 ) = log ⁡ p ( ϕ 1 , ϕ 2 ∣ H 0 ) − log ⁡ p ( ϕ 1 , ϕ 2 ∣ H 1 ) s\left(\phi_1, \phi_2\right)=\log p\left(\phi_1, \phi_2 \mid H_0\right)-\log p\left(\phi_1, \phi_2 \mid H_1\right)s( ϕ1,ϕ2)=logp( ϕ1,ϕ2H0)logp( ϕ1,ϕ2H1)
    In speaker verification, choose to accept the hypothesisH 0 H_0H0Still H 1 H_1H1. In SD, s ( ϕ 1 , ϕ 2 ) s\left(\phi_1, \phi_2\right)s( ϕ1,ϕ2) to check the similarity between two clusters.

Speaker Representation Based on Neural Network

DNN automatically learns the mapping without specifying any factors, which is not as interpretable as JFA.

Extracting speaker representations based on DNN does not require a predefined model (such as GMM-UBM).

It is more efficient than JFA in the inference stage and does not involve matrix inversion operations.

The most classic is the d-vector method, which uses stacked filter bank features that include context frames as input features, uses cross-entropy loss to train multiple fully connected layers, and obtains d-vector from the last fully connected layer:Please add a picture description

x-vector further improves speaker representation. Latency architecture and statistics pooling layer distinguish x-vector from d-vector. The statistics pooling layer aggregates the frame-level output of the previous layer and calculates its mean and standard deviation, which are passed to the next layer. Thus, it can allow x-vectors to be extracted from variable-length inputs:Please add a picture description

clustering

The clustering algorithm clusters speech segments based on the speaker representations and similarity measures obtained in the previous section

Aggregative Hierarchical Clustering (AHC)

It is a bottom-up clustering, starting from a single data, merging the clusters with the highest similarity at each step, and merging iteratively:Please add a picture description

For the SD task, the iteration termination condition of AHC can use the similarity threshold or the target number of clusters.

If PLDA is used as the distance metric, the AHC process should be at s ( ϕ 1 , ϕ 2 ) = 0 s(\phi_1,\phi_2)=0s ( ϕ1,ϕ2)=0 , but when the number of speakers is known, then the number of clusters of AHC is equal to the number of speakers to stop iterations.

spectral clustering

Spectral clustering follows these steps:

  1. Calculate the similarity matrix: the original similarity value ddd through kernelexp ⁡ ( − d 2 / σ 2 ) \exp \left(-d^2 / \sigma^2\right)exp(d2 /p2 )to calculate, whereσ \sigmaσ is the scaling parameter.
  2. Calculating the Laplacian matrix: There are two calculation methods for the graph Laplacian matrix: normalization and non-normalization. degree matrix DDD is a diagonal matrix, anddi = ∑ j = 1 n \boldsymbol{d}_{\boldsymbol{i}}=\sum_{j=1}^ndi=j=1n, among which aij a_{ij}aijis the similarity matrix AAelements of A.
    1. Normalization: L = D − 1 / 2 AD − 1 / 2 \mathrm{L}=\mathrm{D}^{-1 / 2} \mathrm{AD}^{-1 / 2}L=D1 / 2 AD1/2
    2. Unnormalized: L = D − A \mathbf{L}=\mathbf{D}-\mathbf{A}L=DA
  3. Definition: L = X Λ X ⊤ \mathbf{L}=\mathbf{X} \mathbf{\Lambda} \mathbf{X}^{\top}L=XΛX
  4. Quadratic normalization (optional): for X \mathbf{X}X is row normalized so thatyij = xij / (∑ jxij 2 ) 1 / 2 y_{ij}=x_{ij} /\left(\sum_j x_{ij}^2\right)^{1 / 2}yij=xij/(jxij2)1/2
  5. Number of speakers: Estimate the number of speakers by finding the largest eigengap
  6. Spectral embedding clustering: kkThe k smallest eigenvalues ​​and corresponding eigenvectors are spliced ​​to construct a matrixU ∈ R n × k U \in \mathbf{R}^{n\times k}URn×k U U The row vector of U is called kkk- dimensional spectral embedding, and finally use a clustering algorithm to cluster the spectral embeddings (such as k-means clustering).
    There are many variants of spectral clustering, and NJW is often used for SD tasks. Different from AHC, spectral clustering mainly uses cosine distance. In addition, LSTM-based similarity measures and spectral clustering are also competitive. Depending on the data set, the method of cosine distance + spectral clustering is better than LPDA + AHC.

Other clustering algorithms

  1. k-means
  2. mean-shift clustering, used with KL distance in the SD task

Post-processing

heavy division

Re-segmentation is a process of refining speaker boundaries.

Viterbi rearrangement method based on BaumWelch's algorithm: Alternately applying an estimate of the GMM corresponding to each speaker and a Viterbi-based re-segmentation using the estimated speaker GMM

Hidden Markov Model (VB-HMM) based on Variational Bayesian: The effect is better than that of Viter proportion segmentation, see the following chapters for details.

system integration

Fusion of multiple SD systems can improve accuracy.

In SD system fusion, there are some special problems:

  1. Speaker labels are not standardized across different SD systems (i.e. the same speaker may be labeled differently)

  2. The estimated number of speakers may vary

  3. Estimated time boundaries may also differ across multiple SD systems
    . Solutions are:

  4. The recorded result sequence from each SD system is regarded as an object to be clustered, and AHC is used to cluster the result set, and finally clustered into two clusters. In a large cluster, the one closest to other records is the final record.

  5. Two SD systems are combined by finding matches between two speaker clusters and then rearranging based on the matching results.

  6. The DOVER method: Combining the results of multiple SDs in a voting-based scheme while aligning the labels:Please add a picture description

  7. The DOVER method has an implicit assumption that there are no overlapping speeches (at most one speaker per time index), in order to combine this method with overlapping speakers, there are two approaches:

    • Align speaker labels from different diarization results with root hypotheses and estimate each speaker's speech activity based on their weighted vote scores for each subsegment.
    • Multiple hypothetical speakers are aligned by weighted k-part graph matching, and the number of speakers k for each subsegment is estimated based on the weighted average of multiple systems to select the top-k voted speaker labels.

Joint Optimization of Segmentation and Clustering

This section presents a VB-HMM-based SD technique that can be viewed as a joint optimization of segmentation and clustering.

VB-HMM is an extension of VB-based speaker clustering, which introduces HMM to constrain speaker switching. Suppose the speech features X = ( xt ∣ t = 1 , … , T ) \mathbf{X} = (\mathbf{x}_t|t=1,\dots,T)X=(xtt=1,,T ) is the speech feature, where each HMM state corresponds toKKOne of the K possible speakers. Suppose there isMMM states, introduceMMMvariableZ = ( zt ∣ 1 , … , T ) \mathbf{Z}=(\mathbf{z}_t|1,\dots,T)Z=(zt1,,T ) means thekkthk speakers at timeTTWhen T speaks,zt \mathbf{z}_tztThe kkthk elements are 1 and the rest are 0. At the same time based on the hidden variableY = ( yk ∣ k = 1 , … , K ) \mathbf{Y} = (\mathbf{y}_k|k=1,\dots,K)Y=(ykk=1,,K) x t \mathbf{x}_t xtThe distribution of is modeled, where yk \mathbf{y}_kykIndicates the kkthThe low-dimensional vector of k speakers, thenX , Y , Z \mathbf{X},\mathbf{Y},\mathbf{Z}X,Y,Z 的联合分布为: P ( X , Z , Y ) = P ( X ∣ Z , Y ) P ( Z ) P ( Y ) P(\mathbf{X}, \mathbf{Z}, \mathbf{Y})=P(\mathbf{X} \mid \mathbf{Z}, \mathbf{Y}) P(\mathbf{Z}) P(\mathbf{Y}) P(X,Z,Y)=P(XZ,Y)P(Z)P(Y)其中, P ( X ∣ Z , Y ) P(\mathbf{X} \mid \mathbf{Z}, \mathbf{Y}) P(XZ,Y) 是 GMM 建模的发射概率(也就是 HMM 的观测概率 B B B),其平均向量由 Y \mathbf{Y} Y 表示, P ( Z ) P(\mathbf{Z}) P ( Z ) is the transition probability of HMM,P ( Y ) P(\mathbf{Y})P ( Y )Y \mathbf{Y}The prior distribution of Y.

Then the SD problem can be expressed as, maximize the posterior distribution P ( Z ∣ X ) = ∫ P ( Z , Y ∣ X ) d YP(\mathbf{Z} \mid \mathbf{X})=\int P (\mathbf{Z}, \mathbf{Y} \mid \mathbf{X}) d \mathbf{Y}P(ZX)=P(Z,YX ) d Y .

But it is difficult to solve this problem directly, so use variational Bayesian to approximate P ( Z , Y ∣ X ) P(\mathbf{Z}, \mathbf{Y} \mid \mathbf{X})P(Z,YX ) Model parameters.

A simplified VB-HMM based on x-vector, called VBx, uses the x-vector based on the PLDA model to calculate P ( X ∣ Z , Y ) P(X \mid Z, Y)P(XZ,Y ) . The original VB-HMM is frame-level, while VBx is based on x-vector, so it can be regarded as a joint clustering method.

VB-HMM is usually used in the last step of SD system.

Advances in Deep Learning-Based SD

We first introduce the method of using DNN for some of the previous independent modules, and then introduce the method of unifying several parts of SD into a single network.
Overview of DNN-based SD:Please add a picture description

Single Module Optimization

Speaker clustering enhancement based on deep learning

Speaker diarization with session-level speaker embedding refinement
using graph neural networks paper proposes a GNN-based method:Please add a picture description

This method aims to refine the similarity matrix in spectral clustering. Suppose the speaker embedding sequence { e 1 , … e N } \left\{\mathbf{e}_1, \ldots \mathbf{e}_N\right\}{ e1,eN} whereNNN is the sequence length, the input xi of GNN0 \mathbf{x}^0_ixi0 为: { x i 0 = e i ∣ i = 1 , … , N } \left\{\mathbf{x}_i^0=\mathbf{e}_i \mid i=1, \ldots, N\right\} { xi0=eii=1,,N } , theppThe output of layer p is: xi ( p ) = σ ( W ∑ j L i , jxj ( p − 1 ) ) x_i^{(p)}=\sigma\left(\mathbf{W} \sum_j \mathbf{L }_{i, j} x_j^{(p-1)}\right)xi(p)=p(WjLi,jxj(p1)) where,LLL is a normalized similarity matrix (with self-loop connection),WWW is theppthThe weight matrix to be trained in layer p , σ ( ⋅ ) \sigma(\cdot)σ ( ) is a non-linear function that is trained by minimizing the distance between the reference and estimated similarity matrices. And a combination of histogram loss and kernel norm is used to calculate the distance.

The paper Self-attentive similarity measurement strategies in speaker diarization introduces a self-attention-based neural network model to generate similarity matrices directly from speaker embedding sequences. Multi-scale speaker diarization with neural affinity score fusion fuses several similarity matrices with different temporal resolutions into a single similarity matrix based on neural networks.

Unsupervised deep embedding for clustering analysis proposes deep embedding clustering (DEC), whose goal is to transform embeddings to make them easier to separate. To make clusters separable, for each embedding iii calculates "belongs to" thejjthThe probability of j speaker clustersqi , j q_{i,j}qi,j q i j = ( 1 + ∥ z i − μ j ∥ 2 / a ) − a + 1 a ∑ l ( 1 + ∥ z i − μ l ∥ 2 / a ) − a + 1 a , p i j = q i j 2 / f i ∑ l q i l 2 / f l q_{i j}=\frac{\left(1+\left\|z_i-\mu_j\right\|^2 / a\right)^{-\frac{a+1}{a}}}{\sum_l\left(1+\left\|z_i-\mu_l\right\|^2 / a\right)^{-\frac{a+1}{a}}}, \quad p_{i j}=\frac{q_{i j}^2 / f_i}{\sum_l q_{i l}^2 / f_l} qij=l(1+ziml2/a)aa+1(1+zimj2/a)aa+1,pij=lqil2/flqij2/fiAmong them, zi z_iziis the bottleneck feature, aaa is the degrees of freedom of the t-distribution. μ i \mu_imifor secondThe centroid of i clusters, fi = ∑ qij f_i=\sum q_{ij}fi=qij, using an autoencoder to estimate the bottleneck feature, iteratively based on the target distribution.

Improved DEC (IDEC) improves the accuracy of SD by adding a reconstruction loss between the output of the autoencoder and the input features to preserve the local structure of the data. IDEC's loss function consists of four parts:

  • Clustering error L c L_c in the original DECLc
  • Reconstruction error L r L_rLr
  • uniform “speaker airtime” distribution loss L u L_u Lu
  • Loss LMSE L_{MSE} for measuring the distance of the bottleneck feature from the centroidLMSE
    总这些的:L = α L c + β L r + γ L u + δ LMSEL=\alpha L_c+\beta L_r+\gamma L_u+\delta L_{MSE}L=αLc+βLr+γLu+LOST _MSEThe preceding coefficients represent weights.

Learning distance estimator

This section introduces a new approach using a trainable distance function

Based on relational RNN (RRNN), the model learns the relationship between a series of input features (far-near relationship), and the SD problem can be roughly regarded as this type of problem, because the final result of dairization depends on the relationship between the speech segment and the speaker's centroid. distance.

Problems that limit the accuracy of the SD system:

  • The duration of the speaker embedding segment, which requires a trade-off between temporal resolution and robustness
  • The speaker embedding extractor is not explicitly trained to provide the best representation for SD
  • Distance metrics are often based on heuristics and/or rely on certain assumptions that do not necessarily hold
  • Context information is ignored during audio processing.
    The above problems can be attributed to the distance metric function, and most of the problems can be solved with RRNN.

A paper proposes a method for learning the relationship between the speaker cluster centroid and embedding. SD is considered as a classification task for the segmented audio, that is, for each extracted embedding xj x_jxjCompared with the centroids of all speakers, a distance-based loss function is minimized to assign a label to this audio segment.

Post-processing based on deep learning

Medennikov proposed Targeted Speaker VAD (TS-VAD) to achieve accurate SD also under noisy conditions with many speakers overlapping.

TS-VAD assumes that the i-vector of each speaker is known, E = { ek ∈ ​​R f ∣ k = 1 , … , K } \mathcal{E}=\left\{\mathbf{e}_k \in \mathbb{R}^f \mid k=1,\dots,K\right\}E={ ekRfk=1,,K } , wherefff is the dimension of i-vector,KKK is the number of speakers, as shown in the figure:Please add a picture description

The input of TS-VAD is MFCC feature X \mathbf{X}X+ i-vector E \mathcal{E} E , the outputkkk- dimensional vector sequenceO ⊨ ( ot ∈ RK ∣ t = 1 , … , T ) \mathbf{O} \models\left(\mathbf{o}_t \in \mathbb{R}^K \mid t=1, \ldots, T\right)O(otRKt=1,,T ) ,except\mathbf{o}_totThe kkthk elements represent the time framettt corresponds toek e_kekThe probability of the speaker's speech activity (that is, if the speaker ek \mathbf{e}_kekat time ttt speaks, then0 t \mathbf{0}_t0tThe kkthk elements are 1, others are 0).

The flow of the whole model is:

  1. Implement cluster-based diarization
  2. Given the result of diarization above, estimate the i-vector for each speaker
  3. Repeat the following:
    1. Execute TS-VAD according to i-vector
    2. The downside of updating the i-vector
      TS-VAD is that the maximum number of speakers the model can handle is limited by the dimensionality of the output vector.

A different approach was proposed by Horiguchi et al., employing the EEND model to update cluster-based SD results. Clustering-based SD methods can handle a large number of speakers, but cannot handle overlapping speech. EEND has the opposite property. The two can be complementary: first adopt traditional clustering methods, and then iteratively apply the EEND model to each pair of detected speakers to refine the temporal boundaries of overlapping regions.

SD joint optimization

Joint Segmentation and Clustering

The unbounded interleaved-state recurrent neural networks (UIS-RNN) model replaces segmentation and clustering methods with a single trainable model. Given an input embedding sequence X = ( xt ∈ R d ∣ t = 1 , … , T ) \mathbf{X}=\left(\mathbf{x}_t \in \mathbb{R}^d \mid t=1 , \ldots, T\right)X=(xtRdt=1,,T ) , UIS-RNN produces diarization resultsY = ( yt ∈ N ∣ t = 1 , … , T ) \mathbf{Y}=\left(y_t \in \mathbb{N} \mid t=1, \ldots , T\right)Y=(ytNt=1,,T ) as the speaker index for each time frame. X , Y \mathbf{X},\mathbf{Y}X,The joint probability of Y can be decomposed by the chain rule:P ( X , Y ) = P ( x 1 , y 1 ) ∏ t = 2 TP ( xt , yt ∣ x 1 : t − 1 , y 1 : t − 1 ) P(\mathbf{X}, \mathbf{Y})=P\left(\mathbf{x}_1, y_1\right) \prod_{t=2}^TP\left(\mathbf{x}_t, y_t \mid \mathbf{x}_{1: t-1}, y_{1: t-1}\right)P(X,Y)=P(x1,y1)t=2TP(xt,ytx1:t1,y1:t1)
In order to model the distribution of speaker change, UIS-RNN introduces a hidden variableZ = ( zt ∈ { 0 , 1 } ∣ t = 2 , … , T ) \mathbf{Z}=\left(z_t \in\{0 ,1\} \mid t=2, \ldots, T\right)Z=(zt{ 0,1}t=2,,T ) , when the speaker is at timet − 1 t-1t1 and timettWhen t is different,zt z_tztfor 1 11 , otherwise0 00 Determine: P ( X , Y , Z ) = P ( x 1 , y 1 ) ∏ t = 2 TP ( xt , yt , zt ∣ x 1 : t − 1 , y 1 : t − 1 , z 1 : t − 1 ) P(\mathbf{X}, \mathbf{Y}, \mathbf{Z})=P\left(\mathbf{x}_1, y_1\right) \prod_{t=2}^ TP\left(\mathbf{x}_t, y_t, z_t \mid \mathbf{x}_{1:t-1}, y_{1:t-1}, z_{1:t-1}\right)P(X,Y,Z)=P(x1,y1)t=2TP(xt,yt,ztx1:t1,y1:t1,z1:t1)
P ( x t , y t , z t ∣ x 1 : t − 1 , y 1 : t − 1 , z 1 : t − 1 ) P\left(\mathbf{x}_t, y_t, z_t \mid \mathbf{x}_{1: t-1}, y_{1: t-1}, z_{1: t-1}\right) P(xt,yt,ztx1:t1,y1:t1,z1:t1) 可以进一步分解成三个部分: P ( x t , y t , z t ∣ x 1 : t − 1 , y 1 : t − 1 , z 1 : t − 1 ) = P ( x t ∣ x 1 : t − 1 , y 1 : t ) P ( y t ∣ z t , y 1 : t − 1 ) P ( z t ∣ z 1 : t − 1 ) P\left(\mathbf{x}_t, y_t, z_t \mid \mathbf{x}_{1: t-1}, y_{1: t-1}, z_{1: t-1}\right)=P\left(\mathbf{x}_t \mid \mathbf{x}_{1: t-1}, y_{1: t}\right) P\left(y_t \mid z_t, y_{1: t-1}\right) P\left(z_t \mid z_{1: t-1}\right) P(xt,yt,ztx1:t1,y1:t1,z1:t1)=P(xtx1:t1,y1:t)P(ytzt,y1:t1)P(ztz1:t1) If,P ( xt ∣ x 1 : t − 1 , y 1 : t ) P\left(\mathbf{x}_t \mid \mathbf{x}_{1: t-1}, y_{1: t }\right)P(xtx1:t1,y1:t) represents the sequence generation probability, modeled by a GRU-based RNN. P ( yt ∣ zt , y 1 : t − 1 ) P\left(y_t \mid z_t, y_{1: t-1}\right)P(ytzt,y1:t1) represents the speaker assignment probability,
modeled by a distance-dependent Chinese restaurant process. P ( zt ∣ z 1 : t − 1 ) P\left(z_t \mid z_{1: t-1}\right)P(ztz1:t1) represents the speaker change probability, modeled by a Bernoulli distribution. UIS-RNN can maximizelog ⁡ P ( X , Y , Z ) \log P(\mathbf{X}, \mathbf{Y}, \mathbf{Z})logP(X,Y,Z ) to train the optimal parameters. The inference process can be found by a beam search based method for a givenX \mathbf{X}X inlog ⁡ P ( X , Y ) \log P(\mathbf{X}, \mathbf{Y})logP(X,Y ) the largestY \mathbf{Y}Y to achieve.

Joint segmentation, embedding extraction and re-segmentation

RPN-based methods can jointly perform segmentation, speaker embedding extraction, and re-segmentation.

Please add a picture description

The RPN network is shown on the left side of the above figure. The STFT feature is first converted into a feature map (that is, the channel dimension is increased), and then three neural networks are used to apply sliding windows (anchor) of different sizes on the time axis. For each anchor, three The neural network performs VAD, embedding extraction and region refinement respectively.

  • VAD is used to estimate the probability of speech activity in anchor presence
  • Embedding extracts speaker features in anchors
  • region refinement is used to estimate the duration and center position of anchors.
    The inference process is shown on the right. RPN first obtains the anchors whose voice activity is higher than the threshold, and then calculates the speaker embedding of each anchor. The traditional clustering method is used to cluster the anchors. Finally, the anchors that overlap too much after region refinement are removed.

RPN-based SD systems have the advantage of handling overlapping speech that may have an arbitrary number of speakers.

Joint Speech Separation and Diarization

Kounades-Bastian proposed an NFM-based spatial covariance model to incorporate the voice activity detection model into voice separation, using the EM algorithm to estimate the separated voice and each speaker's voice activity from multi-channel overlapping voices. This method is completely based on statistical construction. mold.

Neumann proposed the online Recurrent Selective Attention Network (online RSAN) model, which combines speech separation, speaker counting, and SD based on a single model. Network input spectrum X b ∈ RT × F \mathbf{X}_b \in \mathbb{R}^{T \times F}XbRT × F , a residual mask matrixR b , i − 1 ∈ RT × F \mathbf{R}_{b, i-1} \in \mathbb{R}^{T \times F}Rb,i1RT × F and a speaker embeddingeb − 1 , i ∈ R d \mathbf{e}_{b-1, i} \in \mathbb{R}^deb1,iRd , of whichb , i , T , F b,i,T,Fb,i,T,F are the index of the audio block, the speaker index, the length of the audio block, and the maximum frequency bin. Output speech maskM b , i ∈ RT × F \mathbf{M}_{b, i} \in \mathbb{R}^{T \times F}Mb,iRT × F and corresponding to the speakereb , i \mathbf{e}_{b,i}eb,iThe updated speaker embedding of . For each audio block bbb and speakeriii , the neural network proceeds iteratively:
Repeat (a) and (b) forb = 1 , 2 , … b=1,2, \ldotsb=1,2,
(a) R b , 0 = 1 \mathbf{R}_{b, 0}=1 Rb,0=1
(b) Repeat (i)-(iii) for i = 1 , 2 , … i=1,2, \ldots i=1,2, until being stopped at (iii).
i. M b , i , e b , i = N N ( X b , R b , i − 1 , e b − 1 , i ) \mathbf{M}_{b, i}, \mathbf{e}_{b, i}=\mathrm{NN}\left(\mathbf{X}_b, \mathbf{R}_{b, i-1}, \mathbf{e}_{b-1, i}\right) Mb,i,eb,i=NN(Xb,Rb,i1,eb1,i)
(e e b − 1 , i \boldsymbol{e}_{b-1, i} eb1,i is set to 0 if it was not calculated previously)
ii. R b , i = max ⁡ ( R b , i − 1 − M b , i , 0 ) \mathbf{R}_{b, i}=\max \left(\mathbf{R}_{b, i-1}-\mathbf{M}_{b, i}, \mathbf{0}\right) Rb,i=max(Rb,i1Mb,i,0)
iii. If 1 T F ∑ t , f R b , i ( t , f ) < \frac{1}{T F} \sum_{t, f} \mathbf{R}_{b, i}(t, f)< TF1t,fRb,i(t,f)< threshold, stop iteration.

speaker iii in speech blockbbThe separated speech of b can be passed throughM b , i ⊙ X b \mathbf{M}_{\boldsymbol{b}, i} \odot \mathbf{X}_{\boldsymbol{b}}Mb,iXbobtained, where ⊙ \odot is element-wise multiplication. Speaker embeddingeb , i \mathbf{e}_{b,i}eb,iis used to track the blocks adjacent to the speaker.

Thanks to an iterative approach, neural networks can jointly perform speech separation and SD while handling a variable number of speakers.

Full end-to-end neural Diarization

The EEND framework implements the entire SD process based on a single network. The network structure is shown in the figure:Please add a picture description

The input of EEND is long as TTT 的声学特征 X = ( x t ∈ R F ∣ t = 1 , … , T ) \mathbf{X}=\left(\mathbf{x}_t \in \mathbb{R}^F \mid t=1, \ldots, T\right) X=(xtRFt=1,,T ) , the speaker label sequenceY = ( yt ∣ t = 1 , … , T ) \mathbf{Y}=\left(\mathbf{y}_t \mid t=1, \ldots, T \right)Y=(ytt=1,,T),其中 y t = [ y t , k ∈ { 0 , 1 } ∣ k = 1 , … , K ] \mathbf{y}_t=\left[y_{t, k} \in\{0,1\} \mid k=1, \ldots, K\right] yt=[yt,k{ 0,1}k=1,,K] y t , k = 1 y_{t, k}=1 yt,k=When 1 , it means that in the time framettt speakerkkk has speech activity, and for differentk , k ′ k,k^\primek,k y t , k y_{t, k} yt,kyt , k ′ y_{t, k^{\prime}}yt,kcan be both 1 11 , indicating that two speakers spoke at the same time (that is, overlapping).

Suppose the output yt , k y_{t, k}yt,kIndependently, the network maximizes the conditional distribution log ⁡ P ( Y ∣ X ) ∼ ∑ t ∑ k log ⁡ P ( yt , k ∣ X ) \log P(\mathbf{Y} \mid \mathbf{X}) \sim \sum_t \sum_k \log P\left(y_{t, k} \mid \mathbf{X}\right)logP ( Y)X)tklogP(yt,kX ) for training. Swapping speaker indices produces multiple label candidates, so a loss function is computed for all possible reference labels, and the reference label with the smallest loss is used for error backpropagation.

EEND initially used BLSTM, and later expanded the attention network.

The advantages of EEND are:

  • Overlapping speech can be handled audibly
  • Optimize directly by maximizing diarization accuracy
  • Input labels can be retrained from real data.
    The constraints of EEND are:
  • The model architecture limits the maximum number of speakers it can handle
  • Unable to process online
  • easy to overfit

Therefore, the extension of EEND has:

  • Horiguchi proposed to extend EEND with an encoder-based attractor (EDA). The method applies an LSTM-based codec on the output of EEND to generate multiple attractors until their existence probability is less than a threshold. Then, each attractor is multiplied by the embedding generated by EEND to calculate the voice activity of each speaker.

Fujita et al. output speech activity by using the conditional speaker chain rule. In this method, the neural network is used to generate the posterior probability P ( yk ∣ y 1 , … , yk − 1 , X ) P\left(\mathbf{y}_k \mid \mathbf{y}_1, \ldots, \ mathbf{y}_{k-1}, \mathbf{X}\right)P(yky1,,yk1,X),其中 y k = ( y t , k ∈ { 0 , 1 } ∣ t = 1 , … , T ) y_k=\left(y_{t, k} \in\{0,1\} \mid t=1, \ldots, T\right) yk=(yt,k{ 0,1}t=1,,T ) is thekkthSpeech activity vectors for k speakers. Then use the chain rule to calculate the joint probability: P ( y 1 , … , y K ∣ X ) = ∏ k = 1 KP ( yk ∣ y 1 , … , yk − 1 , X ) P\left(\mathbf{y} _1, \ldots, \mathbf{y}_K \mid \mathbf{X}\right)=\prod_{k=1}^KP\left(\mathbf{y}_k \mid \mathbf{y}_1, \ ldots, \mathbf{y}_{k-1}, \mathbf{X}\right)P(y1,,yKX)=k=1KP(yky1,,yk1,X ) In the inference phase, y_k is
estimated by repeated calculations until the final estimate ofykclose to 0 00

Kinoshita proposed a method combining EEND and speaker clustering. The neural network is trained to generate speaker embedding and speech activity probabilities, and speaker clustering constrained by the speech activity estimated by EEND is applied between different processing blocks. Align the estimated speakers.

There are also some ways to extend EEND to realize online processing.

SD in the context of ASR

It is generally believed that SD is a preprocessing step of ASR. This section discusses how to develop SD systems in the context of ASR.

early work

The lexical information from ASR can be made available to SD systems in the following ways:

  1. RT03 Evaluation: Segmentation Using Word Boundary Information (First attempt to use the output of ASR to improve SD performance)
  2. RT07 Evaluation: Use the results of ASR to improve VAD to reduce false alarms and improve the clustering performance of SD systems.
  3. Silovsky et al: used ASR word alignment during segmentation, the word output by ASR was truncated by the segmentation result, because the segmentation result and the decoded word sequence were not aligned
  4. Canseco-Rodriguez et al: create a dictionary (for broadcast news data) whose phrases provide the identity of who is speaking, who will speak, and who has spoken in a broadcast news scene

Although the early SD research did not make full use of lexical information to greatly improve DER, many studies integrated the output information of ASR to improve the output of SD.

Use lexical information from ASR

In a recent study, SD systems used DNNs to capture the output of ASR.

  1. Flemotomos et al.: propose a method for using linguistic information for SD tasks:Please add a picture description

The system employs neural text-based speaker change detection and text-based character recognition, using both linguistic and acoustic information to improve DER.
2. Park et al.: By using the Seq2Seq model to output speaker turn tokens, the lexical information of ASR is used for speaker segmentation, and then segmented based on speaker turn 3. Park et al.: Using an integrated
adjacency matrix to combine lexical information and speech segment clustering . The adjacency matrix is ​​the max of acoustic information and lexical information:Please add a picture description

Joint ASR and SD based on deep learning

Method 1: Introduce speaker labels in end-to-end ASR:Please add a picture description

  1. Shafey et al. insert speaker role labels in the output of an RNN-T ASR system
  2. Mao et al. propose to insert speaker identity labels into the output of an attention-based encoder-decoder ASR system

These two studies show that inserting speaker labels is a simple and promising approach to joint ASR and SD. But superior to speaker persona or speaker identity labels need to be determined and fixed during training. This approach is difficult to handle with an arbitrary number of speakers.

Method 2: MAP-based joint decoding.

Kanda et al. jointly decode ASR and SD, as shown in the figure:
Please add a picture description
Suppose the observation sequence X = { X 1 , … , XU } X=\left\{\mathbf{X}_1, \ldots, \mathbf{X}_U\ right\}X={ X1,,XU} , among whichUUU represents the number of segments,X u \mathbf{X}_uXumeans uuAcoustic features of segment u , words with temporal boundaries are assumed to beW = { W 1 , … , WU } W=\left\{\mathbf{W}_1, \ldots, \mathbf{W}_U\right\}W={ W1,,WU} Among them,W u \mathbf{W}_uWuuu _The speech recognition hypothesis corresponding to u . AndW u = ( W 1 , u , … , WK , u ) \mathbf{W}_u=\left(\mathbf{W}_{1, u}, \ldots, \mathbf{W}_{K, u}\right)Wu=(W1,u,,WK,u) contains the segmentuuAll speaker hypotheses of u ( KKK represents the number of speakers), speaker embeddingE = ( e 1 , … , e K ) \mathcal{E}=\left(\mathbf{e}_1, \ldots, \mathbf{e}_{\boldsymbol {K}}\right)E=(e1,,eK) , among themej ∈ R d \mathbf{e}_j \in \mathbb{R}^dejRd is thekkthdd for k speakersd- dimensional vector. Then the multi-speaker ASR and SD joint decoding problem becomes to solve the most probableW ^ \hat{W}W^ W ^ = argmax ⁡ W P ( W ∣ X ) = argmax ⁡ W { ∑ E P ( W , E ∣ X ) } ≈ argmax ⁡ W { max ⁡ E P ( W , E ∣ X ) } \begin{aligned} \hat{W} &=\underset{W}{\operatorname{argmax}} P(\mathcal{W} \mid X) \\ &=\underset{\mathcal{W}}{\operatorname{argmax}}\left\{\sum_{\mathcal{E}} P(\mathcal{W}, \mathcal{E} \mid X)\right\} \\ & \approx \underset{\mathcal{W}}{\operatorname{argmax}}\left\{\max _{\mathcal{E}} P(\mathcal{W}, \mathcal{E} \mid X)\right\} \end{aligned} W^=WargmaxP(WX)=Wargmax{ EP(W,EX)}Wargmax{ EmaxP(W,EX)}The last equation uses the Viterbi approximation. The equation can be further decomposed into two iterative problems: W ^ ( i ) = argmax ⁡ WP ( W ∣ E ^ ( i − 1 ) , X ) , E ^ ( i ) = argmax ⁡ EP ( E ∣ W ^ ( i ) , X ) , \begin{gathered} \hat{\mathcal{W}}^{(i)}=\underset{\mathcal{W}}{\operatorname{argmax}} P\left(\boldsymbol{W} \mid \hat{\mathcal{E}}^{(i-1)}, \mathcal{X}\right), \\ \hat{\mathcal{E}}^{(i)}=\underset{ \mathcal{E}}{\operatorname{argmax}} P\left(\mathcal{\mathcal { E }} \mid \hat{\mathcal{W}}^{(i)}, \mathcal{X}\ right), \end{gathered}W^(i)=WargmaxP(WE^(i1),X),E^(i)=EargmaxP(EW^(i),X),Among them, iii represents the iteration index.

Approach 3: End-to-end Speaker Attribute Modeling (SA-ASR) for joint speaker counting, multi-speaker ASR and speaker recognition. Contrary to the previous two methods, the end-to-end SA-ASR model additionally inputs the speaker profile and identifies the speaker profile based on the attention mechanism:Please add a picture description

Due to the attention mechanism trained on serial output, there is no limit to the number of speakers the model can handle. During inference, providing the relevant speaker profile, end-to-end SA-ASR can automatically transcribe speech and identify the speaker of each utterance.

Evaluation Series and Datasets

This section describes the Evaluation Series and datasets used for SD evaluation, and the datasets are summarized as follows:Please add a picture description

  • CALLHOME: The most extensive dataset used in SD papers
  • AMI: A Suitable Dataset for Evaluating Speaker Classification Systems Integrated with ASR Modules
  • ICSI: Conference Audio
  • CHiME-5/6 Challenge and Dataset: ASR Challenge for Multi-Speaker Everyday Conversation
  • VoxSRC Challenge and VoxConverse Corpus: Speaker Identification Challenge, Contains Overlapping Speeches
  • LibriCSS: designed for studying speech separation, ASR and SD, including overlapping speech
  • DIHARD Challenge and Dataset: Focusing on the Performance Gap of State-of-the-Art SD Systems, Challenging
  • Rich Transcription Evaluation Series: RT Evaluation aims to conduct a more in-depth study of SD associated with ASR
  • Other datasets: slightly

Application Scenario

  1. Minutes, some of the technical challenges to overcome include
    1. Overlapped speech ASR, background noise, reverberation, etc.
    2. Modular framework capable of handling multi-modal and variable-channel-count scenarios without loss of performance
    3. When implementing online or streaming ASR, many audio preprocessing operations are required, resulting in low efficiency of the entire process
    4. Multi-device audio capture to improve meeting transcription quality
  2. Dialogue interaction analysis and behavior modeling: the main thing is to extract the voice information of a specific speaker
  3. Audio indexing: content-based audio indexing, understanding speech content with ASR transcripts, speaker summarization, and more
  4. Conversational AI

SD's challenges and future

  1. SD online (online) processing: focus on real-time
  2. Domain Mismatch: A model trained on data in one domain does not perform well on data from another domain
  3. Speaker Overlap: Overlap is unavoidable in multi-speaker scenarios, but traditional methods (especially cluster-based systems) only focus on non-overlapping regions.
  4. Integration with ASR: Many applications require ASR results as well as SD results, and determining the optimal system architecture for both SD and ASR tasks remains an open problem
  5. Audiovisual modeling: Visual information provides powerful cues for speaker recognition. For example, video captured by a fisheye camera is used to improve the accuracy of SD in a meeting transcription task

Guess you like

Origin blog.csdn.net/weixin_43335465/article/details/127797800