In-depth explanation of WebRTC AEC (Acoustic Echo Cancellation)

Foreword: In recent years, audio and video conferencing products have improved the efficiency of work collaboration. Online education products have broken through the limitations of traditional forms of education. Entertainment and interactive live broadcast products have enriched the diversity of life and social interaction. Audio and video communication technologies are inseparable from the background. Optimization and innovation, where the smoothness, completeness, and intelligibility of audio information content transmission directly determine the quality of communication between users. Since the open source of WebRTC in 2011, both its technical architecture and its rich algorithm modules are worthy of our savoring. The well-known 3A algorithm in audio (AGC: Automatic gain control; ANS: Adaptive noise suppression; AEC: Acoustic echo) Cancellation) is the sparkling pearl among them. This article will comprehensively analyze the basic framework and basic principles of WebRTC AEC with examples, and explore the basic principles, technical difficulties and optimization directions of echo cancellation together.

Author: Luo God, Ali cloud senior development engineer, responsible for R & D audio Ali cloud RTC

The formation of echo

The uplink and downlink audio signal processing flow in the WebRTC architecture is shown in Figure 1. Audio 3A is mainly concentrated on the uplink sending end to perform echo cancellation, noise reduction and volume equalization on the transmitted signal (here only the AEC processing flow is discussed, if it is the AECM processing flow ANS will be in front), AGC will act as a compressor on the receiving end to limit the audio signal to be played.

Figure 1 Block diagram of the audio signal uplink and downlink processing flow in WebRTC

So how is the echo formed?

As shown in Figure 2, during the communication between A and B, we have the following definitions:

  • x(n): remote reference signal, that is, the audio stream of end B subscribed by end A, usually used as reference signal;
  • y(n): Echo signal, that is, the signal collected by the microphone after the speaker plays the signal x(n). At this time, the signal y(n) collected by the room reverberation and the microphone can no longer be equal to the signal x(n) , Let us denote the part of linear superposition as y'(n), and the part of non-linear superposition as y''(n), y(n) = y'(n) + y''(n);
  • s(n): The voice signal of the near-end speaker collected by the microphone, that is, the signal we really want to extract and send to the far-end;
  • v(n): environmental noise, this part of the signal will be weakened in ANS;
  • d(n): Near-end signal, that is, the original signal before 3A after the microphone is collected, which can be expressed as: d(n) = s(n) + y(n) + v(n);
  • s'(n): The audio signal after 3A, that is, the signal to be sent to the opposite end after encoding.

The only known signals that the WebRTC audio engine can get are the near-end signal d(n) and the far-end reference signal x(n).

Figure 2 Echo signal generation model

If the signal passes through the audio engine at the A side to get the residual signal y(n) in the s'(n) signal, then the B end can hear its own echo or residual tail sound (residue left by incomplete echo suppression). AEC effect evaluation can be roughly divided into the following situations in actual situations (professionals can further subdivide according to application scenarios, equipment, and single and double lectures):

file

The essence of echo cancellation

Before analyzing the WebRTC AEC architecture, we need to understand the essence of echo cancellation. In the process of audio and video calls, sound is the main way to convey information. Therefore, from the complex recording signal, the information we want to transmit through signal processing means: high fidelity, low latency, clear and understandable are the goals we have been pursuing. . In my opinion, echo cancellation, noise suppression and sound source separation belong to the category of speech enhancement. If noise is understood as a generalized noise, the relationship between the three is as follows:

Figure 3 The relationship between speech enhancement and echo cancellation

Noise suppression requires an accurate estimation of the noise signal. Among them, stationary noise can dynamically update the noise signal by distinguishing the status of the voice terminal and the voice terminal through voice detection, and then participate in noise reduction. The common method is based on spectral subtraction (that is, based on the original signal). A series of improved methods of subtracting the estimated noise component) whose effect depends on the accuracy of the noise signal estimation. For non-stationary noise, deep learning methods based on recurrent neural networks are currently used more often. Many Windows devices have built-in noise reduction algorithms based on multi-microphone arrays. In effect, in order to ensure sound quality, noise suppression allows noise to remain, as long as the signal-to-noise ratio of the original signal is higher, and there is no perception of noise and auditory distortion.

The monophonic sound source separation technology originated from the legendary cocktail party effect, which refers to a person’s ability to choose hearing. In this case, one’s attention is focused on a person’s conversation and others in the background are ignored. Conversation or noise. This effect reveals an amazing ability in the human auditory system that we can talk in noise. Scientists have been working on the use of technical means to separate various components from monophonic recordings. The difficulty has always been. With the application of machine learning technology, this technology has gradually become possible, but the higher computational complexity For reasons such as low latency, RTC is still some distance away from commercial use in low-latency systems.

Both noise suppression and sound source separation are single-source inputs, and only the near-end signal is collected. Tsundere's echo cancellation requires simultaneous input of the near-end signal and the far-end reference signal. Some students may ask, if the remote reference signal is known, why can't it be processed by noise suppression methods? Isn't it enough to subtract the frequency spectrum of the remote signal directly from the frequency domain?

file

The first line in the above figure is the near-end signal s(n), which has been mixed with the near-end human voice and the far-end signal played by the speaker. The aligned far-end signal has been marked in the yellow box, and the content of the voice expression is the same. But the frequency spectrum and amplitude (obviously the sound energy is very high after being amplified by the loudspeaker) are inconsistent, which means that the reference far-end signal and the far-end signal played by the loudspeaker are already "close together", and the combination of noise reduction methods is also good However, the direct application of noise reduction methods will obviously cause serious suppression of echo residue and double talk. Next, let's take a look at how WebRTC scientists do it.

Signal processing flow

The WebRTC AEC algorithm includes three parts: delay adjustment strategy, linear echo estimation, and nonlinear echo suppression. Echo cancellation is more like separation of audio sources in nature. We expect to eliminate unwanted far-end signals from the mixed near-end signal and keep the near-end human voice sent to the far end. However, WebRTC engineers prefer the process of communicating two people It is understood as alternate speaking of one question and one answer, and there are not many cases where there are simultaneous continuous speaking at the far and near ends (that is, the policy talks lightly and double talks).

Therefore, it is only necessary to distinguish between the far and near speaking areas to eliminate most of the far-end echoes by some means. As for the dual-talk recovery capability, the WebRTC AEC algorithm provides {kAecNlpConservative, kAecNlpModerate, kAecNlpAggressive} 3 modes, which represent different modes from low to high. The degree of suppression, the far and near end signal processing flow is shown in Figure 4:

Figure 4 Block diagram of WebRTC AEC algorithm structure

The NLMS adaptive algorithm (orange part in the above figure) is used to eliminate the linear part of the echo in the signal d(n) as much as possible, and the residual nonlinear echo signal will be filtered in the part of the nonlinear filter (the purple part in the above figure) These two modules are the core modules of Webrtc AEC. The module depends on the front and back. In the real scene, the far-end signal x(n) is played by the speaker and collected by the microphone. It also contains the linear and non-linear superposition of the echo y(n) and the near-end signal x(n): The purpose of eliminating the linear echo is to increase the difference between the near-end signal X(ω) and the filtered result E(ω). The greater the difference is when calculating the coherence (the near-end signal is close to 1, and the far-end signal is closer to 1. Close to 0), it is easier to directly distinguish the near-end frame from the far-end frame through the threshold. In the non-linear filtering part, it is only necessary to adjust the suppression coefficient according to the detected frame type, and filter to eliminate the echo. Below we analyze the linear part and non-linear part of this architecture in combination with examples.

Linear filter

Linear echo y'(n) can be understood as the result of the far-end reference signal x(n) after the room impulse response. The essence of linear filtering is to estimate a set of filters so that y'(n) is as equal to x as possible (n), find the far-end signal frame aligned with the maximum amplitude position index of the statistical filter bank, and the frame data will participate in subsequent modules such as coherence calculation.

It should be noted that if index is frantically tested at both ends of the filter order, it can only indicate that the current delay to the far and near ends of the linear part is small or too large. At this time, the filter effect is unstable, and a fixed delay is needed. Time adjustment or large delay adjustment makes index in an ideal position. The linear part of the algorithm can be regarded as a fixed-step NLMS algorithm. You can read the details in conjunction with the source code. This section focuses on the role of linear filtering in the entire framework.

From my personal understanding, the purpose of the linear part is to eliminate the linear echo to the greatest extent. When judging the near and far end frames, the coherence value between the signals is guaranteed to the greatest extent (between 0 and 1, the larger the value, the more the coherence Large) reliability.

We remember that the signal after linear echo cancellation is the estimated echo signal e(n), e(n) = s(n) + y''(n) + v(n), where y''(n) is nonlinear Echo signal, denote y'(n) as linear echo, y(n) = y'(n) + y''(n). Calculation of coherence (Matlab code):

% WebRtcAec_UpdateCoherenceSpectra →_→ UpdateCoherenceSpectra
Sd = Sd * ptrGCoh(1) + abs(wined_fft_near) .* abs(wined_fft_near)*ptrGCoh(2);
Se = Se * ptrGCoh(1) + abs(wined_fft_echo) .* abs(wined_fft_echo)*ptrGCoh(2);
Sx = Sx * ptrGCoh(1) + max(abs(wined_fft_far) .* abs(wined_fft_far),ones(N+1,1)*MinFarendPSD)*ptrGCoh(2);
Sde = Sde * ptrGCoh(1) + (wined_fft_near .* conj(wined_fft_echo)) *ptrGCoh(2);
Sxd = Sxd * ptrGCoh(1) + (wined_fft_near .* conj(wined_fft_far)) *ptrGCoh(2);     

% WebRtcAec_ComputeCoherence →_→ ComputeCoherence
cohde = (abs(Sde).*abs(Sde))./(Sd.*Se + 1.0e-10);
cohdx = (abs(Sxd).*abs(Sxd))./(Sx.*Sd + 1.0e-10);

Two experiments

(1) Calculate the correlation cohdx between the near-end signal d(n) and the far-end reference signal x(n). In theory, the coherence of the far-end echo signal should be closer to 0 (in order to facilitate subsequent comparisons, WebRTC does reverse processing : 1-cohdx), as shown in Figure 5(a), the first line is to calculate the near-end signal d(n), the second line is the far-end reference signal x(n), and the third line is the coherence curve between the two: 1-cohdx, It will be found that the coherence value of the echo part has obvious fluctuations, the maximum value is 0.7, and the near-end part is close to 1.0 as a whole, but there are continuous fluctuations. If you want to distinguish the far and near-end frames through a fixed threshold, there will be different degrees of misjudgment, which is reflected in The sense of hearing is echo (the far end is judged as the near end) or the word drop (the near end is judged as the far end).

 (a) Coherence between near-end signal and far-end reference signal

 (b) The coherence between the near-end signal and the estimated echo signal

Figure 5 Signal coherence

(2) Calculate the coherence between the near-end signal d(n) and the estimated echo signal e(n), as shown in Figure 5(b), the second line is the estimated echo signal e(n), and the third line is the coherence between the two Cohde, it is obvious that the near-end part is almost all close to 1.0. WebRTC can distinguish most of the near-end frames with a relatively strict threshold (>=0.98), and the probability of misjudgment is relatively small. WebRTC engineers set such a strict threshold Presumably, I would rather sacrifice part of the dual talk effect than accept the echo residue.

It can be appreciated from Figure 5 that after linear filtering, the difference between the far-end reference signal x(n) and the estimated echo signal e(n) can be further highlighted, thereby improving the reliability of the judgment of the state of the far and near-end frames.

Existing problems and improvements

Ideally, the far-end signal is played from the speaker without nonlinear distortion, then e(n) = s(n) + v(n), but in reality, e(n) is very similar to d(n), but far There are some amplitude changes in the end region, indicating that the linear part of WebRTC AEC does not perform well in this case, as shown in Figure 6(a) The low frequency band is obviously weakened from the frequency spectrum, but the middle and high frequency part is almost unchanged. The result of using a variable step double filter structure will be very obvious. As shown in Figure 6(b), both the time domain waveform and spectrum are very different from the near-end signal x(n). At present, in aec3 and speex With this structure, it can be seen that there is still a lot of room for optimization in the linear part of WebRTC AEC.

(a) WebRTC AEC linear part output

 (b) Improved linear part output

Figure 6 Comparison of near-end signal and estimated echo signal

How to measure the effect of the linear part of the improvement?

Here we compare the existing fixed-step NLMS and variable-step NLMS, the near-end signal d(n) is the reverberated far-end reference signal x(n) + the near-end speech signal s(n). In theory, when NLMS processes such purely linear superimposed signals, it can directly eliminate the far-end echo signal without the nonlinear part. Figure 7(a) The first line is the near-end signal d(n), the second column is the far-end reference signal x(n), the linear part is the output result, and the yellow box is the far-end signal. The NLMS algorithm with fixed step size in WebRTC AEC converges slowly, and there is some residual echo. However, the variable step size NLMS converges faster, and the echo suppression is relatively better, as shown in Figure 7(b).

(A) NLMS with fixed step

(B) NLMS with variable step size

Figure 7 Comparison of the effects of the two NLMS algorithms

Linear filter parameter setting

#define FRAME_LEN 80
#define PART_LEN 64
enum { kExtendedNumPartitions = 32 };
static const int kNormalNumPartitions = 12;

FRAME_LEN is the length of the data transmitted to the audio 3A module each time. The default is 80 sampling points. Because WebRTC AEC uses 128-point FFT, the internal framing logic will take out PART_LEN = 64 sample points to connect with the remaining data of the previous frame. 128 points are used for FFT, and the remaining 16 points are left to the next time, so PART_LEN sample points (4ms data) are actually processed each time.

The default filter order is only kNormalNumPartitions = 12, and the data range that can be covered is kNormalNumPartitions 4ms = 48ms. If the extended filter mode is turned on (set extended_filter_enabled to true), the coverage data range is kNormalNumPartitions 4ms = 132ms. As the processing power of the chip increases, this extended filter mode will be turned on by default, and even expanded to a higher order to deal with most mobile devices on the market. In addition, although the linear filter does not have the ability to adjust the delay, the current signal delay state can be measured by the estimated index. The range is [0, kNormalNumPartitions]. If the index is at both ends of the scope, the real delay is too small. Or if it is too large, it will affect the effect of linear echo estimation, and severely will bring echo. At this time, it needs to be corrected by combining fixed delay and large delay detection.

Non-linear filtering

The non-linear part has done two things in total, which is to do everything possible to get rid of the remote signal.

(1) According to the estimated echo signal provided by the linear part, the coherence between the signals is calculated and the state of the far and near end frames is judged.

(2) Adjust the suppression coefficient and calculate the nonlinear filtering parameters.

The nonlinear filter suppression coefficient is hNl, which roughly represents the energy ratio of the expected near-end component and the residual nonlinear echo signal y''(n) in different frequency bands in the estimated echo signal e(n), hNl is The coherence value is consistent, and the range is [0, 1.0]. It can be seen from Figure 5(b) that the amplitude value of the far-end part that needs to be eliminated is generally around 0.5. If hNl filtering is used directly, a large amount of echo residue will be caused.

Therefore, WebRTC engineers made the following scale transformations to hNl. Over_drive is related to nlp_mode, representing different levels of suppression of aggressiveness. Drive_curve is a monotonically increasing convex curve with a range [1.0, 2.0]. Since the mid-to-high-frequency tail sounds are more pronounced in the sense of hearing, they designed such a suppression curve to suppress the high-frequency tail sounds. We remember the scale transformation α = over_drive_scaling * drive_curve. If nlp_mode = kAecNlpAggressive is set, α will be about 30.

% matlab代码如下:
over_drive = min_override(nlp_mode+1);
if (over_drive < over_drive_scaling)
  over_drive_scaling = 0.99*over_drive_scaling + 0.01*over_drive;  % default 0.99 0.01
else
  over_drive_scaling = 0.9*over_drive_scaling + 0.1*over_drive; % default 0.9 0.1
end

% WebRtcAec_Overdrive →_→ Overdrive
hNl(index) = weight_curve(index).*hNlFb + (1-weight_curve(index)).* hNl(index);
hNl = hNl.^(over_drive_scaling * drive_curve);

% WebRtcAec_Suppress →_→ Suppress
wined_fft_echo = wined_fft_echo .*hNl;
wined_fft_echo = conj(wined_fft_echo);

If the current frame is a near-end frame (ie echo_state = false), assume that the k-th frequency band hNl(k) = 0.99994, hNl(k) = hNl(k)^α = 0.99994 ^ 30 = 0.9982, even if the filtered loss There is almost no perception. As shown in Figure 8(a), the amplitude of hNl is still very close to 1.0 after α modulation.

If the current frame is a remote frame (ie echo_state = true), assume that the k-th frequency band hNl(k) = 0.6676, hNl(k) = hNl(k)^α = 0.6676 ^ 30 = 5.4386e-06, and the far The end energy is so small that you can hardly hear it. As shown in Figure 8(b), hNl is basically close to 0 after α modulation.

(A) The suppression coefficient corresponding to the near-end frame

(B) The suppression coefficient corresponding to the remote frame

Figure 8 The change of the far and near end signal suppression coefficient before and after modulation

After the above comparison, in order to ensure that the near-end expected signal distortion is minimized after modulation, and the far-end echo can be suppressed to inaudible, WebRTC AEC has set such a strict threshold in the module for judging the state of the far and near-end frames.

In addition, if the adjustment coefficient α is too strict, it will bring about suppression of double talk, as shown in the first row of Figure 9. The near-end speaker's voice is obviously lost, and it can be restored by adjusting α, as shown in the second row. Therefore, if the α estimation is optimized on the existing WebRTC AEC strategy, the serious problem of dual-talk suppression can be alleviated.

Figure 9 Dual talk effect

Delay adjustment strategy

The effect of echo cancellation is strongly related to the data delay at the far and near ends, and improper adjustment will bring the risk of unavailability of the algorithm. Before the far and near end data enters the linear part, it is necessary to ensure that the delay is within the range of the designed filter order, otherwise the delay is too large and exceeds the estimated range of the linear filter or the adjustment causes the far and near non-causal results to fail to converge. Echo. Two questions about popular science:

(1) Why is there a delay?

First, the echo in the near-end signal d(n) is formed by the speaker playing the far-end reference x(n) and collected by the microphone, which means that before the near-end data is collected, the far-end data buffer There are already N frames x(n) in the middle. This natural delay can be approximately equal to the time from when the audio signal is ready to be rendered to when it is collected by the microphone. This delay varies from device to device. Apple devices have relatively small delays, basically around 120ms, while Android devices are generally around 200ms, and on low-end models there will be around 300ms or more.

(2) Why does non-causal far and near cause echo?

It can be considered from (1) that, in order to find the far-end signal aligned with the near-end signal of the current frame under normal circumstances, it must look forward along the write pointer in the far-end buffer. If the device collects and loses data at this time, the far-end data will be consumed quickly, causing the new near-end frame to find the far-end reference frame aligned with it when looking forward, which will cause the subsequent modules to work abnormally. Figure 10 (a) shows the normal delay situation, and (b) shows the non-causality.

(A) Normal delay at the far and near ends

(B) Non-causal far and near

Figure 10 Normal far and near end delay and non-causality

The delay adjustment strategy in WebRTC AEC is critical and complex, involving fixed delay adjustment, large delay detection, and linear filter delay estimation. The relationship between the three is as follows:

① The fixed delay adjustment will only occur before the AEC algorithm starts to process, and it will only be adjusted once. The delay of fixed hardware devices such as conference boxes is basically fixed, and the delay estimation range can be reduced by directly subtracting the fixed delay, so that it can quickly come within the delay range covered by the filter.
Let's take a look at the adjustment process of fixed delay in combination with the code below:

int32_t WebRtcAec_Process(void* aecInst,
const float* const* nearend,
size_t num_bands,
float* const* out,
size_t nrOfSamples,
int16_t reported_delay_ms,
int32_t skew);

The WebRtcAec_Process interface is as above, and the parameter reported_delay_ms is the target value of the delay that the current device needs to adjust. If the fixed delay of an Android device is about 400ms, 400ms has exceeded the delay range covered by the filter, and the delay of at least 300ms needs to be adjusted to meet the requirement of echo cancellation without echo. The fixed delay adjustment only works once at the beginning of the WebRTC AEC algorithm:

if (self->startup_phase) {
    int startup_size_ms = reported_delay_ms < kFixedDelayMs ? kFixedDelayMs : reported_delay_ms;
    int target_delay = startup_size_ms * self->rate_factor * 8;
    int overhead_elements = (WebRtcAec_system_delay_aliyun(self->aec) - target_delay) / PART_LEN;
    printf("[audio] target_delay = %d, startup_size_ms = %d, self->rate_factor = %d, sysdelay = %d, overhead_elements = %d\n", target_delay, startup_size_ms, self->rate_factor, WebRtcAec_system_delay(self->aec), overhead_elements);
    WebRtcAec_AdjustFarendBufferSizeAndSystemDelay_aliyun(self->aec,  overhead_elements);
self->startup_phase = 0;
  }

Why is target_delay calculated like this?

int target_delay = startup_size_ms self->rate_factor 8;
startup_size_ms is actually the reported_delay_ms set down. This step converts the calculation time milliseconds into sample points. In 16000hz sampling, 10ms means 160 sample points, so target_delay is actually the number of target sample points that need to be adjusted (aecpc->rate_factor = aecpc->splitSampFreq / 8000 = 2).

We use 330ms delay data test:
if the default delay is set to 240ms, overhead_elements is adjusted by -60 blocks for the first time, and a negative value means forward search, which is exactly 60 4 = 240ms, and then the linear filter is fixed index = 24 means 24 4 = 96ms delay, the sum of the two is approximately equal to 330ms. The log is printed as follows:

file

②Large delay detection is a process of finding the most similar frame in the large remote buffer based on the similarity of the far and near end data. The algorithm principle is somewhat similar to the idea of ​​feature matching in audio fingerprints. The ability of large delay adjustment is a supplement to the ability of fixed delay adjustment and linear filter. When using it, you need to be more careful, you need to control the frequency of adjustment, and control the risk of non-causality.

The WebRTC AEC algorithm has opened up a large buffer that can store 250 blocks. The length of each block PART_LEN = 64 sample points, which can save the latest 1s data. This is also the theoretically large delay that can be estimated. It is absolutely enough used.

static const size_t kBufferSizeBlocks = 250;
buffer_ = WebRtc_CreateBuffer(kBufferSizeBlocks, sizeof(float) * PART_LEN);
aec->delay_agnostic_enabled = 1;

We use the data test with 610ms delay (to enable large delay adjustment, you need to set delay_agnostic_enabled = 1):
We still set the default delay to 240ms. At the beginning, we adjusted -60 blocks, and then adjusted after the large delay adjustment was connected. After -88 blocks are adjusted, a total of (60 + 88) * 4 = 592ms is adjusted, and then the linear filter is fixed at index = 4, which means that the final remaining delay is 16ms, which is in line with expectations.

file

file

③ Linear filter delay estimation is the most direct feedback of the filter to the current far and near end delay after fixed delay adjustment and large delay adjustment. Improper adjustment of the first two will cause the delay to be too small or even non-causal, or the delay is too large to exceed the filter coverage, resulting in echoes that cannot be converged. Therefore, the first two need to combine the ability of the filter in the adjustment process to ensure that the remaining delay is within the range that the filter can cover. Even if the delay is jittered in a small range, the linear part can be adjusted adaptively.

Summary and optimization direction

Problems with WebRTC AEC:

(1) The convergence time of the linear part is slow, and the fixed-step NLMS algorithm does not estimate the echo of the linear part well;
(2) The filter order of the linear part is 32 by default, the default coverage delay is 132ms, and the mobile terminal is delayed Larger equipment support is not very good, the detection part of the large delay is slow to intervene, and there is a risk of misadjustment leading to non-causal echo;
(3) The frame state based on coherence relies on a strict fixed threshold, and there is a certain degree of misjudgment. If we go to guide the adjustment of the non-linear part of the suppression coefficient, it will bring more serious dual-talk suppression.

Optimization direction:
(1) Algorithms can improve the current linear filtering effect by learning the linear part of speex and AEC3;
(2) Algorithms can optimize the delay adjustment strategy, and engineering methods such as adding parameter configuration and delivery can be added to the project. The delay problem of some equipment;
(3) In addition, there are some new ideas that are worth trying. As mentioned at the beginning, since echo can also be regarded as noise, can we use the idea of ​​noise reduction to do echo cancellation? , The answer is yes.

"Video Cloud Technology" Your most noteworthy audio and video technology official account, pushes practical technical articles from the front line of Alibaba Cloud every week, and exchanges ideas with first-class engineers in the audio and video field.

Guess you like

Origin blog.51cto.com/14968479/2562432