Echo Cancellation (AEC) Principles, Algorithms, and Practice——LMS (Least Mean Square)

Echo cancellation is an important technology in the front-end processing of voice communication. The reason is that in real-time audio and video calls, the sound played by the speaker is recorded into the microphone again.

In instant messaging applications, real-time voice communication between two parties or multiple parties is required. In occasions with high requirements, external speakers are usually used to play the sound, which will inevitably produce echoes, that is, after one party speaks, it passes through the other party's speakers. The sound is played, and then it is picked up by the other party's Mic and sent back to itself (as shown in the figure below). If the echo is not processed, the call quality and user experience will be affected, and even more serious, vibrations and howling will be formed.

Acoustic echo means that the sound broadcast by the speaker is picked up by the microphone through various paths while being heard by the receiver. The result of multipath reflections are echoes with different delays, including direct echoes and indirect echoes.

Direct echo means that the sound from the speaker goes directly to the microphone without any reflections. This kind of echo has the shortest delay, which is directly related to the voice energy of the far-end speaker, the distance and angle between the speaker and the microphone, the playback volume of the speaker, and the pickup sensitivity of the microphone;

Indirect echo refers to the collection of echoes produced by the microphone after the sound broadcast by the speaker passes through different paths (such as the house or any object in the house) after one or more reflections. Any movement of any object in the house will change the echo channel. Therefore, this echo is characterized by multipath and time-varying.

  The basic idea of ​​adaptive echo cancellation is to estimate the characteristic parameters of the echo path, generate an analog echo path, obtain an analog echo signal, and subtract the signal from the received signal to realize echo cancellation. The key is to obtain the impulse response of the echo path \hat{h}(n). Since the echo path is usually unknown and time-varying, an adaptive filter is generally used to simulate the echo path. The notable feature of adaptive echo cancellation is real-time tracking and strong real-time performance.

 

 In the figure, y(n) represents the signal from the far end, r(n) is the echo generated through the echo channel, and x(n) is the near-end voice signal. Terminal D is the near-end microphone, which collects the superimposed echo in the room and the voice of the near-end speaker. For the echo canceller, the received far-end signal is used as a reference signal, and the echo canceller uses the estimated value of the echo generated by the adaptive filter according to the reference signal, and subtracts it from the near\hat{r}(n) -end voice signal with echo to obtain \hat{r}(n)The signal transmitted from the near end. In an ideal and single-talk situation, after being processed by the echo canceller, the residual echo error e(n)=r(n)− \hat{r}(n)will be 0, thus realizing echo cancellation. In the case of double talk (someone is speaking at the near end, and the far end is also speaking, and there is an echo), it is hoped that the echo error e(n) is for the near-end voice signal.

Performance:

  • Convergence speed : The faster the convergence speed of the filter, the better, so that the caller will not feel the existence of obvious echo soon after the normal call starts.
  • Steady-state residual echo (stability) : It is the echo output after the filter converges to a steady state. In practice, it is always hoped that the smaller the parameter, the better.
  • Algorithm complexity : A good algorithm should minimize computational complexity while maintaining convergence speed, and also reduce power consumption

ITU-T G.168 stipulates the standards that must be met for various echo canceller products in various indicators including the above two main indicators

The basic principle of AEC technology is shown in the figure below:

 

 The procedure is as follows:

import numpy as np
import librosa
import soundfile as sf
import pyroomacoustics as pra
# x 参考信号
# d 麦克风信号
# N 滤波器阶数
# mu 迭代步长
def lms(x, d, N = 4, mu = 0.1):
  nIters = min(len(x),len(d)) - N
  u = np.zeros(N)
  w = np.zeros(N)
  e = np.zeros(nIters)
  for n in range(nIters):
    u[1:] = u[:-1]
    u[0] = x[n]
    e_n = d[n] - np.dot(u, w)
    w = w + mu * e_n * u
    e[n] = e_n
  return e

# x 原始参考信号
# v 理想mic信号 
# 生成模拟的mic信号和参考信号
def creat_sim_sound(x,v):
    rt60_tgt = 0.08
    room_dim = [2, 2, 2]

    e_absorption, max_order = pra.inverse_sabine(rt60_tgt, room_dim)
    room = pra.ShoeBox(room_dim, fs=sr, materials=pra.Material(e_absorption), max_order=max_order)
    room.add_source([1.5, 1.5, 1.5])
    room.add_microphone([0.1, 0.5, 0.1])
    room.compute_rir()
    rir = room.rir[0][0]
    rir = rir[np.argmax(rir):]
    # x 经过房间反射得到 y
    y = np.convolve(x,rir)
    scale = np.sqrt(np.mean(x**2)) /  np.sqrt(np.mean(y**2))
    # y 为经过反射后到达麦克风的声音
    y = y*scale

    L = max(len(y),len(v))
    y = np.pad(y,[0,L-len(y)]) # 补零,使其信号长度一致
    v = np.pad(v,[L-len(v),0]) # 补零,使其信号长度一致
    x = np.pad(x,[0,L-len(x)]) # 补零,使其信号长度一致
    d = v + y
    return x,d

if __name__ == "__main__":
    x_org, sr  = librosa.load('female.wav',sr=8000)
    v_org, sr  = librosa.load('male.wav',sr=8000)  # 采样率为8000,不加sr的话默认是22kHz

    x,d = creat_sim_sound(x_org,v_org)

    e =  lms(x, d,N=256,mu=0.1)
    sf.write('x.wav', x, sr, subtype='PCM_16')
    sf.write('d.wav', d, sr, subtype='PCM_16')
    sf.write('lms.wav', e, sr, subtype='PCM_16')

References:

https://www.bilibili.com/video/BV1LP411j7yy/?spm_id_from=333.788&vd_source=77c874a500ef21df351103560dada737

https://www.cnblogs.com/LXP-Never/p/11773190.html 

Guess you like

Origin blog.csdn.net/qq_42233059/article/details/130153002