Android Audio Subsystem (11) ------ Earphone Return Listening (Ear Return) Principle Realization

Hello! Here is Kite's blog,

Welcome to communicate with me.


Back-to-ear, that is, headphone back-listening, is generally used in live concerts, mobile karaoke, KTV and other scenarios.
For example, in a noisy singing environment, by wearing an ear return, the singer can clearly hear the accompaniment and his own voice to identify whether he is out of tune.
Or turn on most of the karaoke software on the mobile phone, which will have an ear return function, monitor your own voice in real time, and constantly adjust the sound state to present the best audio-visual effect.

There are two common implementation methods of ear return:

  • Implementation at the HAL layer: Get the mic data at the HAL, create a new thread to store the data with ringbuf, mix it with the downstream music data, and then write the data to the bottom layer of the kernel for playback.
  • Realized in AF: The recording data is reported to AudioFlinger, and it is directly played with the music after receiving the recording.

The two methods have their own advantages and disadvantages: if implemented in HAL, the mixing problem between different sampling rates needs to be considered. In the implementation of AF, an extra layer of channels will lead to a larger delay of the ear return (150ms-300ms).

Generally, I prefer to implement it in HAL, because the delay is a very important indicator of the ear return. (The back-to-ear delay refers to the round-trip delay of the sound on the device: the audio signal enters the application from the mic for processing, and then is output by the output device. The time spent in this whole process is the complete round-trip time of the sound.)

When I was in the original chip factory, there was also a demand. The customer needed a low-latency ear return function, and the delay was required to be within 30ms. The main user KTV scene required our original chip factory to provide a solution.

Generally, the human ear cannot distinguish the difference within 30ms of the sound. If the delay is greater than 30ms, people will feel "displaced" in the back-to-ear sound, and if it is greater than 50ms, they will obviously feel the existence of the delay.

The old club's chip supports hardware mixing, which can save a little time. At that time, I also tried to do it at the HAL layer, because although the hardware supports audio mixing, there is no way to implement mic loopback spk inside the chip, so it is necessary to start a thread in the HAL and send the mic data to the playback to mix audio at the bottom layer.

At that time, the measured delay was about 100ms. In fact, we know that the delay is related to the buffer size. If the buf is large, the delay will inevitably be high.
At that time, the default period size was 1024, and the period count was 4. After modifying playback to 480 x 4 and record to 240 x 2 (corresponding to twice the relationship, I don’t remember the specific parameters, it’s probably like this), the delay was successfully reduced to 25ms, and finally Meet the requirements!

But there will be a very fatal problem. If the buffer size is too small, there will be a risk of xrun, resulting in overrun in playback and underrun in recording, causing stuttering and noise...

The plan was rough, but there were no other plans at the time.

But after coming to the mobile phone factory, because the mobile phone is mainly used for listening to karaoke headphones, I found a more powerful implementation, which is too fierce!

Here we mainly take the national karaoke as an example. When the national karaoke is turned on for earphone playback, because the fast flag is set, a relatively small buffer size will be used for data transmission, which can ensure better real-time performance and low latency. hour.
But the disadvantages I mentioned above are prone to xrun!

Qualcomm's words are to build channels inside the ADSP, and this part is not well understood and will not be evaluated.

Here we mainly discuss the solution without ADSP: find an entry point in the audio process from Android to the bottom layer of Linux, mix the upstream vocal and downstream music, establish an ear return channel, and reduce the buffer size of this ear return channel to improve Real-time requirements.

hal

If it is still the same as the previous solution, mix in HAL, because there is still a ring buffer managed by ALSA on the downlink path after mixing as a cache for the interaction between the Audio out thread and the underlying interrupt of the chip. This buffer is the playback path itself. buffer, the sound duration corresponding to this buffer size is close to 120ms, such a delay obviously cannot meet the requirements of low delay, but the buffer size cannot be reduced, otherwise the user will be affected when playing with a mobile phone. There are risks of stuttering and noise, which manufacturers cannot afford.

At this point, a difficulty arises: the human voice of the back-to-ear must use a small buffer size to ensure real-time performance, while the music playback uses a large buffer size and cannot be adjusted at will. The back-to-ear solution must fully meet the above two requirements.

Therefore, we must find a small buffer size further "downstream" in the playback process as the entry point for mixing.
But from the process point of view, the ring buffer managed by ALSA is already the bottom buffer, and the sound duration corresponding to the size of this buffer is 120ms. If we need to cut into a small buffer for mixing, we must manage this one by ALSA The ring buffer is divided and divided into zeros.

There is nothing special about the reading and writing mechanism of this ring buffer. In the case of playback, the AudioOut thread continuously writes data as the provider, and the write pointer continues to increase. When the increased size reaches the total size of the buffer, it is reset and starts to overwrite the original Data is written; while the underlying chip interrupt acts as a consumer, continuously consuming data, and the read pointer continues to increase. When the increased size reaches the total size of the buffer, it is reset, and remains behind the write pointer, constantly following the write pointer.

In this way, we can have a bold idea: mix the human voice into the buffer between the write pointer and the read pointer, the closer the pointer position of the mixed point is to the read pointer of the underlying interrupt, the higher the real-time performance will be. The implementation principle of the scheme is as shown in the figure below:
mix
straighten the ring buffer:
buf
the offset is the gap between the pointer position of the mixing point and the read pointer of the underlying interrupt, that is, the buffer corresponding to the delay, that is, human voice + background music Audio data after mixing.

So far, the complete solution has come out, which is compatible with the two requirements of playing low-latency vocals with a small buffer size and playing high-latency music with the original large buffer size.

In fact, on Android O, the introduction of AAUDIO is specially designed for low-latency and high-performance audio applications (only OpenSL ES can be used in previous system versions), this is the introduction of the official website:
https://developer.android.com/ndk /guides/audio/aaudio/aaudio
is mainly implemented by mmap in HAL, but I have not actually used it. I saw a post saying: Pixel 2 XL test results of calling AAudio API on Android 9.0 are as shown in the figure, 14ms

However, there are two difficulties:
Difficulty 1: The built-in ear return function of the Android system is relatively simple and cannot meet the diverse needs of users.
Difficulty 2: The Android system is complex and has various models. Most mobile phone manufacturers directly use the interface of the Android system. The degree of hardware support is different, resulting in different ear return effects. Some models of ear return delay (from system collection to human voice output from headphones) ) is higher.

Therefore, in order to achieve the optimal delay effect for different Android phones, each mobile phone manufacturer optimizes the underlying logic of the system by itself, optimizes the most ideal effect for the audio ear-back function of the Android system, and implements the ear-back function by itself.

I haven’t seen how other manufacturers have implemented this on the Internet. I saw a blogger’s post with a picture:
huawei
Huawei’s karaoke return solution provides a dual-stream bottom loop path. After the recording data reaches the HAL layer, part of the data is sent to the The other part of the data is sent to the upper-layer APP, and the HIFI module is directly sent to the user after the mixing and special effects are added, and the APP only needs to realize the playback of the accompaniment. At the same time, the bottom layer has been deeply optimized on the new platform, and the loopback delay of the bottom layer (see the green arrow in the figure above) can reach less than 30ms.

I don’t know if this part is true or not. If you understand, you can leave a message to discuss~


For AF implementation, please refer to: Implementation of Android Ear Return Function

Guess you like

Origin blog.csdn.net/Guet_Kite/article/details/126853385