Best Sound Quality Practices of RTC in Different Business Scenarios

background introduction

WebRTC is currently the most popular open source framework in the field of real-time audio and video. After Google acquired the GIPS engine in 2010, it was incorporated into the Chrome system and open sourced, named "WebRTC". WebRTC is supported by major browser manufacturers and incorporated into W3C standards, which promotes the popularization of real-time audio and video in mobile Internet applications. In January 2021, W3C and IETF, two major standard-setting organizations, announced that WebRTC has become an official standard. Users do not need to download additional components or separate applications to support real-time audio and video communication on the network. Although WebRTC has the characteristics of free and open source, it is huge, complicated, has a high learning threshold, and lacks the design and deployment of server solutions, leaving room for development of commercial solutions based on WebRTC. Third-party RTC PaaS vendors have become the first choice of developers due to their scale effect and technical advantages, pushing the real-time audio and video industry into the fast lane of development.

615254d6d74c0490384df679f0583717.png

This article will focus on the audio engine architecture of webrtc, analyze the key technical points behind it that affect the sound quality, and introduce the selection considerations under the differentiated technical solutions behind the different business scenarios of mainstream RTC manufacturers in the industry, and reveal the full link scenario The best sound quality improvement solution under .

1. Introduction to webrtc audio architecture

2b5b81cdbf1ea7b33cba4ce82c3249dc.png

The audio architecture of the entire RTC is shown in the figure above:

  • Uplink: After the audio signal is collected and called back by the device, it will be processed by software 3A (acoustic echo cancellation (aec), automatic noise suppression (ANS), automatic gain control (AGC)), and then encoded by the audio encoder. Do RTP packaging and send to SFU; the sending end is driven by the acquisition thread. After the acquisition device is started, it calls back the audio data every 10ms. The data needs to be removed in time without blocking the acquisition thread. It is a push mode. Usually the acquisition thread and software 3A are in the acquisition thread Processing, to the audio encoding link is an asynchronous thread;

  • Downlink: At the receiving end, due to factors such as network packet loss, delay, jitter, and out-of-sequence, after receiving the audio RTP packet, it will be put into neteq for analysis and sorting. After analysis, the audio decoder (Decoder) will be called to do Decoding, the decoded pcm data, according to different judgment strategies of neteq, will perform post-processing such as packet loss compensation (PLC) or accelerated playback, and the processed audio data streams of different channels will be mixed by a mixer (Mixer) In order to play in a format suitable for Render, after Mixing and before sending to Render, it will also be sent to the software 3A AEC module as a remote reference signal to eliminate echo and prevent the peer end user from hearing their own voice. Different from the acquisition thread, the downlink is driven by the Render (playback) thread. After the render thread is started, the 10ms frame-by-frame timing callback requests data from the playback buffer, which is driven by the pull mode.

2. Factors affecting the sound quality of rtc

According to the audio architecture description in 1, the full-link factors that affect the end-to-end sound quality experience of rtc are concentrated in four aspects, namely: audio equipment, audio 3A, audio encoder, and Neteq.

2.1 Audio equipment

Audio equipment is the source of affecting sound quality, but audio equipment at different ends has different characteristics and performances. Therefore, the first step to improve sound quality is to deeply understand the differences of audio equipment at each end;

2.1.1 Android terminal

  • Audio driver: Android provides different audio capture and playback drivers. The current mainstream audio capture and playback drivers are Java-based AudioRecord and AudioTrack modes, and C++-based Opensles driver; in addition, Google introduced a new Android C API——AAudio in the Android O version. Related information: https://developer .android.com/ndk/guides/audio/aaudio/aaudio, the official statement that this API is designed for high-performance audio applications that require low latency;

         Java: AudioRecord/AudioTrack

        Opensles: C++ ——— (low latency, mainstream in the industry)

        AAudio———There are many problems, to be improved

2ab868eb8f884e518dd1966beb29f907.png

  • Audio parameters:

        AudioMode:

ae684a7d43cd0f3a4338a9af2154b329.png

        AudioSource:

d097c3ec8171f6aa26b10b262bcc6100.png

       StreamType: Affects the volume bar

9991eb2e2b489116dfd31ce7fc69b20e.png

  • RTC hardware and software 3A parameter settings: AudioMode/AudioSource/StreamType

  • 硬件3A:  audioMode =  MODE_IN_COMMUNICATION   audioSource = VOICE_COMMUNICATION, streamType = STREAM_VOICE_CALL;

  •  软件3A: audioMode =  MODE_NORMAL; audioSource = MIC; streamType = STREAM_MUSIC;

15f5b8c9585b8a8f8c731fc66b0e5a48.jpeg

6975999d2ac49e641aa9711f4949cbfd.jpeg

  •  The difference between hardware 3A mode and software 3A mode:

361ada87c70a017ca204bffb5d43ce67.png

074aa66359aea3c06908fb985e36a9ae.png

d14cf8265e09864ec125f96b3ff64d8d.png

Common practice in the industry: Due to the fragmentation of Android, different mobile phone manufacturers have different support capabilities for different audio drivers, and the compatibility is quite different. Even the most common Java mode and Opensles mode also have adaptation problems; therefore, On the Android side, rtc manufacturers usually deliver configurations based on the above parameters combined with different Android models to solve stability and usability problems ; as shown below, it is a typical configuration delivery parameter:

{"audioMode":3,"audioSampleRate":48000,"audioSource":7,"query":"oneplus/default ","useHardwareAEC":true,"useJavaAudioClass":true}

At present, in the RTC field, opensles is widely used. One of the reasons is that from the perspective of architecture, ADM (AudioDeviceModule) is the unified management of the C++ layer, and the data callback of the C++ layer is convenient to reduce the time-consuming data callback from java->jni->c++;

The impact of channels on the sound quality of acquisition: Generally speaking, the spectral components under binaural acquisition are richer than those under monophonic acquisition

Android hardware 3A, native Android supports hardware AEC and ANS to be turned on and off, but due to the current domestic Android mobile phone manufacturers custom Rom modification, most android mobile phone manufacturers are forced to enable hardware AEC and ANS under hardware 3A, and cannot be truly turned off;

  • https://developer.android.com/reference/android/media/audiofx/AcousticEchoCanceler

  • https://developer.android.com/reference/android/media/audiofx/NoiseSuppressor

  • Android AudioKit

  • Huawei AudioKit:

    https://developer.huawei.com/consumer/cn/codelab/HMSAudioKit/#0;https://developer.huawei.com/consumer/cn/doc/development/Media-Guides/introduction_services-0000001053333356;

  • Integration benefits: It can solve the problem of bandwidth loss in some mobile phone collection signals of factory A

  • limitation factor:

  1. Hongmeng system

  2. Kirin chip

  3. The playback thread starts before the capture thread

  • In addition to the problem of bandwidth, the sampling rate of Andorid devices is also very different on different devices. Usually, the sampling rate of 44100 or 48000Hz is more compatible. On different devices, these two sampling rates may be different. The support performance is different, for example, the hardware 3A echo cancellation effect may not be good, or the collection callback data is unstable, so it is also necessary to adapt the model according to the sampling rate;

2.1.2 IOS terminal

  • Volume bar:

  • The IOS side is also divided into a call volume bar and a media volume bar. However, the UI display of the ios system is not distinguished, and the UI icon display is the same. The way to distinguish is as shown in the figure below. Can be adjusted to 0, media volume bar, volume bar can be adjusted to 0;

3ab52260d112722fe7f82ec07444d17b.png

  • Software and hardware 3A:

  • Under the IOS system, the same phenomenon is similar to the Android system:

54742947c62274f7290e23b92fa99b66.jpeg

  • In the RTC scenario, the IOS system software and hardware 3A parameter collocation:

  • 硬件3A: kAudioUnitSubType_VoiceProcessingIO + AVAudioSessionModeVoiceChat

  • 软件3A: kAudioUnitSubType_RemoteIO + AVAudioSessionModeDefault

  • The following is the ios-related system interface description:

d0085115c1c72877a5026de1d999b35e.png

e8da1ef9d246ec65987470fd3a0ee153.png

a7e2dd690a261e78f421d36230d738ed.png

  • WebRTC uses AudioUnit acquisition on the iOS platform, and the relevant codes are as follows:

aaf262980e07ad8312454c9545425563.png

  • According to Apple's API description, iOS provides three I/O units, of which the Remote I/O unit is the most commonly used. Connects to input and output audio hardware, provides low-latency access to incoming and outgoing sample values, and provides format conversion between hardware audio formats and application audio formats. The Voice-Processing I/O unit is an extension of the Remote I/O unit, which adds echo cancellation in voice chat, and also provides features such as automatic gain correction, voice quality adjustment, and mute. The Generic Output unit does not connect to the audio hardware, but rather provides a mechanism to send the output of the processing chain to the application. Usually used for offline audio processing.

2.1.3 Windows side

  • On the windows side, there are usually three sets of audio device drivers: dsound, CoreAudio, Wav

  • Dsound - the best compatibility, more desktop use

  • CoreAudio - compatibility second

  • Wav - rarely used

  • At present, many Windows devices will have a built-in microphone array at the top of the screen to provide audio enhancement function. The opening method is as shown in the figure below. This function defaults to the area in front of the screen as the pick-up area. The microphone array technology can effectively enhance the voice of the speaker in the pick-up area and "isolate" the "noise" outside the pick-up area. The main disadvantage is that opening this function After the function, only 8k spectrum is supported, and the enhancement algorithm of each manufacturer is different, and the effect is also uneven. Therefore, the software needs to have the ability to bypass the audio enhancement function of the hardware to ensure high sound quality.

003adf6058c88c5e4872feef713065a0.png

38602f548c2537b3f38bbe5b58207d74.png

<Enhanced feature switch in audio settings>

d31127269730e6d6cffd945275303cf9.png

<The loss of frequency spectrum after audio enhancement is turned on>

  • In terms of volume, PC-side devices support analog gain adjustment, and most Windows devices with arrays have additional microphone enhancement (as shown below). The software algorithm level (AGC in 3A) needs to have the ability to adaptively adjust them to ensure the stability of the audio collection volume to control the collection noise floor level. Improper initial value setting or adaptive adjustment will lead to problems such as low volume and popping sound, which will seriously affect the effect of echo cancellation and noise reduction, and bring the risk of affecting usability.

d6841605fbc21c8578374f0cca7b3503.png

<Analog Gain and Mic Boost>

2.1.4 Mac side

  • The Mac side is less used, so I won’t introduce it in detail here;

  • The Mac side supports the analog volume dynamic adjustment similar to the Windows side;

2.1.5 Common problems and solutions of audio equipment

  • Due to different hardware manufacturers, audio collection solutions at different ends are uneven, so the quality of the collected audio directly affects the usability of the production materials obtained by the 3A algorithm, and also determines the quality of the audio signal received by the end user. upper limit. According to the audio problems encountered in actual work, the problems caused by equipment acquisition can basically be summarized into the following categories:

31c2a28ed4d4b66d8ec97b25f43657eb.png

To give a few examples:

(1) Abnormal collection:

Acquisition abnormalities are mainly reflected in the "fuzzy" spectrum, which can lead to inability to understand semantics and affect normal communication. The spectrogram is as follows.

8f0d24d80e339583cf64e48f2d146dcc.png

In addition, after the acquisition is abnormal, the broadcast signal will also appear abnormal after being collected by the microphone, which will cause serious nonlinear distortion and affect the echo cancellation effect, as shown in the figure below.

a6fe10d02b03222dff0c49d48ccbaddb.png

(2) Acquisition jitter

The common thing is to collect and lose data, and you will hear a lot of high-frequency noise in the sense of hearing (the picture below is the partial picture after the noise in the picture above is enlarged), which will seriously affect the delay estimation accuracy and far-near-end non-linearity in the AEC algorithm. Cause and effect problems, serious ones will lead to echo leakage.

82985a43c0369a4d97a09d4556e9fde1.png

039ab679d70fdb45286212d35a60fd96.png

(3) Minor problems with popping sound and volume

Acquisition popping mainly occurs on PCs, and it is also a problem that PC-side devices should avoid. It has a greater impact. In addition to spectral distortion caused by truncation, severe nonlinear distortion will affect the echo cancellation effect. The popping problem requires the AGC algorithm to adaptively adjust the analog gain on the PC side and strengthen the microphone to solve it.

97051f4be5f2127e2b676900373dada3.png

(4) Spectrum missing

The lack of spectrum on the equipment acquisition side is mainly due to the fact that the audio sampling rate of the hardware callback is inconsistent with the actual spectrum distribution. Even if the encoder gives a high encoding bit rate, there is no high-quality sound effect in the sense of hearing, which has been introduced here;

(5) Solution:

In order to solve the availability of equipment in various business scenarios and improve the accuracy of data collected by equipment, the commonly used strategies are as follows:

  • Device adaptation:

According to different models, Android adapts the appropriate acquisition sampling rate samplerate, audiomode, audiosource and whether java or opensles is used through configuration distribution;

  • Business layer cooperation strategy:

windows: DingTalk meeting, users can choose to use dsound or coreaudio

6016a507e8fedee37c8022c926536f13.png

  • The sdk self-adaptive strategy uses devicemonitor to monitor whether the device is available or abnormal. Once an abnormality occurs, the sdk can adopt the following strategies:

Automatically restart audio capture/playback devices;
automatically downgrade and switch audio drivers;

2.2 Audio 3A

Audio pre-processing is the key to the entire audio processing chain. The original audio data collected by the microphone will have various problems such as noise and echo. For example, in a multi-person video conference scene, multiple devices in the same place open the microphone at the same time, which will cause strong howling, and the speaker is far away from the microphone. bad. In order to improve the audio quality, it is necessary to perform echo cancellation, noise reduction and volume equalization operations on the transmitted signal in sequence at the sending end, that is, 3A processing of AEC echo cancellation, ANS noise suppression and AGC automatic gain control. In different scenarios such as calls, chats, teaching, and games, real-time audio and video manufacturers need to consider the actual needs of the scenarios and adjust the 3A algorithm accordingly to achieve good audio effects.

In webrtc open source code, taking the Android side as an example, webrtc will check whether the hardware has hardware ANS and hardware AEC capabilities, and if so, software aec will not be enabled ; Support, the software 3A will be turned on at the same time as a bottom-up solution. The main reason is that the hardware 3A capabilities of different devices are relatively different. Relying on hardware 3A cannot achieve completely reliable and stable effects. The following focuses on the analysis of the principle of 3A and the impact on sound quality. Influence;

(1) AEC (Acoustic Echo Cancellation)

  • Principle: As shown in the figure below, the user in Room1 speaks to the peer Room2 through rtc, and after being played by the peer speaker, it is collected by the peer mic, and then encoded and transmitted to Room1, so that Room1 can hear its own voice, that is, The so-called "echo", the problem that AEC has to solve is to eliminate the echo signal, so as to avoid the other end hearing its own voice; therefore, in audio and video communication, the common problem is "I can hear my voice" problem The reason is not that there is a problem with "my" mobile phone, but that the echo cancellation of other people communicating with "me" has not been handled well, which leads to "I hear my own voice".

a305314d8720633625a03ac1aef3885e.png

39201832963cb5177b7caa9c1bec4228.png

  • The basic principle of echo cancellation is to use the correlation between the echo-containing signal collected by the near-end signal (mic) and the far-end reference signal (the received voice signal of the opposite end before playback), according to the loudspeaker signal and its generated Based on the correlation of the echo signal, the remote signal model is established, the echo path is simulated, and the adaptive algorithm is adjusted to make the impulse response close to the real echo path. Then subtract the estimated value from the signal received by the microphone to realize the echo cancellation function.

  • Several difficulties in echo cancellation in rtc scenes:

  • The effect of hardware AEC is quite different;

  • double talk question;

  • Latency estimation;

  • Leak echo, etc.;

  • Reference article: https://developer.aliyun.com/article/781449?spm=a2c6h.14164896.0.0.70a21f36aoEDj7

(2) ANS (Automatic Noise Suppression)

  • ANS in webRTC is based on Wiener filtering to reduce noise. For each frame of noisy speech signal received, based on the initial noise estimation of the frame, the speech probability function is defined, and the noise signal of each frame is measured. Classification features, using the measured classification features, calculate the multi-feature-based speech probability for each frame, and weight the calculated speech probability with dynamic factors (signal classification features and threshold parameters), according to the calculated feature-based for each frame Speech probability, which modifies the speech probability function for each frame of the multiframe, and updates the initial noise (quantile noise for each frame of consecutive multiframes) estimate in each frame using the modified speech probability function per frame.

7ba82d68dc4cabf82c4413f4319c006d.png

  • The impact of noise reduction on sound quality:

The impact of noise reduction on the sound quality of the signal is greater than that of the echo cancellation module. This is due to the fact that at the beginning of the design of the noise reduction algorithm, we assumed a priori that the noise floor is a stable signal (at least a short-term stable signal), and according to this assumption , the discrimination between music and background noise is significantly weaker than that between speech and background noise. A piece of pleasant music has rich details in each frequency band (especially the high frequency part), and the loss of any frequency band may affect the sense of hearing. However, the energy of music in the mid-to-high frequency (especially high frequency) part is often low, which leads to a small signal-to-noise ratio after superimposing noise, which makes ANS processing difficult. Music details in the middle and high frequencies are likely to be mistaken for noise and processed, causing damage. In contrast, human voice is generally concentrated in the middle and low frequencies, with high energy and signal-to-noise ratio, and relatively less damage in ANS processing.

In summary, music is more likely to be accidentally injured by ANS. In music scenes with high sound quality requirements, it is recommended to reduce the noise reduction level, or even turn off the noise reduction processing, to reduce noise interference from the environmental level as much as possible.

  • Recorded audio:

f6650962fdcdeec9b33caa4dce1d1d74.png

  • Audio after ANS:

9e1cda03d2389c925a46ee5b928640be.png

(3) AGC (Automatic Gain Control)

  • Not all signals are subject to volume gain control. Just like ANS/AEC only suppresses noise/far-end echo and preserves near-end voice, AGC also needs to screen the near-end voice in the collected signal to avoid noise and echo. and other unrelated signal gain. Considering this point, it is more appropriate to place the AGC module after AEC and ANS processing, because the noise and echo in the signal have been greatly reduced at this time. However, the "advantage" of the location does not mean that AGC can carry out its work without any worries. In order to avoid slipping through the net, it is often necessary to perform voice detection (VAD) to further distinguish speech segments from non-speech segments.

28d7d8f40ee4250a772567c015bb8232.png

AGC has several key parameters:

Target volume - targetLevelDbfs: Indicates the target value of the volume equalization result, if set to 1, it means the target value of the output volume is - 1dB;

Gain capability - compressionGaindB: Indicates the maximum gain capability of the audio, if set to 12dB, the maximum can be increased by 12dB;

894368321b9d1e628aa61253b97c28be.png

  • Three modes of AGC:

enum {

kAgcModeUnchanged,

kAgcModeAdaptiveAnalog, // adaptive analog mode

kAgcModeAdaptiveDigital, // Adaptive digital gain mode

kAgcModeFixedDigital // fixed digital gain mode

};

The PC side supports analog gain adjustment. On the PC side, kAgcModeAdaptiveAnalog is usually used, that is, the analog gain and digital gain are adjusted together to adjust the volume;

2.3 Encoder

Opus encoder:

Opus is a codec mixed with SILK+CELT. The academic name is USAC, Unify Speech and Audio Coding, a codec that does not distinguish between music and speech. There is a Music detector in this codec to judge whether the current frame is speech or music. The speech is encoded in the silk frame, and the music is encoded in the celt frame. It is generally recommended not to limit which mode the encoder uses for encoding.

Currently WebRTC uses kvoip as the Application, and the mixed encoding mode is enabled by default;

Judgment of the music and speech coding algorithm in the mixed coding mode in the encoder:

62853e29196fbf0d1eb994b141da732c.png

Opus coding has the following characteristics:

  • Bitrates from 6 kb/s to 510 kb/s

  • Sampling rates from 8 kHz (narrowband) to 48 kHz (full frequency)

  • Frame size from 2.5ms to 60ms

  • Supports constant bit rate (CBR) and variable bit rate (VBR)

  • Audio bandwidth from narrow to full band

  • Support voice and music

  • Support mono and stereo

  • Supports up to 255 channels (multi-streamed frames)

  • Dynamically adjustable bitrate, audio bandwidth and frame size

  • Good robustness to loss rate and packet loss concealment (PLC

b4be192d4d13e346935961d951513e16.png

  • Dessert code rate:

71d3a7322038d1ea3d1ece77fa9e1eb2.png

dd76a692eca6c90453859b2c8513c4c0.png

  • Summary: In music scenes and other scenes that require high sound quality, Celt encoding with higher complexity and higher bit rate can be used to achieve high-quality music experience; in voice scenes, Silk encoding can be used; in addition, The bit rate of the encoder is also very important. When the bit rate is too low, Opus will automatically downgrade from FB encoding to WB or even NB encoding, resulting in a "spectrum truncation" effect on the encoder side. AAC also has a similar phenomenon. This part can be combined with this article Look at the case analysis in 3.6.

2.4 Neteq

Neteq is an important technology of webrtc, mainly to resist network jitter, packet loss, etc.; jitter refers to the imbalance of the data arriving at the receiving end in different time periods due to network reasons; or the receiving end receives data packets. The time interval varies from one to another; and the packet loss is that the data packet is transmitted through the network and is lost due to various reasons. After several retransmissions, it is successfully received as a recovery packet. The retransmission also fails or the recovery packet is out of date. It will cause real packet loss, and the packet loss recovery PLC algorithm is needed to generate some false data out of nothing to compensate. Packet loss and jitter are unified in terms of time dimension. The one that waits within the waiting period is jitter, the one that arrives a long time late is retransmission, and the one that does not wait beyond the waiting period is "true packet loss". Neteq optimization One of the goals of is to minimize the probability of a packet becoming a "true drop".

6444b304e4c58aae137f6f1d5fdaf514.png

The core module of NetEQ has MCU module and DSP module. The MCU module is responsible for inserting and fetching data into the jitter buffer cache, as well as the operation for DSP; the DSP module is responsible for the processing of voice information numbers, including decoding, acceleration, deceleration, fusion, PLC, etc. . At the same time, the MCU module takes packets from the jitter buffer and is affected by the related feedback of the DSP module. The MCU module includes audio packets entering the buffer, taking audio packets out of the buffer, evaluating the network transmission time delay through the voice packet arrival interval, and performing operations on the DSP module (acceleration, deceleration, PLC, merge), etc. The DSP module includes decoding, performing related operations on the decoded voice PCM data, evaluating the network jitter level, delivering playable data to the play buffer, and so on.

Neteq will use the packet loss compensation (Packet Loss Concealment, PLC) algorithm when a real packet loss occurs. The PLC algorithm can properly compensate the lost audio packets by using all the obtained information, making it difficult to detect, thus ensuring the clarity and fluency of the receiving side audio and bringing users a better call experience.

In the entire rtc link, packet loss is a phenomenon often encountered when data is transmitted in the network, and it is also one of the main reasons for voice quality degradation in VOIP (Voice Over Internet Phone, VOIP) calls. The traditional PLC solution is mainly based on the principle of signal processing. The basic principle is to use the decoding parameter information before packet loss to reconstruct the lost voice signal. The biggest advantage of the traditional PLC method is that it is simple to calculate and can be compensated online; the disadvantage is that the compensation ability is limited, and it can only effectively fight against packet loss of about 40ms. When dealing with long-term continuous burst packet loss, the traditional algorithm will cause mechanical sound, rapid waveform attenuation and other situations that cannot be effectively compensated;

As shown in the figure below, when the packet loss is fixed at 120ms, the neteq_plc algorithm completes the packet loss compensation through simple gene cycle repetition and attenuation. The waveform of the packet part; the compensation ability of the opus_plc algorithm is limited, it can only effectively compensate for about 40ms, and the packet loss of more than 40ms will be attenuated into silence.

e0fb2e9c414a092f6af95afaa8265e62.png

Therefore, to improve the sound quality of the whole link, there are two technical directions that can be optimized on the network and neteq sides:

  • Improve the ability to resist weak networks through Red/FEC+Nack;

  • Improve the sense of hearing under "true packet loss" by optimizing the PLC algorithm at the receiving end;

3. Thinking about different business scenarios and the technology selection behind it

3.1 Live scene:

  • The live broadcast scene is usually an entertainment-based scene. The anchor usually plays music on the mobile phone with an external sound card. This type of scene is characterized by high requirements for sound quality, especially music sound quality. Therefore, in terms of technical solutions, software 3A + media volume bar is usually used. +Music encoding scheme, and rtc usually does not perform 3A processing on the sound when it is connected to an external sound card. When not connected to a sound card, it also uses weak mode noise reduction or directly turns off noise reduction;

747ee5df2f80f76b17a6329a453dfb21.png

3.2 Meeting scene:

  • The conference scene is a typical voice call-based scene. Therefore, Tencent Conference and DingTalk Conference generally adopt the hardware 3A mode, call volume bar, and voice coding. The picture below is the DingTalk conference mobile terminal. The actual corresponding method of "True Music Mode" is to turn off the software noise reduction (ANS) algorithm; however, the disadvantages of this solution are also obvious, that is, the effective bandwidth of the collected sound source is already very low and has been processed by hardware 3A, and then turned off later Software 3A has little benefit;

6864004036706040f11b4bf357eddfb8.png

97b8b3d87f37d288fb13c53e7d309913.jpeg

The client of Tencent Conference is the same solution:

b1e697f68f3171d43fa04c918aa4c760.png

3.3 Communication scenarios:

  • Taking WeChat video call as an example, it is a typical communication scenario. The selection of sound quality technology in the communication scenario and the conference scenario is the same solution, that is, the solution of hardware 3A + call volume bar, but WeChat call does not provide the setting of music in the meeting. the function of the mode;

776c1421de64673b97ebba712ff65901.jpeg

3.4 Online Education Scenario:

Online education scenarios, in terms of subdivisions, are divided into ordinary online education scenarios and music education scenarios. The former mainly focuses on voice lectures, while the latter focuses on musical instrument performance, such as piano education, etc.;

  • General Education Scenario:

In general education scenarios, from the perspective of sound quality, the main purpose is to ensure clear and intelligible voice quality. Therefore, a common solution is to use hardware 3A solutions similar to communication scenarios;

f7d8ef552838ee5b4303578c68fcff98.jpeg

  • Music education scene:

ad9e9e5ea36f2d95df98e35bb09a1f86.png

The music education scene is essentially similar to the live broadcast scene, and it is also a kind of pan-entertainment scene. The technology selection behind this kind of scene also needs to focus on sound quality. Main, that is, the solution of software 3A acquisition + music noise reduction + music encoding;

3.5 "Watching together" scene

The "watching together" scene refers to the scene where multiple people watch the same video together in real-time voice chat in one room. This kind of application scene and its extended scene include "watching movies together", or playing video courseware in the education classroom teacher It can be watched by online classmates together, or in scenes such as script killing games interspersed with video plots online.

The picture below is a typical Top1 APP "Shimmer" in this scene:

a02b480708c1b62115e00aaab0562934.jpeg

77629178cee59545c49cb4ab12c612c6.png

The technical difficulty in this scenario is that the player SDK and the rtc SDK are two SDKs. How to solve the problem of echo cancellation of the player's sound, otherwise people in the meeting will hear the sound of the movie twice. And due to the existence of the echo, the human voice in the movie will trigger the voice activation when the player is playing again, causing the sound wave of the UI to vibrate with the voice of the characters in the movie even if the user does not speak on the UI. Therefore, The difficulty in this scenario is to eliminate the movie echo; for this scenario, there are currently three solutions:

  • Solution 1: Hardware 3A, this solution is simple to implement, relying on hardware 3A to eliminate movie echo, currently the online version of Shimmer uses this solution; but there are several problems with this solution:

  1. Android has dual volume bars; the player is the media volume bar, and the rtcsdk of the voice chat is the call volume bar, so the user experience is not very good. The user on the microphone adjusts the call volume by default, and the media volume needs to be adjusted separately;

  2. The aec effects of different models of Android hardware 3A are inconsistent, and there are echo leakage problems caused by poor echo cancellation of some android models, which cannot be solved and the experience is not good;

  3. Although the IOS side can also use the hardware 3A solution, but the ios side is above ios14, because ios gives priority to ensuring the call volume when the hardware 3A and media volume exist at the same time, even if the player volume is adjusted to the maximum, the actual movie sound is still very small The situation seriously affects the user experience;

  • Option 2: The core idea of ​​Option 2 and Option 3 is to get the audio reference signal before rendering from the player, and send it to the software aec as a remote reference signal for echo cancellation, thereby eliminating echo; and Option 2 and Option 3 can be unified Using the media volume bar, rtc is connected with a pure soft 3A solution, which can solve the three experience problems in Solution 1 very well.

e5e250c47148ed67458f29a354635b0f.png

  • third solution:

6a528e09e5127b888ab87c8cee59a64d.png

Solution 3 is a further evolution solution. It is also possible to synchronize the playback progress of the anchor to the Lianmai terminal through the rtc signaling channel, so as to achieve synchronization of the playback progress. The logic of the business layer can be simplified and the performance can also be achieved. best.

3.6 Case study:

  • Background: In a head entertainment scene, after accessing rtcsdk, the host reported that the external sound card played music, the sound quality was not good, and the sound quality heard by the audience was not good.

fa554a21fe39daa1817125bdc9bb4850.png

  • Cause Analysis:

  • SDK side: When anchor A uses an external sound card to push background music, the sound quality of the music heard by anchor B is not good. The main reason is that the ANS (Automatic Noise Reduction Algorithm) in the 3A algorithm library exists in some scenarios. Self-test damages the music Caused, but will not cause spectrum truncation, solved after optimization;

  • Server side: When the mpu pulls the streams of two anchors and mixes them and then transcodes them into AAC streams, the AAC Audioprofile parameters set are LC, 64kbps, 1ch; the code rate of this set of parameters will lead to the bandwidth of AAC Only 16kHz, that is, spectral truncation. When the input is mono, 48000 audio, and the bit rate is 64kbps, the AAC LC mode is inferior to the AAC HEv1/HEv2 mode in terms of bandwidth and hearing.

d537f5e015d12d6c094e2139ec2d5fc4.png

7075a429aafef3cc7fa4ce4a0414451c.png

3.7 Industry evolution trend:

Due to the simple implementation of hardware 3A and relatively low power consumption, it has been the main technical solution in the field of rtc for a long time. However, due to the development of software 3A technology is becoming more and more mature, the current cutting-edge software 3A overall solution can already be based on voice and music content. Adapt to the adjustment algorithm strategy, coupled with the "fatal injury" under the hardware 3A - the low quality of the collected sound and the large difference in the experience of different android devices has brought about a large adaptation cost, the cutting-edge technology trend of rtc has gradually moved towards pure Software 3A evolved. Pure software 3A can bring users a more consistent high-quality sound experience in various scenarios, and greatly reduce the adaptation manpower and technical costs of rtc manufacturers, but it also requires sufficient 3A algorithm capability support.

In order to meet different sound quality requirements in different business scenarios with a set of SDKs, rtc manufacturers will provide different AudioProfiles to differentiate and provide settings for users, see Appendix II for details.

4. Summary:

This article combines the common audio engine architecture in RTC scenarios, from audio equipment acquisition, 3A processing, to encoder and neteq, and analyzes the factors that affect sound quality from the technical latitude one by one. Combining the different requirements for sound quality in various scenarios such as conferences, education, and entertainment, this article expounds the audio strategy and solution selection considerations in different business scenarios, and finally makes a personal analysis and prediction of the industry evolution trend.

Appendix 1: Exploration of only 8Khz acquisition bandwidth in VOIP mode

Phenomenon:

  • When the APP layer creates AudioRecord collection, with the same sampling rate setting, the effective bandwidth collected by different audiomodes is different. In the voip mode, the effective bandwidth collected is only 8khz, while the bandwidth in the normal mode is the normal full frequency band. Data; (this phenomenon also exists on the ios platform), has nothing to do with mobile phones and chips;

756157f986869a823e8f9d42bb115393.png

3c1d48847a8c190f112298cda4627cb5.png

reason

  • Where is the rootcase with 8khz bandwidth under voip? (It is suspected that hardware 3a supports 16khz, and resample was done before and after hardware 3a)

865c17234d0ad43bddce1944b7bd52f8.png

  • Reason: Qualcomm chip as an example, SRC is inside AudioDSP, in voip mode, enc is optional, the front SRC will be down-sampled to 16khz, audiodsp will be up-sampled to 48khz, and 48khz data is given to the AP layer; SRC of MTK chip The logic is outside AudioDSP; the core reason is that the hardware 3A only supports 16khz. On the one hand, it is due to the consideration of dsp power consumption (the complexity of the aec algorithm is relatively high), and on the other hand, because the voice coding of voip usually only supports 8khz/16khz voice coding ;

  • Relevant information:

  1. https://www.cnblogs.com/talkaudiodev/p/8996338.html

  2. https://www.cnblogs.com/talkaudiodev/p/8733968.html

solution exploration

  • Solution 1: You can consider introducing a bypass mode similar to ios: kAUVoiceIOProperty_BypassVoiceProcessing

  • https://developer.apple.com/documentation/audiotoolbox/1534007-voice-processing_i_o_audio_unit_proper

  • The problem solved by the bypass mode of ios: In voip mode, under the call volume bar, the hardware 3A is not enabled;

  • Solution 2: Adapt the control logic of the native Android system to the hardware 3A switch

income

  • The hardware 3A strategy can be controlled separately to achieve the best configuration of sound quality under one volume bar (call volume bar), providing APP developers with more ways to play;

  • Specific application scenarios: scenarios that require the volume bar (cannot be adjusted to 0) and high sound quality; for example: music education scenarios;

Equipment differences

  • Huawei mate30:

  • audiomode: MODE_IN_COMMUNICATION + audiosouce: VOICE_COMMUNICATION recording is 8khz bandwidth, (maybe due to power consumption considerations)

  • Oneplus 10R 5G:

  • audiomode: MODE_IN_COMMUNICATION + audiosouce: VOICE_COMMUNICATION, enable hardware aec, the bandwidth will not be damaged, but if there is an audiotrack thread started, the aec logic will be triggered, resulting in reduced bandwidth;

Appendix II: Related definitions of AudioProfile

Refer to the relevant interface documents of the sound network: https://docportal.shengwang.cn/cn/All/API%20Reference/java_ng/API/toc_audio_process.html#ariaid-title36

public abstract int setAudioProfile(int profile, int scenario);

 encoding mode (profile)

  • The corresponding relationship between the instantaneous sending code rate is only an approximate range, and it is for reference only. Usually, qos will dynamically adjust the number of redundant packets and the encoding target code rate according to the network conditions.

5d415fcbd80bf121ef58100ad9d3649a.png

scene mode:

1964c6b27d6e6cfd5faef49cf369f592.png

The scenario mode (scenario) will affect the audio device parameters, software 3A strategy and related Qos strategy, etc., and has nothing to do with the encoding rate; different scenario modes have different emphasis on the processing of audio data, and corresponding to different volume bars. Different business scenarios and different sound quality and related technical indicators are used for selection.

Appendix III: References:

  1. Simple WebRTC AEC (acoustic echo cancellation: https://developer.aliyun.com/article/781449?spm=a2c6h.14164896.0.0.70a21f36aoEDj7

  2. Interpretation of WebRTC audio NetEQ and optimization practice in vernacular: https://developer.aliyun.com/article/782756

  3. Improve RTC audio experience - start from understanding the hardware: https://developer.aliyun.com/article/808257

  4. Detailed explanation of low latency and high sound quality|Echo cancellation and noise reduction: https://www.rtcdeveloper.cn/cn/community/blog/21147

Guess you like

Origin blog.csdn.net/feelabclihu/article/details/131862994