Those things about the audio of the WebRTC series | Audio and video development

 Today I will talk to you about the audio in WebRTC. WebRTC consists of three modules: voice engine, video engine and network transmission . The voice engine is one of the most valuable technologies in WebRTC , which realizes the acquisition, preprocessing, encoding, sending, receiving, decoding, mixing, A series of processing flows such as post-processing and playback.

The audio engine mainly includes: audio device module ADM, audio encoder factory, audio decoder factory, mixer Mixer, audio pre-processing APM.

How audio works

If you want to understand the audio engine systematically, you first need to understand the core implementation classes and the flow of audio data. Next, we will briefly analyze it.

 Audio engine core class diagram 

 

The audio engine WebrtcVoiceEngine mainly includes the audio device module AudioDeviceModule, the audio mixer AudioMixer, the audio 3A processor AudioProcessing, the audio management class AudioState, the audio encoder factory AudioEncodeFactory, the audio decoder factory AudioDecodeFactory, and the voice media channel includes sending and receiving.

1. The audio device module AudioDeviceModule is mainly responsible for the hardware device layer, including the collection and playback of audio data, and the related operations of the hardware device.

2.  Audio mixer AudioMixer is mainly responsible for the mixing of audio transmission data (device collection and audio mixing), and the audio playback data mixing (multi-channel receiving audio and audio mixing).

3.  Audio 3A processor AudioProcessing is mainly responsible for the pre-processing of audio acquisition data, including echo cancellation AEC, automatic gain control AGC, and noise suppression NS. APM is divided into two flows, one near-end flow and one far-end flow. The near-end stream refers to the incoming data from the microphone; the far-end stream refers to the received data.

4.  Audio management class AudioState includes audio device module ADM, audio pre-processing module APM, audio mixer Mixer and data flow center AudioTransportImpl.

5.  Audio encoder factory AudioEncodeFactory contains codecs such as Opus, iSAC, G711, G722, iLBC, L16, etc.

6.  Audio Decoder Factory AudioDecodeFactory contains Opus, iSAC, G711, G722, iLBC, L16 and other codecs.

  Audio workflow 

 

1. The initiator collects sound through the microphone

2. The initiator sends the collected sound signal to the APM module for echo cancellation AEC, noise suppression NS, and automatic gain control processing AGC

3. The initiator sends the processed data to the encoder for voice compression and encoding

4. The initiator sends the encoded data through the RtpRtcp transmission module, and transmits it to the receiver through the Internet

5. The receiving end accepts the audio data transmitted from the network, and first transmits it to the NetEQ module for jitter elimination, packet loss concealment decoding, etc.

6. The receiver sends the processed audio data to the sound card device for playback

 NetEQ module is the core module in Webrtc speech engine 

 

In the NetEQ module, it is roughly divided into MCU module and DSP module.

The MCU  is mainly responsible for the calculation and statistics of delay and jitter, and generates corresponding control commands.

The  DSP  module is responsible for receiving and processing the corresponding data packets according to the control commands of the MCU, and transmitting them to the next link.

audio data flow

According to the audio workflow described above, we will continue to refine the data flow of audio. It will focus on the important role played by the data transfer center AudioTransportImpl in the whole link .

The data transfer center AudioTransportImpl implements the acquisition data processing interface RecordDataIsAvailbale and the playback data processing interface NeedMorePlayData.

RecordDataIsAvailbale is responsible for the processing of capturing audio data and distributing it to all sending Streams.

NeedMorePlayData is responsible for mixing all received Streams, then feeding it to the APM for processing as a reference signal, and finally resampling it to the requested output sample rate.

The main internal process of RecordDataIsAvailbale:

  1. The audio data collected by the hardware is directly resampled to the sending sample rate

  2. 3A processing is performed on the audio data after resampling by audio preprocessing

  3. VAD processing

  4. Digital gain adjusts acquisition volume

  5. Audio data callback external for external pre-processing

  6. All the audio data that needs to be sent by the mixing sender, including the data collected and the data of the accompanying sound

  7. Calculate the energy value of audio data

  8. Distribute it to all sending Streams

The main internal processes of NeedMorePlayData:

  1. Mix audio data from all received Streams

    1.1 Calculate the output sampling rate CalculateOutputFrequency()

    1.2 Collect audio data from Source GetAudioFromSources(), select the three channels with no mute and the largest energy for mixing

    1.3 Execute the mixing operation FrameCombiner::Combine()

  2. Under certain conditions, noise injection is performed for the acquisition side as a reference signal

  3. Mixing local sound

  4. Digital gain adjusts playback volume

  5. Audio data callback external for external pre-processing

  6. Calculate the energy value of audio data

  7. Resample the audio to the sample rate of the requested output

  8. Send audio data to APM as a reference signal for processing

 

From the data flow in the above figure, why do you need FineAudioBuffer and AudioDeviceBuffer?

Because WebRTC's audio pipeline only supports processing data of 10 ms, different operating system platforms provide audio data with different acquisition and playback durations, and different sampling rates also provide data with different durations.

For example, on iOS, 16K sampling rate will provide 128 frames of audio data of 8ms; 8K sampling rate will provide 128 frames of audio data of 16ms; 48K sampling rate will provide 512 frames of audio data of 10.67ms.

The data played and collected by AudioDeviceModule is always taken in or sent out through AudioDeviceBuffer with 10 ms audio data.

For platforms that do not support capturing and playing 10 ms audio data, a FineAudioBuffer is also inserted into the platform's AudioDeviceModule and AudioDeviceBuffer to convert the platform's audio data format into audio frames that can be processed by 10 ms WebRTC.

In AudioDeviceBuffer, the number of sampling points and sampling rate corresponding to the audio data from the current hardware device will be counted regularly for 10s, which can be used to detect a working state of the current hardware.

Audio related changes

1. The realization of audio profile supports two scenarios of Voip and Music, and realizes a comprehensive technical strategy of sampling rate, encoding bit rate, encoding mode, and number of channels. iOS implements the separation of acquisition and playback threads, and supports dual-channel playback.

2. The compatibility of audio 3A parameters is issued and adapted.

3. Adaptation of headset scene, adaptation of Bluetooth headset and ordinary headset, dynamic 3A switching adaptation.

4. Noise_Injection noise injection algorithm, as a reference signal, plays a particularly obvious role in echo cancellation in headphone scenarios.

5. Support local audio file file and network audio file http&https.

6. The implementation of Audio Nack improves the anti-packet loss capability of audio, and is currently conducting In-band FEC.

7. Audio processing is optimized for single-speaking and dual-speaking.

8. iOS research on Built-In AGC:

(1) Built-In AGC is effective for Speech and Music, but has no effect on noise and environmental noise floor.

(2) The gain of the microphone hardware of different models is different, iPhone 7 Plus > iPhone 8 > iPhone X; therefore, when both software AGC and hardware AGC are turned off, the sound level heard at the far end will be different.

(3) In addition to the switchable AGC provided by iOS, there is also an AGC that will work all the time to fine-tune the level of the signal; I guess this AGC that is always working is the analog AGC that comes with iOS, which may be related to hardware and has no API. switchable, and switchable AGC is a digital AGC.

(4) On most iOS models, the input volume will be reduced in the external mode "after the earphone is plugged in again". The current solution is to add a preGain to bring the input volume back to normal after the headphones are plugged in again.

Audio Troubleshooting

Let me share with you some of the most common phenomena and reasons for audio :

The above is some of my sharing about the voice engine, welcome to leave a message to discuss with me.

 

Guess you like

Origin blog.csdn.net/m0_60259116/article/details/123512563