Talk about several sets of voice communication solutions I have developed

I have been engaged in audio software development for more than 10 years. I have developed both voice-related and music-related, but most of them are still voice-related. There are five sets of voice solutions developed on the communication terminal so far. They have both wired communication and wireless communication; both are developed at the upper layer and at the bottom; Developed on ARM, and developed on DSP, in short, each has its own characteristics. However, because they are all voice communication solutions, and they have something in common, they must have voice acquisition and playback, encoding and decoding, pre- and post-processing, and transmission. Today I will select three representative solutions and talk about their implementation.

 

1. A wired communication voice solution developed on embedded Linux

This solution is developed on embedded Linux, the audio solution is based on ALSA, and the voice communication is done in user space, which is an upper-level solution. Because it is wired communication, the network environment is not particularly bad compared to wireless communication, and there are not many compensation measures for packet loss, mainly PLC, RFC2198 and so on. I have described in detail how the solution is done in the previous article ( How to Develop a Voice Communication Solution on Embedded Linux ), you can take a look if you are interested.

 

2. Traditional wireless communication voice solutions developed on Android phones

This solution is developed on an Android phone and is a traditional voice communication solution on a mobile phone (relative to APP voice communication). Android is based on Linux, so ALSA is also used, but it is mainly used for control, such as the configuration of codec chips. The driver, codec, and pre- and post-processing related to audio data are developed on the Audio DSP, and those related to the network side are developed on the CP (communication processor), which is a bottom-level solution. The software block diagram of the program is as follows:

As can be seen from the above figure, the AP plays a controlling role in the solution. One is to control the selection of the audio path on the codec chip (completed by the configuration register), and the other is to control the start/stop of the audio stream on the Audio DSP. The real audio data processing is implemented on Audio DSP and CP (the implementation on Audio DSP and CP is done in two different departments, I am doing audio development on Audio DSP, I only understand the implementation on CP, and can't go into details. Describe the Audio implementation on the CP). Voice communication is divided into uplink and downlink. Let's look at the uplink first. The voice data collected by the codec chip is sent to the Audio DSP by I2S. There is a DMA IN interrupt in the Audio DSP, which occurs once every 5ms, acquires the voice data, and then performs resampling (the sampling rate of the codec chip in the scheme is 48k Hz, and the sampling rate of the codec codec is 8k/16k, etc., and resampling is required) It becomes 8k/16k and other voice data and saves it in the buffer. After four occurrences, 20ms of voice data is obtained (based on 20ms because one frame of AMR/EVS is 20ms), the 20ms data is pre-processed (AEC/ANS/AGC, etc.), and then encoded to obtain the code stream, And send the code stream to CP through IPC. Do some network-side related processing in the CP and finally send it to the other party through the air interface. Look down. After receiving the voice data from the air interface, the CP performs network-side processing (jitter buffer, etc.) and sends the code stream to the Audio DSP. After receiving the code stream, Audio DSP first decodes it into PCM data and does post-processing (ANS/AGC, etc.), then resamples it into 48k Hz voice data, and also needs to do mixing processing (mainly system sound, which must be played together). out), put it in the buffer after processing. There is also a DMA OUT interrupt in the downlink, which is also once every 5ms. A frame of 20ms data is sent to the codec chip in four times, that is, 5ms data is taken from the above buffer each time, and the buffer is emptied after taking four times, and then removed. One frame of data playback. The data sent to the codec chip will be played out through the peripherals.

 

Due to the development on DSP, hardware resources (DSP frequency/memory) have become a bottleneck, and a lot of time is spent on load/memory optimization. The DSP frequency is only more than 300 MHZ. The upstream and downstream processing, encoding and decoding, and resampling are relatively load-intensive. Without optimization, it cannot run smoothly at all. After C-level optimization, some scenes still cannot run smoothly. In the end, the ultimate is used in many places. Dafa compilation optimization enables smooth operation in various scenarios. Memory is divided into internal memory (DTCM (Data Tightly Coupled Memory, used to store data) and PTCM (Program Tightly Coupled Memory, used to store code)) and external memory (DDR). In order to run fast, data and code are best placed in internal memory, but the space of internal memory is very small. Both DTCM and PTCM have only tens of K words (the basic unit on DSP is word, and one word is two bytes) ), not only can memory not be used at will, but also must pay attention to saving memory at all times when writing code, but also optimize (often encountered is to develop new features, memory is not enough, optimize memory first, and then develop, memory is a little bit A little bit out), and the memory can’t be extracted in the final optimization, what should I do? Using the overlay mechanism, to put it bluntly, is memory reuse in different scenarios. For example, it is impossible to play music and make calls at the same time, and part of the memory used can be reused. Another example is that only one codec is working when making a phone call, and the system will support multiple codecs at the same time, and part of the memory used by these codecs can be reused.

 

3. Voice solution on APP developed on Android phone

This solution is also developed on Android mobile phones, but it is APP voice communication, similar to WeChat voice. It is done at the Native layer and calls the API provided by Android JNI, which is an upper-level solution. The software block diagram of the program is as follows:

This solution is implemented on the AP (application processor). The voice acquisition and playback do not directly call the API (AudioTrack/AudioRecorder) provided by the system, but use the open source library openSL ES to allow openSL ES to call the system API. We will register two callback functions with openSL ES (one for collecting voice and one for playing voice). These two callbacks are called every 20Ms to obtain the collected voice and send the received voice to the bottom layer respectively. For playback, the voice data obtained from the bottom layer and sent to the bottom layer are all in PCM format, and are configured in a 16k sampling rate mono mode. In the upstream direction, the voice PCM data collected by the codec chip is sent to the audio DSP through I2S. After the audio DSP is processed, the PCM data is sent to the AP, and finally the PCM data is sent to the upper layer through the registered collection callback function. In the upper layer, do pre-processing first, including AEC/AGC/ANS, etc., using the implementation of webRTC (now the pre- and post-processing in the APP voice basically uses the implementation of webRTC). Whether to do resampling, if the codec is 8k sampling rate, you need to do resampling (16k to 8k), if the codec is 16k sampling rate, you don't need to do resampling. After that, the code stream is obtained by encoding, and at the same time, it is packaged with RTP, and the RTP package is sent to the other party with UDP socket. In the downstream direction, first use the UDP socket to receive the voice RTP packet, go to the packet header to get the code stream and put it into the jitter buffer. Every 20Ms, a frame of data will be decoded from the jitter buffer to obtain PCM. It may be necessary to do PLC and resampling, and then do post-processing (ANS/AGC). Finally, the PCM data is sent to the Audio DSP layer by layer through the playback callback function. After the audio DSP is processed, the PCM data is sent to the codec chip for playback. Voice communication on the APP belongs to OTT (On The Top) voice, which is not guaranteed by QoS like traditional voice communication. To ensure voice quality, more compensatory measures must be taken (due to the changing wireless network environment, it is often necessary to compare Bad, will lead to out-of-order packet loss, etc.), common methods are FEC (forward error correction), retransmission, PLC (packet loss compensation) and so on. Compensation measures are the difficulty of this scheme. Through compensation, the voice quality is improved, but the time delay and traffic are increased.

 

The implementation of the solution is the part above the gray dotted line in the figure above. The reason why I drew the part below the gray dotted line is to show you the realization of the entire voice data flow. The following part is a black box for APP developers. , what they can see is the API provided by the system. I first did the voice communication on the APP, and then I did the traditional voice communication on the mobile phone. When doing voice communication on the APP, I don't know how the bottom layer is implemented. I really want to know, but there is no information to understand. I think many people who do APP voice have the same confusion as me, what is the underlying implementation? Later, I realized the traditional voice communication on the mobile phone, and I understood how the bottom layer did it. It was a clear understanding of the entire data flow of the APP voice. This article also draws the underlying implementation block diagram, just to let people who do APP voice communication also understand the underlying implementation. The underlying implementation of APP voice on different mobile phone platforms is different (mainly refers to the implementation on DSP, the implementation on Android Audio Framework is basically the same), for example, some have pre- and post-processing (usually mid-to-high-end machines), Some do not have pre- and post-processing (usually low-end machines). Mid-to-high-end machines do not do echo cancellation on the upper layer and have no echo, that's because they do it on the bottom layer. The same echo cancellation algorithm works better when it is closer to the hardware, mainly because the closer the latency between the near and the far end is to the hardware, the more accurate and constant the calculation is, and the more accurate the latency is, the better the performance of echo cancellation will be. Calculating the latency at the upper layer will introduce the delay caused by the software, and the delay introduced by the software may vary, which makes the calculation of the latency at the upper layer less accurate than the bottom layer. When echo cancellation is performed on the upper layer of the APP, the models are different, and the latency is also different (different hardware and different underlying software implementations). milliseconds, the difference is huge. As for how to measure, if you are interested, you can read an article I wrote earlier ( Echo cancellation and debugging experience in audio processing). When doing echo cancellation on the underlying DSP, it is the closest to the hardware. I calculated the delay between DMA OUT and DMA IN, which was only five to a few milliseconds, and the latency was very stable after multiple measurements, and the error did not exceed a few samples. point, which ensures the high performance of echo cancellation. At that time, the speaker and the MIC were connected on the hardware to form a loopback. The data of DMA OUT completely entered into DMA IN, and the special waveform (such as sine wave) was used to traverse, and the audio data of DMA OUT and DMA IN was dumped at the same time. Come out to see with CoolEdit, so as to get the exact value of latency. The low-end machine must do echo cancellation on the upper layer, that is because the bottom layer does not do it. The APP voice communication solution needs to take into account all models, so echo cancellation is a must in the APP voice solution, and of course other pre- and post-processing modules, such as ANS and AGC, are included.

 

The above solutions are the three typical voice communication solutions I have done, including wired communication, wireless communication, and APP voice communication. In my opinion, it should cover all the main voice solutions on the current communication terminal. Due to the development on different platforms and at different levels, there are huge differences, but the core modules of voice communication are the same.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325918055&siteId=291194637