Speech coding technology, summary of AMR, AMR-NB, AMR-WB, EVS

Recently, I have a little interest in real-time speech coding technology, so I learned about it.
At the beginning, I heard about AMR-NB narrowband encoding, and only after searching did I find more encoding techniques. Here is a summary for future reference.

1. What is AMR?
The full name of AMR-WB is Adaptive Multi-Rate and Adaptive Multi-Rate Wideband. It is mainly used for the audio of mobile devices. The compression ratio is relatively large, but the quality is relatively poor compared to other compression formats. Because it is mostly used for human voice, Calls, the effect is still very good.

AMR: Also known as AMR-NB, relative to the following WB
Voice bandwidth range: 300-3400Hz
8KHz sampling rate

AMR-WB: AMR WideBand,
voice bandwidth range: 50-7000Hz
16KHz sampling rate
"AMR-WB" is called "Adaptive Multi-rate - Wideband", that is, "adaptive multi-rate wideband coding", the sampling frequency is 16kHz, is a A wideband speech coding standard adopted by the International Standardization Organization ITU-T and 3GPP at the same time, also known as the G722.2 standard. AMR-WB provides a voice bandwidth ranging from 50 to 7000 Hz, and users can subjectively feel that the voice is more natural, comfortable and easy to distinguish than before.


Reference source:

1. AMR-NB, AMR-WB, EVS speech coding comparison
https://www.txrjy.com/thread-1030405-1-1.html

2. Audio AMR-NB, AMR-WB
https://blog.csdn.net/weixin_45312249/article/details/120508280


3. The most advanced audio codec EVS for mobile communication and the work to be done to use it
https://www.cnblogs.com/talkaudiodev/p/9074554.html
(this is the most detailed) The full text is as follows:

Voice communication has changed from only wired communication at the beginning to the competition between wired communication and wireless communication (mobile communication). When the price of mobile voice communication fell, wired voice communication was obviously in a contrarian trend. The competitor of mobile voice communication today is OTT (On The Top) voice. OTT voice is a service provided by Internet manufacturers, which is generally free, such as WeChat voice. At present, the voice communication technology is divided into two camps: the traditional communication camp and the Internet camp, which compete with each other and promote the development of voice communication technology. Specific to the codec, the Internet camp has proposed an audio codec OPUS covering voice and music (OPUS is jointly developed by the non-profit Xiph.org Foundation, Skype and Mozilla, etc., full frequency band (8kHZ to 48kHZ), Supports voice and music (SILK for voice, CELT for music), has been accepted by IETF as the sound codec standard on the Internet (RFC6716)), most OTT voice APPs support it, and has a tendency to unify the Internet camp. In response to competition from the Internet camp, the mobile communication standards organization 3GPP also proposed an audio codec EVS (Enhanced Voice Service) covering voice and music. I have successfully added EVS to the mobile phone platform I made, and passed the test in the real network environment of China Mobile. Let's talk about this codec and the work to be done with it.

3GPP standardized the EVS codec in September 2014, defined by the 3GPP R12 version, which is mainly applicable to VoLTE, but also applicable to VoWiFi and fixed network telephone VoIP. The EVS codec was jointly developed by operators, terminal equipment, infrastructure and chip providers, and experts in speech and audio coding, including Ericsson, Fraunhofer Institute for Integrated Circuits, Huawei Technologies Co., Ltd., Nokia Corporation, Nippon Telecom and Telephone Corporation (NTT), Japan NTT DOCOMO, France Telecom (ORANGE), Japan Panasonic Corporation, Qualcomm Corporation, Samsung Electronics Corporation, VoiceAge Corporation and ZTE Corporation, etc. It is the best voice and audio codec in 3GPP so far. It is a full frequency band (8kHZ to 48kHZ) and can work in the bit rate range of 5.9kbps to 128kbps. It can provide very high performance not only for voice and music signals. It not only has excellent audio quality, but also has strong anti-frame loss and anti-delay jitter capabilities, which can bring users a brand new experience.

The figure below is the SPEC related to 3GPP EVS, from TS26.441 to TS26.451.

I have marked the key ones with red boxes, among which TS26.441 is an overview, and TS26.442 is a fixed-point implementation (reference code) written in C language, which is also the most important thing in the work of making good use of EVS later. TS26.444 is a test sequence. In the process of optimizing the reference code, an optimized version is saved almost every day, and the optimized version is run with the test sequence every day. If it is found to be different, it means that there is a problem with the optimization, and you must return to Go to the previous version and check out which optimization step went wrong. TS26.445 is the specific description of the EVS algorithm, nearly 700 pages, and to be honest, it is a headache to read. If you are not doing the algorithm, you can just read the algorithm part, but you must read the feature description carefully.

EVS uses different encoders for voice signals and music signals. The speech coder is an Improved Algebraic Code-Excited Linear Prediction (ACELP), which also employs linear prediction modes for different speech classes. For music signal coding, the frequency domain (MDCT) coding method is adopted, and special attention is paid to the frequency domain coding efficiency in the case of low delay/low bit rate, so as to realize seamless and reliable switching between speech processors and audio processors. The following figure is a block diagram of the EVS codec:

When encoding, preprocess the input PCM signal first, and at the same time determine whether it is a voice signal or an audio signal. If it is a speech signal, it is encoded with a speech encoder to obtain a bit stream, and if it is an audio signal, it is encoded with a perceptual encoder to obtain a bit stream. When decoding, it is determined whether it is a voice signal or an audio signal according to the information in the bit stream. If it is a voice signal, it is decoded with a voice decoder to obtain PCM data, and then the voice bandwidth is expanded. If it is an audio signal, use a perceptual decoder to decode the PCM data, and then expand the frequency bandwidth. Finally, post-processing is done as the output of the EVS decoder.

Let's talk about the key features of EVS.

1. EVS supports full frequency band (8kHZ–48kHZ), and the bit rate ranges from 5.9kbps to 128kbps. Each frame is 20Ms long. The following figure shows the distribution of audio bandwidth:

The narrowband (Narrow Band, NB) range is 300HZ-3400HZ, and the corresponding sampling rate is 8kHZ, which is the sampling rate used by AMR-NB. The wide band (WB) range is 50HZ-7000HZ, and the corresponding sampling rate is 16kHZ, which is the sampling rate used by AMR-WB. The range of Super Wide Band (SWB) is 20HZ-14000HZ, and the corresponding sampling rate is 32kHZ. The full band (Full Band, FB) range is 20HZ-2000HZ, and the corresponding sampling rate is 48kHZ. EVS supports full frequency band, so it supports four sampling rates: 8kHZ, 16kHZ, 32kHZ and 48kHZ.

The following figure shows the supported bit rates at various sampling rates:

It can be seen from the above figure that only the full bit rate is supported under WB, and only part of the bit rate is supported under other sampling rates. It should be noted that EVS is forward compatible with AMR-WB, so it also supports all bit rates of AMR-WB.

2. EVS supports DTX/VAD/CNG/SID, which is the same as AMR-WB. During a call, you usually talk about half the time and listen the rest of the time. There is no need to send voice packets to the other party in the listening state, so there is DTX (discontinuous transmission). Use the VAD (Voice Detection) algorithm to judge whether it is voice or silence, send a voice packet when it is a voice packet, and send a silence packet (SID packet) when it is silent. After receiving the SID packet, the other party uses the CNG (Comfort Noise Generation) algorithm to generate comfort noise. There are two CNG algorithms in EVS: CNG based on linear prediction (linear prediction-domain based CNG) and CNG based on frequency domain (frequency-domain based CNG). EVS is different from AMR-WB in the sending mechanism of SID packets. In AMR-WB, VAD sends a SID packet when it detects silence, and then sends a second SID packet after 40Ms, and then sends a SID packet every 160Ms. However, VAD sends voice packets as soon as it detects voice. The sending mechanism of SID packet in EVS is configurable. It can send a SID packet at regular intervals (a few frames, ranging from 3 to 100), or it can adaptively send SID packets according to the SNR. The sending cycle range is 8 to 50 frames. . The payload size of the EVS SID packet is also different from that of AMR-WB. AMR-WB is 40 bytes (50 40=2000bps), and EVS is 48 bytes (50 48=2400bps). It can be seen from the above that DTX has two advantages, one is that it can save bandwidth and increase capacity, and the other is that it reduces the amount of calculation because it does not encode and decode, thereby reducing power consumption and increasing battery life.

3. EVS also supports PLC (packet loss compensation), which is the same as AMR-WB. However, EVS also includes the Jitter Buffer Module (JBM), which has never been seen in previous codecs. I didn't use JBM in my use, and I didn't have time to study it because of the tight schedule. When you have time later, you must study it carefully. JB is one of the difficulties in voice communication and also one of the bottlenecks in voice quality.

The algorithm delay of EVS varies according to the sampling rate. When the sampling rate is WB/SWB/FB, the total delay is 32ms, including a delay of 20ms for one frame, a delay of 0.9375ms for input resampling on the encoding side and a forward delay of 8.75ms, and a time-domain bandwidth extension on the decoding side. 2.3125ms delay. When the sampling rate is NB, the total delay is reduced to 30.9375ms, which is 1.0625ms lower than that of WB/SWB/FB. This 1.0625ms is mainly reduced on the decoding side.

Compared with AMR-NB/AMR-WB, the voice quality (MOS value) of EVS has been significantly improved. The figure below is a comparison of the MOS values ​​of these codecs:

It can be seen from the above figure that when the sampling rate is NB, the MOS value of EVS-NB is significantly higher than that of AMR-NB at various code rates; when the sampling rate is WB, the MOS value of EVS-WB is The MOS value is also significantly higher than that of AMR-WB; when the sampling rate is SWB and the code rate is greater than 15kbps, the MOS value of EVS-SWB is close to that of PCM without encoding. It can be seen that the voice quality of EVS is quite good.

The work to be done to make good use of EVS will be different on different platforms. I use it on the mobile phone platform audio DSP for voice communication. Let me talk about what I have done to support EVS for mobile phones.

1. Learn the SPEC related to EVS. You have to read all the SPECs I listed above, because it is not an algorithm, and the algorithm-related ones can be read in a rough way, but the feature description-related ones must be read in detail, which is related to the later use.

2. Generate the encoder/decoder application on the PC. I did it on Ubuntu, using the PCM file as the input of the encoder, generating the corresponding code stream file according to different configurations, and then using the code stream file as the input of the decoder to decode and restore it to a PCM file. If the decoded PCM file sounds the same as the original PCM file, it means that the algorithm implementation is credible (algorithm implementations from authoritative organizations are all credible, if there is any abnormality, it means that the application program is not done well). The purpose of making an application program is for subsequent optimization, and it is also convenient to understand the peripheral implementation, such as how to convert the encoded value into a code stream. The encoded value is placed in indices (up to 1953 indices), and each index has two member variables, one is nb_bits, indicating how many bits the index has, and the other is value, indicating the value of the index. There are two storage methods for Indices: G192 (ITU-T G.192) and MIME (Multipurpose Internet Mail Extensions). Look at G192 first, the storage format of G192 for each frame is shown in the figure below:

The first word is the synchronization value, divided into good frame (value 0x6B21) and bad frame (value 0x6B20), the second word is the length, followed by each value (1 is represented by 0x0081, 0 is represented by 0x007F) . The value in Indices is expressed in binary. If the bit value is 1, it will be stored as 0x0081, and if it is 0, it will be stored as 0x007F. The picture below is an example, the sampling rate is 16000HZ, the code rate is 8000bps, so one frame has 160 bits (160 = 8000/50), and saving it in G192 format is 160 words. In the figure below, the header is 0x6B21, which means good frame, the length is 0x00A0, and the following 160 words are the content.

Let's look at the MIME format again. To pack the value of indices into a serial value, see the pack_bit() function for details on how to pack. The MIME format is that the first Word is a header (the lower 4 bits indicate the code rate index, the 5th and 6th bits must be set to 1 when using WB_IO, and it is not required for EVS), followed by the bit stream. Still the above example with a sampling rate of 16000HZ and a code rate of 8000bps, but saved in MIME format, a frame has 160 bits and requires 20bytes (20 = 160/8), as shown in the figure below:

The first 16 bytes in the above figure are the reference code, the 17th byte is the header, 0x2 means EVS encoding, the code rate is 8kbps (8kbps index is 2), and the last 20 bytes indicate the packed payload.

In voice communication, the value of the indices should be packed into a serial value, and then sent to the other party as a payload. After receiving it, the other party first unpacks and then decodes to obtain the PCM value.

3. The original reference code is usually not usable directly and needs to be optimized. As for how to optimize, please read an article I wrote earlier (audio codec and its optimization methods and experience), the article is about a more general method. I want to use it on DSP now, the frequency of DSP is relatively low, only more than 300 MHZ, it can’t be done without assembly optimization. I have never used DSP assembly before, and it is very difficult to optimize it well in a short time. After weighing it, the boss decided to use the optimized library provided by the DSP IP manufacturer, which is more professional in terms of assembly.

4. The application program transformation of the reference code is convenient to use as a tool for later debugging and verification. When the original reference code is saved as a file, the unit is byte, while DSP is based on Word (two bytes), so the pack/unpack function in the reference code must be modified to adapt to DSP.

5. If the codec is EVS when making a call, the corresponding codes must be added to the Audio DSP and CP, first write their own codes for self-adjustment, and then joint adjustment. I use the shell of AMR-WB for self-tuning (because a frame of EVS and AMR-WB is 20ms long), that is, I use AMR-WB for the process, but the codec is changed from AMR-WB to EVS. Mainly verify whether the encoder, pack, unpack, and decoder are ok, where encoder and pack are upstream, and unpack and decoder are downstream. Their sequence relationship is as follows:

First adjust the uplink, save the encoded data into G192 format, use the decoder tool to decode it into PCM data and listen to it with CoolEdit, it is the same as what I said, indicating that the encoder is OK. Adjust the pack again, save the code stream after the pack in MIME format, and use the decoder tool to decode it into PCM data and listen to it with CoolEdit. It is the same as what I said, indicating that the pack is OK. Then tune down. Since the CP has not yet sent the correct EVS code stream to the Audio DSP, it uses the loopback method to debug. Specifically, the packed code stream is used as the input of the unpack, and the unpacked code stream is saved in G192 format, and decoded into PCM by the decoder tool. The data is listened to with CoolEdit, and it is the same as what I said, indicating that unapck is OK. Finally, adjust the decoder and listen to the PCM data after the decoder with CoolEdit. It is the same as what I said, indicating that the decoder is OK. This completes the self-regulation.

6. Joint debugging with CP. Since all key modules were adjusted during the previous self-adjustment, the joint adjustment was relatively smooth, and the adjustment was completed within a few days. In this way, you can enjoy the high sound quality brought by EVS when you make a call.


Guess you like

Origin blog.csdn.net/u013209189/article/details/129734025