Introduction to the principle of audio coding based on audio and video

One: covert signal

If the digital audio signal is transmitted directly without compression, it will occupy a huge bandwidth. For example, if the sampling frequency of a set of two-channel digital audio is 44.1KHz, and each sample value is quantized by 16bit, the code rate is:
2 44.1kHz 16bit=1.411Mbit/s

Such a large bandwidth will bring many difficulties and costs to signal transmission and processing (after the Alibaba Cloud server bandwidth is greater than 5M, the price per M is 100 yuan/month), so audio compression technology must be used to process audio data in order
to Efficiently transmit audio data.

Digital audio compression coding compresses the audio data signal as much as possible to reduce the amount of data on the premise of ensuring that the signal does not produce distortion in the auditory sense. Digital audio compression coding is realized by removing redundant components in the sound signal. The so-called redundant components refer to the signals in the audio that cannot be perceived by the human ear, and they are not helpful for determining the timbre, pitch and other information of the sound.

Redundant signals include audio signals outside the range of human hearing and masked audio signals. For example, the frequency range of the sound signal that can be detected by the human ear is 20 Hz to 20 KHz, and other frequencies that cannot be detected by the human ear can be regarded as redundant signals.

In addition, according to the physiological and psychoacoustic phenomena of human hearing, when a strong sound signal and a weak sound signal exist at the same time, the weak sound signal will be masked by the strong sound signal and cannot be heard, so that the weak sound signal can be regarded as a redundant signal. Do not send. This is the masking effect of human hearing, mainly manifested in the spectrum masking effect and time domain masking effect.

After the sound energy of a frequency is less than a certain threshold, the human ear will not be able to hear it. When another sound with higher energy appears, the threshold near the sound frequency will increase a lot, which is the so-called masking effect. As shown in the figure below:
insert image description here
From the figure, we can see that the human ear is most sensitive to the sound of 2KHz ~ 5KHz, and is very slow to the sound signal of too low or too high frequency. When there is a sound signal with a frequency of 0.2KHz and an intensity of 60dB When a sound is present, the threshold around it is raised a lot. From the figure, we can see that the part below 0.1KHz and above 1KHz is not affected by the strong signal of 0.2KHz because it is far away from the strong signal of 0.2KHz, and the threshold value is not affected; while in the range of 0.1KHz to 1KHz, due to the With the emergence of strong sounds, the threshold is greatly improved, and the minimum sound intensity that the human ear can perceive in this range is greatly improved. If the strength of the sound signal in the range of 0.1KHz to 1KHz is below the raised threshold curve, because it is masked by the 0.2KHz strong sound signal, then our human ears can only hear the 0.2KHz strong sound signal at this time Other weak signals cannot be heard, and these weak signals that exist simultaneously with the strong 0.2KHz signal can be regarded as redundant signals and do not need to be transmitted.

When strong and weak signals appear at the same time, there is also a temporal masking effect. That is, when the two occur very close in time, the masking effect will also occur. The time-domain masking process curve is shown in the figure, which is divided into three parts: front masking, simultaneous masking and post masking.
insert image description here
Temporal masking effects can be divided into three types: front masking, simultaneous masking, and post masking. Pre-masking refers to the short time before the human ear hears the strong signal, the existing weak signal will be masked and cannot be heard. Simultaneous masking means that when a strong signal and a weak signal exist at the same time, the weak signal will be masked by the strong signal and cannot be heard. Back-masking means that after a strong signal disappears, it takes a long period of time to hear a weak signal again, which is called back-masking. These masked weak signals can be regarded as redundant signals.

Two: Audio coding – compression coding method

There are different coding schemes and implementation methods in the current digital audio coding field, but the basic coding ideas are similar, as shown in the figure.
insert image description here
For the audio sample signals in each audio channel:
1. Map them into the frequency domain, and the mapping from the time domain to the frequency domain can be realized by sub-band filters. The audio sampling block in each channel first calculates the masking threshold according to the psychoacoustic model;
2. The calculated masking threshold determines how many bits in different frequency domains are allocated to the channel from the public bit pool Number, followed by quantization and encoding;
3. Add control parameters and auxiliary data to the data to generate encoded data stream.

Guess you like

Origin blog.csdn.net/weixin_43907175/article/details/129127551