AAC audio basic knowledge and code stream analysis

AAC audio basic knowledge and code stream analysis


table of Contents

  1. Introduction to AAC
  2. AAC specification introduction
  3. AAC features
  4. AAC audio file format
  5. AAC element information
  6. AAC file processing flow
  7. AAC decoding process
  8. Technical analysis

1. Introduction to AAC

  1. AAC is the abbreviation of Advanced Audio Coding. It appeared in 1997 and was originally an audio coding technology based on MPEG-2. It is jointly developed by Fraunhofer IIS, Dolby Laboratories, AT&T, Sony and other companies to replace the MP3 format. In 2000, after the emergence of the MPEG-4 standard, AAC re-integrated its characteristics and added SBR technology and PS technology. In order to distinguish it from the traditional MPEG-2 AAC, it is also called MPEG-4 AAC.

  2. AAC is a new generation of audio lossy compression technology. It uses some additional coding techniques (such as PS, SBR, etc.) to derive three main encodings: LC-AAC, HE-AAC, and HE-AACv2. LC-AAC is Compared with the traditional AAC, relatively speaking, it is mainly used for medium and high code rates (>=80Kbps), HE-AAC (equivalent to AAC+SBR) is mainly used for medium and low code (<=80Kbps), and the newly launched HE-AACv2 (Equivalent to AAC+SBR+PS) is mainly used for low bit rate (<=48Kbps), in fact, most encoders are set to <=48Kbps to automatically enable PS technology, and >48Kbps without PS, it is equivalent to ordinary HE-AAC.

  3. The position of the audio stream in the video player is shown below.
    Insert picture description here


2. Introduction to AAC Specifications

  1. AAC has a total of 9 specifications to meet the needs of different occasions. At present, the most used are LC and HE (suitable for low bit rate).

    1. MPEG-4 AAC LC Low Complexity Specification (Low Complexity)------In the MP4 file that is more common in mobile phones now.
    2. MPEG-4 AAC HE High Efficiency Specification (High Efficiency)-This specification is suitable for low bit rate encoding, supported by Nero ACC encoder
  2. The popular Nero AAC encoding program only supports the three specifications of LC, HE, and HEv2. The specifications of the encoded AAC audio are all LC. HE is actually AAC (LC) + SBR technology, and HEv2 is AAC (LC) + SBR + PS technology;
    3.

  3. HE: "High Efficiency" (high efficiency). HE-AAC v1 (also known as AACPlusV1, SBR), implements AAC (LC) + SBR technology using a container method. SBR actually stands for Spectral Band Replication (spectral band replication). To briefly describe, the main frequency spectrum of music is concentrated in the low frequency range, and the high frequency range is very small, but it is very important, which determines the sound quality. If the entire frequency band is encoded, if it is to protect the high frequency, the low frequency encoding will be too detailed and the file will be huge; if the main component of the low frequency is preserved and the high frequency component is lost, the sound quality will be lost. SBR cuts the frequency spectrum. The low frequency is separately coded to save the main components, and the high frequency is separately amplified and coded to save the sound quality. It "coordinates and takes care of all aspects". It also saves the sound quality while reducing the file size, which perfectly resolves this contradiction.

  4. HEv2: The container method includes HE-AAC v1 and PS technology. PS stands for "parametric stereo". The file size of the original stereo file is twice that of a channel. However, there is a certain similarity between the sound of the two channels. According to the Shannon Entropy Coding Theorem, the correlation should be removed to reduce the file size. Therefore, the PS technology stores all the information of one channel, and then spends a few bytes to describe the other channel and its differences with parameters.


3. AAC features

  1. AAC is an audio compression algorithm with a high compression ratio, but its compression ratio far exceeds that of older audio compression algorithms, such as AC-3, MP3, etc. And its quality is comparable to that of uncompressed CD sound.

  2. Like other similar audio coding algorithms, AAC also uses a transform coding algorithm, but AAC uses a higher resolution filter bank, so it can achieve a higher compression ratio.

  3. AAC uses the latest technologies such as temporary noise rearranging, backward adaptive linear prediction, joint stereo technology, and quantized Huffman coding. The use of these new technologies has further improved the compression ratio.

  4. AAC supports more kinds of sample rates and bit rates, supports 1 to 48 audio tracks, supports up to 15 low frequency audio tracks, has multi-language compatibility, and up to 15 embedded data streams.

  5. AAC supports a wider range of sound frequencies, up to 96kHz and as low as 8KHz, which is much wider than the 16KHz-48kHz range of MP3.

  6. Unlike MP3 and WMA, AAC hardly loses the very high and very low frequency components in the sound frequency, and is closer to the original audio in the spectrum structure than WMA, so the sound fidelity is better. Professional evaluation shows that AAC has clearer sound than WMA and is closer to the original sound.

  7. AAC uses optimized algorithms to achieve higher decoding efficiency and requires less processing power during decoding.


4. AAC audio file format

1. The audio file format of AAC has ADIF & ADTS:

  1. ADIF: Audio Data Interchange Format Audio Data Interchange Format. The feature of this format is that the beginning of the audio data can be found deterministically, without the need to start decoding in the middle of the audio data stream, that is, its decoding must be performed at the clearly defined beginning. Therefore, this format is commonly used in disk files.
  2. The ADIF format of AAC is shown below:
    Insert picture description here
  3. **ADTS: Audio Data Transport Stream Audio Data Transport Stream. **The characteristic of this format is that it is a bit stream with sync words, and decoding can start at any position in this stream. Its characteristics are similar to the mp3 data stream format.
  4. Simply put, ADTS can be decoded in any frame, which means that it has header information in every frame. ADIF has only one unified header, so it must be decoded after all the data is obtained. And the formats of these two headers are also different. Currently, audio streams in ADTS format are generally encoded and extracted.
  5. Sometimes when you encode an AAC bare stream, you will encounter that the written AAC file cannot be played on PCs and computers. It is very likely that every frame of the AAC file lacks the ADTS header information file. Packaging splicing.
  6. Just add the header file ADTS. The size of an AAC original data block is variable. Adding the ADTS header to the original frame and encapsulating the ADTS form an ADTS frame.
  7. The general format of AAC's ADTS is shown in the figure below:
    Insert picture description here
  8. The figure shows the concise structure of one frame of ADTS, and the purple rectangles on both sides represent the data before and after one frame.

2. ADIF and ADTS header

1. ADIF header information:

Insert picture description here

  1. The ADIF header information is located at the beginning of the AAC file, followed by continuous raw data blocks.
  2. The fields that make up the ADIF header information are as follows:
    Insert picture description here
2. Each frame of AAC audio file is composed of ADTS Header and AAC Audio Data. The structure is as follows:

Insert picture description here

  1. The ADTS header file of each frame contains information such as audio sampling rate, channel, frame length, etc., so that the decoder can analyze and read it.
  2. In general, the ADTS header information is 7 bytes, divided into 2 parts:
    1. adts_fixed_header();
    2. adts_variable_header();
  3. One is fixed header information, followed by variable header information. The data in the fixed header information is the same for every frame, while the variable header information is variable from frame to frame
1. ADTS fixed header information:

Insert picture description here

  1. syncword: The sync header is always 0xFFF, all bits must be 1, which represents the beginning of an ADTS frame
  2. ID: MPEG identifier, 0 for MPEG-4, 1 for MPEG-2
  3. Layer:always: ‘00’
  4. protection_absent: Indicates whether the error code is checked. Warning, set to 1 if there is noCRC and 0 if there is CRC
  5. profile: Indicates which level of AAC is used, such as 01 Low Complexity (LC)-AAC LC. Some cores only support AAC LC
  6. Three types are defined in MPEG-2 AACInsert picture description here
  7. The value of profile is equal to the value of Audio Object Type minus 1, profile = MPEG-4 Audio Object Type-1
    Insert picture description here
  8. sampling_frequency_index: indicates the sampling rate subscript used, through this subscript to find the value of the sampling rate in the Sampling Frequencies[] array.
    Insert picture description here
  9. channel_configuration: Indicates the number of channels, such as 2 for stereo dual channels
    Insert picture description here
0: Defined in AOT Specifc Config
1: 1 channel: front-center
2: 2 channels: front-left, front-right
3: 3 channels: front-center, front-left, front-right
4: 4 channels: front-center, front-left, front-right, back-center
5: 5 channels: front-center, front-left, front-right, back-left, backright
6: 6 channels: front-center, front-left, front-right, back-left, backright, LFE-channel
7: 8 channels: front-center, front-left, front-right, side-left, side-right,back-left, back-right, LFE-channel
8-15: Reserved
2. ADTS variable header information:

6.

  1. frame_length: The length of an ADTS frame includes the ADTS header and the original AAC stream.

frame length, this value must include 7 or 9 bytes of header length:
aac_frame_length = (protection_absent == 1 ? 7 : 9) + size(AACFrame)
protection_absent=0时, header length=9bytes
protection_absent=1时, header length=7bytes

  1. adts_buffer_fullness: 0x7FF means it is a stream with variable bit rate.

  2. number_of_raw_data_blocks_in_frame: It means that there are number_of_raw_data_blocks_in_frame + 1 AAC original frame in the ADTS frame, so number_of_raw_data_blocks_in_frame == 0 means that there is one AAC data block in the ADTS frame.

  3. The following is the AAC file part of ADTS:
    Insert picture description here

  4. The header 7 bytes of the first frame are: 0xFF 0xF1 0x4C 0x40 0x20 0xFF 0xFC Analyze each key value:
    Insert picture description here

  5. Calculate the frame length: Convert the decimal number 0000100000111 to the decimal number 263. Observe that the length of the first frame is indeed 263 bytes. Calculation method: (frame size is 13 bits, use unsigned int to store the frame value)

unsigned int getFrameLength(unsigned char* str)
{
    
    
	if ( !str )
	{
    
    
		return 0;
	}
	unsigned int len = 0;
	int f_bit = str[3];
	int m_bit = str[4];
	int b_bit = str[5];
	len += (b_bit>>5);
	len += (m_bit<<3);
	len += ((f_bit&3)<<11);
	return len;
}
  1. The purpose of frame synchronization is to find out the position of the frame header in the bit stream. 13818-7 stipulates that the frame header synchronization word in the aac ADTS format is 12-bit "1111 1111 1111".

  2. The header information of ADTS is composed of two parts, one is fixed header information, followed by variable header information. The data in the fixed header information is the same every frame, while the variable header information is variable from frame to frame.


5. AAC element information

1. In AAC, the composition of the original data block may have seven different elements:

  1. SCE: Single Channel Element. The single channel element basically consists of only one ICS. An original data block is most likely composed of 16 SCEs.

  2. CPE: Channel Pair Element Two-channel element, composed of two ICSs that may share side information and some joint stereo coding information.

  3. CCE: Coupling Channel Element Coupling channel element. Represents a block of multi-channel joint stereo information or dialogue information of a multilingual program.

  4. LFE: Low Frequency Element Low frequency element. Contains a channel that enhances the low sampling frequency.

  5. DSE: Data Stream Element The data stream element contains some additional information that is not audio.

  6. PCE: Program Config Element Program configuration element. Contains channel configuration information. It may appear in ADIF header information.

  7. FIL: Fill Element Fill element. Contains some extended information. Such as SBR, dynamic range control information, etc.


6. AAC file processing flow

  1. Determine the file format and determine it as ADIF or ADTS

  2. If it is ADIF, solve ADIF header information and skip to step 6.

  3. If it is ADTS, look for the sync header.

  4. Decode ADTS frame header information.

  5. If there is an error detection, perform an error detection.

  6. Deblocking information.

  7. Solution element information.


7. AAC decoding process

Insert picture description here

  1. After the main control module starts to run, the main control module puts a part of the AAC bit stream into the input buffer and obtains the start of a frame by searching for the synchronization word. After finding it, it starts to proceed according to the grammar described in ISO/IEC 13818-7 Noisless Decoding (noiseless decoding), noiseless decoding is actually Huffman decoding, through inverse quantization (Dequantize), joint stereo (Joint Stereo), perceptual noise replacement (PNS), instantaneous noise shaping (TNS), inverse discrete cosine After transforming (IMDCT) and frequency band copying (SBR) these modules, the PCM code streams of the left and right channels are obtained, and the main control module puts them into the output buffer and outputs them to the sound playback device.

8. Technical Analysis

1. Main control module:

  1. The so-called main control module, its main task is to operate the input and output buffers, and call other modules to work together.
  2. Among them, the input and output buffers are provided with interfaces by the DSP control module.
  3. The data stored in the output buffer is decoded PCM data, which represents the amplitude of the sound. It is composed of a fixed-length buffer. By calling the interface function of the DSP control module, the head pointer is obtained. After the output buffer is filled, the interrupt processing is called to output to the audio ADC chip (stereo audio DAC and stereo audio DAC) connected to the I2S interface. DirectDrive headphone amplifier) ​​outputs analog sound.

2. Noisless Decoding:

  1. Noise-free coding is Huffman coding, and its function is to further reduce the redundancy of the scale factor and quantized spectrum, that is, the scale factor and quantized spectrum information are Huffman encoded.
  2. The global gain is coded into an 8-bit unsigned integer, the first scale factor and the global gain value are differentially coded, and then the scale factor coding table is used for Huffman coding.
  3. All subsequent scale factors are differentially coded with the previous scale factor.
  4. The noiseless coding of the quantized spectrum has two divisions of spectrum coefficients.
  5. One is the division of 4-tuple and 2-tuple, and the other is section division. For the previous division, it is determined whether the value found in a Huffman table is 4 or 2. For the latter division, it is determined which Huffman table should be used, one section contains several scale factor bands and only one Huffman table is used in each section.
2.1 Segmentation
  1. Noise-free coding divides the input 1024 quantized spectral coefficients into several sections. Each point in the section uses the same Huffman table. Considering the coding efficiency, the boundary of each section is best to have the same scale factor. The boundaries coincide. Therefore, the transmission information of each necessary segment should have: the length of the segment, the scale factor band where it is located, and the Huffman table used.
2.2 Grouping and alternation
  1. Grouping refers to ignoring the window where the spectral coefficients are located, grouping continuous spectral coefficients with the same scale factor band into one group and sharing a scale factor to obtain better coding efficiency.
  2. Doing so will inevitably cause alternation, that is, the coefficient arrangement is originally in the order of c[group][window][scale factor band][coefficient index], and the coefficients with the same scale factor are put together: c[group][ Scale factor band][window][coefficient index] This causes the coefficients of the same window to alternate.
2.3 Processing of large-scale value
  1. There are two ways to deal with a large number of values ​​in AAC: use the escape flag in the Huffman code table or use the pulse escape method.
  2. The former is similar to the mp3 encoding method. It uses a special Huffman table when many large-scale values ​​appear. This table implies that its use will be followed by a pair of escape values ​​and symbols for the value of the Huffman code.
  3. When using the pulse escape method, a large value is subtracted from a difference value to a small value, and then the Huffman table is used for encoding, followed by a pulse structure to help restore the difference.

3. Scale factor decoding and inverse quantization

  1. In AAC coding, inverse quantization of spectral coefficients is realized by a non-uniform quantizer, and its inverse operation is required in decoding. That is, keep the sign and perform 4/3 power operation.

  2. The basic method of adjusting quantization noise in the frequency domain is to use scale factors for noise shaping. The scale factor is an amplitude gain value used to change all the spectral coefficients in a scale factor band. The mechanism of using the scale factor is to use a non-uniform quantizer to change the bit allocation of quantization noise in the frequency domain.

3.1 Scale factor band (scalefactor-band)
  1. The frequency line is divided into multiple groups according to the auditory characteristics of the human ear, and each group corresponds to several scale factors. These groups are called scale factor bands. In order to reduce the side information containing short windows, consecutive short windows may be grouped together, that is, several short windows are transmitted together as one window, and then the scale factor will be applied to all grouped windows.

4. Joint Stereo

  1. The joint stereo is to perform certain rendering work on the original samples to make the sound more "sound".

5. Perceptual Noise Replacement (PNS)

  1. The perceptual noise replacement module is a module that simulates noise by way of parameter encoding. After distinguishing the noise in the audio value, the noise is not quantized and encoded, but some parameters are used to tell the decoder that this is a certain kind of noise, and then the decoder will use some random encoding to create the noise This type of noise.

  2. In terms of specific operations, the PNS module detects signal components with a frequency below 4kHz for each scale factor band. If this signal is neither a pitch nor a strong energy change in time, it is considered a noise signal. The pitch and energy changes of the signal are calculated in the psychoacoustic model.

  3. In decoding, if it is found that Huffman Table 13 (NOISE_HCB) is used, it indicates that PNS is used. Since M/S stereo decoding and PNS decoding are mutually exclusive, the parameter ms_used can be used to indicate whether both channels use the same PNS. If the ms_used parameter is 1, the two channels will use the same random vector to generate noise signals. The energy signal of PNS is represented by noise_nrg. If PNS is used, the energy signal will be transmitted instead of the respective scale factor. The noise energy coding is the same as the scale factor, using a differential coding method. The first value is also the global gain value. It is placed alternately with the intensity stereo position value and scale factor, but is ignored for differential decoding. That is, the next noise energy value and the previous noise energy value are not the intensity stereo position or the scale factor for standard differential decoding. Random energy will produce the average energy distribution calculated by noise_nrg in a scale factor band. This technology is only used in MPEG-4 AAC.

6. Transient Noise Shaping (TNS)

  1. This magical technique can modify the distribution of quantization noise in the time domain through prediction in the frequency domain. In the quantization of some special speech and drastically changing signals, TNS technology has contributed a lot to the improvement of sound quality!

  2. TNS transient noise shaping is used to control the transient noise shape within a conversion window. It is realized by a filtering process of a single channel. Traditional transform coding schemes often encounter the problem of very drastic changes in the signal in the time domain, especially speech signals. This problem is because although the quantized noise distribution is controlled in the frequency domain, it is distributed as a constant in the time domain. Within a conversion block. If the signal in this block changes drastically but does not turn to a short block, then this constant distribution of noise will be heard.

  3. The principle of TNS utilizes the duality of time domain and frequency domain and the time-frequency symmetry of LPC (Linear Predictive Coding), that is, coding in any one domain is equivalent to predictive coding in the other domain. In other words, predictive coding in one domain can increase its resolution in another domain. The quantization noise is generated in the frequency domain, which reduces the resolution in the time domain, so predictive coding is done in the frequency domain here. In AACplus, because it is based on the AAC profile LC, the TNS filter order is limited to 12 orders.

7. Inverse Discrete Cosine Transform (IMDCT)

  1. The process of converting audio data from frequency domain to time domain is mainly realized by filling frequency domain data into a set of IMDCT filters. After the IMDCT transformation, the output value is windowed and superimposed, and finally the time domain value is obtained.

8. Frequency band replication (SBR)

  1. Briefly, the main frequency spectrum of music is concentrated in the low frequency band, and the high frequency band has a small amplitude, but it is very important, which determines the sound quality.
  2. If the entire frequency band is encoded, if it is to protect the high frequency, the low frequency encoding will be too detailed and the file will be huge; if the main component of the low frequency is preserved and the high frequency component is lost, the sound quality will be lost.
  3. SBR cuts the frequency spectrum. The low frequency is separately coded to save the main components, and the high frequency is separately amplified and coded to save the sound quality. It "coordinates and takes care of all aspects". It also saves the sound quality while reducing the file size, which perfectly resolves this contradiction.

9. Parametric stereo (PS)

  1. For the previous stereo file, the file size is twice that of mono, but the sound of the two channels has some similarities. According to the Shannon information entropy coding theorem, the correlation should be removed to reduce the file size. Therefore, the PS technology stores all the information of one channel, and then uses a few bytes as parameters to describe the difference between another channel and it.

Insert picture description here


Reference blog: AAC file format analysis

Guess you like

Origin blog.csdn.net/weixin_41910694/article/details/107735932