H.264 encoding and AAC encoding basics

Article directory


Preface

This section focuses on H.264 encoding and AAC encoding. Before explaining them, we first introduce the implementation principle of video encoding.


1. Implementation Principle of Video Coding

1. Basic principles of video coding technology

Encoding is for compression. To achieve compression, various algorithms must be designed to remove redundant information from video data.

type content Compression method
spatial redundancy Correlation between pixels transform coding, predictive coding
time redundancy Correlation in time direction Inter prediction, motion compensation
Image construction redundancy The structure of the image itself Contour encoding, region segmentation
knowledge redundancy Shared understanding of the characters on both sides of the sender and receiver knowledge-based coding
visual redundancy human visual characteristics Nonlinear quantization, bit allocation
other Uncertainties

The priority elimination goal of video coding technology is spatial redundancy and temporal redundancy .
Insert image description here

2. Implementation method of video coding technology

Video is formed by playing different frames continuously . These frames are mainly divided into three categories, namely:

  • I 帧
    • It is an independent frame that carries all its own information . It is the most complete picture (occupies the largest space) and can be decoded independently without referring to other images. The first frame in a video sequence is always an I-frame.
  • B frame
    • "Bidirectional predictive coding frame" uses the previous frame and the following frame as the reference frame. It not only refers to the previous frame, but also refers to the subsequent frame. Therefore, its compression rate is the highest, which can reach 200:1. However, because it relies on subsequent frames, it is not suitable for real-time transmission (such as video conferencing).
      Insert image description here
  • P frame
    • " Inter-frame predictive coding frame " requires reference to different parts of the previous I-frame and/or P-frame in order to be encoded. P frames have dependencies on previous P and I reference frames . However, the P frame has a higher compression rate and takes up less space.
      Insert image description here
      By classifying frames, the size of the video can be greatly compressed.

After all, the objects to be processed are greatly reduced (from the entire image to an area in the image).
Insert image description here

3. Motion estimation and compensation

①. Block and MicroBlock

If the calculation is always based on pixels, the amount of data will be relatively large. Therefore, the image is generally cut into different " Blocks " or " MacroBlocks " and calculated on them.

A macroblock is generally 16 pixels × 16 pixels.
Insert image description here

②, Summary of I frame, P frame and B frame

  • The I-frame is processed using intra-frame coding , which only uses the spatial correlation within the image of this frame.
  • For the processing of P frames , inter-frame coding (forward motion estimation) is used , while utilizing spatial and temporal correlations.

Simply put, a motion compensation algorithm is used to remove redundant information.
Insert image description here

③, I frame (intra-frame coding)

It is important to note that although I frames (intra-frame coding) only have spatial correlation, the entire coding process is not simple.
Insert image description here
As shown in the figure above, the entire intra-frame coding also goes through multiple processes such as DCT (discrete cosine transform), quantization, and coding.

  • RGB to YUV: fixed formula
  • Picture macro block cutting: macro block 16x16
  • DCT: discrete cosine transform
  • Quantification: Sampling
  • ZigZag scan:
  • DPCM: Differential Pulse Code Modulation
  • RLE: run-length encoding
  • Arithmetic coding:

④. How to measure and evaluate the effect of encoding and decoding

Generally speaking, it is divided into objective evaluation and subjective evaluation. Objective evaluation is to use numbers to speak. For example, calculate " signal-to-noise ratio/peak signal-to-noise ratio "

Generally speaking, it is divided into objective evaluation and subjective evaluation .

  • Objective evaluation is to use numbers to speak. For example, calculate "Signal to Noise Ratio/Peak Signal to Noise Ratio".
  • Subjective evaluation is measured directly using people’s subjective perception.

2. H.264 encoding basics

1. H.264 Quick Start

①. Video coding standardization organization

There are two main standardization organizations engaged in video coding algorithms, ITU-T and ISO.

  • ITU-T, International Telecommunications Union - Telecommunications Standards Bureau, mainly formulates standards such as H.261/H263/H263+/H263++
  • ISO, the International Organization for Standardization, mainly formulates MPEG-1/MPEG-4, etc.

Such as MPEG-2, H.264/AVC and H.265/HEVC, etc. The development of video coding standards formulated by different standards organizations is shown in the figure below:
Insert image description here

②. Basic technology of video compression and coding

The reason why video information has a large amount of space that can be compressed is because there is a large amount of data redundancy in it.
Its main types are:

  • Temporal redundancy : The content between two adjacent frames of the video is similar and there is a motion relationship.
  • Spatial redundancy : There is similarity between adjacent pixels within a certain frame of the video
  • Coding redundancy : different data in the video appear with different probabilities
  • Visual redundancy : The viewer’s visual system is differently sensitive to different parts of the video

For these different types of redundant information, there are different technologies in various video coding standard algorithms to specifically deal with them to improve the compression ratio from different angles.

<1>, Predictive coding

Predictive coding can be used to handle redundancy in the temporal and spatial domains in videos .

Predictive coding in video processing is mainly divided into two categories: intra prediction and inter prediction .

  • Intra-frame prediction : The predicted value and the actual value are located in the same frame, used to eliminate the spatial redundancy of the image; intra-frame prediction is characterized by a relatively low compression rate, but it can be decoded independently without relying on the data of other frames; usually in video The key frames all use intra prediction.
  • Inter-frame prediction : The actual value of inter-frame prediction is located in the current frame, and the predicted value is located in the reference frame, which is used to eliminate the temporal redundancy of the image; the compression rate of inter-frame prediction is higher than that of intra-frame prediction, but it cannot be decoded independently and must be obtained after obtaining the reference The current frame can only be reconstructed after receiving the frame data.

Usually in video streams, all I frames use intra-frame coding, and the data in P frames/B frames may use intra-frame or inter-frame coding.

<2>, Transform coding

The so-called transformation coding refers to transforming a given image into another data domain such as the frequency domain, so that a large amount of information can be represented by less data, thereby achieving the purpose of compression.

The current mainstream video coding algorithms are all lossy coding , which achieve relatively higher coding efficiency by causing limited and tolerable losses to the video.

The part that causes information loss lies in the transformation and quantization part.

Before quantization, the image information first needs to be transformed from the spatial domain to the frequency domain through transformation coding, and its transformation coefficients are calculated for subsequent coding.

In video coding algorithms, orthogonal transform is usually used for transform coding. Commonly used orthogonal transform methods include: discrete cosine transform (DCT), discrete sine transform (DST), KL transform, etc.

<3>, entropy coding

The entropy coding method in video coding is mainly used to eliminate statistical redundancy in video information.

Since the probability of each symbol appearing in the source is not consistent, it would be wasteful to use codewords of the same length to represent all symbols.

Through entropy coding, symbols of different lengths are allocated to different syntax elements, which can effectively eliminate redundancy caused by symbol probability in video information.

Commonly used entropy coding methods in video coding algorithms include variable length coding and arithmetic coding. Specifically, they mainly include:

  • UVLC (Universal Variable Length Coding): Mainly uses exponential Columbus coding
  • CAVLC (Context Adaptive Variable Length Coding): Context-adaptive variable length coding
  • CABAC (Context Adaptive Binary Arithmetic Coding): Context-adaptive binary arithmetic coding

Specify different encoding methods according to different syntax element types. Through these two entropy coding methods,
a balance between coding efficiency and computational complexity is achieved.

③、VCL NAL

Coding tools such as predictive coding, variation quantization, and entropy coding used in video coding mainly work at the slice layer or below. This layer is usually called the "Video Coding Layer " (VCL).

In contrast, the data and algorithms performed on the slice are usually called the " Network Abstraction Layer " (NAL).

The main significance of designing and defining the NAL layer is to improve the affinity of H.264 format video for network transmission and data storage .

④, grade and level

In order to adapt to different application scenarios, H.264 also defines three different levels:

  • Baseline Profile : Mainly used in low-latency real-time communication fields such as video conferencing and video phone calls; supports I strips and P strips, and entropy coding supports the CAVLC algorithm.
  • Main Profile : Mainly used for digital TV broadcasting, digital video data storage, etc.; supports video field coding, B-strip bidirectional prediction and weighted prediction, and entropy coding supports CAVLC and CABAC algorithms.
  • Extended Profile : Mainly used for online video live broadcast and on-demand, etc.; supports all features of the baseline profile, and supports SI and SP stripes, supports data segmentation to improve bit error performance, supports B stripes and weighted prediction, but CABAC and field encoding are not supported.

CAVLC supports all H.264 profiles, while CABAC does not support Baseline and Extended profiles.
Insert image description here

⑤. Common encoders

H.264 is a video compression standard that only specifies the format of the code stream that conforms to the standard and the parsing method of each syntax element in the code stream. The H.264 standard does not specify the implementation or process of the encoder, which gives different manufacturers or organizations great freedom in encoding implementation, and has produced some well-known open source H.264 codec projects.

Among them, the two most famous H.264 encoders are JM and X264 , both of which are implementation forms of the H.264 encoding standard.

JM encoder is too slow, x264 is quite fast.

2. H.264 encoding principle and implementation

①、Foreword

H264 is the data that belongs to the codec level in the video collection and output, as shown in the figure below. It is the data that is encoded by the coding standard when the data is collected and compressed.
Insert image description here

②、H264 related concepts

<1>, sequence

The theoretical basis followed in the H264 encoding standard is personally understood as: referring to adjacent images within a period of time, the differences in pixels, brightness and color temperature are very small.

Therefore, when faced with images within a period of time, we do not need to encode a complete frame of each image. Instead, we can select the first frame of the image during this period as the complete encoding, and the next image can be recorded with the first frame. The frame can completely encode the difference in image pixels, brightness, color temperature, etc., and so on.

What is a sequence? We can call the image set with little change in the image during the above period a sequence . A sequence can be understood as a piece of data with the same characteristics.

<2>, frame type

In the H264 structure, the encoded data of a video image is called a frame. A frame is composed of one slice (slice) or multiple slices. A slice is composed of one or more macroblocks (MB) . A macroblock is composed of 16x16 yuv Data composition. Macroblock is the basic unit of H264 encoding.

Three types of frames are defined in the H264 protocol, namely I frame, B frame and P frame .

<3>, GOP (picture group, image group)

I personally understand that GOP means something similar to a sequence, which is a set of images that have not changed much over a period of time.

The GOP structure generally has two numbers, such as M=3, N=12. M specifies the distance between I frames and P frames, and N specifies the distance between two I frames.

The above M=3, N=12, and the GOP structure is: IBBPBBPBBPBBI.

Within a GOP, I frame decoding does not depend on any other frame, p frame decoding depends on the previous I frame or P frame, and B frame decoding depends on the previous most recent I frame or P frame and the most recent subsequent P frame.

<4>, IDR frame (key frame)

For convenience in encoding and decoding, the first I frame in the GOP must be distinguished from other I frames . The first I frame is called IDR. This makes it easier to control the encoding and decoding process. Therefore, the IDR frame must be an I frame, but the I frame It is not necessarily an IDR frame; the function of an IDR frame is to refresh immediately so that errors are not propagated, and a new sequence starts encoding from the IDR frame. I-frames may be referenced across frames, but IDRs may not .

The I frame does not need to refer to any frame, but subsequent P frames and B frames may refer to the frame before this I frame.

IDR does not allow this, for example:

IDR1 P4 B2 B3 P7 B5 B6 I10 B8 B9 P13 B11 B12 P16 B14 B15

B8 here can cross I10 to refer to P7

IDR1 P4 B2 B3 P7 B5 B6 IDR8 P11 B9 B10 P14 B11 B12

B9 here can only refer to IDR8 and P11, and cannot refer to the frame before IDR8.

Function:
H.264 introduces IDR images for decoding resynchronization . When the decoder decodes the IDR images, it immediately clears the reference frame queue, outputs or discards all the decoded data, searches for the parameter set again, and starts a new one. sequence . This way, if a major error occurs in the previous sequence, there is an opportunity to resynchronize here. The image after the IDR image is never decoded using the data from the image before the IDR.

③, H264 compression method

The core algorithms used by H264 are intra-frame compression and inter-frame compression . Intra-frame compression is the algorithm for generating I frames, and inter-frame compression is the algorithm for generating B frames and P frames.

  • Intraframe compression is also called spatial compression . When compressing a frame of image, only the data of this frame is considered without considering the redundant information between adjacent frames. This is actually similar to static image compression. Intra-frame generally uses a lossy compression algorithm. Since intra-frame compression encodes a complete image, it can be decoded and displayed independently. Intra-frame compression generally does not achieve very high compression, similar to encoding jpeg.
  • The principle of interframe compression is that the data of adjacent frames have great correlation , or the information of the two frames before and after changes little. That is to say, continuous videos have redundant information between adjacent frames. According to this characteristic, compressing the amount of redundancy between adjacent frames can further increase the compression amount and reduce the compression ratio. Interframe compression is also called temporal compression (Temporalcompression), which compresses data by comparing data between different frames on the timeline. Interframe compression is generally lossless. The frame difference (Framedifferencing) algorithm is a typical time compression method. It compares the difference between this frame and adjacent frames and only records the difference between this frame and its adjacent frames, which can greatly reduce the amount of data.

Compression method description

  • Step1: Grouping , that is, classifying a series of images with little change into a group, that is, a sequence, which can also be called GOP (Group of Pictures);
  • Step2: Define frames and classify each group of image frames into three types: I frame, P frame and B frame;
  • Step3: Predict frames , use I frame as the basic frame, use I frame to predict P frame, and then use I frame and P frame to predict B frame;
  • Step4: Data transmission , finally store and transmit the difference information between I frame data and prediction .

④. H264 layered structure

The H.264 original code stream (naked stream) is composed of NALU one after another. Its function is divided into two layers, VCL (Video Coding Layer) and NAL (Network Extraction Layer)

The main goal of H264 is to have a high video compression ratio and good network affinity. In order to achieve these two goals, the H264 solution is to divide the system framework into two levels, namely the video coding level (VCL) and Network abstraction level (NAL), as shown below;
Insert image description here

  • The VCL layer (Video Coding Layer) is the syntax level definition of the core algorithm engine, blocks, macroblocks and slices . It is responsible for effectively representing the content of video data and ultimately outputting the encoded data SODB
    • Compress the original data of the video.
  • The NAL layer (Video Data Network Abstraction Layer) defines syntax levels above the slice level (such as sequence parameter set parameter set and image parameter set, which will be described later for network transmission), and is responsible for formatting in the appropriate way required by the network. data and provide header information to ensure that the data is suitable for transmission on various channels and storage media. The NAL layer packages SODB into RBSP and then adds the NAL header to form a NALU unit .
    • Because H264 will eventually be transmitted on the network. During transmission, the maximum transmission unit of a network packet is 1500 bytes. An H264 frame is often larger than 1500 bytes, so a frame needs to be split into multiple packets. transmission. These unpacking, packaging and other tasks are all handled at the NAL layer.

<1>, Association between SODB and RBSP

The specific structure is shown in the figure below:

  • SODB: Data bit string, which is the encoded original data;
  • RBSP: Original byte sequence payload, which adds a tail bit, a bit "1" and several bits "0" after the original encoded data, for byte alignment.
    Insert image description here

<2>, the difference between SODB RBSP EBSP

  • SODB (String of Data Bits, data bit string)
    • The most raw, unprocessed encoded data
    • Because it is in the form of a stream, the length is not necessarily a multiple of 8, it is generated by the VLC layer. Since our computer processes data at a multiple of 8, RBSP is needed when the computer processes H264.
  • RBSP (Raw Byte Sequence Payload, original byte sequence payload)
    • A trailing bit (RBSP trailing bits, one bit '1') and several bits '0' are added after the SODB to facilitate byte alignment.
    • Since it is a compressed stream, SODB does not know where it ends, so the algorithm adds a 1 to the last bit of SODB, and 0 if it is not byte aligned.
  • EBSP (Encapsulated Byte Sequence Payload, extended byte sequence payload)
    • The start code of NALU is 0x000001 or 0x00000001 (the start code includes two types: 3 bytes (0x000001) and 4 bytes (0x00000001). The first NALU in SPS, PPS and Access Unit uses 4 byte start code , in other cases, the 3-byte start code is used.)
    • After generating the compressed stream, add a start bit at the beginning of each frame. This start bit is generally 00 00 00 01 or 00 00 01. Therefore, it is stipulated in the h264 code stream that for every two consecutive 00 00, a 0x03 is added.
  • Imitation check byte (0x03)
    • At the same time, H264 stipulates that when 0x000000 is detected, it can also indicate the end of the current NALU. This will cause a problem, that is, what should we do if 0x000001 or 0x000000 appears inside NALU?
    • The reason for adding a fake check byte (0x03) on the basis of RBSP is: when NALU is added to Annexb, the start code StartCodePrefix before each group of NALU needs to be added. If the slice corresponding to the NALU is one frame The start is represented by a 4-bit byte, 0x00000001, otherwise a 3-bit byte is used to represent 0x000001. In order to prevent the NALU body from including the ones that conflict with the start code, during encoding, every time two bytes are encountered that are 0 in a row, Insert a byte of 0x03. Remove 0x03 when decoding. Also called shelling operation.

relation chart:
Insert image description here

Why get an EBSP?
Compared with RBSP, EBSP has one more byte to prevent competition: 0x03.

We know that the start code of NALU is 0x000001 or 0x00000001, and H264 stipulates that when 0x000000 is detected, it can also indicate the end of the current NALU. This will cause a problem, that is, what should we do if 0x000001 or 0x000000 appears inside NALU?

Therefore, H264 proposes a mechanism to "prevent competition". When the encoder finishes encoding a NAL, it should detect whether the four sequences on the left appear inside the NALU. When their presence is detected, the encoder inserts a new byte at the last byte: 0x03.
Insert image description here

⑤. H264 code stream structure

Before describing the NAL unit in detail, it is very necessary to first understand the code stream structure of H264; the code stream of H264 after encoding is shown in the figure below. From the figure, we need to get a concept. The H264 code stream is composed of It consists of NAL units, where SPS, PPS, IDR and SLICE are certain types of data in NAL units.
Insert image description here

⑥. NAL unit of H264

<1>, NAL structure of H264

In the actual network data transmission process, the data structure of H264 is transmitted in NALU (NAL unit). The transmission data structure consists of [NALU Header]+[RBSP] , as shown in the following figure:
Insert image description here
Video frame data encoded by the VCL layer , the frames may be I/B/P frames, and these frames may also belong to different sequences. The same sequence also has corresponding sequence parameter sets and picture parameter sets; in order to complete accurate video decoding, in addition to The video frame data encoded by the VCL layer also needs to transmit sequence parameter sets and image parameter sets , etc.

As we know above, the NAL unit is the basic unit for actual video data transmission. The NALU header is used to identify what type of data the subsequent RBSP is. It also records whether the RBSP data will be referenced by other frames and whether there are errors in network transmission.

<2>, NAL header

The header of the NAL unit is composed of three parts : forbidden_bit(1bit) , nal_reference_bit(2bits) (priority), nal_unit_type(5bits) (type), as shown in the figure below:

  • F (forbiden): forbidden bit, occupying the first bit of the NAL header. When the forbidden bit value is 1, it indicates a syntax error.
  • NRl: Reference level, occupying the second to third bits of the NAL header; the larger the value, the more important the NAL is.
  • Type: Nal unit data type, which is the data type that identifies the NAL unit, occupying the fourth to eighth bits of the NAL header.

Insert image description here

<3>, NAL unit data type

The NAL types are mainly the types in the figure below. Each type has a special function;
Insert image description here
Insert image description here
before introducing the NAL data type in detail, it is necessary to know that NAL is divided into VCL and non-VCL NAL units .

Another concept that needs to be understood is parameter sets. Parameter sets are NAL units that carry decoding parameters. Parameter sets are very important for correct decoding. In a lossy transmission scenario, the bit string or Packets may be lost or damaged. In this network environment, the parameter set can be sent through a high-quality service, such as a forward error correction mechanism or a priority mechanism.
Insert image description here

  • Non-VCL NAL data types:

    • SPS (Sequence Parameter Set) : SPS identifies and records decoding parameters such as identifier, number of frames and reference frames, decoded image size and frame field mode.
    • PPS (Picture Parameter Set) : PPS flags and records decoding parameters such as entropy coding type, number of valid reference pictures and initialization.
    • SEI (Supplementary Enhancement Information) : This part of the parameters can be transmitted as H264 bitstream data, and each SEI information is encapsulated into a NAL unit. SEI may be useful for decoders, but is not necessary for the basic decoding process.
  • NAL data type for VCL:

    • Header information block , including macroblock type, quantization parameters, and motion vectors. This information is the most important, because without them, the code elements of the data block cannot be used. This data chunk is called a Type A data chunk.
    • Intra-frame coding information data blocks are called type B data blocks. It contains the intra-coded macroblock type, intra-coded coefficients. For the corresponding slice, the availability of type B data blocks depends on type A data blocks. Unlike blocks of inter-coded information, intra-coded information prevents further deviations and is therefore more important than inter-coded information.
    • Interframe coding information data blocks are called type C data blocks. It contains inter-coded macroblock types, inter-coded coefficients. It is usually the largest part of the slice. The inter-coded information data block is an unimportant part. The information it contains does not provide synchronization between codecs. The availability of Class C data blocks also depends on Class A data blocks, but has nothing to do with Class B data blocks.

Each partition of the above three data blocks is stored separately in a NAL unit and can therefore be transmitted separately.

<4>、NAL Unit

An H264 frame consists of at least one slice. There cannot be no slices. It can be one or more.

During network transmission, an H264 frame may need to be cut into pieces for transmission. Each frame cannot be transmitted at once, so it is cut according to slices.

Each slice forms a NAL Unit.

Insert image description here

<5>, the relationship between slices and macroblocks

Insert image description here
The slice data contains several macroblocks.
A macroblock also contains macroblock type, macroblock prediction, and residual data.
Insert image description here

<6>, the relationship between the NAL unit of H264 and slices and macros

  • 1 frame (one image) = 1~N slices (slices) //It can also be said that 1 to multiple slices are a slice group
  • 1 slice = 1~N macroblocks (Marcroblock)
  • 1 macroblock = 16X16 YUV data (original video capture data)

From the perspective of data hierarchy, an original picture can be counted as a frame in a broad sense. A frame contains slice groups and slices. A slice group is composed of slices, and a slice is composed of macro blocks. Each macro block can be 4 4 , 8 8, 16*16 pixel scale size, the relationship between them is shown in the figure.
Insert image description here
At the same time, there are a few points that need to be explained to make it easier to understand the NAL unit:

  • If the FMO (Flexible Macroblock Ordering) mechanism is not used , an image will have only one slice group;
  • If multiple slices are not used, a slice group has only one slice;
  • If the DP (data division) mechanism is not used , a slice is a NALU, and a NALU is a slice. Otherwise, a slice needs to be composed of three NALUs, which are the A, B, and C type data blocks mentioned above.

At this time, looking at the code stream data layering below, the following figure can better understand the overall code stream structure.
Insert image description here
As we can see, each fragment also contains two parts: header and data; the fragment header contains the fragment The slice type, the macroblock type in the slice, the number of slice frames, and the corresponding frame settings and parameters; and the slice data contains macro blocks. This is where we are looking for to store pixel data. Macro The block is the main carrier of video information because it contains the luminance and chrominance information of each pixel.

The main task of video decoding is to provide an efficient way to obtain the pixel array in the macroblock from the code stream.

The composition of macroblock data is shown in the figure below:
Insert image description here
From the figure above, you can see that the macroblock contains information such as macroblock type, prediction type, Coded Block Pattern, Quantization Parameter, pixel brightness and chrominance data set, etc.

A few points to note:

  • The H.264/AVC standard has strict requirements on the order of NAL units sent to the decoder . If the order of the NAL units is confusing, they must be reorganized according to the specification and sent to the decoder, otherwise the decoder cannot decode correctly. .
  • A sequence parameter set (sps) NAL unit must be transmitted before all other NAL units that reference this parameter set , but duplicate sequence parameter set NAL units are allowed among these NAL units. The detailed explanation of the so-called duplication is: sequence parameter set NAL units have their own special identifiers. If the identifiers of two sequence parameter set NAL units are the same, it can be considered that the latter one is just a copy of the previous one, not a new sequence parameter. set.
  • The picture parameter set (pps) NAL unit must be transmitted before all other NAL units that refer to this parameter set, but duplicate picture parameter set NAL units are allowed to appear among these NAL units. This is consistent with the above-mentioned sequence parameter set NAL unit. same

3. AAC encoding basics

AAC is the abbreviation of Advanced Audio Coding. AAC is a new generation of audio lossy compression technology.

1. Characteristics of AAC encoding

  • AAC is a high compression ratio audio compression algorithm, but its compression ratio is much higher than older audio compression algorithms, such as AC-3, MP3, etc. And its quality is comparable to uncompressed CD sound quality.
  • Like other similar audio coding algorithms, AAC also uses a transform coding algorithm , but AAC uses a higher-resolution filter bank, so it can achieve a higher compression ratio.
  • AAC uses the latest technologies such as temporary noise reshaping, backward adaptive linear prediction, joint stereo technology and quantized Huffman coding . The use of these new technologies has further improved the compression ratio.
  • AAC supports a wider range of sample rates and bit rates, 1 to 48 audio tracks, up to 15 low-frequency audio tracks , multi-language compatibility, and up to 15 embedded data streams.
  • AAC supports a wider sound frequency range, up to 96kHz and as low as 8KHz , which is much wider than the 16KHz-48kHz range of MP3.
  • Unlike MP3 and WMA, AAC almost does not lose the very high and very low frequency components in the sound frequency , and is closer to the original audio in spectral structure than WMA, so the sound fidelity is better. Professional evaluation shows that AAC sounds clearer and closer to the original sound than WMA
  • AAC uses optimized algorithms to achieve higher decoding efficiency, requiring less processing power when decoding.

2. AAC audio file format

①. ACC audio file format type

AAC audio file formats are: ADIF, ADTS

  • ADIF: Audio Data Interchange Format audio data exchange format. The characteristic of this format is that the beginning of the audio data can be found deterministically without decoding starting in the middle of the audio data stream, that is, its decoding must be performed at a clearly defined start. Therefore, this format is commonly used in disk files.
  • ADTS: Audio Data Transport Stream audio data transport stream. The characteristic of this format is that it is a bit stream with synchronization words, and decoding can start at any position in this stream. Its characteristics are similar to the mp3 data stream format.

简单说,ADTS 可以在任意帧解码,也就是说它每一帧都有头信息。ADIF 只有一个统一的头,所以必须得到所有的数据后解码。这两种的 header 的格式也是不同的,一般编码后的和抽取出的都是 ADTS 格式的音频流。

The ADIF file format of AAC is as follows:
Insert image description here
The format of a frame in the ADTS file of AAC is as follows:
Insert image description here

②. Header structure of ADIF

The ADIF header information is as shown below:
Insert image description here
The ADIF header information is located at the beginning of the AAC file, followed by continuous Raw Data Blocks.

The fields that make up the ADIF header information are as follows:
Insert image description here

③. Header structure of ADTS

Fixed header information of ADTS:
Insert image description here
Variable header information of ADTS:
Insert image description here

  • The purpose of frame synchronization is to find the position of the frame header in the bit stream. 13818-7 stipulates that the frame header synchronization word in the AAC ADTS format is 12 bits "1111 1111 1111".
  • The header information of ADTS consists of two parts, one is fixed header information, followed by variable header information. The data in fixed header information is the same from frame to frame, while variable header information changes from frame to frame.

④.AAC element information

In AAC, the original data block may be composed of six different elements:

  • SCE: Single Channel Element. Single channel elements basically consist of just one ICS. A raw data block most likely consists of 16 SCEs.
  • CPE: Channel Pair Element A dual-channel element consisting of two ICSs that may share side information and some joint stereo coding information. A raw data block may consist of up to 16 SCEs.
  • CCE: Coupling Channel Element Coupling channel element. Represents a block of multi-channel joint stereo information or dialogue information for a multilingual program.
  • LFE: Low Frequency Element. Contains a channel that enhances low sampling frequencies.
  • DSE: Data Stream Element Data stream element contains some additional information that does not belong to audio.
  • PCE: Program Config Element program configuration element. Contains channel configuration information. It may appear in the ADIF header information.
  • FIL: Fill Element Fill element. Contains some extended information. Such as SBR, dynamic range control information, etc.

⑤. AAC file processing process

  • (1) Determine the file format and determine it to be ADIF or ADTS
  • (2) If it is ADIF, decode the ADIF header information and jump to step 6
  • (3) If it is ADTS, look for the synchronization header
  • (4) Decipher ADTS frame header information
  • (5) If there is error detection, perform error detection
  • (6) Deblocking information
  • (7) Solve element information

3. Coding algorithm processing flow

First, the input PCM signal is segmented, with 1024 samples per channel per frame, using 1/2 overlap, and combined to obtain 2048 samples.

After windowing, discrete cosine change (MDCT) is performed to output 1024 spectral components, which are divided into 10 scale factor bands with different bandwidths according to different sampling rates and transform block types. The change block type is calculated and analyzed by the psychoacoustic model, which will also output the signal-to-masking ratio for processing in subsequent modules.

AAC also uses a new technology called temporal noise shaping , or TNS for short . The mechanism of TNS is to take advantage of the duality of time domain and frequency domain signals.

Quantization and encoding use a two-level nested loop algorithm to trade off the trade-off between bit rate and distortion.

Finally, the bit stream is encapsulated to obtain the compressed code stream.
Insert image description here


My qq: 2442391036, welcome to communicate!


Guess you like

Origin blog.csdn.net/qq_41839588/article/details/132772840