AVC/H264 format compression coding principle

1. H.264/AVC framework flow chart

1.1 H.264/AVC frame diagram

insert image description here
The function of H.264 is divided into two layers
VCL (VideoCoding Layer, video coding layer): responsible for efficient video content representation, including:
inter-frame and intra-frame prediction (Estimation),
transformation (Transform) and inverse transformation,
quantization (Quantization) ) and inverse quantization,
loop filter (Loop Filter),
entropy coding (Entropy Coding).

NAL (NetworkAbstraction Layer, Network Abstraction Layer): Responsible for packaging and transmitting data in an appropriate manner required by the network.
Network encapsulation format
RTP: RTP 12-byte header + NALU, general UDP transmission
RTMP: RTMP header + NALU, general TCP transmission. The RTMP client transmits to the server, and RTMP is the form of live broadcast.
RTSP: RTSP header + NALU, general UDP transmission. RTSP sends control commands and video images at the same time.

1.2 H.264/AVC flow chart

insert image description here
The above is the encoding flowchart of H.264AVC, and the H264 encoder adopts a hybrid encoding method of transformation and prediction. As shown in the figure above, the input frame or field Fn is processed by the encoder in units of macroblocks.

First, it is processed according to the method of intra-frame or inter-frame predictive coding. If inter-frame predictive coding is used, the predicted value PRED (indicated by P in the figure) is obtained after motion compensation (MC) of the previously coded reference image in the current slice, where the reference image is represented by F'n-1. In order to improve the prediction accuracy and thereby improve the compression ratio, the actual reference image can be selected from the past or future (referring to the display order) coded, decoded, reconstructed and filtered frames. The intra prediction process is very similar to JPEG. After an image is divided into macroblocks, 9 modes of prediction can be performed for each macroblock. Find a prediction model that is closest to the original image.

Then, the residual value is obtained by subtracting the original image and the intra-predicted image. After the prediction value PRED is subtracted from the current block, a residual block Dn is generated. After block transformation and quantization, a set of quantized transformation coefficients X is generated, and then entropy coding, and some side information required for decoding (such as prediction mode Quantization parameters, motion vectors, etc.) together form a compressed code stream, which is used for transmission and storage through NAL (Network Adaptation Layer).

This is the encoder of H264. Next, let’s take a look at the decoder.
insert image description hereAs can be seen from the above figure, the NAL of the encoder outputs a compressed H.264 compressed bit stream. A set of quantized transformation coefficients X is obtained through entropy decoding, and then the residual Dn' is obtained through inverse quantization and inverse transformation. Using the header information decoded from the bitstream, the decoder generates a predicted value PRED, which is the same as the original PRED in the encoder. When the PRED generated by the decoder is added to the residual Dn', uFu' is generated, and after filtering, the filtered Fn' is finally obtained, and this Fn' is the best decoded output image.

2. H264 encoding principle

Below we briefly describe the process of H264 compression encoding data. The video frames (calculated at 30 frames per second) collected by the camera are sent to the buffer of the H264 encoder. The encoder first divides the macroblocks for each picture.

2.1 Divide macroblocks

A macroblock is the basic processing unit of a coding standard, and it is usually also 16x16 pixels in size. A 16X16 macroblock can be divided into smaller subblocks. The size of the sub-block can be 8X16, 16X8, 8X8, 4X8, 8X4, 4X4

H264 uses a 16X16 area as a macroblock by default, and it can also be divided into 8X8 macroblocks.
After the macroblock is divided, calculate the pixel value of the macroblock.
insert image description here
By analogy, the pixel value of each macroblock in an image is calculated, and all macroblocks are processed as follows.
insert image description here

2.2 Divide sub-blocks

H264 uses 16X16 macroblocks for relatively flat images. However, for higher compression rate, the 16X16 macroblock can be further divided into smaller sub-blocks.

The size of the sub-block can be 8X16, 16X8, 8X8, 4X8, 8X4, 4X4, very flexible.
insert image description here
In the above figure, most of the 16X16 macroblock in the red frame has a blue background, and part of the image of the three eagles is drawn in the macroblock. In order to better process part of the images of the three eagles, H264 divides the 16X16 macroblock into multiple sub-blocks.
insert image description here
In this way, after intra-frame compression, more efficient data can be obtained.
The figure below is the result of compressing the above macroblock using MPEG-2 and H264 respectively. The left half is the compressed result of MPEG-2 sub-block division, and the right half is the compressed result of H264 sub-block division. It can be seen that the division method of H264 has more advantages: after the macroblocks of each picture are divided
insert image description here
, , you can group all the pictures in the H264 encoder cache.

2.3 Frame Grouping

There are mainly two types of data redundancy for video data, one is data redundancy in time, and the other is data redundancy in space. . Among them, the data redundancy in time is the largest. Next, let's talk about the redundancy problem of video data time.

Why is it said that the redundancy in time is the greatest? Assuming that the camera captures 30 frames per second, the data of these 30 frames are correlated in most cases. There may also be more than 30 frames of data, maybe dozens of frames, or hundreds of frames of data are all closely related.

For these closely related frames, in fact, we only need to save the data of one frame, and other frames can be predicted according to certain rules through this frame, so the video data has the most redundancy in time.

In order to compress data by predicting related frames, video frames need to be grouped. So how to determine that some frames are closely related and can be classified as a group?

Let's take a look at an example. Below is a set of captured video frames of a moving billiard ball. The billiard ball rolls from the upper right corner to the lower left corner. insert image description here
insert image description here
Through macroblock scanning and macroblock searching, it can be found that the correlation between these two frames is very high. It is further found that the correlation degree of this group of frames is very high. Therefore, the above frames can be divided into one group, that is, the sequence (GOP) in H264. Its algorithm is: in several adjacent images, generally there are only points with differences within 10% of the pixels, the change of brightness difference is not more than 2%, and the change of chroma difference is only within 1%. We think that such Graphs can be grouped into groups.

In such a group of frames, after encoding, we only keep the complete data of the first frame, and other frames are calculated by referring to the previous frame. We call the first frame IDR/I frame, and other frames are called P/B frames, and the encoded data frame group is called GOP.

So if the scene doesn't change all the time, the number of I-frames in a sequence of video frames will be small. If the scene change is very complicated, there will be I-frames appearing when the scene change is large all the time.

2.4 Inter prediction and motion compensation

After the frames are grouped in the H264 encoder, it is necessary to calculate the motion vector of the objects in the frame group. Also take the billiard video frame in motion above as an example, let's see how it calculates the motion vector.

The H264 encoder first fetches two frames of video data from the head of the buffer in sequence, and then performs macroblock scanning. When an object is found in one of the pictures, a search is performed in the vicinity (in the search window) of the other picture. If the object is found in another image at this time, the motion vector of the object can be calculated. insert image description here
The direction and distance of the table map can be calculated by the difference in the position of the billiard ball in the above picture. H264 records the distance and direction of the ball moving in each frame in turn, and it becomes the following: insert image description here
After the motion vector is calculated, the same part (that is, the green part) is subtracted to obtain the compensation data. In the end, we only need to compress and save the compensation data, and the original image can be restored when decoding later. The compressed and compensated data only needs to record a small amount of data. As shown below: insert image description here
We refer to motion vector and compensation as inter-frame compression technology, which solves the data redundancy of video frames in time. In this step we get the P/B frame.

In addition to inter-frame compression, intra-frame data compression is also required, and intra-frame data compression solves data redundancy in space. Next, let's introduce the intra-frame compression technology.

2.5 Intra prediction

The human eye has a degree of recognition for images, and is very sensitive to low-frequency brightness, but not very sensitive to high-frequency brightness. Therefore, based on some research, data that is not sensitive to human eyes in an image can be removed. In this way, the intra prediction technique is proposed.

H264's intraframe compression is very similar to JPEG. After an image is divided into macroblocks, 9 modes of prediction can be performed for each macroblock. Find a prediction model that is closest to the original image.

The comparison between the intra-predicted image and the original image is as follows: insert image description here
Then, the residual value is obtained by subtracting the original image from the intra-predicted image
insert image description here

Then save the prediction mode information we got before, so that we can restore the original image when decoding. The effect is as follows:
insert image description here
After intra-frame and inter-frame compression, although the data is greatly reduced, there is still room for optimization. This step is mainly to compress the I frame.

2.6 DCT conversion of residual data

Integer discrete cosine transform can be performed on the residual data to remove the correlation of the data and further compress the data.

As shown in the figure below, the left side is the macroblock of the original data, and the right side is the macroblock of the calculated residual data. insert image description here
After the residual data macroblock is digitized, it is shown in the figure below: insert image description here
DCT transforms the residual data macroblock.
insert image description here
After removing the associated data, we can see that the data is further compressed. After the DCT is done, it is not enough, and CABAC is required for lossless compression.

2.7 CABAC Compression

The above intra-frame compression is a lossy compression technique. That is to say, after the image is compressed, it cannot be completely restored. CABAC is a lossless compression technology.

The lossless compression technology that everyone is most familiar with may be Huffman coding, which gives a short code to high-frequency words and a long code to low-frequency words to achieve the purpose of data compression.

VLC used in MPEG-2 is this kind of algorithm, we take AZ as an example, A belongs to high-frequency data, Z belongs to low-frequency data. See how it does it: insert image description here
CABAC also gives short codes for high-frequency data and long codes for low-frequency data. At the same time, it will also compress according to the context dependency, which is much more efficient than VLC. The effect is as follows:
insert image description here
Now replace AZ with video frame, it becomes the following:
insert image description here
From the above picture, it is obvious that the lossless compression scheme using CACBA is much more efficient than VLC.

The encoding principle of AVC/H264 is over, and then continue to improve the encoding process principles of HEVC and VVC.

Briefly describe the video encoding process of HEVC and VVC

Introduction to HEVC and VVC (H264 and H266) video compression encoding formats

Guess you like

Origin blog.csdn.net/qq_39969848/article/details/129064183