Audio and video - compression principle

The H264 video compression algorithm is now undoubtedly the most widely used of all video compression techniques,

most trendy. With the introduction of open source libraries such as x264/openh264 and ffmpeg, most users do not need to do too much research on the details of H264, which greatly reduces the cost of using H264.

But in order to make good use of H264, we still need to understand the basic principles of H264. Today we will take a look at the basic principles of H264.

H264 overview

insert image description here

H264 compression technology mainly uses the following methods to compress video data. include:

  1. Intra-frame prediction compression solves the problem of spatial data redundancy.
  2. Inter-frame prediction compression (motion estimation and compensation) solves the problem of temporal data redundancy.
  3. Integer Discrete Cosine Transform (DCT), which converts spatial correlation into frequency-domain irrelevant data and then quantizes it.
  4. CABAC compression.

The compressed frames are divided into: I frame, P frame and B frame:

  • I frame: key frame, using intra-frame compression technology.
  • P frame: forward reference frame, when compressing, only refer to the frame that has been processed before. Use frame audio compression technology.
  • B frame: Bidirectional reference frame, when compressing, it refers to the previous frame and the frame behind it. Using inter-frame compression technology.

In addition to I/P/B frames, there is also a picture sequence GOP.

GOP: There is an image sequence between two I frames, and there is only one I frame in an image sequence. As shown below:
insert image description here

Let's describe the H264 compression technology in detail.

H264 compression technology

The basic principle of H264 is actually very simple, let's briefly describe the process of H264 compressed data. The video frames (calculated at 30 frames per second) collected by the camera are sent to the buffer of the H264 encoder. The encoder first divides the macroblocks for each picture.

Take the following picture as an example:

insert image description here

divide macroblock

By default, H264 uses a 16X16 size area as a macroblock, which can also be divided into 8X8 size.

insert image description hereAfter the macroblock is divided, calculate the pixel value of the macroblock.

insert image description here

By analogy, the pixel value of each macroblock in an image is calculated, and all macroblocks are processed as follows.

insert image description here

sub-block

H264 uses 16X16 macroblocks for relatively flat images. However, for higher compression rate, the 16X16 macroblock can be further divided into smaller sub-blocks. The size of the sub-block can be 8X16, 16X8, 8X8, 4X8, 8X4, 4X4, which is very flexible.

insert image description here

In the above picture, most of the 16X16 macroblocks in the red frame have a blue background, and some images of the three eagles are drawn in the macroblock. In order to better process the partial images of the three eagles, H264 uses Within the 16X16 macroblock, multiple sub-blocks are divided.

insert image description here

In this way, after intra-frame compression, more efficient data can be obtained. The figure below is the result of compressing the above macroblocks using mpeg-2 and H264 respectively. The left half is the result of compression after MPEG-2 sub-block division, and the right half is the result of compression after H264 sub-block division. It can be seen that the division method of H264 is more advantageous.

insert image description here

After the macroblocks are divided, all pictures in the cache of the H264 encoder can be grouped.

frame grouping

There are mainly two types of data redundancy for video data, one is data redundancy in time, and the other is data redundancy in space. Among them, the data redundancy in time is the largest. Next, let's talk about the redundancy problem of video data time.

Why is it said that the redundancy in time is the greatest? Assuming that the camera captures 30 frames per second, the data of these 30 frames are correlated in most cases. There may also be more than 30 frames of data, maybe dozens of frames, or hundreds of frames of data are all closely related.

For these closely related frames, in fact, we only need to save the data of one frame, and other frames can be predicted according to certain rules through this frame, so the video data has the most redundancy in time.

In order to compress data by predicting related frames, it is necessary to group video frames. So how to determine that some frames are closely related and can be classified as a group? Let's take a look at an example. Below is a set of captured video frames of a moving billiard ball. The billiard ball rolls from the upper right corner to the lower left corner.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-5eMeFcWG-1690276172009)(img/40.png)]

insert image description here

The H264 encoder will sequentially take out two adjacent frames for macroblock comparison and calculate the similarity between the two frames. As shown below:
insert image description here

Through macroblock scanning and macroblock searching, it can be found that the correlation between these two frames is very high. It is further found that the correlation degree of this group of frames is very high. Therefore, the above frames can be divided into one group. The algorithm is: in several adjacent image frames, generally there are only points with differences within 10% of the pixels, and the change of brightness difference is not more than 2%, while the change of chrominance difference is only within 1%. We think that such Graphs can be grouped into groups.

In such a group of frames, after encoding, we only keep the complete data of the first frame, and other frames are calculated by referring to the previous frame. We call the first frame IDR/I frame , and other frames are called P/B frames , and the encoded data frame group is called GOP .

Motion Estimation and Compensation

After the frames are grouped in the H264 encoder, it is necessary to calculate the motion vector of the object in the frame group. Also take the billiard video frame in motion above as an example, let's see how it calculates the motion vector.

The H264 encoder first fetches two frames of video data from the head of the buffer in sequence, and then performs macroblock scanning. When an object is found in one of the pictures, a search is performed in the vicinity (in the search window) of the other picture. If the object is found in another image at this time, the motion vector of the object can be calculated. The picture below is the position of the billiard ball after the search.

insert image description here

Douyin----"Video 1M 9M

The direction and distance of the table map can be calculated by the difference in the position of the billiard ball in the above picture. H264 records the distance and direction of the ball moving in each frame in turn, and it becomes the following.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-CFeLZgaf-1690276172012)(img/44.png)]

After the motion vector is calculated, subtract the same part (that is, the green part) to get the compensation data. In the end, we only need to compress and save the compensation data, and the original image can be restored when decoding later. The compressed and compensated data only needs to record a small amount of data. As follows:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-G5ngbg4Z-1690276172013)(img/45.png)]

We call motion vector and compensation inter-frame compression technology , which solves the temporal data redundancy of video frames. In addition to inter-frame compression, intra-frame data compression is also required, and intra-frame data compression solves data redundancy in space. Next, let's introduce the intra-frame compression technology.

intra prediction

The human eye has a degree of recognition for images, and is very sensitive to low-frequency brightness, but not very sensitive to high-frequency brightness. Therefore, based on some research, data that is not sensitive to human eyes in an image can be removed. In this way, the intra prediction technique is proposed.

H264's intra-frame compression is very similar to JPEG. After an image is divided into macroblocks, 9 modes of prediction can be performed for each macroblock. Find a prediction model that is closest to the original image.
insert image description here

The picture below is the process of predicting each macroblock in the whole picture.

insert image description here

The comparison between the intra-predicted image and the original image is as follows:

insert image description here

Then, the residual value is obtained by subtracting the original image and the intra-predicted image.

insert image description here

Then save the prediction mode information we got before, so that we can restore the original image when decoding. The effect is as follows:

insert image description here

After intra-frame and inter-frame compression, although the data has been greatly reduced, there is still room for optimization.

Do DCT on the residual data

Integer discrete cosine transform can be performed on the residual data to remove the correlation of the data and further compress the data. As shown in the figure below, the left side is the macroblock of the original data, and the right side is the macroblock of the calculated residual data.

insert image description here

After the residual data macroblock is digitized, it is shown in the figure below:

insert image description here

The residual data macroblock is subjected to DCT transformation.
insert image description here

After removing the associated data, we can see that the data is further compressed.

insert image description here

After the DCT is done, it is not enough, and CABAC is required for lossless compression.

The principle of DCT in plain language

Here's the first frame: P1 (our reference frame)
insert image description here

This is the second frame: P2 (the frame that needs to be encoded)

insert image description here

The two pictures taken from the video with an interval of 1-2 seconds are similar to the actual situation. Let's do a few motion searches:

This is a demonstration program. Select any 16x16 block on P2 with the mouse, and you can search for the BestMatch macro block on P1 . Although the vehicle is moving from far to near, the closest macroblock coordinates are still found.

insert image description here

This is a demonstration program. Select any 16x16 Block on P2 with the mouse to search for the BestMatch macro block on P1. Although the vehicle is moving from far to near, the closest macroblock coordinates are still found.

Search Demo 2: Aerial wire crossing location (P1 above, P2 below)

insert image description here

insert image description here

Also successfully find the macroblock position closest to the poster in P2 in P1.

Full image search: P2' restored based on P1 and motion vector data (searching for the most similar position set of each macroblock in P1 in P2), that is, the most similar to P2 pieced together from the macroblocks in each position of P1 Picture P2', the effect is as follows:

insert image description here

Look closely, it's a bit fragmented, right? Definitely, this is what we put together. Now we subtract P2` and P2 pixels to get the difference image D2 = (P2' - P2) / 2 + 0x80:

insert image description here

This is how the previously fragmented P2` became clearly visible after adding the error D2, basically restoring the original picture P2.

insert image description here

Since D2 only occupies 5KB, and the compressed motion vector is only 7KB, we only need an additional 7KB of data to fully represent P2 with reference to P1, and if P2 is independently compressed with a lossy compression method with acceptable quality , then at least go to 50-60KB, which saves almost 8 times the space, which is the basic principle of the so-called motion coding.

In actual use, the reference frame is not necessarily the previous frame, nor is it necessarily the I frame of the same GOP, because when the GOP interval is long, the subsequent pictures may have changed greatly from the I frame, so the common practice is The frame with the smallest error among the last 15 frames is selected as the reference frame. Although the color image has three components of YUV, a lot of prediction work and the best selection are usually judged based on the grayscale frame of the Y component.

Furthermore, the error we save is (P2-P2')/2 + 0x80. In actual use, we will use a more efficient method, such as making the color difference accuracy between [-64,64] 1, [-255, The color difference accuracy between -64], [64, 255] is 2-3, which will be more realistic.

At the same time, in many places above, direct lzma2 is used for simple storage. In actual use, entropy coding is generally introduced, and the data is sorted at a certain level and then compressed, and the performance will be much better.

CABAC

The above intra-frame compression is a lossy compression technique. That is to say, after the image is compressed, it cannot be completely restored. CABAC is a lossless compression technique.

The lossless compression technology that everyone is most familiar with may be Huffman coding, which gives a short code to high-frequency words and a long code to low-frequency words to achieve the purpose of data compression. The VLC used in MPEG-2 is such an algorithm. We take AZ as an example. A belongs to high-frequency data, and Z belongs to low-frequency data. See how it's done.

insert image description here

CABAC also provides short codes for high-frequency data and long codes for low-frequency data. At the same time, it will also compress according to the context dependency, which is much more efficient than VLC. The effect is as follows:

insert image description here

Now replace AZ with video frames and it will look like below.

insert image description here

From the above picture, it is obvious that the lossless compression scheme using CACBA is much more efficient than VLC.

Guess you like

Origin blog.csdn.net/qq_39431405/article/details/131922302