One article to understand the principle of video codec

Video encoding and decoding algorithms are divided into traditional algorithms and methods based on deep learning. This article mainly introduces the principle of video encoding and decoding technology, and some content and pictures refer to online technical blogs (links are at the end of the article).

1. Basic terms

For the definition and understanding of digital images, please refer to this article: Notes on Digital Image Processing|One article to understand the basics of digital images .

  1. Color depth : To store the color intensity of each pixel, a certain amount of data space is required. This space is the color depth. For the RGBcolor model, the color depth is 24(8*3) bits.
  2. Image resolution : The number of pixels in an image, usually expressed as width*height.
  3. Image/Video Aspect Ratio : Simply describes the proportional relationship between the width and height of an image or pixel.
  4. Bit rate : the amount of data required to play a video per second, bit rate = width * height * color depth * frames per second. For example, a video with 30 frames per second, 24 bits per pixel, and a resolution of 480x240 would require 82,944,000 bits per second or 82.944 Mbps (30x480x240x24) if we did not do any compression . When the bit rate is nearly constant it is called constant bit rate (CBR); but it can also vary and is called variable bit rate (VBR).

The graphic below shows a limited VBR that doesn't cost too much data when the frame is black.
insert image description here

1.1, Color brightness and our eyes

Because the human eye has many more rods (brightness) than cones, it's a reasonable inference that we have a better ability to distinguish darkness from light than color.

Our eyes are more sensitive to brightness than to color , take a look at the picture below to test it.
insert image description here

If you can't see that square A and square B on the left are the same color , that's because our brains play a trick that makes us pay more attention to light and dark than to color. There's a connector here on the right that's the same color, so we (the brain) can easily tell the fact that they're the same color.

Second, the implementation principle of video coding

2.1, Overview of Video Coding Technology

The purpose of encoding is to compress. The so-called encoding algorithm is to find rules to build an efficient model to remove redundant information in video data.

Common video redundancy information and corresponding compression methods are as follows:

type content compression method
Spatial redundancy Correlation between pixels transform coding, predictive coding
time redundancy Correlation in time direction Inter prediction, motion compensation
Redundant image construction The construction of the image itself Contour encoding, region segmentation
knowledge redundancy The sending and receiving ends share a common understanding of the characters knowledge-based coding
visual redundancy human visual characteristics Nonlinear quantization, bit allocation
other Uncertainty factor

An example of video frame redundancy information is shown in the following figure:
insert image description here

2.2, frame type

We know that video is formed by continuous playback of different frames. Video frames are mainly divided into three categories, namely (1) I frame; (2) B frame; (3) P frame.

  • I-frame (key frame, intra-frame coding): It is an independent frame with all information, is the most complete picture (takes up the most space), and can be decoded independently without referring to other pictures. The first frame in a video sequence is always an I frame .
  • P-frame (prediction): "Inter-frame predictive coded frame", which needs to refer to different parts of the previous I-frame and/or P-frame in order to be coded. P frames have dependencies on previous P and I reference frames. However, the P frame has a relatively high compression rate and takes up less space.
  • B frame (bidirectional prediction): "Bidirectional predictive coding frame", the frame after the previous frame is used as the reference frame. It not only refers to the front, but also refers to the following frames, so its compression ratio is the highest, which can reach 200:1. However, it is not suitable for real-time transmission (such as video conferencing) because of the reliance on subsequent frames.

The processing of I frame adopts intra-frame coding (inter-frame prediction) method, and only utilizes the spatial correlation in the image of this frame .

For the processing of P frames, inter-frame coding (forward motion estimation) is used, and the spatial and temporal correlations are used at the same time. Simply put, a motion compensation (motion compensation) algorithm is used to remove redundant information.
Processing of I frame and P frame

2.3, intra-frame coding (intra-frame prediction)

Intra coding/prediction is used to address single frame spatial redundancy . If we analyze each frame of the video , many regions are correlated .
insert image description here

To give an example to understand intra-frame coding, as shown in the picture below, it can be seen that most areas of this picture have the same color. Assuming this is one I 帧, we're about to encode the red region, assuming that the colors in the frame are vertically consistent, meaning that unknown pixels are the same color as neighboring pixels .
insert image description here

Although such a priori prediction will be wrong, we can first use this technology ( intra-frame prediction ), and then subtract the actual value to calculate the residual. The resulting residual matrix is ​​easier to compress than the original data.
insert image description here

2.4, Inter coding (inter prediction)

The repetition of video frames in time , the technology to solve this kind of redundancy is inter-frame coding/prediction.

Try to encode frame 0 and frame 1 consecutively in time with a small amount of data . For example, to do a subtraction, simply subtract frame 1 from frame 0 to get the residual, so we only need to encode the residual .
insert image description here
insert image description here

The method of subtraction is relatively simple and crude, and the effect is not very good. There can be better methods to save data. First, we 0 号帧treat as a collection of blocks, and then we will try 帧 1to 帧 0match blocks on and . We can think of this as motion prediction .

Motion compensation is a method to describe the difference between adjacent frames (adjacent here means adjacent in encoding relationship, and two frames may not be adjacent in playback order), specifically to describe the previous frame (adjacent here means How does each small block move to a certain position in the current frame in front of the encoding relationship, not necessarily in front of the current frame in the playback order. "

insert image description here

As shown above, we expect the ball to x=0, y=25move to x=7, y=26, and the x and y values ​​are the motion vectors . A further way to save data is to encode only the difference between the two motion vectors. So, the final motion vector is x=7 (6-0), y=1 (26-25). The method of using motion prediction will not find a perfectly matching block, but when using motion prediction , the amount of encoded data is less than using simple residual frame technology. The comparison diagram is shown in the following figure:
insert image description here

Third, how the actual video encoder works

3.1, video container (video data encapsulation)

First of all, video encoders and video containers are different. Our common video file name suffixes: .mp4, .mkv, .aviand .mpegare actually video containers. Video container definition : put the encoded and compressed video track and audio track into a file according to a certain format, and this specific file type is the video container.

3.2, Encoder development history

The development history of the video encoder is shown in the figure below:
insert image description here

3.3, General Encoder Workflow

Although the video encoder has gone through decades of development history, it still has a main working mechanism.

3.3.1, the first step - image partition

The first step is to divide the frame into several partitions , sub-partitions and even more. The purpose of partitioning is to handle predictions more precisely, using smaller partitions for small moving parts and larger partitions for static backgrounds.

Typically, codecs organize these partitions into slices (or tiles), macros (or coding tree units), and many subpartitions. The maximum size of these partitions is different for different encoders, for example HEVC is set to 64x64, while AVC uses 16x16, but sub-partitions can reach a size of 4x4.
insert image description here

3.3.2, the second step - prediction

With partitions, we can make predictions on top of them. For inter prediction, we need to send motion vector and residual ; for intra prediction, we need to send prediction direction and residual .

3.3.3, the third step - conversion

After we get the residual block ( ), we can transform预测分区-真实分区 it in a way so we know which pixels we should discard and still maintain the overall quality . There are several ways to transform this exact behavior, here only the discrete cosine transform ( ) is introduced, and its functions are as follows:DCT

  • Convert a block of pixels into a block of frequency coefficients of the same size .
  • Compressing energy, it is easier to eliminate spatial redundancy.
  • Reversible also means you can restore pixels back.

We know that in an image, most of the energy will be concentrated in the low frequency part, so if we convert the image into frequency coefficients and discard the high frequency coefficients , we can reduce the amount of data required to describe the image without sacrificing too much Much better image quality. DCT It is possible to convert the original image to frequencies (blocks of coefficients) and discard the least significant coefficients.

We reconstruct the image from the coefficient block after dropping the unimportant coefficients, and compare it with the original image, as shown in the figure below.
insert image description here

It can be seen that it is very similar to the original image. Compared with the original image, we discarded 67.1875%, and how to intelligently select the discarding coefficient is a question to be considered in the next step.

3.3.4, the fourth step - quantization

When we drop some blocks of (frequency) coefficients, in the last step (transform), we do some form of quantization. In this step, we selectively remove information ( lossy parts ) or simply, we quantize the coefficients to achieve compression .

How to quantize a block of coefficients? A simple approach is uniform quantization , where we take a block and divide it by a single value (10), and round the value.
insert image description here

So how to reverse (dequantize) this coefficient block? This can be done by multiplying by the same value (10) that we divided by earlier.
insert image description here

Uniform quantization is not the best quantization scheme, because it does not take into account the importance of each coefficient, we can use a quantization matrix instead of a single value, this matrix can take advantage of the properties of DCT, quantize the lower right part more, and less (Quantization) Top left, JPEG uses a similar approach , we can see this matrix by looking at the source code .

3.3.5, the fifth step - entropy coding

After we quantize the data (tiles/slices/frames), we can still compress it in a lossless way. There are many methods (algorithms) available to compress data:

  1. VLC encoding
  2. arithmetic coding

3.3.6, Step 6 - Bitstream Format

After completing the above steps, the encoding and compression of video data has been completed, and then we need to pack the compressed frames and content into it . It is necessary to clearly inform the decoder of encoding definitions , such as color depth, color space, resolution, prediction information (motion vector, intra prediction direction), grade*, level*, frame rate, frame type, frame number, and more.

References

  1. digital_video_introduction
  2. Zero foundation, introduction to the most popular video coding technology in history

Guess you like

Origin blog.csdn.net/qq_20986663/article/details/126893727