Briefly describe the video encoding process of HEVC and VVC

H.265/HEVC video encoding

The purpose of video coding is to compress the original video. The main idea of ​​compression is to remove redundant information from several main perspectives such as space, time, coding, and vision. Due to its excellent data compression ratio and video quality, H.264 has become the most popular codec standard in the market today. While H.265 is based on H.264, while ensuring the same video quality, the bit rate of the video stream can be reduced by 50%. As the H.265 encoding format becomes more and more popular, the following is the flow chart of the H.265 encoding framework:
insert image description here
As shown in the figure, a video is gradually decomposed into input image blocks, and then undergoes about three encoding processes, namely predictive encoding , transform coding and statistical coding. The general process is to first compress the video frame through predictive coding, and then perform transform coding on the residual of predictive coding, convert the image data from the spatial domain matrix to the frequency domain transform coefficient matrix, and obtain the quantized coefficients for subsequent entropy coding to obtain For the binary code stream, there is also a branch inside the encoder for inverse quantization and transformation, which is used for subsequent pixel reference to facilitate the next prediction, while statistical coding is video compression in the frequency domain. These three processes are described in detail next.
In the past few days, I found a more professional and detailed description in the paper: insert image description here
define IV and PV as the source value and predicted value of YUV video, (·)/Q is the quantization operation, Q is the quantization factor, and DCT/DST is determined by DT (·) said.
Then the DCT variation coefficient D(1)=[DT (IV(1) − PV(1))/Q], [ ] represents rounding.
The quantized coefficients of the source values ​​after transformation and quantization are reversely decoded for frame reconstruction and the reconstruction coefficient RESrt is obtained. RT(·) is defined to represent rounding and stage operation, and IDT(·) represents inverse DCT/DST transformation.
Then RESrt=RT(IDT(D(1)/Q^-1).
After the inverse transformation, loop filtering is performed. This process can be expressed as IV(1)=loop filtering (PV(1)+RT(IDT(D (1)/Q^-1)), and then store the frame in the buffer for prediction of subsequent frames.

1. Predictive coding

The essence of video is composed of a series of continuous video frames, and there is a large amount of redundancy within a single video frame and between multiple video frames. From a spatial point of view, the difference in pixel values ​​between pixels within a single video frame is very small. From the perspective of time, there are also many identical pixels between two consecutive video frames. Predictive coding is a method of data compression based on image statistical characteristics, which utilizes the correlation of images in time and space, and predicts the pixels currently being encoded through the reconstructed pixel data. Predictive coding can be further divided into intra prediction and inter prediction:
insert image description here

1.1 Intra prediction

Intra-frame prediction means that the pixels used for prediction and the pixels currently being encoded are both in the same video frame, and generally in adjacent areas. Due to the strong correlation between adjacent pixels, the pixel values ​​are generally very close, the probability of mutation is very small, and the difference is 0 or a very small number. Therefore, after intra-frame prediction encoding, the difference between the predicted value and the real value is transmitted, that is, the value near 0, which is called the prediction error or residual, so that less bits are used to transmit to achieve the effect of compression.
H.265 intra-frame predictive coding takes a block as a unit, and uses the reconstruction value of an adjacent reconstructed block to predict the block being coded. The prediction component is divided into luma and chroma, and the corresponding prediction blocks are luma prediction block and chroma prediction block respectively. In order to adapt to the content characteristics of high-definition video and improve prediction accuracy, H.265 adopts more abundant prediction block sizes and prediction modes.
The size of the H.265 luma prediction block is between 4 4 and 32 32, and there are 35 prediction modes for all size prediction blocks, which can be divided into 3 categories: planar mode, direct current (DC) mode and Angular mode.
1. Planar mode: Brightness mode 0, suitable for areas where pixel values ​​change slowly, such as scenes with gradual pixel changes. A different predictor is used for each pixel in the predicted block. The predicted value is equal to: the average value of linear interpolation of the pixel in both horizontal and vertical directions.
2. DC mode: Brightness mode 1, suitable for large flat areas of the image, this mode uses the same prediction value for all pixels in the prediction block. If the prediction block is a square, the prediction value is equal to the average value of the reference pixels on the left and top; if the prediction block is a rectangle, the prediction value is equal to the average value of the long side.
3. Angle mode: brightness mode 2~34, a total of 33 prediction directions, of which mode 10 is the horizontal direction, and mode 26 is the vertical direction. The predicted value of each pixel in the angle mode is predicted by the horizontal or vertical offset angle from the sample value of the pixel set that has been reconstructed before the corresponding prediction direction.
In color video, the chrominance signal and the luma signal at the same position have similar features, so the prediction modes of the chroma prediction block and the luma prediction block are also similar. There are five prediction modes for the chroma prediction block in H.265: Planar mode, vertical mode, horizontal mode, DC mode, and derived mode:
1. Planar mode: chroma mode 0, which is the same as luma mode 0.
2. Vertical mode: chroma mode 1, same as luma mode 26.
3. Horizontal Mode: Chroma Mode 2, same as Brightness Mode 10.
4. DC mode: chroma mode 3, same as luma mode 1.
5. Derivation mode: chroma mode 4, adopts the same prediction mode as the corresponding luma prediction block. If the corresponding luma prediction block mode is one of 0, 1, 10, 26, then replace it with mode 34.
The flow chart of intra-frame prediction is as follows. Since it is a brief description, I won’t say much about the details:
insert image description here

1.2 Inter prediction

Inter-frame prediction means that the pixels used for prediction and the pixels currently being encoded are not in the same video frame, but are generally adjacent or nearby. In general, the compression effect of inter-frame predictive coding is better than that of intra-frame prediction, mainly because the correlation between video frames is very strong. If the moving object in the video frame changes very slowly, the pixel difference between the video frames is also very small, and the time redundancy is very large.
The method of inter-frame prediction to evaluate the motion of moving objects is motion estimation. Its main idea is to search for a matching block from a given range of the reference frame for the prediction block, and calculate the relative displacement between the matching block and the prediction block. The relative displacement is Motion vector. After obtaining the motion vector, it is necessary to correct the prediction, that is, motion compensation. Input the motion vector to the motion compensation module, "compensate" the reference frame, and then get the predicted frame of the current coded frame. The difference between the predicted frame and the current frame is the inter-frame prediction error.
If the inter-frame prediction only uses the previous frame image, it is called forward inter-frame prediction or unidirectional prediction. The predicted frame is also the P frame, and the P frame can refer to the previous I frame or P frame.
If the inter-frame prediction not only uses the previous frame image to predict the current block, but also uses the next frame image, then it is bidirectional prediction. The predicted frame is also the B frame, and the B frame can refer to the previous I frame or P frame and the subsequent P frame.
Since the P frame needs to refer to the previous I frame or P frame, and the B frame needs to refer to the previous I frame or P frame and the subsequent P frame, if in a video stream, the B frame comes first, and the dependent I frame and P frame It has not arrived yet, so the B frame cannot be decoded immediately, so how should the playback order be guaranteed? In fact, during video encoding, PTS and DTS are generated. Usually, after the encoder generates an I frame, it will skip several frames backwards, use the previous I frame as a reference frame to encode the P frame, and the frame between the I frame and the P frame is encoded as a B frame. The sequence of video frames for streaming has been compiled according to the dependency sequence of I frame, P frame, and B frame during encoding, and it can be decoded directly after receiving the data. Therefore, it is impossible to receive B frames first, and then receive dependent I frames and P frames.
PTS: Presentation Time Stamp, which displays a timestamp and tells the player when to display this frame.
DTS: Decoding Time Stamp, decoding timestamp, tells the player when to decode this frame.
The flow chart of inter-frame prediction is as follows:
insert image description here
Overall, the input, input and output of predictive coding are as follows:
insert image description here

2. Transform coding

Transform coding refers to mapping and transforming the spatial domain signal in the image to the frequency domain (frequency domain), and then encoding the generated transform coefficients. Because in the spatial domain, the correlation between data is relatively large, the change of the residual after predictive coding is small, and there is a large amount of data redundancy, especially in the flat area where the brightness value changes slowly in the image. After transforming to the frequency domain, the scattered residual data in the spatial domain will be converted into a centralized distribution, which can reduce correlation and data redundancy, thereby achieving the purpose of removing spatial redundancy.
In H.265, a coding block (CB) can be divided into several prediction blocks (PB) and transform blocks (TB) through a quadtree. Since the division of the quadtree from CB to TB is mainly for the transformation operation of the residual, this quadtree is also called the residual quadtree (RQT). As shown in the figure below, it is an example of RQT division, which divides a 32 32 residual CB into 13 TBs of different sizes. insert image description here
There are four sizes of each TB, ranging from 4
4, 8 8, 16 16, 32 32, and each TB corresponds to an integer transformation coefficient matrix. A large size TB is suitable for flat areas where image brightness values ​​change slowly, and a small size TB is suitable for complex areas where image brightness values ​​change rapidly. All dimensions can be transformed using the discrete cosine transform (DCT). In addition, for the 4 4 ​​intra-predicted luma residual blocks, discrete sine transform (DST) can also be used.

Since intra-frame predictive coding is based on the data of the encoded blocks on the left and above, the closer the predicted block is to the encoded block, the stronger the correlation and the smaller the prediction error; the farther away from the encoded block, the smaller the correlation and the smaller the prediction error. bigger. The data distribution characteristics of the prediction error are very similar to the sinusoidal basis function sin of DST, the initial point is the smallest, and then gradually increases. However, because the calculation amount of DST is larger than that of DCT, more transformation type identifiers need to be added, so DST is only used for 4*4 intra-frame prediction luma residual blocks.

2.1 Quantization

Since the transform coding only converts the image data from the spatial domain matrix to the frequency domain transform coefficient matrix, neither the number of coefficients nor the amount of data in the matrix is ​​reduced. To compress data, quantization and entropy coding of statistical features in the frequency domain are also required.
Common quantization methods can be divided into two categories: **Scalar Quantization (SQ) and Vector Quantization (VQ)**:
1. Scalar Quantization: Divide the data in the image into several intervals, and then represent each interval with a value The values ​​of all sample points in this interval.
2. Vector quantization: Divide the data in the image into several intervals, and then use a representative vector in each interval to represent all the vector values ​​in this interval.
Because vector quantization introduces the correlation between multiple pixels and uses a probability method, the general compression rate is higher than that of scalar quantization. However, due to its high computational complexity, the quantization method widely used at present is scalar quantity.
The quantization process at the encoding end can be simply understood as dividing each DCT transform coefficient by the quantization step to obtain a quantization value. The corresponding inverse quantization process at the decoder is to multiply the quantization value by the quantization step to obtain the DCT variation coefficient value.

3. Statistical coding (entropy coding)

Entropy coding refers to the coding that does not lose any information according to the principle of entropy during the coding process. Quantization is a lossy compression method, while entropy coding is a more compact way to mark the mapping relationship with the original data, which belongs to lossless compression. Common entropy coding includes Shannon coding, Huffman coding, Arithmetic coding, run-length coding, etc.

3.1 Huffman coding

Huffman coding is a variable-length coding, that is, the coding length of different characters changes. This encoding uses the probability of occurrence of characters to construct a Huffman binary tree. The goal is to use short codes (closer to the root node) when encoding characters with high probability of occurrence, and use long codes (far from the root node) when encoding characters with low probability, so that Let the average codeword length be the shortest.

3.2 Arithmetic coding

Although Huffman coding can obtain the best coding results in theory, in actual coding, since the minimum data unit processed by the computer is 1 bit, the length of codewords containing decimal points can only be processed as integers, so the actual coding effect is often slightly different. Inferior to the theoretical coding effect. In the field of image compression, arithmetic coding is usually used instead of Huffman coding. However, the theoretical basis of arithmetic coding is the same as that of Huffman coding. Characters with high probability use short codes, and characters with low probability use long codes.
Arithmetic coding is divided into fixed mode arithmetic coding, adaptive arithmetic coding (AAC), binary arithmetic coding, adaptive binary arithmetic coding (CABAC), etc. CABAC is used in H.265. Here we will only introduce the fixed-mode arithmetic coding process:
1. Count the characters and the probability of occurrence in the input symbol sequence;
2. According to the probability distribution, divide the [0, 1) interval into multiple sub-intervals, and each sub-interval represents a character, the size of the subinterval represents the probability of the character appearing; the sum of all subinterval sizes is equal to 1; assuming that the range of the character is [L, H); 3. Set the initial variable low=0, high=1, and read
continuously For each character in the symbol sequence, find the interval range [L, H) corresponding to the character, and update the values ​​​​of low and high: low
= low + (high - low) * L
high = low + (high - low) * H
4. After traversing the symbol sequence, get the final low and high, convert the binary form and output to get the encoded data.

4. Other technologies

4.1 Loop Filtering

Since H.265 adopts block encoding, there will be some distortion effects, such as block effect and ringing effect, when the image is dequantized and transformed and reconstructed. In order to solve these problems, H.265 adopts loop filtering technology, including deblocking filtering (DBF) and sample point adaptive compensation (SAO).
DBF acts on border pixels to resolve blocking artifacts. Block effect refers to the obvious discontinuity of the gray value at the boundary of some adjacent coding blocks. There are two main reasons for block effect: the DCT transformation and quantization of the
residual by the encoder are based on blocks, ignoring the relationship between blocks and The correlation between blocks leads to inconsistent processing between blocks;
the incomplete matching of inter-frame prediction motion compensation blocks causes errors; the prediction reference frame during encoding usually comes from these reconstructed images, resulting in distortion of the image to be predicted; DBF
for The boundary type adopts strong filtering, weak filtering or no processing, and the determination of the boundary type is determined by the boundary pixel gradient threshold and the quantization parameter of the boundary block. In DBF processing, first perform horizontal filtering on the vertical edges of the entire image, and then perform vertical filtering on the horizontal edges. The filtering process is actually the process of correcting the pixel values ​​to make the squares look less obvious. DBF technology also exists in H.264, but it is applied to processing blocks of 4 4 size, while it is applied to 8 8 processing blocks in H.265.
SAO is a newly introduced error compensation mechanism for reconstructed images in H.265, which is used to improve the ringing effect. The ringing effect refers to the shock caused by the drastic change of the gray value of the image. The main reason for the ringing effect is the loss of high-frequency information after DCT transformation. The principle of SAO is to reduce the distortion of high-frequency information by adding negative value compensation to the peak pixels of the reconstructed curve, and adding positive value compensation to the valleys. Unlike DBF, which only works on boundary pixels, SAO works on all pixels in the block.

Supplementary understanding

When CU is encoded, the optimal PU division mode and intra-frame prediction mode will be selected according to RD:
insert image description here
SATD (s, p) is the sum of the absolute conversion difference between the original PB block s and the predicted PB block p from mode
SSD (s, c) Indicates the sum of squares of errors between s and the reconstructed block
Rtmode indicates the bit rate required to encode the current mode
Rtall is the total bit rate required to encode all information (partition mode, number of prediction modes, residual coefficients)
first use (4) Preselect several possible optimal modes, and then use (5) to select the optimal mode

H.266/VVC video encoding

1. Intra prediction

In terms of intra prediction, VVC supports 67 intra prediction modes (this number is 35 in HEVC) and adjusts the angle prediction direction of non-square blocks. The prediction pixel interpolation uses two types of four-tap interpolation filters (in HEVC is a linear difference with low precision). The Position Dependent intra Prediction Combination (PDPC) technology combines the pre-filtered and pre-filtered prediction signals to further improve the accuracy of intra-frame prediction. The multi-reference line intra-frame prediction technology can not only use the nearest neighbor reconstructed pixel value, but also use the farther reconstructed pixel value for intra-frame prediction. In the matrix-based intra-frame prediction technology, matrix-vector multiplication is used for intra-frame prediction. Cross-component linear model intra-prediction techniques use pixel values ​​of luma image components to predict pixel values ​​of chrominance components in the same image. In the sub-block mode, different sub-blocks of a luma CU adopt the same coding mode information.

2. Inter prediction

In terms of inter-frame prediction, VVC inherits HEVC's Motion Vector Difference (MMVD) encoding and motion information inheritance mode based on the entire coding unit, namely: AMVP (Adaptive Motion Vector Prediction) and Skip/Merge mode, and were expanded separately. For AMVP mode, VVC introduces block-level adaptive motion vector accuracy, and symmetrical coding mode (Symmetric Motion Vector Differences Signaling) for bidirectional prediction but only needs to encode MVD of one of the reference images. For Skip/Merge mode, VVC introduces history-based motion vector prediction (HMVP, History-based Motion Vector Prediction) and pair-wise average merge candidate. In addition to the above-mentioned motion vector encoding/inheritance based on the entire coding unit, VVC also introduces a subblock-based Temporal Motion Vector Prediction (SbTMVP), that is, the current coding unit is divided into subblocks of the same size. block (8×8 luminance sub-block), the motion vector of each sub-block is derived separately. VVC also introduces an affine motion model to more accurately represent high-order motions such as scaling and rotation to improve the efficiency of encoding motion information. The accuracy of motion vectors has been improved from 1/4 luma pixel in HEVC to 1/16 luma pixel. In addition, VVC also introduces a number of new inter-frame prediction coding tools, such as: the merge mode (Merge mode with MVD, MMVD) that combines AMVP and merge mode, which is further improved by adding an additional motion vector difference to the merge mode ; The block results of the geometric block mode can be more in line with the motion trajectory of the entity object boundary in the video content; the prediction mode combining the inter prediction and intra prediction can reduce the temporal redundancy and spatial redundancy at the same time to achieve better High compression performance. Another important improvement of VVC is the introduction of two tools, motion refinement at the decoding end and bidirectional optical flow, to further improve the efficiency of motion compensation without increasing bit rate overhead.

3. Transformation and quantization

In terms of transformation, VVC introduces non-square transformation, multi-transform (main transformation) selection, low-frequency non-separable transformation and sub-block transformation. In addition, the maximum transform dimension in VVC has been increased to 64×64 (32×32 in HEVC). Non-square transforms are used to perform transform operations on non-square blocks. This transformation uses transformation kernels of different lengths horizontally and vertically. With the multi-transform option, an encoder can select from a predefined set of integer sine, cosine, skip transforms and indicate the used transform in the bitstream. The low-frequency inseparable transform performs a secondary transform on the low-frequency components in the main transform result of the intra prediction residual to better utilize the directionality of the coding block content to further improve the compression performance. Sub-block transform is used when coding a part of an inter-prediction residual block while setting the values ​​of other parts to all zeros.
VVC introduces three new coding tools in terms of quantization: adaptive chroma quantization parameter bias, dependent quantization and quantized residual joint coding. When using the tool of adaptive chroma quantization parameter deviation, for a specific quantization group, the chroma quantization parameters are not directly encoded, but are derived from the luma quantization parameters and a predefined and transmitted lookup table. In dependent quantization, the range of reconstructed values ​​of one transform coefficient depends on the reconstructed values ​​of several transform coefficients preceding it in scan order, thereby reducing the average distortion between the input vector and the closest reconstructed vector. Joint coding of quantized residuals refers to encoding the residuals of two chrominance components together instead of encoding them separately, so that the coding efficiency will be higher when the residuals of the two chrominance components are similar.

4. Entropy coding

Like HEVC, the entropy coding used by VVC is also context-adaptive binary arithmetic coding (Context-Adaptive Binary Arithmetic Coding, CABAC), but improvements have been made in both the CABAC engine and transform coefficient coding. The improvement in the CABAC engine is the adaptive rate of the multi-hypothesis probability update model and the context model binding (that is, the probability update speed depends on the context model), which uses two probability estimates P0 and P1 coupled with each context model, However, P0 and P1 are updated independently of each other according to their respective adaptive rates. The probability estimate P used for interval subdivision in the binary arithmetic coder is set to the mean of P0 and P1. In terms of transform coefficient coding, in addition to the 4×4 coefficient group, VVC also allows six coefficient groups of 1×16, 16×1, 2×8, 8×2, 2×4 and 4×2. In addition, a flag bit is added for quantization-dependent state transitions, and an improved probabilistic model selection mechanism is added for the encoding of syntax elements related to the absolute value of transform coefficients.

5. Loop Filtering

In addition to supporting the deblocking filter (Deblocking Filter, DBF) and sample adaptive offset (Sample Adaptive Offset, SAO) that are also available in HEVC, VVC also supports Luma Mapping with Chroma Scaling, LMCS) And adaptive loop filter (adaptive loop filter, ALF). In DBF, a longer filter and a brightness adaptive filtering mode specially designed for high dynamic video are conveniently added. SAO is the same as HEVC. The encoder can use LMCS to linearly change the dynamic range of the amplitude distribution of the input video signal segmented before encoding to improve the encoding efficiency, and restore it reversely at the decoding end. ALF in VVC includes two modes: 1) block-based ALF for luma and chroma samples; 2) cross-component adaptive loop filter (CC-ALF) for chroma samples. In ALF, diamond filters of 7×7 and 5×5 are used for luma and chrominance respectively; for each 4×4 block, one of 25 classes and 4 transposed states are classified according to its directionality and gradient activity , select one of the passed sets of filters to use. CC-ALF uses a diamond-shaped linear high-pass filter to further refine the chroma samples using the ALF-filtered luma samples.

6. Screen Content Encoding

VVC preserves the block-based differential pulse code modulation in HEVC, but only for intra-predicted coding units. Transformation skip residual coding has made the following improvements on the basis of HEVC: 1) The position of the first non-zero value is no longer encoded, and the scanning direction is changed to the opposite direction; 2) The context model is used to improve the coding efficiency of sign indication ;3) Coding improvements for absolute values. Intra Block Copy (IBC) and Palette Mode, two tools already in HEVC, are also retained and improved. In HEVC, IBC is defined as an inter-frame prediction mode where the reference frame is the current frame itself and the motion vector must point to an area where the current frame has been decoded and not loop-filtered. In VVC, IBC is decoupled from inter-frame prediction, and the management of reference buffers is simplified compared to HEVC. Reference samples are stored in a local small buffer. How palettes are encoded in VVC depends on whether luma-chroma uses a single coding tree. If a single coding tree is used, the palettes of the three chroma components are coded jointly; otherwise the luma and chroma palettes are coded separately. For a coding unit that uses a palette, individual pixels can also encode their quantized values ​​directly without using the contents of the palette. Finally, Adaptive Color Transformation in VVC, a tool for encoding screen content, remains unchanged from HEVC.

7. 360-degree video encoding

360-degree video gradually became popular around 2014 and 2015, and the first version of HEVC was finalized in early 2013, so VVC has successfully become the first international video coding standard that includes 360-degree video coding tools . Since traditional video coding technologies are basically used in 360-degree video coding, there are only two 360-degree video "compression" tools included in VVC, and more support for 360-degree video is in the design of the system and transmission interface (see next section of this article). A 360-degree video "compression" tool in VVC is called motion vector surround, that is, when the motion vector points to a position outside the right (left) boundary of the image, the actual reference pixels used in motion compensation are within the left (right) boundary of the image Pixels (or sub-pixels obtained by interpolation filtering). This is because the left and right boundaries of an image commonly used in 360-degree videos called Equirectangular Projection (ERP) are actually continuous positions on the spherical surface of the physical world, similar to the left and right boundaries of a world map. The same line of longitude connecting the North and South Poles. Therefore, such motion vector wrapping can improve the coding efficiency of 360-degree video using ERP. Another 360-degree video "compression" tool is called loop filter virtual border; if used, the effect of loop filter application will not exaggerate certain horizontal or vertical lines in the image (these lines are the so-called virtual borders mentioned here) . This tool works with another type of mapping commonly used in 360-degree video, called Cube Map Projection (CMP). Reference [13] contains a detailed 360-degree video and an introduction to ERP and CMP.

Some information reference: https://mp.weixin.qq.com/s/tQo3_EffwUNOph4DnobFzg
https://juejin.cn/post/6940078108787769357

Guess you like

Origin blog.csdn.net/qq_39969848/article/details/129020948