Video coding knowledge

Record the process of learning video coding and your own understanding

video

Representation of a digital image in a computer: a two-dimensional matrix, or a three-dimensional matrix (in color).

Each point in the matrix is ​​a pixel, and the size of the value reflects the intensity of the color. The color depth needs to be stored in a certain data space. Each color plane (RGB) is usually 8bit, and the range is 0 to 255. The color depth is 24 (8 *3) bit.

Resolution: How many pixels are there in an image per inch, the resolution mainly includes: 320 x 240, 640 x 400, 640 x 480, 800 x 600, 1024 x 768 1280 x 720 (720P SD), 1600 x 1200, 1920 x 1080 (1080P Full HD)

The video is n consecutive frames per unit time, n is the frame rate, and the unit is seconds, which is equivalent to fps (frame per second)

The amount of data required to play a video per second is the bit rate (code rate) in kb/s or Mb/s

bit rate = width x height x color depth x frames per second

For example, an RGB video with 30 frames per second and a resolution of 480x240, if no compression is performed, the bit rate is 82944000 (30x480x240x24) bits per second.

Generally speaking, at the same resolution, the higher the bit rate of the video, the greater the number of bits representing the pixels, and the larger the number of bits, it can be used to represent more delicate color or picture information. Therefore, the smaller the compression ratio of the video, the better the picture quality. higher quality

The huge amount of data leads to the need to find a way to compress the video as much as possible while ensuring the perception of the human eye.

Visual characteristics of the human eye: more sensitive to brightness than chroma

The AB area in this picture can reflect the visual characteristics of the human eye.

visual characteristics
insert image description here

In addition to the RGB color model, there are other models where YCbCr can separate brightness (luma) and chroma (chroma), where Y is brightness, and Cb and Cr are blue and red chroma, respectively.

Brightness and Chromaticity Separationinsert image description here

After separating luma and chrominance, the chrominance part can be compressed. Called chroma subsampling, denoted as a ❌y

  • a: Horizontal sampling reference, generally 4
  • x: number of chroma samples for the first row
  • y: The number of chroma samples in the second line
    Commonly used schemes are: 4:4:4 (without subsampling), 4:2:2, 4:2:0, etc. If you use 4:2:0 sampling, you can reduce the video size by half.

There is a strong correlation between consecutive frames in the video, that is to say, many parts in the current frame overlap with parts in the previous frame, and such redundancy is temporal redundancy. Elimination of temporal redundancy is conditional. There must be key frames in several consecutive frames, and the remaining frames can refer to key frames to eliminate redundancy, so we need to classify frames.

frame type

I frame (key frame): An I frame is a self-contained frame that does not require any other information to render. It is similar to a static image, so the first frame is usually an I frame, and an I frame is generally inserted into the frame sequence at regular intervals.
P frame (prediction frame): P frame can be rendered using previous frames. For example, only a small part of two consecutive frames has motion. Then we only need to know the difference between the current frame and the previous frame to reconstruct the current frame.
B frame (bidirectional prediction): B frame has the same prediction principle as P frame, except that B frame can refer to forward and backward frames.

Normally, the occupied bit size: I frame > P frame > B frame

Inter prediction

It is used to remove time redundancy.
Assuming that there are two frames of images
insert image description here
and subtracting 1 from 0, the residual can be obtained, so that the transmission residual is much smaller than the data volume of the direct transmission of the image itself.

But the residual between the whole frame is still relatively large, so a better way is to divide the image
into blocks and then we try to match the blocks in 0 and 1.
insert image description here
Suppose we find a matching block, that is, the ball will move from (0, 25) to (7, 26), then the value of x, y is also called the motion vector, similar to the residual of the entire frame, we can only transmit the motion vector x=(7-0), y=(26-25), can be further compressed.
example

intra prediction

The redundancy between frames is eliminated, but there is still redundant information in the frame, for example, this part is reflected in the same color area in the image, or repeated content.

Intra can use all pixels in the upper left side of the block to predict
examples

Assuming that we predict that the colors in the block are consistent in the vertical direction, then we can infer the pixel values ​​in the entire block based on the pixels in the top row in the vertical direction. This method may be wrong, so we still use the residual information to reduce the amount of data by using the predicted value-actual value=residual.

codec

In layman's terms, it is the software or hardware used to compress or decompress video and audio, because we need to improve the quality of video in limited bandwidth or storage space.

Standard refers to the rules for converting the original video to another video format in order to unify the transmission of digital video streams

General Workflow

1. Chunk

First of all, the frame should be divided into many macro blocks (Macro Blocks). Blocking is based on a basis. Larger blocks are used in static backgrounds, and smaller divisions are used in complex parts.

Usually the encoder will organize the partitions into slices (Slice), macroblocks (MB) and sub-partition
examples
insert image description here

2. Forecast

According to the two compression methods mentioned above, the inter-frame prediction needs to transmit the motion vector and the residual, and the intra-frame prediction needs to transmit the prediction direction and the residual.

3. Conversion

After obtaining the residual, some pixels can be further removed without affecting the overall quality. The most commonly used method is the discrete cosine transform (DCT), which can separate the high-frequency and low-frequency parts of the data, and most of the energy is concentrated in the low frequency. Therefore, losing some high-frequency information can reduce the amount of data without sacrificing a lot of image quality.

4. Quantification

Quantization is to convert continuous values ​​into finite discrete values. One of the simplest examples is uniform quantization. Take a MB and divide all pixels by 10, then the pixel values ​​in the block will decrease. A small number means that you can Represented with fewer bits, but quantization is not lossless, and quantization noise is introduced due to rounding after division.

This is just the simplest example and does not consider the importance of coefficients, so the quantization in actual encoders is generally expressed as a quantization matrix.

5. Entropy coding

After we get the quantized block, we can continue to encode. At this time, the non-zero coefficients in the block are concentrated in the low-frequency part, and most of the high-frequency coefficients are 0. Use zig-zag scanning to convert the two-dimensional data in kuai to one-dimensional, that is, a string of non-zero values ​​​​are concentrated in the front, and 0 is concentrated in the numbers behind. There are two encoding methods

  1. CAVLC

  2. CABAC

6. Bitstream format

After completing the above steps, the compressed data can be transmitted, which contains some information, such as color depth, resolution, block division method, motion vector, intra prediction direction, grade, level, frame rate, frame type, Important information such as frame number. It is stipulated in H.264 that the information will be transmitted in the macro frame, which is called the NAL layer. At this time, the data is a large number of hexadecimal numbers, and each part represents the information of the original video during the compression process.

Summarize

The purpose of video coding is to improve video quality in a limited bandwidth. In order to achieve this goal, the encoder is responsible for analyzing the original video and removing redundancy in the video, including intra prediction and inter prediction. The principle is to use the residual The original image can be restored using the information of the key frame, and then the DCT is used to convert the data to the frequency domain to remove the high-frequency part. At this time, the video has become a two-dimensional matrix represented by decimal values ​​and contains non-zero and zero values. Finally, entropy coding is used to obtain a hexadecimal bit stream. When transmitting, the bit stream is transmitted together with the compression method used by the encoder, and the decoder decodes the original video according to the bit stream and different modes.

Guess you like

Origin blog.csdn.net/qq_41285115/article/details/125824309