Record the process of learning video coding and your own understanding
video
Representation of a digital image in a computer: a two-dimensional matrix, or a three-dimensional matrix (in color).
Each point in the matrix is a pixel, and the size of the value reflects the intensity of the color. The color depth needs to be stored in a certain data space. Each color plane (RGB) is usually 8bit, and the range is 0 to 255. The color depth is 24 (8 *3) bit.
Resolution: How many pixels are there in an image per inch, the resolution mainly includes: 320 x 240, 640 x 400, 640 x 480, 800 x 600, 1024 x 768 1280 x 720 (720P SD), 1600 x 1200, 1920 x 1080 (1080P Full HD)
The video is n consecutive frames per unit time, n is the frame rate, and the unit is seconds, which is equivalent to fps (frame per second)
The amount of data required to play a video per second is the bit rate (code rate) in kb/s or Mb/s
bit rate = width x height x color depth x frames per second
For example, an RGB video with 30 frames per second and a resolution of 480x240, if no compression is performed, the bit rate is 82944000 (30x480x240x24) bits per second.
Generally speaking, at the same resolution, the higher the bit rate of the video, the greater the number of bits representing the pixels, and the larger the number of bits, it can be used to represent more delicate color or picture information. Therefore, the smaller the compression ratio of the video, the better the picture quality. higher quality
The huge amount of data leads to the need to find a way to compress the video as much as possible while ensuring the perception of the human eye.
Visual characteristics of the human eye: more sensitive to brightness than chroma
The AB area in this picture can reflect the visual characteristics of the human eye.
In addition to the RGB color model, there are other models where YCbCr can separate brightness (luma) and chroma (chroma), where Y is brightness, and Cb and Cr are blue and red chroma, respectively.
Brightness and Chromaticity Separation
After separating luma and chrominance, the chrominance part can be compressed. Called chroma subsampling, denoted as a ❌y
- a: Horizontal sampling reference, generally 4
- x: number of chroma samples for the first row
- y: The number of chroma samples in the second line
Commonly used schemes are: 4:4:4 (without subsampling), 4:2:2, 4:2:0, etc. If you use 4:2:0 sampling, you can reduce the video size by half.
There is a strong correlation between consecutive frames in the video, that is to say, many parts in the current frame overlap with parts in the previous frame, and such redundancy is temporal redundancy. Elimination of temporal redundancy is conditional. There must be key frames in several consecutive frames, and the remaining frames can refer to key frames to eliminate redundancy, so we need to classify frames.
frame type
I frame (key frame): An I frame is a self-contained frame that does not require any other information to render. It is similar to a static image, so the first frame is usually an I frame, and an I frame is generally inserted into the frame sequence at regular intervals.
P frame (prediction frame): P frame can be rendered using previous frames. For example, only a small part of two consecutive frames has motion. Then we only need to know the difference between the current frame and the previous frame to reconstruct the current frame.
B frame (bidirectional prediction): B frame has the same prediction principle as P frame, except that B frame can refer to forward and backward frames.
Normally, the occupied bit size: I frame > P frame > B frame
Inter prediction
It is used to remove time redundancy.
Assuming that there are two frames of images
and subtracting 1 from 0, the residual can be obtained, so that the transmission residual is much smaller than the data volume of the direct transmission of the image itself.
But the residual between the whole frame is still relatively large, so a better way is to divide the image
into blocks and then we try to match the blocks in 0 and 1.
Suppose we find a matching block, that is, the ball will move from (0, 25) to (7, 26), then the value of x, y is also called the motion vector, similar to the residual of the entire frame, we can only transmit the motion vector x=(7-0), y=(26-25), can be further compressed.
example
intra prediction
The redundancy between frames is eliminated, but there is still redundant information in the frame, for example, this part is reflected in the same color area in the image, or repeated content.
Intra can use all pixels in the upper left side of the block to predict
examples
Assuming that we predict that the colors in the block are consistent in the vertical direction, then we can infer the pixel values in the entire block based on the pixels in the top row in the vertical direction. This method may be wrong, so we still use the residual information to reduce the amount of data by using the predicted value-actual value=residual.
codec
In layman's terms, it is the software or hardware used to compress or decompress video and audio, because we need to improve the quality of video in limited bandwidth or storage space.
Standard refers to the rules for converting the original video to another video format in order to unify the transmission of digital video streams
General Workflow
1. Chunk
First of all, the frame should be divided into many macro blocks (Macro Blocks). Blocking is based on a basis. Larger blocks are used in static backgrounds, and smaller divisions are used in complex parts.
Usually the encoder will organize the partitions into slices (Slice), macroblocks (MB) and sub-partition
examples
2. Forecast
According to the two compression methods mentioned above, the inter-frame prediction needs to transmit the motion vector and the residual, and the intra-frame prediction needs to transmit the prediction direction and the residual.
3. Conversion
After obtaining the residual, some pixels can be further removed without affecting the overall quality. The most commonly used method is the discrete cosine transform (DCT), which can separate the high-frequency and low-frequency parts of the data, and most of the energy is concentrated in the low frequency. Therefore, losing some high-frequency information can reduce the amount of data without sacrificing a lot of image quality.
4. Quantification
Quantization is to convert continuous values into finite discrete values. One of the simplest examples is uniform quantization. Take a MB and divide all pixels by 10, then the pixel values in the block will decrease. A small number means that you can Represented with fewer bits, but quantization is not lossless, and quantization noise is introduced due to rounding after division.
This is just the simplest example and does not consider the importance of coefficients, so the quantization in actual encoders is generally expressed as a quantization matrix.
5. Entropy coding
After we get the quantized block, we can continue to encode. At this time, the non-zero coefficients in the block are concentrated in the low-frequency part, and most of the high-frequency coefficients are 0. Use zig-zag scanning to convert the two-dimensional data in kuai to one-dimensional, that is, a string of non-zero values are concentrated in the front, and 0 is concentrated in the numbers behind. There are two encoding methods
-
CAVLC
-
CABAC
6. Bitstream format
After completing the above steps, the compressed data can be transmitted, which contains some information, such as color depth, resolution, block division method, motion vector, intra prediction direction, grade, level, frame rate, frame type, Important information such as frame number. It is stipulated in H.264 that the information will be transmitted in the macro frame, which is called the NAL layer. At this time, the data is a large number of hexadecimal numbers, and each part represents the information of the original video during the compression process.
Summarize
The purpose of video coding is to improve video quality in a limited bandwidth. In order to achieve this goal, the encoder is responsible for analyzing the original video and removing redundancy in the video, including intra prediction and inter prediction. The principle is to use the residual The original image can be restored using the information of the key frame, and then the DCT is used to convert the data to the frequency domain to remove the high-frequency part. At this time, the video has become a two-dimensional matrix represented by decimal values and contains non-zero and zero values. Finally, entropy coding is used to obtain a hexadecimal bit stream. When transmitting, the bit stream is transmitted together with the compression method used by the encoder, and the decoder decodes the original video according to the bit stream and different modes.