[Android audio and video development] video coding principle

Article version number change content change date Remark
0.0.1 create 2022/9/30 first edition
0.0.2 Supplementary Video Coding Principles Content 2022/10/10 Complete the first edition

1 Introduction

At the beginning of this series of articles [ Android audio and video development ], it is necessary to introduce the principle of audio and video encoding, which is conducive to understanding the concepts that will appear later. My last article has already introduced the principle of audio encoding. Friends who haven’t read it can click here to read it . In this article, I will introduce the principle of video encoding in the same way.

2. Text

2.1 The nature of images

The reflection of visible light by an object is imaged on the retina after being received by the human eye, forming the "image" we see. The human eye can be regarded as a "super camera" with up to 576 million pixels, and for this "super camera", the image is actually a combination of a large number of pixels (colored points) - this is the essence of the image.

2.1.1 Concepts related to images

  • Pixel (point): An image is composed of multiple "colored points". The "colored points" mentioned here are called pixels. A pixel is the basic unit of an image. composed of pixels. For example, a 1080p image consists of 1920x1080=2073600 pixels. It can also be said that this image is 2 million pixels, and 1920x1080 is also called resolution.
  • Resolution: The number of pixels in the horizontal and vertical directions of the image. The larger the resolution, the clearer and more delicate the image.

2.1.2 Color representation of pixels

  • Three primary colors: Each color can be modulated by three colors of red (Red), green (Green), and blue (Blue), so red, green, and blue are called three primary colors, and they are represented by RGB in computers.
  • Primary color component: R\G\B is also called "primary color component". In the computer, the space occupied by each primary color component is 8bit (1byte), so the value range of each primary color component is 0 (0b0000 0000)~255 (0b1111 1111), that is, 256 values.
    There are 256x256x256=16777216 colors that can be represented by the combination of the three primary color components, which can also be referred to as 16 million colors and 24-bit colors. This color range has exceeded all colors visible to the human eye, so it is also called true color. No matter how high it is, it is meaningless to the human eye, because it cannot be recognized.

In addition to using the RGB color model to represent colors, the commonly used color model also has the YUV color model, which will be mentioned later.

2.2 The nature of video

Video, in fact, is a continuous playback of images. When the switching speed of the image reaches or exceeds a certain value, for the human eye, what we see is the video effect in our cognition. The "switching speed" mentioned here is the frame rate we often say . The number of visible frames when the human eye is comfortable and relaxed is 24 frames per second (that is, the frame rate is 24), and it does not exceed 30 frames when concentrating. The number of frames that can be captured at the moment of opening the eyes when blinking is more than 30 frames.

2.2.1 Basic concepts in the video

(Video) frame: It is a basic concept of video, representing an image, such as a page in a flip book, is a frame. A video is made up of many frames.

Frame rate: the number of video frames displayed per unit time (1 second), the unit is: frame per second or fps (frames per second). The larger the frame rate, the smoother the picture and the more natural the transition.
Several typical values ​​​​of the frame rate:

  • 24/25 fps: 24/25 frames in 1 second, the general movie frame rate;
  • 30/60 fps: 30/60 frames per second, the frame rate of the game, 30 frames is acceptable, and 60 frames will feel more smooth and realistic.
  • Above 85 fps is basically invisible to the human eye, so a higher frame rate does not make much sense in video.

Color space: The value space of the pixel color value. There are two commonly used ones, RGB and YUV, which can also be called color models.
RGB is introduced above, and the following mainly introduces the YUV color model (this color model will be used many times in the follow-up video encoding and decoding).

2.2.1.1 YUV color model

Early TVs were all black and white, that is, only the brightness value Y. With color TV, two kinds of UV chroma were added, and the combination of the two formed the current YUV, also called YCbCr:

  • Y: Brightness, which is the gray value. In addition to representing the brightness signal, it also contains more green channel volume;
  • U: the difference between the blue channel and the brightness;
  • V: The difference between the red channel and the brightness.
Advantages of the YUV color model:

The human eye is sensitive to brightness but not to chromaticity, so reducing the data volume of some UVs cannot be perceived by the human eye. In this way, the volume of the video can be reduced without affecting the look and feel by compressing the resolution of the UVs.

The conversion formula of YUV and RGB:

Y = 0.299R + 0.587G + 0.114B
U = -0.147R - 0.289G + 0.436B
V = 0.615R- 0.515G - 0.100B
R = Y + 1.14V
G = Y - 0.39U - 0.58V
B = Y + 2.03U

Commonly used YUV encoding formats:
  • YU12(I420/YUV420P)
  • YV12
  • I422(YUV422P)
  • NV12(YUV420SP)
  • NV21 (Android camera (Camera) default output format)

For a detailed introduction to the YUV format, you can read this article , and I won't go into details here.

2.3 Principles of Video Coding

Coding: It is to convert information from one form (format) to another form (format) according to the specified method.
Video encoding: It is to convert one video format into another video format.

Coding diagram
The principle of video coding is similar to that of audio coding. It also compresses and codes by reducing redundancy. However, unlike audio coding, video coding uses the redundancy of human eyes in video observation .

2.3.1 How big is the unencoded video

Take a video with a resolution of 1920×1280 and a frame rate of 30 as an example, each frame of image is 1920×1280=2,073,600 (Pixels), and each pixel is 24bit (RGB color model, each color component is 1byte=8bit) , that is, each picture is 2073600×24=49766400 bit, that is, 49766400bit=6220800byte≈6.22MB, which is the original size of a 1920×1280 picture, and multiplied by the frame rate by 30.
That is to say: the size of video per second is 186.6MB, about 11GB per minute, and a 90-minute movie is about 1000GB.
Such a huge amount of data has huge requirements on the technology and cost of transmission and storage, so it is necessary to encode and compress the video.

2.3.2 Redundancy in video images

Video Redundancy Type
Schematic diagram of video redundancy
Redundancy in video images mainly includes the types shown in the table above. The purpose of video coding is to remove the above redundancy as much as possible while retaining the best image effect.
Among these types of redundancy, the priority elimination of video coding technology is spatial redundancy and temporal redundancy. Next, let me introduce to you what kind of method is needed to kill them.

2.3.3 Coding Technology Realization

The video is formed by continuous playback of different frames.
These frames are mainly divided into three categories, namely:

  1. I 帧;
  2. B-frame;
  3. P frame.

I frame: It is an independent frame with all the information, it is the most complete picture (occupies the largest space), and it can be decoded independently without referring to other pictures. The first frame in a video sequence is always an I frame.
B frame: "Bidirectional predictive coding frame", the frame after the previous frame is used as the reference frame. It not only refers to the front, but also refers to the following frames, so its compression ratio is the highest, which can reach 200:1. However, it is not suitable for real-time transmission (such as video conferencing) because of the reliance on subsequent frames.
B frame diagram
P frame: "Inter-frame predictive coding frame", which needs to refer to different parts of the previous I frame and/or P frame for coding. P frames have dependencies on previous P and I reference frames. However, the P frame has a relatively high compression rate and takes up less space (compared to the I frame).

P frame diagram

By classifying the frames, the size of the video can be greatly compressed. After all, the objects to be processed are greatly reduced (from the entire image to a region in the image).

Schematic diagram of frame classification compression

Let's look at it with an example.
This has two frames:

Comparison of two frames

It seems to be the same?
No, make these two pictures into GIF animations, and you can see that they are different:

Image Frame Schematic Diagram

The people are moving, but the background is not moving. The difference between the two pictures is as follows:

difference between two frames

That is to say, from Figure 1 to Figure 2, only some pixels have been moved. The movement trajectory is as follows:

Pixel movement track

In the case of knowing the information in Figure 1, you only need to know these moving trajectories, and you can refer to Figure 2. This is motion estimation and compensation.

Motion Estimation and Compensation

Of course, if the calculation is always based on pixels, the amount of data will be relatively large. Therefore, the image is generally divided into different "blocks" or "macroblocks" and calculated on them. A macroblock is generally 16 pixels by 16 pixels.

macroblock division

OK, let me sort it out. The processing of the I frame adopts the intra-frame coding method, and only utilizes the spatial correlation in the image of this frame. For the processing of P frames, inter-frame coding (forward motion estimation) is used, and the spatial and temporal correlations are used at the same time. Simply put, a motion compensation (motion compensation) algorithm is used to remove redundant information.

Schematic diagram of I/P frame encoding

Special attention should be paid to the I frame (intra-frame coding), although there is only spatial correlation, the whole coding process is not simple.

Schematic diagram of I frame encoding

As shown in the figure above, the entire intra-frame coding also needs to go through multiple processes such as DCT (discrete cosine transform), quantization, and coding. Due to space limitations and complexity, I won't explain too much here.
Then, after the video is encoded and decoded, how to measure and evaluate the effect of encoding and decoding?
Generally speaking, it is divided into objective evaluation and subjective evaluation . Objective evaluation is to speak with numbers. For example, calculate the signal-to-noise ratio/peak signal-to-noise ratio .

SNR comparison chart

I won’t introduce the calculation of the signal-to-noise ratio. I’ll leave a formula and study it slowly when I have time.

SNR formula

In addition to objective evaluation, it is subjective evaluation. Subjective evaluation is to use people's subjective perception to directly measure. In human terms, it means "I have the final say on whether it looks good or not."

Subjective Compression Evaluation

2.4 Video encoding format

Common video encoding formats and their compression ratios

Development History:

History of Video Coding

The last HEVC mentioned in the development history diagram is the H.265 format we often say today.
As a new coding standard, H.265 has greatly improved performance compared with H.264, and has become the standard configuration of the latest video coding system.

Schematic diagram of H265 encoding format

2.5 (Audio) Video Encapsulation Format

For any video, there are only images and no sound, which is definitely not enough. Therefore, after video encoding, audio encoding must be added and packaged together.
Encapsulation: It is the encapsulation format. Simply put, it is to put the encoded and compressed video track and audio track into a file according to a certain format. To put it more simply, the video track is equivalent to rice, and the audio track is equivalent to dishes, and the packaging format is a lunch box, a container for holding meals.

At present, the main video packaging formats are as follows: MPG, VOB, MP4, 3GP, ASF, RMVB, WMV, MOV, Divx, MKV, FLV, TS/PS , etc.
The encapsulated video can be transmitted, or decoded and watched through a video player.

references

  1. What is the resolution of the human eye
  2. How many frames can the human eye see?
  3. Basic Concepts of Video
  4. Detailed explanation of YUV format (I420/YUV420/NV12/NV12/YUV422)
  5. Detailed Video Coding Technology
  6. Talking about the Principle of Video Coding

Guess you like

Origin blog.csdn.net/yonghuming_jesse/article/details/127123394