[FFmpeg learning] H264 video encoding format detailed summary

1. Explanation of some terms of audio and video

1. Bit rate

Refers to the data flow used by a video file per unit time, also called the bit rate. The larger the bit rate, the larger the sampling rate per unit time, the higher the accuracy of the data stream, and the effect shown in this way is: the video picture is clearer and the image quality is higher.

Generally, the unit is seconds, such as: 128 kbps, which means that the amount of data transmitted through the network per second is 128k bit.

2. Frame rate

Refers to how many frames a video contains per second, and the general unit is fps.

The higher the frame rate, it means the number of image frames per unit time. Ordinary video files are generally between 25fps - 30fps, which means 25-30 images per second. Generally, the frame rate related to games will be higher. Generally>60fps. When other parameters are fixed, the higher the frame rate, the better the fluency of the video or game. On the contrary, the lower the frame rate, the lower the fluency of the video or game. When it is lower than 15fps, the human eye will generally feel it. to noticeable lag

3. Resolution

The resolution is the length and width of the (rectangular) image, ie the dimensions of the image. Resolution affects image size in direct proportion to image size: higher resolutions result in larger images; lower resolutions result in smaller images.

2. H264 video encoding format

Why do encoding compression:

For video data, the main purpose of video coding is data compression. This is because the pixel form of dynamic images represents a huge amount of data, and the storage space and transmission bandwidth are completely unable to meet the needs of storage and transmission. For example, the three color components RGB of each pixel of an image need to be represented by one byte, then each pixel needs at least 3 bytes, and the size of an image with a resolution of 1280×720 is 2.76M bytes.

If for the same resolution video, if the frame rate is 25 frames per second, then the bit rate required for transmission will reach 553Mb/s! For higher-definition videos, such as 1080P, 4k, and 8k videos, the transmission bit rate is even more amazing. Such an amount of data cannot be afforded either for storage or transmission. Therefore, compressing video data has become an inevitable choice.

Why video information can be compressed

The reason why video information has a lot of space that can be compressed is because there is a lot of data redundancy in it. Its main types are:

  1. Temporal redundancy: the content between two adjacent frames of the video is similar, and there is a motion relationship
  2. Spatial redundancy: there is similarity between adjacent pixels within a certain frame of the video
  3. Coding redundancy: the probability of different data appearing in the video is different
  4. Visual redundancy: The viewer's visual system is sensitive to different parts of the video differently

1. Basic technology of video compression coding

1). Predictive coding

Predictive coding can be used to deal with redundancy in the temporal and spatial domains in video. Predictive coding in video processing is mainly divided into two categories: intra prediction and inter prediction.

  • Intra-frame prediction: The predicted value and the actual value are located in the same frame, which is used to eliminate the spatial redundancy of the image; the characteristic of intra-frame prediction is that the compression rate is relatively low, but it can be decoded independently without relying on the data of other frames; usually in the video All the key frames are intra-frame predicted.
  • Inter-frame prediction: The actual value of inter-frame prediction is located in the current frame, and the predicted value is located in the reference frame, which is used to eliminate the temporal redundancy of the image; the compression rate of inter-frame prediction is higher than that of intra-frame prediction, but it cannot be decoded independently, and must be obtained after reference Only after the frame data can the current frame be reconstructed.

Usually in the video code stream, all I frames use intra-frame coding, and data in P-frames/B-frames may use intra-frame or inter-frame coding.

2). Transform coding

The current mainstream video coding algorithms are all lossy coding, which can achieve relatively higher coding efficiency by causing limited and tolerable loss to the video. The part that causes information loss is the part of transform and quantization. Before quantization, the image information needs to be transformed from the spatial domain to the frequency domain through transform coding.

2. H.264 video coding structure

 H.264 codec is mainly divided into five parts: inter-frame and intra-frame prediction (Estimation), transform (Transform) and inverse transform, quantization (Quantization) and inverse quantization, loop filter (Loop Filter), entropy coding (Entropy Coding) ).

During the encoding process of H.264, the H image of each frame is divided into one or more slices for encoding. Each slice contains multiple macroblocks (MB, Macroblock). A macroblock is the basic coding unit in the H.264 standard, and its basic structure includes a 16×16 luma pixel block and two 8×8 chrominance pixel blocks, as well as some other macroblock header information. When encoding a macroblock, each macroblock is divided into multiple sub-blocks of different sizes for prediction .

The block size used in intra prediction may be 16×16 or 4×4, and the block used in inter prediction/motion compensation may have 7 different shapes: 16×16, 16×8, 8×16, 8×8, 8×4, 4×8 and 4×4. Compared with earlier standards, which can only perform motion compensation according to macroblocks or half macroblocks, the more subdivided macroblock segmentation method adopted by H.264 provides higher prediction accuracy and coding efficiency. In terms of transform coding, the transform block size for prediction residual data is 4×4 or 8×8 (only supported in FRExt version). Compared to earlier versions that only supported transform blocks of size 8×8, H.264 avoids the mismatch problem that often occurs in transform and inverse transforms.

 

In H.264, it is divided into five levels: sequence, image, slice, macroblock, and sub-macroblock

After image encoding

After a frame of picture passes through the H.264 encoder, it is encoded into one or more slices , and the carrier loaded with these slices is NALU.

What is NALU?

H.264 original code stream (also known as naked stream) is composed of NALU one after another , and its function is divided into two layers:  video coding layer (VCL, Video Coding Layer) and network abstraction layer (NAL, Network Abstraction Layer). The VCL data is the output of the encoding process. It represents the compressed and encoded video data sequence . The Network Abstraction Layer (NAL, Network Abstraction Layer) encapsulates the VCL and becomes each NALU. The structure of NALU is: NAL header + RBSP

The main function of a slice is to be used as a carrier of a macroblock (Macroblock).

The reason why slices are created is to limit the spread and transmission of bit errors.
How to limit the spread and transmission of bit errors?
Each slice (slice) should be transmitted independently of each other, and the prediction of a certain slice (slice (slice) intra-prediction and slice (slice) inter-prediction) cannot use the macroblock (Macroblock) in other slices as a reference image.

 NALU structure

 Slices can be divided into slice headers and slice data [as above], and the data of a slice is divided into several macroblocks [as below].

In the fragmented data, there are macroblocks, which are actually places where pixel data is stored in macroblocks.

A macroblock is the main carrier of video information because it contains the luminance and chrominance information for each pixel. The main job of video decoding is to provide an efficient way to obtain the pixel array in the macroblock from the code stream.
Components: A macroblock consists of a 16×16 luminance pixel and an additional 8×8 Cb and an 8×8 Cr color pixel block. In each picture, several macroblocks are arranged in the form of slices.

From the above figure, you can see that the macroblock contains information such as macroblock type, prediction type, Coded Block Pattern, Quantization Parameter, pixel brightness and chrominance data sets, and so on.

For slices, there are several types:

I slice: contains only I macroblocks, and I macroblocks use the decoded pixels in the current slice as references for intra-frame prediction (the decoded pixels in other slices cannot be used as references for intra-frame prediction).

P slice: It can include P and I macroblocks. The P macroblock uses the previously encoded image as a reference image for intra-frame prediction. An intra-frame encoded macroblock can be further divided into macroblocks: 16×16, 16 ×8, 8×16, or 8×8 luminance pixel blocks (and accompanying color pixels); if 8×8 sub-macroblocks are selected, they can be subdivided into sub-macroblock divisions with a size of 8×8 , 8×4, 4×8, or 4×4 luma pixel blocks (and accompanying color pixels).

B slice: It can include B and I macroblocks, and B macroblocks use bidirectional reference images (current and incoming coded image frames) for intra-frame prediction.

SP slice (switching P): used for switching between different encoded streams, including P and/or I macroblocks

SI slice: A must-have switch in the extended profile. It includes a special type of coded macroblock called SI macroblock. SI is also a necessary function in the extended profile.

The overall structure of NALU is ready to come out, the following is a picture in the H.264 document

3. I frame, B frame, P frame

DTS、PTS、GOP

  • DTS : decoded timestamp
  • PTS: display timestamp
  • GOP : A complete set of IBP frame pictures

GOP is a group of pictures, and a GOP is a group of continuous pictures. The GOP (Group of Picture) shown in the figure below:

 GOP generally has two numbers, such as M = 3, N = 12, M specifies the distance between I frame and P frame, and N specifies the distance between two I frames. Then the current GOP structure is:

I BBP BBP BBP BB I

Increasing the picture group can effectively reduce the size of the encoded video, but it will also reduce the video quality. As for how to choose, it depends on the demand.

From the figure above, it can be seen that the order of DTS and PTS is inconsistent, and each group of GOP starts with I frame, and then there are B and P frames. If the image quality of the first I frame is relatively poor, it will also Affects the image quality of subsequent B and P frames in a GOP.

  • I frame (intra picture): Intra frame coded frame, which can be decompressed into a single complete picture by the video decompression algorithm;
  • B frame (bidirectional): Bidirectional predictive interpolation coding frame, the data of this frame obtained by referring to the data of the previous and the following two frames plus the change of this frame
  • P frame: Forward predictive coding frame, refer to the data of this frame obtained from the previous.

Guess you like

Origin blog.csdn.net/qq_40587575/article/details/123561005