H.264 image coding basic knowledge and professional words

As a person who has just started to learn audio and video related knowledge, there are few materials on the Internet and I am very confused. It is very difficult to see the H264 encoding process directly, and many professional vocabularies are not well understood. So this chapter discusses some basic knowledge and professional words of coding, which is convenient for further study later.

Preface

Why encoding is needed: For example, the current screen is 1280*720. 24 pictures per second. Then our one second video data is:

1280*720(位像素)*24(张) / 8(1字节8位)(结果:B) / 1024(结果:KB) / 1024 (结果:MB) =  2.64MB

One minute is more than 100 M, so a compression method is needed to reduce the size of the data.
Talking about H264: It is a new generation of coding standard. It is known for its high compression and high quality and supports streaming media transmission on multiple networks. In terms of coding, I understand his theoretical basis: referring to the statistical results of images over a period of time, it shows that, In several adjacent images, the pixels that are generally different are only within 10%, the brightness difference does not exceed 2%, and the chromaticity difference only changes within 1%. Therefore, for a picture with little change, we can first encode a complete image frame A, and then not encode all the images in the subsequent B frames, but only write the difference from the A frame, so that the size of the B frame is only the complete frame. 1/10 or less! If the C frame after the B frame does not change much, we can continue to encode the C frame in the way of referring to the B, and the cycle continues. This image is called a sequence (sequence is a piece of data with the same characteristics). When a certain image changes greatly from the previous image and cannot be generated by referring to the previous frame, then we end the previous sequence and start the next one. Sequence, that is, a complete frame A1 is generated for this image, and subsequent images are generated with reference to A1, and only the differences from A1 are written.
In this section, we will take everyone to understand. The order from largest to smallest is sequence, image, slice, slice, NALU, macroblock, sub-macroblock, block, pixel.

principle

H.264 original stream (bare stream) is composed of NALU one after another, and its function is divided into two layers, VCL (Video Coding Layer) and NAL (Network Extraction Layer):

VCL(Video Coding Layer) + NAL(Network Abstraction Layer).

(1) VCL: including the core compression engine and block, macroblock and slice syntax level definitions, the design goal is to perform efficient encoding independently of the network as much as possible;
(2) NAL: Responsible for adapting the bit string generated by VCL In a variety of networks and multiple environments, it covers all grammatical levels above the film level.
Before the VCL performs data transmission or storage, these encoded VCL data are mapped or encapsulated into NAL units. (NALU).

一个NALU = 一组对应于视频编码的NALU头部信息 + 一个原始字节序列负荷(RBSP,Raw Byte Sequence Payload).

As shown in the figure, the NALU header + RBSP in the figure above is equivalent to a NALU (Nal Unit), and each unit is transmitted as an independent NALU. The structure of H.264 is all based on NALU. If you understand NALU, you can understand the structure of H.264. Insert picture description here
An original H.264 NALU unit is often composed of three parts: [StartCode] [NALU Header] [NALU Payload]. Each NALU is separated by a startcode (start code). The start code is divided into two types: 0x000001 (3Byte). ) Or 0x00000001 (4Byte). If the slice corresponding to NALU is the beginning of a frame, use 0x00000001, otherwise use 0x000001. The H.264 code stream analysis step is to first search for 0x000001 and 0x00000001 from the code stream to separate NALU; then analyze each field of NALU
1. NAL Header is
composed of three parts, forbidden_bit(1bit), nal_reference_bit(2bits) (preferred Level), nal_unit_type(5bits) (type), the VCL layer comes out of the encoded video frame data, these frames may be I, B, P frames, and these frames may belong to different sequences, and the same sequence has Corresponding set of sequence parameter sets and picture parameter sets, etc., so to complete video decoding, not only the video frame data encoded by the VCL layer needs to be transmitted, but also data such as sequence parameter sets and image parameter sets need to be transmitted. The NALU header is used to identify what type of data the following RBSP is, as shown in the following table: For
Insert picture description here
example:

00 00 00 01 06:  SEI信息   
00 00 00 01 67:  0x67&0x1f = 0x07 :SPS
00 00 00 01 68:  0x68&0x1f = 0x08 :PPS
00 00 00 01 65:  0x65&0x1f = 0x05: IDR Slice			//idr片表示作为参考帧

2. The RBSP
NALU header is used to identify what type of data the following RBSP is, whether it will be referenced by other frames, and whether there is an error in network transmission. RBSP is used to store one of the following:
Insert picture description here
3. SODB and RBSP
SODB data bit string -> is the original encoded data.
RBSP original byte sequence payload -> the end bit is added after the original encoded data. One bit "1" has several bits "0" for byte alignment.
Insert picture description here

Learn about professional words in H.264 from NALU

Insert picture description here

1帧 = n个片
1片 = n个宏块
1宏块 = 16x16yuv数据

1. Slice (slice)
as shown in the figure, the main body of NALU contains Slice (slice)

一个片 = Slice Header + Slice Data

A slice is a new concept proposed by H.264, a concept that is integrated in an efficient way through segmentation after encoding a picture. A picture has one or more slices, and the slices are loaded by NALU and transmitted over the network. But NALU is not necessarily a slice, this is a fully unnecessary condition, because NALU may also be loaded with other information used to describe the video.

So why set up slices?
The purpose of setting slices is to limit the spread and transmission of error codes, and the coded slices should be independent of each other. The prediction of a certain slice cannot use the macroblocks in other slices as reference images, so that the prediction errors in one slice will not propagate to other slices.

As you can see in the above figure, in each image, several macroblocks (Macroblock) are arranged into slices. One video image can be programmed with one or more slices, each slice contains an integer number of macroblocks (MB), and each slice contains at least one macroblock.
There are five types of films:
Insert picture description here
2. Macroblock
concept: Macroblock is the main carrier of video information. A coded image is usually divided into multiple macroblocks. It contains the luminance and chrominance information of each pixel. The main job of video decoding is to provide an efficient way to obtain the pixel array in the macro block from the code stream.

一个宏块 = 一个16*16的亮度像素 + 一个8×8Cb + 一个8×8Cr彩色像素块组成。(YCbCr 是属于 YUV 家族的一员,在YCbCr 中 Y 是指亮度分量,Cb 指蓝色色度分量,而 Cr 指红色色度分量)

Insert picture description here
Insert picture description here
In H.264, syntactic elements are organized into five levels: sequence, image, slice, macroblock, and sub-macroblock. The hierarchical structure of syntactic elements helps to save code stream more effectively. For example, in another image, there will often be the same data between each slice. If each slice carries these data at the same time, it will inevitably cause a waste of code stream. A more effective method is to extract the public information of the image to form syntactic elements at the image level, and only carry the unique syntactic elements of the film itself at the film level.
Insert picture description here
The syntactic unit
Insert picture description here
of the macro block : the description of the filling data in the macro block:
Insert picture description here
3. Image, field, and frame
Image is a collective concept, and the top field, bottom field, and frame can all be called images. For the H.264 protocol, the names that we are usually familiar with, such as: I frame, P frame, B frame, etc., are actually all of us who have concreted and refined the concept of image. The "frame" we mentioned in H.264 usually refers to an image without field division;

One field or frame of video can be used to generate a coded image . A frame is usually a complete image. When collecting video signals, if interlaced scanning (odd and even lines) is used, the scanned image is divided into two parts, and each part is called [ field ], according to the order atmosphere: [ top field ] And [ bottom field ].

the way Scope
Frame encoding method Images with little activity or still images should be used
Field coding Moving images with a lot of activity

Insert picture description here

4. I, P, B frames and pts/dts

Frame classification Chinese significance
I 帧 Intra-encoded frame, also known as intra picture The I frame represents the key frame, which you can understand as the complete preservation of this frame of picture; only this frame of data is needed for decoding (because it contains the complete picture).
P frame Forward predictive coding frame, also known as predictive-frame It represents the difference between this frame and the previous key frame (or P frame). When decoding, the difference defined by this frame needs to be superimposed on the previously buffered picture to generate the final picture. (That is, the difference frame, P frame does not have complete picture data, only data that is different from the picture of the previous frame)
B frame Bi-directional interpolated prediction frame, also known as bi-directional interpolated prediction frame The B frame uses the preceding I or P frame and the following P frame as reference frames to "find out" the predicted value and two motion vectors of a "point" in the B frame, and then take the prediction difference and the motion vector for transmission. The receiving end "finds (calculates)" the predicted value in the two reference frames according to the motion vector and sums it with the difference to obtain the "some point" sample value of the B frame, so that the complete B frame can be obtained. To decode a B frame, not only the previous buffered picture must be obtained, but also the decoded picture, and the final picture is obtained by superimposing the front and rear pictures with the data of the current frame. B-frame compression rate is high, but the CPU will be more tired when decoding

Features of I frame:
1. It is a full-frame compression coded frame. It performs JPEG compression encoding and transmission of the full frame image information;
2. The complete image can be reconstructed with only the data of the I frame when decoding ; 3. The
I frame describes the details of the image background and the moving subject; 4. The
I frame is not required Generated by referring to other pictures;
5.I frame is the reference frame of P frame and B frame (its quality directly affects the quality of subsequent frames in the same group);
6.I frame is the basic frame of the frame group GOP (the first frame ), there is only one I frame in a group;
7. I frame does not need to consider the motion vector;
8. The amount of information of the data occupied by the I frame is relatively large.
P frame features:
1. P frame is an encoded frame separated by 1 to 2 frames after I frame;
2. P frame uses motion compensation to transmit the difference between it and the previous I or P frame and the motion vector (prediction error);
3. During decoding, the predicted value in the I frame must be summed with the prediction error to reconstruct the complete P frame image;
4. P frames belong to the inter-frame coding of forward prediction. It only refers to the I frame or P frame
that is closest to it in front; 5.P frame can be the reference frame of the P frame behind it, or the reference frame of the B frame before and after it;
6.Because the P frame is a reference frame, it It may cause the spread of decoding errors;
7. Because it is a differential transmission, the compression of P frames is relatively high.
B frame characteristics 1. B frame
is predicted by the previous I or P frame and the following P frame; 2. The
B frame is transmitted by the prediction error between it and the previous I or P frame and the following P frame And motion vector;
3.B frame is a two-way predictive coding frame;
4.B frame has the highest compression ratio, because it only reflects the change of the moving subject between the reference frames of C, and the prediction is more accurate;
5.B frame is not a reference frame, it will not Cause the spread of decoding errors.

name significance
PTS(Presentation Time Stamp) PTS is mainly used to measure when the decoded video frame is displayed
DTS(Decode Time Stamp) DTS mainly identifies when the bit stream in the memory starts to be sent to the decoder for decoding

Insert picture description here
The difference between DTS and PTS: The
main user video of DTS is decoded and used in the decoding stage. PTS is mainly used for video synchronization and output, and is used in display. The output sequence is the same when there is no B frame.

5. GOP
GOP is a group of pictures, and a GOP is a group of continuous pictures. A GOP is formed between two I frames. If there is a B frame, the last frame of a GOP must be P.
GOP generally has two numbers, such as M=3, N=12. M specifies the distance between an I frame and a P frame, and N specifies the distance between two I frames. So the current GOP structure is

I BBP BBP BBP BB I

Enlarging the picture group can effectively reduce the volume of the encoded video, but it will also reduce the video quality. As for the choice, it depends on the needs.

6.
The first image in a sequence of IDR is called IDR image (immediate refresh image), and IDR images are all I-frame images.
Both I and IDR frames use intra prediction. The I frame does not need to refer to any frame, but the subsequent P and B frames may refer to the frame before the I frame. IDR does not allow this.
For example, in this case:
IDR1 P4 B2 B3 P7 B5 B6 I10 B8 B9 P13 B11 B12 P16 B14 B15 where B8 can cross I10 to refer to P7

Core function:
H.264 introduces IDR image to resynchronize the decoding. When the decoder decodes the IDR image, it immediately clears the reference frame queue, outputs or discards all the decoded data, searches for the parameter set again, and starts a new one. the sequence of. In this way, if there is a major error in the previous sequence, you can get a chance to resynchronize here. The image after the IDR image will never be decoded using the data of the image before the IDR.

7. Compression algorithm: Intra-frame compression\Inter-frame compression
The core algorithm used in H264 is intra-frame compression and inter-frame compression. Intra-frame compression is an algorithm for generating I frames, and inter-frame compression is an algorithm for generating B and P frames.
1. Grouping: Divide several frames of images into a group (GOP, that is, a sequence). In order to prevent motion changes, the number of frames should not be too large.
2. Defining frames: Define each frame image in each group as three types, namely I frame, B frame and P frame
3. Predicted frame: I frame is used as the basic frame, and P frame is predicted by I frame, and then I frame is used to predict P frame. Frames and P frames predict B frames.
4. Data transmission: Finally, the difference information between the I frame data and the prediction is stored and transmitted.
Intraframe compression is also called Spatialcompression. When compressing a frame of image, only the data of the current frame is considered without considering the redundant information between adjacent frames, which is actually similar to still image compression. Intraframe generally uses lossy compression algorithm. Since intraframe compression encodes a complete image, it can be decoded and displayed independently. Intra-frame compression generally does not achieve high compression, which is similar to encoding jpeg. The principle of
interframe compression is: the data of several adjacent frames have great correlation, or the characteristics of little change in the information of the two frames before and after. That is, continuous video has redundant information between adjacent frames. According to this feature, compressing the redundancy between adjacent frames can further increase the amount of compression and reduce the compression ratio. Inter-frame compression is also called temporal compression (Temporalcompression), which compresses data by comparing data between different frames on the time axis. Inter-frame compression is generally lossless. The frame difference (Framedifferencing) algorithm is a typical time compression method. It compares the difference between the current frame and the adjacent frames, and only records the difference between the current frame and its adjacent frames, which can greatly reduce the amount of data.
Lossy (Lossy) compression and lossless (Lossyless) compression: Lossless compression means that the data before compression and after decompression are exactly the same. Most lossless compression uses the RLE run length encoding algorithm. Lossy compression means that the data after decompression is inconsistent with the data before compression. In the process of compression, some images or audio information that are not sensitive to human eyes and ears will be lost, and the lost information cannot be recovered. Almost all high compression algorithms use lossy compression to achieve the goal of low data rate. The lost data rate is related to the compression ratio. The smaller the compression ratio, the more data is lost, and the effect after decompression is generally worse. In addition, some lossy compression algorithms use multiple repeated compressions, which will also cause additional data loss.

8. Sequences
In H264, images are organized in units of sequences. A sequence is a data stream of a segment of image encoding, starting with an I frame and ending with the next I frame. The first image in a sequence is called an IDR image.
A sequence is a data stream generated after encoding a segment of images with not too big differences in content. When the motion changes are relatively small, a sequence can be very long, because less motion changes means that the content of the image screen changes very little, so you can edit an I frame, and then keep P and B frames. When the motion changes a lot, a sequence may be relatively short, for example, it contains one I frame and 3 or 4 P frames.

references

A new generation of video compression coding H.264.pdf

Guess you like

Origin blog.csdn.net/weixin_37921201/article/details/88982219