Audio and video eight-part essay--h264 AnnexB

NALU(Network Abstract Layer Unit)

Audio and video encoding occupies an important position in the streaming media and network fields; the streaming media encoding and decoding process is roughly shown in the figure below:

H264 Introduction

H.264 started in 1999, was drafted in 2003, and was finally finalized in 2007 to be verified. It is called H.264 in the ITU standard, and it is a part of MPEG-4 in the MPEG standard - MPEG-4 Part 10, also called Advanced Video Codec, so it is often called MPEG-4 AVC or directly called AVC .

As a benefit of this article, you can receive free C++ audio and video learning materials package, technical videos/codes, including (audio and video development, interview questions, FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, codec, push-pull streaming, srs)↓↓↓ ↓↓↓See below↓↓Click at the bottom of the article to get it for free↓↓

H264 codec analysis

After a frame of picture passes through the H.264 encoder, it is encoded into one or more slices (slices), and the carrier loaded with these slices (slices) is NALU. We can look at the relationship between NALU and slices ( slice).

The concept of slice is different from that of frame. Frame is used to describe a picture. One frame corresponds to one picture. Slice is a new feature proposed in H.264. Concepts are concepts that are integrated in an efficient way by encoding pictures and then segmenting them. A picture has at least one or more slices.

As can be seen from the above figure, slices are loaded into NALU and transmitted over the network. However, this does not mean that there must be slices in NALU. This is a sufficient and unnecessary condition, because NALU may also be loaded with other functions. information describing the video.

What is a slice?

The main function of a slice is to serve as a carrier for macroblocks (ps: the concept of macroblocks will be introduced below). Chips were created primarily to limit the spread and transmission of bit errors.

How to limit the spread and transmission of bit errors?

Each slice (slice) should be transmitted independently of each other, and the prediction of a certain slice (slice (slice) intra-prediction and slice (slice) inter-prediction) cannot use the macroblock (Macroblock) in other slices as a reference image.

So let’s use a picture to illustrate the specific structure of a slice:

We can understand that a picture/frame can contain one or more slices (Slice), and each slice (Slice) contains an integer number of macroblocks (Macroblock), that is, each slice (slice) has at least one macroblock (Macroblock) ), at most each slice contains macroblocks of the entire image.

In the above structure, it is not difficult to see that each slice also contains two parts: header and data: 1. The slice header contains slice type, macroblock type in slice, number of slice frames, slice Information belonging to that image and the settings and parameters of the corresponding frame. 2. The fragmented data contains macroblocks. This is where we are looking for storing pixel data.

What are macroblocks?

Macroblock is the main carrier of video information because it contains the brightness and chrominance information of each pixel. The main task of video decoding is to provide an efficient way to obtain the pixel array in the macroblock from the code stream.

Components: A macroblock consists of a 16×16 luminance pixel and an additional 8×8 Cb and an 8×8 Cr color pixel block. In each image, several macroblocks are arranged into slices.

From the above figure, you can see that the macroblock contains information such as macroblock type, prediction type, Coded Block Pattern, Quantization Parameter, pixel brightness and chrominance data sets, and so on.

H264 encoding principle

In the process of audio and video transmission, the transmission of video files is a huge problem; a video with a resolution of 1920*1080, each pixel occupying 3 bytes for RGB, and a frame rate of 25, for The transmission bandwidth requirements are:

1920 1080 3*25/1024/1024=148.315MB/s, changing to bps means that the video bandwidth per second is 1186.523Mbps, which is unacceptable for network storage. Therefore, video compression and encoding technology came into being.

For video files, the video is composed of a single picture frame, such as 25 frames per second, but there are similarities between the pixel blocks of the picture frames, so the video frame images can be compressed; H264 adopts Using a 16*16 block size pair, the video frame images are compared similarly and compressed. As shown below:

I-frame, P-frame and B-frame in H264

H264 uses intra-frame compression and inter-frame compression to improve encoding compression rate; H264 uses a unique I-frame, P-frame and B-frame strategy to achieve compression between consecutive frames;

As shown in FIG;

Compression ratio B > P > I

H264 encoding structure analysis

In addition to implementing video compression, H264 also provides corresponding video encoding and fragmentation strategies to facilitate network transmission; similar to encapsulating network data into IP frames, it is called a group (GOP) in H264 , group of pictures), slices, and macroblocks together form the hierarchical structure of the code stream of H264; H264 organizes it into sequences (GOP), pictures (pictrue), slices (Slice), There are five levels of macroblock and subblock. GOP (picture group) is mainly used to describe the number of frames between one IDR frame and the next IDR frame.

H264 divides video into continuous frames for transmission, using I frames, P frames, and B frames between consecutive frames. At the same time, for intra-frame speech, the image is divided into slices, macro blocks and word blocks for segmentation and transmission; through this process, the video file is compressed and packaged.

IDR (Instantaneous Decoding Refresh, instant decoding refresh)

The first image of a sequence is called an IDR image (immediate refresh image), and IDR images are all I-frame images.

Both I and IDR frames use intra prediction. The I frame does not need to refer to any frame, but subsequent P frames and B frames may refer to the frame before this I frame. IDR does not allow this. For example (order of decoding):

IDR1 P4 B2 B3 P7 B5 B6 I10 B8 B9 P13 B11 B12 P16 B14 B15 Here, B8 can cross over I10 to refer to the original image of P7: IDR1 B2 B3 P4 B5 B6 P7 B8 B9 I10

IDR1 P4 B2 B3 P7 B5 B6 IDR8 P11 B9 B10 P14 B11 B12 B9 here can only refer to IDR8 and P11, and cannot refer to the frame before IDR8

Its core function is to resynchronize decoding. When the decoder decodes an IDR image, it immediately clears the reference frame queue, outputs or discards all the decoded data, searches for the parameter set again, and starts a new one. sequence. In this way, if there is a major error in the previous sequence, you can get a chance to resynchronize here. Images following an IDR image are never decoded using data from the image before the IDR.

The following is an example of an H264 code stream (it can be seen from the frame analysis of the code stream that B frames cannot be used as reference frames)

I0 B40 B80 B120 P160I0 B160

WAVES

SPS: Sequence parameter set. SPS stores a set of global parameters of the coded video sequence (Coded video sequence).

PPS: Image parameter set, corresponding to the parameters of a certain image or several images in a sequence.

I-frame: Intra-coded frame that can be independently decoded to produce a complete picture.

P frame: Forward predictive coding frame, it needs to refer to an I or B in front of it to generate a complete picture.

B frame: Bi-directional predictive interpolation coding frame refers to the previous I or P frame and the following P frame to generate a complete picture.

Before sending I frame, SPS and PPS must be sent at least once.

As a benefit of this article, you can receive free C++ audio and video learning materials package, technical videos/codes, including (audio and video development, interview questions, FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, codec, push-pull streaming, srs)↓↓↓ ↓↓↓See below↓↓Click at the bottom of the article to get it for free↓↓

NALU structure

The H.264 original code stream (naked stream) is composed of NALU one after another. Its function is divided into two layers, VCL (Video Coding Layer) and NAL (Network Extraction Layer):

VCL: Includes the core compression engine and syntax-level definitions of blocks, macroblocks, and slices. The design goal is to enable efficient encoding as independent of the network as possible;

NAL: Responsible for adapting the bit strings generated by VCL to various networks and diverse environments, covering all syntax levels above the film level

Before VCL performs data transmission or storage, these encoded VCL data are mapped or encapsulated into NAL units.

(WAVE)

The main structure of the NALU structural unit is as follows; an original H.264 NALU unit usually consists of three parts: [StartCode] [NALU Header] [NALU Payload], where Start Code is used to indicate that this is the beginning of a NALU unit. , must be "00 00 00 01" or "00 00 01", otherwise it is basically equivalent to a NAL header + RBSP

(After FFmpeg demultiplexes, the packet read from the MP4 file does not have a startcode, but the packet read from the TS file does have a startcode)

Parse NALU

Each NAL unit is a variable-length byte string with certain syntax elements, including one-byte header information (used to indicate the data type), and several integer bytes of payload data.

NALU header information (one byte):

in:

T is the load data type, accounting for 5 bitnal_unit_type: the type of this NALU unit, 1~12 are used by H.264, 24~31 are used by applications other than H.264

R is the importance indication bit, accounting for 2 bitnal_ref_idc.: 00~11 seems to indicate the importance of this NALU. For example, the NALU decoder of 00 can discard it without affecting the playback of the image. 0~3, the larger the value. , indicating that the current NAL is more important and needs to be protected first. If the current NAL is an important unit such as a frame belonging to a reference frame, a sequence parameter set, or an image parameter set, this syntax element must be greater than 0.

The last F is the forbidden bit, occupying 1bitforbidden_zero_bit: The H.264 specification stipulates that this bit must be 0.

The H.264 standard states that when the data stream is stored on the medium, a start code: 0x000001 or 0x00000001 is added before each NALU to indicate the start and end position of a NALU:

Under such a mechanism, the start code is detected in the code stream as the start identifier of a NALU. When the next start code is detected, the current NALU ends.

The 3-byte 0x000001 is only used in one occasion, that is, when a complete frame is compiled into multiple slices (slices), the NALU containing these slices uses a 3-byte start code. In other cases, it is 4 bytes 0x00000001.

Example: 0x00 00 00 01 67 …0x00 00 00 01 68 …0x00 00 00 01 65 …67: Binary: 0110 011100111 = 7 (decimal)

For NALU analysis, this lesson mainly focuses on four types: 5/6/7/8.

H264 annexb mode

H264 has two packages

One is the annexb mode, the traditional mode with startcode, SPS and PPS in ES

One is mp4 mode, generally mp4 mkv is mp4 mode, there is no startcode, SPS and PPS and other information are encapsulated in the container, the first 4 bytes of each frame is the length of the frame

Many decoders only support the annexb mode, so mp4 needs to be converted: Use h264_mp4toannexb_filter in ffmpeg to do the conversion.

accomplish:

const AVBitStreamFilter *bsfilter = av_bsf_get_by_name("h264_mp4toannexb");
AVBSFContext *bsf_ctx = NULL;
// 2 初始化过滤器上下⽂
av_bsf_alloc(bsfilter, &bsf_ctx); //AVBSFContext;
// 3 添加解码器属性
avcodec_parameters_copy(bsf_ctx->par_in, ifmt_ctx>streams[videoindex]->codecpar);
av_bsf_init(bsf_ctx);

Supplementary explanation

GOP group of pictures

GOP refers to the interval between two I frames. In comparison, the GOP is 120. If it is 720 p60, it is an I frame every 2s. In the video coding sequence, there are mainly three kinds of coding frames: I frame, P Frame, B frame, as shown below:

  1. I frame is Intra-coded picture (intra-coded image frame). It does not refer to other image frames and only uses the information of this frame for encoding.
  2. P frame is Predictive-coded Picture (predictive coded image frame), which uses the previous I frame or P frame to perform inter-frame predictive coding in the way of motion prediction.
  3. B frame is Bidirectionallypredicted picture (bidirectionally predictive coding image frame), which provides the highest compression ratio. It requires both the previous image frame (I frame or P frame) and the subsequent image frame (P frame), using motion prediction. This method performs inter-frame bidirectional predictive coding.

In the video coding sequence, GOP is Group of picture (group of pictures), which refers to the distance between two I frames, and Reference (reference cycle) refers to the distance between two P frames. An I frame occupies more bytes than a P frame, and a P frame occupies more bytes than a B frame.

Therefore, under the premise that the code rate remains unchanged, the larger the GOP value, the greater the number of P and B frames, and the more bytes each I, P, and B frame occupies on average, making it easier to obtain. Better image quality; the larger the Reference, the greater the number of B frames, and similarly, it is easier to obtain better image quality.

It should be noted that there is a limit to improving image quality by increasing the GOP value. When encountering a scene switch, the H.264 encoder will automatically insert an I frame. At this time, the actual GOP The value is shortened. On the other hand, in a GOP, P and B frames are predicted from I frames. When the image quality of I frame is relatively poor, it will affect the image quality of subsequent P and B frames in a GOP. It is not possible to recover until the next GOP starts, so the GOP value should not be set too large. At the same time, since the complexity of P and B frames is greater than that of I frames, too many P and B frames will affect the coding efficiency and reduce the coding efficiency. In addition, too long a GOP will also affect the response speed of the Seek operation. Since P and B frames are predicted from the previous I or P frame, the Seek operation needs to be positioned directly. When decoding a certain P or B frame, You need to first decode the I frame in this GOP and the previous N predicted frames. The longer the GOP value, the more predicted frames that need to be decoded, and the longer the seek response time is.

I-frame, B-frame and P-frame in H.264

Images in H264 are organized in units of sequences. A sequence is a data stream after image encoding, starting with an I frame and ending with the next I frame.

IDR image: The first image of a sequence is called an IDR image (immediate refresh image), and IDR images are all I-frame images.

H.264 introduces IDR images for decoding resynchronization. When the decoder decodes the IDR images, it immediately clears the reference frame queue, outputs or discards all the decoded data, searches for the parameter set again, and starts a new one. sequence. In this way, if there is a major error in the previous sequence, you will get a chance to resynchronize here. The image after the IDR image is never decoded using the image data before the IDR.

A sequence is a data stream generated by encoding an image whose content is not very different. When the movement changes are relatively small, a sequence can be very long, because the small movement changes means that the content of the image changes very little, so you can compile an I frame, and then continue to P frames and B frames. . When the motion changes a lot, a sequence may be shorter, for example, it may contain one I frame and 3 or 4 P frames.

Description of three types of IPB frames

1. I frame I frame: Intra-frame encoding frame, I frame represents the key frame, you can understand it as the complete retention of this frame; only the data of this frame is needed when decoding (because it contains the complete frame)

I-frame features:

  1. It is a full frame compression coded frame. It performs JPEG compression encoding and transmission of full-frame image information;
  2. During decoding, the complete image can be reconstructed using only the data of the I frame;
  3. I-frames describe details of the image background and moving subjects;
  4. I-frames are generated without reference to other frames;
  5. I frame is the reference frame of P frame and B frame (its quality directly affects the quality of subsequent frames in the same group);
  6. The I frame is the basic frame of the frame group GOP (if it is IDR, it is the first frame). There is only one IDR frame in a group, and one or more I frames (including IDR frames);
  7. I frames do not need to consider motion quantities;
  8. The amount of information contained in the data occupied by I frames is relatively large.

2. P frame

P frame: forward predictive coded frame. The P frame represents the difference between this frame and the previous key frame (or P frame). When decoding, it is necessary to superimpose the difference defined in this frame with the previously cached picture to generate the final picture. (That is, the difference frame, the P frame does not have complete screen data, only the data that is different from the previous frame)

Prediction and reconstruction of P frame: The P frame uses the I frame as the reference frame, finds the prediction value and motion vector of "a certain point" of the P frame in the I frame, and takes the prediction difference and motion vector and transmits them together. At the receiving end, the predicted value of the "certain point" of the P frame is found from the I frame based on the motion vector and added to the difference to obtain the sample value of the "certain point" of the P frame, thereby obtaining the complete P frame.

P-frame features:

  1. A P frame is a coded frame separated by 1~2 frames after an I frame;
  2. The P frame uses motion compensation to transmit the difference and motion vector (prediction error) between it and the previous I or P frame;
  3. During decoding, the prediction value and prediction error in the I frame must be summed before the complete P frame image can be reconstructed;
  4. P-frames belong to forward-predictive inter-coding. It only refers to the previous I-frame or P-frame closest to it;
  5. A P frame can be the reference frame of the following P frame or the reference frame of the preceding and following B frames;
  6. Since the P frame is a reference frame, it may cause the spread of decoding errors;
  7. Due to the differential transmission, the compression of P frames is relatively high.

3. B frame

B frame: bidirectional predictive interpolation coded frame. The B frame is a bidirectional difference frame, that is, the B frame records the difference between this frame and the previous and next frames (the specifics are more complicated, there are 4 situations, but I put it this way to make it simpler). In other words, to decode the B frame, not only To obtain the previous cached picture, you also need to decode the subsequent picture, and obtain the final picture by superimposing the previous and next pictures with the data of this frame. B frame has high compression rate, but the CPU will be tired during decoding.

B-frame prediction and reconstruction

The B frame takes the previous I or P frame and the following P frame as the reference frame, "finds out" the predicted value and two motion vectors of "a certain point" in the B frame, and takes the prediction difference and motion vector for transmission. The receiving end "finds (calculates)" the predicted value in the two reference frames based on the motion vector and sums it with the difference to obtain the "certain point" sample value of the B frame, thus obtaining the complete B frame.

B-frame features

1) The B frame is predicted from the previous I or P frame and the following P frame;

2) The B frame transmits the prediction error and motion vector between it and the previous I or P frame and the following P frame;

3) B frame is a bidirectional predictive coding frame;

4) The B frame has the highest compression ratio, because it only reflects the changes of the moving subject between the two reference frames, and the prediction is more accurate;

5) The B frame is not a reference frame and will not cause the spread of decoding errors.

Note: Each frame I, B, and P is artificially defined based on the needs of the compression algorithm. They are all real physical frames. Generally speaking, the compression rate of I frame is 7 (similar to JPG), P frame is 20, and B frame can reach 50. It can be seen that using B frames can save a lot of space, and the saved space can be used to save more I frames, which can provide better image quality at the same bit rate.

The six-layer structure of H264 code stream is very important

I originally wanted to draw this picture myself, but it was incompatible with the pictures on the Internet. Incompatibility is not conducive to dissemination, so I quoted it directly.

The first, second and third layers in the figure below are very important and will be encountered during development.

Each group of images (GOP, image group) of the H.264 encoded video is given the sequence in transmission (PPS) and the image parameters (SPS) of the frame itself. Therefore, our overall structure should be like this:

In H.264, syntax elements are organized into five levels: sequence, image, slice, macroblock, and sub-macroblock.

As a benefit of this article, you can receive free C++ audio and video learning materials package, technical videos/codes, including (audio and video development, interview questions, FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, codec, push-pull streaming, srs)↓↓↓ ↓↓↓See below↓↓Click at the bottom of the article to get it for free↓↓

Guess you like

Origin blog.csdn.net/m0_60259116/article/details/132741804