H.264 encoding analysis--NALU, frames and GOPs, video sequences

Why encode video?

Video encoding is used to compress video. An uncompressed 1-hour 1080p video (25 frames per second) occupies the following size:
1920x1080x4x3600x25=746,496,000,000 bytes, which is about 700G
A small The movie is so big, how can our network bandwidth and hard drive capacity bear it? Therefore, large files such as videos and pictures are compressed.
H.264 is a classic video coding (compression) standard.

Video compression principle

The reason why video information has a large amount of space that can be compressed is because there is a large amount of data redundancy in it. The main types are:
Temporal redundancy: The content between two adjacent frames of the video is similar and there is a motion relationship
Spatial redundancy: A certain frame of the video There is similarity between adjacent pixels within the video
Coding redundancy: Different data in the video appear with different probabilities
Visual redundancy: The viewer’s visual system responds differently to the video Some have different sensitivities

WAVES

From the perspective of storage structure, H.264 is composed of NALU (Network Abstraction Layer Unit-Network Abstraction Layer Unit).
NALU can be seen as consisting of three parts:

  1. start_code
  2. NALU header
  3. RBSP (Raw Byte Sequence Payload–original byte sequence payload)

image.png

start_code

The start code is three bytes0x000001 or four bytes0x00000001, used to divide NALU units. When start_code is read, it indicates the end of the previous NALU and the beginning of the next NALU.
In order to avoid subsequent payload data (RBSB) from conflicting with the start code, H.264 uses emulation_prevention_three_byte.
emulation_prevention_three_byte is a byte equal to 0x03. When an emulation_prevention_three_byte appears in a NAL unit, it should be discarded by the decoding process.
When the original data is the following bytes, 03 will be inserted at the third byte, that is,
0x000000 -> 0x00000300
0x000001 -> 0x00000301
0x000002 -> 0x00000302
When decoding, it is just the opposite, you need to remove 03

NALU header

The header occupies 1 byte and contains three fields.

image.png

forbidden_zero_bit

Occupies 1 bit, H.264 specifies 0

nal_ref_idc

Occupying 2 bits, it indicates the importance of the NALU. SPS, PPS, and IDR should not be 0. The larger the value, the more important it is.

nal_unit_type

occupies 5 bits and specifies the data type of RBSP. Mainly focus on types 1, 5, 6, 7, and 8.
When the header byte is:
0x41: Non-IDR frame, that is, P frame or B frame
0x65: IDR frame< /span> 0x68: PPS
0x67: SPS

RBSP

RBSP is Raw Byte Sequence Payload – original byte sequence payload. It is the byte stream of SODB after byte alignment.
SODB–String Of Data Bits, refers to the load data byte stream.
When the last byte of SODB uses less than 8 bits, add 1 bit first and then fill it with 0, thus becoming RBSP.

SPS

nal_unit_type=7
SPS is Sequence parameter set – sequence parameter set.
sequence parameter set sequence parameter set: a syntax structure containing syntax elements that apply to 0 or more complete encoded video sequences. The referenced image parameters are determined by the syntax element pic_parameter_set_id in the slice header. Set, the referenced sequence parameter set is determined by the syntax element seq_parameter_set_id in the image parameter set.
The picture below is the SPS information of a certain .h264 file
image.png
The common SPS parameters are as follows:
profile_idc and level_idc refer to the bit stream Configurations and levels to adhere to

profile_idc

profile_idc is a parameter indicating the encoding configuration information used by the encoder. Its full name is "Profile Indication Parameter", which means the indication parameter of the video encoding profile used by the encoder. In the H.264 standard, multiple different profile_idc values ​​are defined, each value corresponding to different video encoding configuration information, such as baseline, main, high, etc. These different configuration information can affect the coding efficiency, image quality and system complexity of the video.

level_idc

level_idc indicates the encoding level used by the encoder. Its full name is "Level Indication Parameter", which means the indication parameter of the video encoding level used by the encoder. The H.264 standard defines multiple different level_idc values, each value corresponding to a different video encoding level, such as Level 1, Level 1.1, Level 1.2, etc. These different levels define parameters such as the maximum resolution, maximum frame rate, maximum bit rate, and maximum buffer size that the video encoding can achieve, among other technical details. The level_idc value chosen by the encoder must conform to the level range supported by the given profile_idc.

seq_parameter_set_id

seq_parameter_set_id is used to identify the sequence parameter set that the image parameter set refers to.

max_num_ref_frames

Specifies the maximum reference frame for inter prediction

pic_width_in_mbs_minus1

pic_width_in_mbs_minus1 plus 1 refers to the width of each decoded picture in macroblock units.

pic_height_in_map_units_minus1

pic_height_in_map_units_minus1 plus 1 represents the height of one decoded frame or field in units of the stripe group map.

Calculate video resolution and frame rate

num_units_in_tick and time_scale can be used to calculate the frame rate
https://www.jianshu.com/p/9d7e2a2629ee

PPS

nal_unit_type=8
picture parameter set: A syntax structure containing syntax elements that apply to zero or more encoded pictures, represented by each slice header The syntax element pic_parameter_set_id is determined.
In the video coding standard H.264/AVC, Picture Parameter Set (PPS) is a video parameter set used to describe the coding parameters of an image in a video sequence. PPS contains some image-specific parameters, such as image slicing, reference frame selection, quantization parameters, etc. PPS is used to instruct the decoder how to decode each image in the video sequence to ensure the correctness and consistency of the image. PPS are usually encoded as NAL units and can be transmitted and stored independently of other video parameter sets. During the decoding process, the decoder needs to decode the PPS first before it can correctly decode the image data.
The picture below is the PPS information of a certain .h264 file
image.png

IDR

nal_unit_type=5
Coding slice of IDR image
The slice syntax structure of IDR image and non-IDR image is the same, as shown below:
image.png

Non-IDR

nal_unit_type=1
Coded slice of non-IDR image

Frames and GOP

I-wall

I frame is an intra-frame coding frame, also called a key frame. (Intra-coded (I) frames/slices (key frames))
Decoding can be performed independently without reference to other images. The first frame in a video sequence is always an I frame. It is a full frame compression encoded frame. It performs JPEG compression encoding and transmission of full-frame image information.
The I frame is usually the first frame of each GOP (a video compression technology used by MPEG). After moderate compression, it can be used as a reference point for random access and can be regarded as a static image. I-frame can be regarded as the product of an image being compressed. I-frame compression can achieve a compression ratio of 6:1 without any noticeable blurring. I-frame compression can remove the spatial redundant information of the video. The P-frame and B-frame to be introduced below are to remove the temporal redundant information.

P frame

P ​​frame is a forward prediction encoding frame (Predicted § frames/slices), which needs to refer to a previous I frame or P frame for decoding.
P frames and B frames are both inter-frame predictions. The P frame represents the difference between this frame and a previous key frame (or P frame). When decoding, you need to superimpose the difference defined in this frame with the previously cached picture to generate the final picture. (That is, difference frames, P frames do not have complete picture data, only data that is different from the picture of the previous frame)

B frame

B frame is Bi-directional predicted (B) frames/slices (macroblocks). Decoding needs to refer to the previous I frame or P frame, and the following I frame or P frame.
The B frame is a bidirectional difference frame, that is, the B frame records the difference between this frame and the previous and subsequent frames. In other words, to decode the B frame, not only the previous cached picture must be obtained, but also the decoded The final picture is obtained by superimposing the previous and next pictures with the data of this frame. B frame compression rate is high, but the CPU will be tired when decoding.
B-frames are not used in scenarios with high real-time requirements, such as live broadcasts and real-time calls.

IDR frame

Instantaneous Decoding Refresh, instant decoding refresh
is a special I frame. The first image of a sequence is called an IDR image (immediate refresh image), and IDR images are all I-frame images. H.264 introduces IDR images for decoding resynchronization. When the decoder decodes an IDR image, it immediately clears the reference frame queue, outputs or discards all decoded data, re-searches the parameter set, and starts a new sequence. This way, if a major error occurs in the previous sequence, there is an opportunity to resynchronize here. The image after the IDR image is never decoded using the data from the image before the IDR.

The difference between IDR frame and I frame

All IDR frames are I-frames, but not all I-frames aare IDR frames. IDR (Instantaneous Decoder Refresh) frames are special I-frames that not only contain a complete picture, but also indicates that no P/B-frame after the IDR is allowed to reference a frame before the IDR.

GOP

GOP (group of pictures – group of pictures) is a group of consecutive images in the encoded video stream. GOP is used to describe the number of frames between two I frames, such as IBBPBBPBBPBBPBBI, GOP is 15.
In H264, images are organized in units of sequences. A sequence is a data stream after encoding an image. A sequence is a data stream generated by encoding an image with little difference in content. When there are relatively few motion changes, a sequence can be very long, because low motion changes mean that the content of the image changes very little, so you can compile an I frame, and then continue to P frames and B frames. When the motion changes a lot, a sequence may be relatively short, for example, it may contain one I frame and 3 or 4 P frames.
In the video coding sequence, GOP is Group of picture, which refers to the distance between two I frames, and Reference refers to the distance between two P frames. A group of pictures is formed between two I frames, which is the GOP (Group Of Picture).
image.png

frame application

The video images previewed in the video surveillance system are real-time and require high smoothness of the images. Using I frames and P frames for video transmission can improve the adaptability of the network and reduce decoding costs. Therefore, the current video decoding only uses I frames and P frames for transmission. Hikvision camera encoding, I frame interval is 50, including 49 P frames.

Partitioning of video sequences

coded video sequence: access unit sequence, consisting of IDR access units arranged in decoding order followed by zero or more non-IDR access units, including to the next (not All access units before and including the IDR access unit.
Encoded video sequences appear many times in the H.264 standard.

You can see that there are similarities between the encoded video sequence and the GOP.
A video sequence can be viewed as consisting of continuous images or multiple GOPs.
The following figure introduces the division of video sequences

The video sequence is composed of continuous images
1 image is 1 frame a>

slice (strip/slice)

1 image (picture) can be divided into 1 or more slices (strips, slices). Usually one encoded image corresponds to one slice.
Slice is divided into header and data in terms of syntax structure. Data includes n macroblocks.
image.png

Macroblocks

image.png
Video decoding ultimately simplifies to searching and retrieving macroblocks from the bitstream and subsequently recovering pixel color from the luminance and chrominance components.

Subblock

H.264 divides the image into a series of macroblocks and sub-blocks for compression. The following is a brief description of macroblocks and sub-blocks:
Macroblock: Macroblock is a basic unit in H.264, which consists of a group of 16x16 pixel blocks. Each macroblock can contain a complete image area, including the three components of YUV (brightness, blue color difference and red color difference).
Subblock: Subblock is a small block further decomposed by macroblocks. H.264 supports multiple sub-block sizes, including 4x4, 8x8 and 16x16. The size of the sub-blocks can be adjusted based on the complexity of the image for better compression of the video.
By decomposing the image into macroblocks and sub-blocks, H.264 can better exploit the spatial correlation in the video and produce higher quality compressed video.

reference

  • T-REC-H.264-202108-I!!PDF-E.pdf

Guess you like

Origin blog.csdn.net/ET_Endeavoring/article/details/129812378