Audio and video H.264 format introduction

Name explanation:

  • GOP (Group of Pictures) is mainly used to describe the number of frames between one i frame and the next i frame.
    Enlarging the group of pictures (GOP) can effectively reduce the volume of the encoded video, but it will also reduce the video quality.
  • Each group of pictures (GOP, group of pictures) of the encoded video is given the transmission sequence (PPS) and the picture parameters (SPS) of the frame itself.
  • The SPS (Sequence Parameter Set) sequence parameter set and the PPS (Picture Parameter Set) picture parameter set contain various parameter information for image coding and are necessary parameter information for the initialization of the decoder.
  • IDR (Instantaneous Decoding Refresh) frame, that is, instantaneous decoding refresh frame, intuitively means that the decoder will refresh the reference frame buffer after receiving the IDR frame. The video frames before and after the IDR frame will not have any reference relationship, and the decoder can start decoding from any IDR frame.

H.264 syntax and semantics

In H.264, syntactic elements are organized into five levels: sequence, image, slice, macroblock, and sub-macroblock.
Each frame is NALU. If you understand NALU, you can understand the structure of H.264.

  • Syntactic elements are organized into a hierarchical structure, describing each level of information separately.
    Hierarchical structure of syntactic elements

The hierarchical structure of syntactic elements in previous standards

In such a structure, the header of each layer and its data part form a strong dependency relationship between management and being managed. The syntax element of the header is the core of the data of the layer, and once the header is lost, the information in the data part is almost It can no longer be decoded correctly. Especially in the sequence layer and the image layer, due to the limitation of the MTU (Maximum Transmission Unit) size in the network, it is impossible to put all the syntax elements of the entire layer into the same packet. At this time, if the packet where the header is located is lost, the other layer Even if the packet can be received correctly, it cannot be decoded, resulting in a waste of resources.
The hierarchical structure of syntactic elements in previous standards

The hierarchical structure of syntactic elements in H.264

In H.264, the biggest difference in the hierarchical structure is that the sequence layer and the image layer are eliminated, and most of the syntax elements originally belonging to the sequence and the image header are freed to form a sequence and image two-level parameter set, and the rest are Put in the slices. The parameter set is an independent data unit and does not depend on other syntactic elements outside the parameter set.
The hierarchical structure of syntactic elements in H.264

Streaming model

A simplified code stream structure model

The first image in a sequence is called an IDR image (immediate refresh image), and IDR images are all I images.
H.264 introduces IDR image to resynchronize the decoding. When the decoder decodes the IDR image, it immediately clears the reference frame queue, outputs or discards all the decoded data, searches for the parameter set again, and starts a new sequence.
In this way, if a major error occurs in the transmission of the previous sequence, such as severe packet loss, or other reasons causing data misalignment, resynchronization can be obtained here.
The image after the IDR image will never reference the data of the image before the IDR image for decoding.
The difference between IDR image and I image, IDR image must be I image, but I image is not necessarily IDR image. There can be many I pictures in a sequence, and the pictures after the I pictures can refer to the pictures between the I pictures for motion reference.
Data unit in H.264 stream

H.264 elementary stream NALU

The elementary stream (ES) structure of H.264 is divided into two layers, including the video coding layer (VCL) and the network adaptation layer (NAL).
The video coding layer VCL is responsible for efficient video content representation. It defines the core algorithm engine, block, macroblock and slice syntax level, and finally outputs the encoded data SODB.
The network adaptation layer NAL is responsible for packaging and transmitting the data in an appropriate manner required by the network, packaging the SODB into RBSP and then adding the NAL header to form a NALU (NAL unit, Nal Unit) .

  • The benefits of introducing NAL and separating it from VCL include two aspects:

    1. Separate signal processing and network transmission, VCL and NAL can be implemented on different processing platforms;
    2. VCL and NAL are designed separately, so that in different network environments, the gateway does not need to reconstruct and re-encode the VCL bitstream due to different network environments.
  • Term introduction
    SODB (String Of Data Bits): Original data bit stream, the length is not necessarily a multiple of 8, so you need to fill in
    RBSP (Raw Byte Sequence Payload): Original data byte stream, SODB+RBSP trailing bits=RBSP, add Trailing bits are added to make an RBSP an integer number of bytes, byte aligned

  • The basic stream of H.264 is composed of a series of NALUs (Network Abstraction Layer Unit). Different NALUs have different data volumes. Each NALU may contain IDR images, SPS, PPS, non-IDR images, etc.

  • The H.264 draft points out that when the data stream is stored on the medium, a start code: 0x000001 or 0x00000001 is added in front of each NALU to indicate the start and end positions of a NALU. Under this mechanism, the start code is detected in the code stream as the start identifier of a NALU. When the next start code is detected, the current NALU ends.
    The 3~4 bytes at the beginning of each frame in the H.264 bitstream are the start_code of H.264, 0x00000001 or 0x000001.
    The 3-byte 0x000001 is only used in one occasion, that is, when a complete frame is compiled into multiple slices, starting from the second slice, the NALU containing these slices uses a 3-byte start code. That is to say, if the slice corresponding to NALU is the beginning of a frame, use 0x00000001, otherwise use 0x000001.
    image.png

  • A video frame may contain multiple NALUs. At this time, the video frame can be called a multi-slice video frame (a NALU contains a slice of the video frame)
    H.264 stream

NALU structure

  • NALU, namely NAL header + RBSP
    NALU

NALU head structure

NALU head structure

  • Length: 1Byte, orbidden_bit(1bit) + nal_reference_bit(2bit) + nal_unit_type(5bit)
    F (forbidden_zero_bit): 1 bit, initially 0. When the network recognizes that this unit has a bit error, it can be set to 1, so that the receiver will lose the unit
    NRI (nal_ref_idc): 2 bits, used to indicate the importance of the NALU. The larger the value, the more important the current NALU. The specific value is greater than 0.
    Type (nal_unit_type) is not clearly specified : 5 bits, indicating the type of NALU
     nal_uint_type semanticsInsert picture description here

  • NALU decoding process
    NALU decoding

  • nal_unit_type=5: indicates that the current NAL is a slice of the IDR image. In this case, the nal_unit_type of each slice in the IDR image should be equal to 5. Note that IDR images cannot use partitions.

  • nal_unit_type=7 or 8: Each SPS or PPS corresponds to only one NALU.

RBSP structure

  • A typical RBSP unit sequence is shown in the figure. Each unit is transmitted as an independent NAL unit.
    The header information (one byte) of the NAL unit defines the type of the RBSP unit, and the rest of the NAL unit is the RBSP data.
    Here is the RBSP sequence, which requires NAL packing and NAL header for transmission.
    RBSP sequence example

 RBSP description

sheet

  • Slice:
    A frame of image can be encoded into one or more slices. Each slice contains an integer number of macro blocks, that is, each slice contains at least one macro block, and at most contains the entire image.
    The purpose of slicing is to limit the spread and transmission of error codes and to keep coded slices independent of each other.

  • The syntax structure of
    the slice The slice header specifies the type of slice, which picture the slice belongs to, related reference pictures, etc. The slice data contains a series of coded macroblocks, and or skip coded (uncoded) data. Each macro block contains a header unit and residual data.
    Grammatical structure

  • For slice type
    IDR images, slice_type is equal to 2, 4, 7, 9.
    slice_type semantics

Macroblock

  • Macro Block:
    A coded image is usually divided into several macro blocks. A macro block consists of a 16×16 luminance pixel and an additional 8×8 Cb and an 8×8 Cr color pixel block. In each image, several macro blocks are arranged in the form of slices.
    I slices only contain I macroblocks, P slices can contain P and I macroblocks, and B slices can contain B and I macroblocks.
    The I macroblock uses the decoded pixels in the current slice as a reference for intra prediction (cannot take decoded pixels in other slices as a
    reference for intra prediction).
    P macroblock uses the previously coded image as a reference image for intra-frame prediction. An intra-frame coded macroblock can be further divided into macroblocks: 16×16, 16×8, 8×16 or 8×8 brightness Pixel block (and the attached color pixels); if you select an 8×8 sub-macro block, it can be divided into various sub-macro block divisions, the size of which is 8×8, 8×4, 4×8 or 4× 4 Luminance pixel blocks (and accompanying color pixels).
    The B macroblock uses bidirectional reference images (current and future encoded image frames) for intra-frame prediction.

Slice type and macro block type

to sum up

The basic stream of H.264 is composed of a series of NALUs (Network Abstraction Layer Unit). Different NALUs have different data volumes. Each NALU may contain IDR images, SPS, PPS, non-IDR images, etc.

In H.264, syntactic elements are organized into five levels: sequence, image, slice, macroblock, and sub-macroblock.
Each frame is NALU. If you understand NALU, you can understand the structure of H.264.

Reference:
H.264.pdf In-depth
understanding of video coding H264 structure
H.264 format analysis
H.264 streaming media protocol format Annex B format and AVCC format in-depth analysis of the
mobile terminal hardware solution key process combing
x264 source code analysis
H264 notes
H264 frame format Analyze
ffmpeg encoding H.264 frame type judgment

Guess you like

Origin blog.csdn.net/u014099894/article/details/108442613