H.264 basic knowledge and video stream analysis

H.264 basic knowledge and video stream analysis


table of Contents

  1. H.264 overview
  2. H264 related concepts
  3. H264 compression method
  4. H264 layered structure
  5. H264 stream structure
  6. H264 NAL unit
  7. H.264 video stream analysis

1. H.264 overview

  1. Encoding is to compress data so that resources will not be wasted during transmission.
  2. Use a simple example to illustrate the necessity of encoding: when you are currently playing a video on the monitor, the resolution is 1280 *720, the frame rate is 25, then the normal data size generated in one second is: 1280*720 (bit pixels) *25 (sheets) / 8 (1 byte 8 bits) (result: B) / 1024 (result: KB) / 1024 (result: MB) = 2.75MB. Obviously, you cannot accept data as large as one second, and you need to compress the data.
  3. H264 belongs to the data of the codec level in the video capture and output. As shown in the figure below, it is the data presented after being encoded by the coding standard when the data is collected and compressed.
    Insert picture description here
  4. For video files, the video is composed of a single image frame, such as 25 frames per second, but there are similarities between the pixel blocks of the image frames, so the video frame images can be image compressed; H264 adopts With the 16*16 block size, the video frame images are similarly compared and compressed. As shown below:
    Insert picture description here

2. H264 related concepts

1. Sequence

  1. The theoretical basis followed in the H264 coding standard is personally understood as: referring to adjacent images within a period of time, the difference in pixels, brightness and color temperature is very small. Therefore, when facing images within a period of time, we do not need to encode each image in a complete frame. Instead, we can select the first image of this period as the complete encoding, and the next image can be recorded with the first image. The frame can completely encode the difference of image pixels, brightness and color temperature, and so on.
  2. What is a sequence? The image set in which the image changes little during the above period can be called a sequence. A sequence can be understood as a piece of data with the same characteristics. However, if a certain image has a large transformation from the previous image, it is difficult to refer to the previous frame to generate a new frame, then end the deletion of a sequence and start the next sequence. Repeat the previous sequence to generate a new sequence.

2. Frame Type

  1. In the H264 structure, the encoded data of a video image is called a frame. A frame is composed of a slice or multiple slices, a slice is composed of one or more macro blocks (MB), and a macro block is composed of a 16x16 yuv. Data composition. The macro block is the basic unit of H264 encoding.
  2. H26 uses intra-frame compression and inter-frame compression to improve the encoding compression rate;
  3. H264 uses a unique I frame, P frame and B frame strategy to achieve compression between consecutive frames;
    Insert picture description here
1. I
  1. I frame: Intra-frame coded frame, I frame represents the key frame, you can understand it as the complete preservation of this frame of picture; only this frame of data is needed for decoding (because it contains the complete picture)
  2. I frame features:
    1. It is a full-frame compression coded frame. It performs JPEG compression encoding and transmission of full frame image information;
    2. The complete image can be reconstructed with only I frame data during decoding;
    3. The I frame describes the details of the image background and the moving subject;
    4. I frames do not need to be generated with reference to other pictures;
    5. I frame is the reference frame of P frame and B frame (its quality directly affects the quality of subsequent frames in the same group);
    6. The I frame is the basic frame (the first frame) of the GOP of the frame group, and there is only one I frame in a group;
    7. I frame does not need to consider motion vector;
    8. The amount of information occupied by the I frame is relatively large.
2. P frame
  1. P frame: forward predictive coding frame. P frame represents the difference between this frame and the previous key frame (or P frame). When decoding, the difference defined by this frame needs to be superimposed on the previously buffered picture to generate the final picture. (That is, the difference frame, P frame does not have complete picture data, only data that is different from the picture of the previous frame)
  2. Prediction and reconstruction of P frame: P frame uses I frame as a reference frame, find the predicted value and motion vector of P frame "a certain point" in I frame, and transmit the prediction difference and motion vector together. At the receiving end, according to the motion vector, the predicted value of "a certain point" of the P frame is found from the I frame and added with the difference to obtain the sample value of the "certain point" of the P frame, thereby obtaining a complete P frame.
  3. P frame features:
    1. P frame is an encoded frame separated by 1 to 2 frames after I frame;
    2. The P frame uses motion compensation to transmit the difference between it and the previous I or P frame and the motion vector (prediction error);
    3. During decoding, the predicted value in the I frame must be summed with the prediction error to reconstruct the complete P frame image;
    4. P frames belong to the inter-frame coding of forward prediction. It only refers to the previous I frame or P frame closest to it;
    5. The P frame can be the reference frame of the P frame after it, or the reference frame of the B frame before and after it;
    6. Since the P frame is a reference frame, it may cause the spread of decoding errors;
    7. Since it is a differential transmission, the compression of P frames is relatively high.
3. B frame
  1. B frame: Bidirectional predictive interpolation coding frame. The B frame is a two-way difference frame, that is, the B frame records the difference between the current frame and the previous and next frames (the specifics are more complicated, there are 4 cases), in other words, to decode the B frame, you must not only obtain the previous buffered picture, but also decode After the screen, the final screen is obtained by superimposing the front and back screens with the current frame data. The B-frame compression rate is high, but the CPU will be tired when decoding.
  2. The B frame uses the previous I or P frame and the following P frame as reference frames to "find out" the predicted value and two motion vectors of "a certain point" of the B frame, and take the prediction difference and the motion vector for transmission. The receiving end "finds (calculates)" the predicted value in the two reference frames according to the motion vector and sums it with the difference to obtain the "some point" sample value of the B frame, thereby obtaining the complete B frame.
  3. B-frame characteristics
    1. The B frame is predicted by the previous I or P frame and the following P frame;
    2. The B frame transmits the prediction error and motion vector between it and the previous I or P frame and the following P frame;
    3. B frame is a two-way predictive coding frame;
    4. The B frame has the highest compression ratio, because it only reflects the changes in the moving subject between the two reference frames, and the prediction is more accurate;
    5. The B frame is not a reference frame and will not cause the spread of decoding errors.

3. GOP (Group of Pictures)

  1. GOP is Group of picture (group of pictures), which refers to the distance between two I frames (the video sequence mentioned in the figure below is GOP), Reference (reference period) refers to the distance between two P frames, which can be understood as follow The sequence is almost the same, that is, the image set that has not changed much over a period of time. For comparison, the GOP is 120. If it is 720 p60, it is an I frame every 2s. The number of bytes occupied by an I frame is greater than that of a P frame, and the number of bytes occupied by a P frame is greater than that of a B frame. Therefore, under the premise of the same bit rate, the larger the GOP value, the more the number of P and B frames, the more bytes occupied by each I, P, and B frame, and the easier it is to obtain Good image quality; the larger the Reference, the more B-frames, and similarly, it is easier to obtain better image quality.
  2. The GOP structure generally has two numbers, such as M=3, N=12. M specifies the distance between an I frame and a P frame, and N specifies the distance between two I frames. The above M=3, N=12, the GOP structure is: IBBPBBPBBPBBI. I frame decoding within a GOP does not depend on any other frames, while p frame decoding depends on the previous I frame or P frame, and B frame decoding depends on the previous I frame or P frame and the next closest P frame.

4. IDR frame (key frame)

  1. IDR (Instantaneous Decoding Refresh) instantaneous decoding refresh. For convenience in encoding and decoding, the first I frame in the GOP should be distinguished from other I frames, and the first I frame is called IDR, which is convenient for controlling the encoding and decoding process, so IDR frames must be I frames, but I frames It is not necessarily an IDR frame; the function of an IDR frame is to refresh immediately, so that errors will not be propagated, and start coding from the IDR frame as a new sequence. I frame may be referenced across frames, IDR will not.
  2. The I frame does not have to refer to any frame, but the subsequent P and B frames may refer to the frame before the I frame. IDR does not allow this, for example:
    Insert picture description here
  3. Its core function is to resynchronize the decoding. When the decoder decodes the IDR image, it will clear the reference frame queue, output or discard all the decoded data, search for the parameter set again, and start a new one. sequence. In this way, if there is a repetitive error in the previous sequence, you can get a chance to resynchronize here. Images after IDR images will never be decoded using the data of images before IDR.
    Insert picture description here

3. H264 compression method

  1. The core algorithms used by H264 are intra-frame compression and inter-frame compression. Intra-frame compression is an algorithm for generating I frames, and inter-frame compression is an algorithm for generating B and P frames.
  2. Intraframe compression is also called Spatial compression . When compressing a frame of image, only the data of the current frame is considered without considering the redundant information between adjacent frames, which is actually similar to still image compression. Intraframe generally uses lossy compression algorithm. Since intraframe compression encodes a complete image, it can be decoded and displayed independently. Intra-frame compression generally does not achieve high compression, which is similar to encoding jpeg.
  3. The principle of interframe compression is that the data of several adjacent frames have great correlation, or the characteristics of little change in the information of the two frames before and after. That is, continuous video has redundant information between adjacent frames. According to this feature, compressing the redundancy between adjacent frames can further increase the amount of compression and reduce the compression ratio. Inter-frame compression is also called temporal compression (Temporalcompression), which compresses by comparing data between different frames on the time axis. Inter-frame compression is generally lossless. The frame difference (Framedifferencing) algorithm is a typical time compression method. It compares the difference between the current frame and the adjacent frames, and only records the difference between the current frame and its adjacent frames, which can greatly reduce the amount of data.

1. Description of compression method

  1. Grouping, that is, grouping a series of images with little transformation into a group, that is, a sequence, which can also be called GOP (Group of Pictures);
  2. Define frames and classify each group of image frames into three types: I frame, P frame and B frame;
  3. Predicted frame, I frame is used as the basic frame, P frame is predicted by I frame, and B frame is predicted by I frame and P frame;
  4. Data transmission, finally storing and transmitting I frame data and predicted difference information.

4. H264 layered structure

  1. The main goal of H264 is to have a high video compression ratio and good network affinity. In order to achieve these two goals, the H264 solution is to divide the system framework into two levels, namely the video coding level (VCL: Video Coding Layer) and Network Coding Layer (NAL: Network Coding Layer), the original H.264 stream (bare stream) is composed of one connected NALU, as shown below:
    Insert picture description here
  2. The VLC layer is the definition of the syntax level of the core algorithm engine, block, macroblock and slice, which is responsible for effectively representing the content of video data, and finally outputting the encoded data SODB;
  3. The NAL layer defines the grammatical level above the slice level (such as sequence parameter set and image parameter set, which will be described later for network transmission), and is responsible for formatting data and providing header information in an appropriate manner required by the network to ensure data Suitable for transmission on various channels and storage media . The NAL layer packs the SODB into an RBSP and then adds a NAL header to form a NALU unit. The specific composition of the NAL unit will also be described in detail later.
  4. Before data transmission or storage in the VCL, these encoded VCL data are mapped or encapsulated into NAL units. (NALU)
  5. A NALU = a set of NALU header information corresponding to the video encoding + a raw byte sequence payload (RBSP, Raw Byte Sequence Payload).
  6. The main structure of the NALU structural unit is shown below; an original H.264 NALU unit usually consists of three parts: [StartCode] [NALU Header] [NALU Payload], where the Start Code is used to indicate that this is the beginning of a NALU unit , Must be "00 00 00 01" or "00 00 01", otherwise it is basically equivalent to a NAL header + RBSP;
    Insert picture description here
  7. The association between SODB and RBSP, the specific structure is shown in Figure 3:
    1. SODB (String Of Data Bits): data bit string, is the original data after encoding;
    2. RBSP (Raw Byte Sequence Payload): original byte sequence payload , That is, trailing bits are added at the back of SODB, that is, a bit 1 and several bit 0, so as to align the bytes;
  8. The formation process of RBSP
    1. If the content of SODB is empty, then the content of RBSP is also empty. Otherwise, the first byte of RBSP is taken from the first to eighth bits of SODB, and the RBSP bytes are arranged in order from left to right from high to low. By analogy, each byte in the RBSP is directly taken from the corresponding bit of SODP. The last byte of RBSP contains the last few bits of SODB, as well as trailing bits. Among them, the first bit of trailing bits is 1, and the remaining bits are 0 to ensure byte alignment. So RBSP is equivalent to SODB following the last bit of its last byte, followed by 1 bit with a value of 1, and then adding several bits of 0 to complement this byte.
      Insert picture description here

5. H264 stream structure

  1. Before describing the NAL unit in detail, it is necessary to understand the bit stream structure of H264. The code stream of the encoded H264 is shown in the figure below. From the figure, we need to get a concept. The H264 code stream is composed of NAL units. Among them, SPS, PPS, IDR and SLICE are some types of NAL units. data.
    Insert picture description here

6. H264 NAL unit

1. H264 NAL structure

  1. In the actual network data transmission process, the data structure of H264 is transmitted in NALU (NAL unit), and the transmission data structure is composed of [NALU Header]+[RBSP], as shown in the following figure:
    Insert picture description here
  2. From the previous analysis, we can know that the video frame data encoded by the VCL layer may be I/B/P frames, and these frames may also belong to different sequences; the same sequence also has corresponding sequence parameter sets. And picture parameter set; In summary, in order to complete the decoding of an accurate and error-free video, in addition to the video frame data encoded by the VCL layer, it also needs to transmit sequence parameter sets and image parameter sets, etc., so RBSP does not just save Data encoding information of I/B/P frames, and other information may also appear in it.
  3. It is known from the above that the NAL unit is the basic unit of the actual video data transmission. The NALU header is used to identify what type of data the following RBSP is. At the same time, it records whether the RBSP data will be referenced by other frames and whether there are errors in network transmission. The function and structure of RBSP and the data it carries require a simple understanding.

2. NAL header

1. The composition of the NAL head

The header of the NAL unit is composed of three parts: forbidden_bit (1bit), nal_reference_bit (2bits) (priority), and nal_unit_type (5bits) (type), as shown in the figure below:

  1. F (forbiden): Forbidden bit, which occupies the first bit of the NAL header. When the forbidden bit value is 1, it means a syntax error;
  2. NRI: Take 00~11, which seems to indicate the importance of this NALU. For example, the NALU decoder of 00 can discard it without affecting the playback of the image. 0~3, the larger the value, the more important the current NAL, and it needs to be protected first . If the current NAL belongs to the reference frame, or the sequence parameter set, or the important unit of the image parameter set, this syntax element must be greater than 0.
  3. Type: NAL unit data type, which is to identify the data type of the NAL unit, occupying the fourth to eighth bits of the NAL header;
    Insert picture description here
2. NAL unit data type
  1. The NAL type is mainly that each of these types in the following figure has a special role;
  2. Example: 0x00 00 00 01 67, 67: decimal: 0110 0111, 00111 = 7 (decimal), which is the sequence parameter set

Insert picture description here
Insert picture description here
4. Before introducing the NAL data types in detail, it is necessary to know that NAL is divided into VCL NAL units and non-VCL NAL units .
5. NAL units with nal_unit_type of 1, 2, 3, 4, 5, and 12 are called VCL NAL units, and other types of NAL units are non-VCL NAL units.
6. Another concept that needs to be understood is parameter sets. Parameter sets are NAL units that carry decoding parameters. Parameter sets are very important for correct decoding. In a lossy transmission scenario, bit arrays during transmission Or the packet may be lost or damaged. In this network environment, the parameter set can be sent through high-quality services, such as forward error correction mechanism or priority mechanism. The relationship between Parameter sets and other syntactic elements is shown in the following figure:
Insert picture description here
7. Each type has a data type. The more important ones are briefly introduced:

1. Non-VCL NAL data types:
  1. SPS (Sequence Parameter Set) contains some common parameters, such as Profile and Level, such as the size of the video frame, the maximum number of reference frames, etc. These parameters are common to the entire Video Sequence or Program.
  2. PPS (Picture Parameter Set) contains some general parameters, such as the type of entropy coding, the number of valid reference pictures, and initialization parameters. These parameters can be applied to a Video Sequence or a part of the coded frame.
  3. SEI (Supplemental Enhancement Information): This part of the parameters can be transmitted as H264 bitstream data, and each SEI information is encapsulated into a NAL unit. SEI may be useful for the decoder, but it is not necessary for the basic decoding process.
  4. A Parameter Set is inactive at the beginning until it is activated. A PPS is transmitted to the decoder in advance, and when it is involved in a Slice Header, it will be activated, and until a different PPS is activated. For SPS, when a PPS involves it, it will be activated. For a Coded Video Sequence starting with IDR Access Unit, an SPS will always be active during the whole process. Therefore, an SPS can be effectively activated by the IDR Slice.
2. NAL data type of VCL
  1. The header information block includes the macro block type, quantization parameter, and motion vector. This information is the most important, because without them, the symbols of the data block cannot be used. This data block is called Type A data block.
  2. The intra-frame coding information data block is called type B data block. It contains intra-frame coding macroblock type and intra-frame coding coefficients. For the corresponding slice, the availability of Type B data blocks depends on Type A data blocks. Different from the inter-frame coding information data block, the intra-frame coding information can prevent further deviation, so it is more important than the inter-frame coding information.
  3. The inter-frame coding information data block is called Type C data block. It contains the type of inter-frame coding macroblock and the inter-frame coding coefficient. It is usually the largest part of the slice species. The inter-frame coding information data block is an unimportant part. The information it contains does not provide synchronization between codecs. The availability of Type C data blocks also depends on Type A data blocks, but has nothing to do with Type B data blocks.
  4. Each of the above three data blocks is divided into a separate NAL unit, so it can be transmitted separately.

3. The connection between H264 NAL unit and slice and macro

  1. Why are there so many data types in the data NAL unit, and what is this SLICE? Why not directly encode the original byte sequence payload, so I think that the concept of some slices and macros subdivided by the frame should be explained here. It is more appropriate, and it is also possible to understand the position of these concepts better by referring to the context, and to give a reasonable explanation for these confusions, so here is a description:
    1. 1 frame (an image) = 1~N slices (slice) //It can also be said that 1 or more slices are a slice group
    2. 1 slice = 1~N macroblocks (Marcroblock)
    3. 1 macro block = 16*16 YUV data (raw video acquisition data)
  2. From the perspective of data level, an original picture can be counted as a frame in a broad sense. The frame contains slices and slices. The slices are composed of slices, and the slices are composed of macroblocks. Each macroblock can be 4 4 , 8 8 , 16*16 pixel scale, the relationship between them is shown in the figure below. Each slice is an independent coding unit.
    Insert picture description here
  3. From the perspective of accommodating data, in addition to accommodating the slice-encoded code stream, the NAL unit can also accommodate other data. This is why there are data such as SPS, PPS, etc., and these data are in the process of transmitting the H264 code stream. Play an indispensable role, and the specific role is also mentioned above.
  4. Then you can sort the following concepts in size: sequence>image>slice>macro>pixel (of course, there are also concepts such as slice group, sub-macro block, etc., the initial understanding will not be so deep, and later Study slowly)
  5. At the same time, there are several points that need to be explained, so as to facilitate the understanding of NAL units:
    1. If the FMO (flexible macroblock ordering) mechanism is not used, there is only one slice group for an image;
    2. If multiple slices are not used, there is only one slice in a slice group;
    3. If the DP (Data Splitting) mechanism is not used, a slice is a NALU, and a NALU is a slice.
      Otherwise, the composition of a slice needs to be composed of three NALUs, which are the A, B, and C data blocks mentioned above.
  6. At this time, you can understand the overall structure of the code stream by looking at the following code stream data layering diagram.
    Insert picture description here
  7. As we can see, each fragment also contains two parts: header and data. The fragment header contains information such as the type of fragment, the type of macroblock in the fragment, the number of fragmented frames, and the settings and parameters of the corresponding frame. , And the sliced ​​data is a macro block, here is where we are looking for pixel data storage; a macro block is the main carrier of video information, because it contains the brightness and chroma information of each pixel. The main job of video decoding is to provide an efficient way to obtain the pixel array in the macro block from the code stream. The composition of macro block data is shown in the figure below:
    Insert picture description here
  8. From the above figure, you can see that the macro block contains the macro block type, prediction type, Coded Block Pattern, Quantization Parameter, pixel brightness and chrominance data set and so on.
  9. At this point, we should have a general understanding of the H.264 stream data structure.
Points to note:
  1. The H.264/AVC standard has strict requirements on the order of NAL units sent to the decoder. If the order of NAL units is chaotic, they must be reorganized in accordance with the specification and sent to the decoder, otherwise the decoder cannot decode correctly .
  2. The sequence parameter set NAL unit must be transmitted before all other NAL units referenced by this parameter set are transmitted, but repeated sequence parameter set NAL units are allowed among these NAL units. The detailed explanation of the so-called repetition is: the sequence parameter set NAL unit has its own special identification. If the two sequence parameter set NAL units have the same identification, it can be considered that the latter is just a copy of the previous one, rather than a new sequence parameter. set.
  3. The image parameter set NAL unit must be transmitted before all other NAL units with this parameter set as a reference, but repeated image parameter set NAL units are allowed among these NAL units, which is the same as the above-mentioned sequence parameter set NAL unit.

7. H.264 video stream analysis

Introduction to video and audio data processing: H.264 video stream analysis
H.264 original stream (also known as "naked stream") is composed of one NALU. Their structure is shown in the figure below.
Insert picture description here

Among them, each NALU is separated by startcode (start code), which is divided into two types: 0x000001 (3Byte) or 0x00000001 (4Byte). If the slice corresponding to NALU is the beginning of a frame, use 0x00000001, otherwise use 0x000001.
The H.264 code stream analysis step is to first search for 0x000001 and 0x00000001 from the code stream to separate NALU; then analyze each field of NALU. The procedure in this article realizes the above two steps.


Reference blog: Getting started to understand H264 encoding
video and audio data processing: H.264 video stream analysis

Guess you like

Origin blog.csdn.net/weixin_41910694/article/details/107661624