Notes: Principles, Standards and Implementation of New Generation High Efficiency Video Coding H.265/HEVC

Chapter One Introduction

3 basic color components, or luma and chrominance components.

The number of frames played per second is called the frame rate, and the unit is fps. In order to enable the human eye to have a smooth and continuous experience, the frame rate of the video needs to be above 25-30 fps.

H.265/HEVC introduces new coding technology in almost every module

1. Intra prediction

2. Inter prediction

3 Transform quantization

4 Deblocking filtering

The 5-sample adaptive offset (SAO) filter is after the deblocking filter. By analyzing the statistical characteristics of the pixels after the deblocking filter, adding a corresponding offset value to the pixel can weaken the ringing effect to a certain extent. .

6 Entropy coding

Featured Coding Techniques:

1 Coding unit: CTU (coding tree unit) coding tree unit and coding tree block CTB

2 Improved Intra Prediction Technology

3 Advanced inter-frame prediction technology

4 RQT (residual quad-tree transform is a quadtree-based adaptive transformation) technology:

5 ACS (adaptive coefficient scanning) technology: diagonal scanning, horizontal scanning, vertical scanning. 

6 SAO technology (sample adaptive offset, SAO)

7 IBDI (Internal bit depth increase) technology refers to increasing the pixel depth of the uncompressed image from P bits to Q bits (Q>P) at the input of the encoder, and increasing the pixel depth of the decompressed image from Q bits are restored to P bits. The end of IBDI improves the encoding accuracy of the encoder and reduces the error of intra/inter prediction.

Chapter 2 Digital Video Formats

Temporal resolution: The number of frames of images per second is the frame rate.

Spatial resolution: The number of pixel rows of the image and the number of pixels per row indicate that the higher the spatial resolution, the clearer the details of the image.

Color space: Also known as a color model.

(1) RGB color space: The components of the RGB color space are closely related to the brightness, that is, as long as the brightness changes, the three components will change accordingly, which is not suitable for image processing.

(2) YUV color space: Y represents the brightness, that is, the grayscale value, while U and V represent the chroma, which is used to specify the color of the pixel. Brightness is established by RGB input signals by superimposing the RGB signals together in a specific ratio, and the chromaticity U reflects the difference between the blue part of the RGB input signal and the brightness value of the signal. The chromaticity V reflects the difference between the red part of the RGB input signal and the brightness value of the signal.

(3) YCbCr color space:

Chapter 3 Coding Structure

3.1 Coding Structure Description

In video coding, there are 2 types of GOPs, a closed GOP and an open GOP, and in the closed GOP type, each GOP starts with an IDR picture. Each GOP is independently encoded and decoded. In an open GOP type, the first intra-coded picture in the first GOP is an IDR picture, and the first intra-coded picture in subsequent GOPs is a non-IDR picture. A piece of coded picture can skip the NON-IDR picture and use the coded picture of the previous GOP clock as the reference picture.

Each GOP is divided into multiple slices (slices), and independent codecs are performed between slices. One of its main purposes is to resynchronize in case of data loss. Each slice consists of one or more segments (slice segment ss).

Each CTU includes a luminance tree coding block and two color difference tree coding blocks. When coding, an SS is first divided into the same CTU, and each CTU is divided into different types of codes according to the quadtree partition method. unit. The coding structure between slice and CU.

GOP->SS->CTU->one CTB 

                             Two color-difference tree coding modules

Code stream structure: most of the grammatical elements shared by the GOP layer and the silence layer are freed to form a sequence parameter set (SPS) and a picture parameter set (PPS). SPS includes all images in the CVS (code video sequence). information.

The content of SPS roughly includes decoding-related information, such as profile level, resolution, encoding tool switch representation and design parameters in a certain profile, time-domain scalable information, etc.

The PPS contains the common parameters used by an image, that is, all SSs in an image refer to the same PPS.

The grammatical structure of H265 adds a video parameter set, and a certain content roughly includes grammatical elements shared by multiple sublayers, and other specific information that does not belong to SPS.

For an SS, by citing the PPS it uses, the PPS citing its corresponding SPS, and the SPS citing the VPS it uses, finally obtaining the shared information of the SS.

The parameter set is an independent data unit, which contains the shared information of coding units at different levels of the video, and takes effect only when the parameter set is directly or indirectly referenced by the SS. A parameter set does not correspond to a specific image or CVS, the same VPS or SPS can be referenced by multiple CVSs, and the same PPS can be referenced by multiple images.

The NAL unit is divided into VCLU (video coding layer nal unit) and non-VCLU according to whether it is loaded in video coding data. Parameter sets for non-encoded data are transmitted as non-VCLU, which provides a highly robust mechanism for passing critical data.

The independence of the parameter set allows it to be sent in advance, and it can also be sent when a new parameter set needs to be added. It can be retransmitted multiple times or protected by special technology, or even sent out-of-band. . Slice SS is the basic unit of video encoding data,

The compressed data of one SS generates one VCLU for data transmission.

Segment SS is the basic unit of video coding data, and the compressed data of one SS generates one VCLU for transmission. Ultimately, the coded code stream of a video sequence is composed of multiple VCLU units generated by a series of SSs and some segmentation marking data and parameter set data interspersed therebetween. The segmentation label data is used to distinguish which image and which CVS an SS belongs to.

3.2 Video parameter set

3.2.1 Video layer description

VPS is introduced in H.265/HEVC to overcome the deficiencies in H.264/AVC. At the same time, the design is simple, and the multi-layer video coding can be expanded to provide convenience. VPS is mainly used to transmit video classification information, which is beneficial to the expansion of compatible standards in scalable video coding or multi-view video coding.

A given video sequence, regardless of whether the SPS of each layer is the same, refers to the same VPS. The information contained in the VPS is:

1) Syntax elements shared by multiple self-layers and operation points.

2) Respond to the key information about the operating point required for the conversation, such as grade and level.

3) Other operating point characteristic information that does not belong to the SPS, such as the Hypothetical Reference Decoder (Hypothetical Reference Decoder HRD) parameters related to multiple layers or sub-layers. The coding of the key information of each operation point does not require variable-length coding, which is beneficial to lighten the burden of most network components.

3.3 Sequence parameter set

A CVS starts from a random access point, and the first image can be an IDR image or a non-IDR image. The non-IDR image can be a BLA (BROKEN LINK ACCESS) image or a CRA (clean Random Access) image. For a piece of video stream, it may contain one or more coded video sequences CVS, the content hormone hi of the sequence parameter set SPS contains the shared coding parameters of all coded pictures in a CVS, the SPS acts on the coded picture by being referenced by the PPS, a All PPSs used in CVS must refer to the same SPS.

3.4 Image parameter set

In the coded video stream, a CVS includes multiple images, and each image may include one or more SSs, and each SS header provides the PPS identification number it refers to, so as to obtain the public information in the corresponding PPS. For the same image, all SSs in it use the same PPS. It should be noted that there are some same parameters in the PPS as in the SPS, and these parameters in the PPS will be their values ​​in the SPS, that is to say, the SS uses these parameters in the PPS to decode. At the beginning of decoding, all PPSs are inactive, and only one PPS can be active at any moment in decoding.

3.5 Fragment Layer

3.5.1 Slices and Fragments

An image can be divided into one or more slices, and the compressed data of each slice is correct, and the slice header information cannot be obtained from the header information of the previous slice. This requires that the slice cannot cross its boundary for intra or inter prediction, and needs to be initialized before entropy coding. However, when performing loop filtering, the filter is allowed to filter across slice boundaries. Except that the slice boundary may be affected by the loop filter, the decoding process of the slice does not apply any influence from other slices, and it is conducive to the realization of parallel operations. The main purpose of using slices is to ensure decoding synchronization when data is lost.  

According to different encoding types, slice can be divided into the following parts

I Slice All CU encoding processes in this slice use intra prediction.

P Slice On the basis of I slice, the CU in this slice can also use inter-frame prediction, and each prediction block (PB) uses at most one piece of motion compensation prediction information. P slice only uses the image reference list lsit0.

B Slice can use list0 and list1

3.6 Tile unit

An image can not only be divided into several slices, but also can be divided into several titles, that is, an image can be divided into several rectangular areas from the horizontal and vertical directions, and a rectangular area is a tile. Each tile contains an integer number of CTUs. It can be decoded independently.

3.6.2

The purpose of slice and tile division is for independent decoding.

A slice is composed of a series of SSs, and a SS is composed of a series of CTUs. A tile directly consists of a series of CTUs.

There are some basic principles to be followed between SLICE/SS and TILE.

1. All CTUs in a Slice/SS belong to the same tile.

2. All CTUs in a Tile belong to the same slice/ss

3.7 Tree coding blocks

The H.265/HEVC standard introduces the tree coding unit CTU, whose size is specified by the encoder and can be larger than the macroblock size. One luma CTB and two chroma CTBs at the same location, together with the corresponding syntax elements, form a CTU.

Coding Unit CU

Prediction Unit PU

Transformation Unit TU

Chapter 4 Predictive Coding

Advanced video coding often adopts intra-frame prediction and inter-frame prediction, using the coded pixels in the image to predict adjacent pixels, or using the coded image to predict the image to be coded, so as to effectively remove the spatial and temporal correlation of the video. The video encoder transforms, quantizes, and entropy-codes the predicted residuals instead of the original pixel values, thereby greatly improving the coding efficiency.

Chapter 5 Transform Coding

Image transform coding refers to transforming the graphics described in the form of pixels in the spatial domain into the transform domain and expressing them in the form of transform coefficients. Most graphics contain more flat areas and areas where the content changes slowly. Appropriate transformation can make the image energy distributed in the spatial domain and focus on the relatively concentrated distribution in the transformed domain, so as to achieve the purpose of removing spatial redundancy.

5.1 Discrete Cosine Transform

5.1.1 Principle and characteristics of DCT

The Fourier transform shows that any signal can be expressed as a superposition of multiple sine or cosine signals of different amplitudes and frequencies. If the Yuhang function is used, the signal decomposition process is called cosine transform; if the input signal is discrete, it is called discrete cosine transform. Mathematically, there are 8 types of 8DCT.

5.2 Discrete Sine Transform: There are 8 types of DST.

5.4 Hadamard Transform

Chapter 6 Quantification

Quantization refers to the process of mapping the continuous value of the signal (or a large number of possible discrete values) into a finite number of discrete amplitude values, so as to realize the many-to-one mapping of signal values. In video coding, after the residual signal undergoes offline cosine transform (DCT), the transform coefficients often have a large dynamic range, so quantizing the transform coefficients can effectively reduce the signal value space and obtain better compression effects. . At the same time, due to the many-to-one mapping mechanism, the quantization process will inevitably introduce distortion, which is also the root cause of distortion in video coding. Since quantization affects both the quality and bit rate of video, quantization is a very important part of video coding.

6.1 Standard quantification

6.2 Uniform quantization

6.3 Lioyd-Max Quantizer

6.4 Entropy Encoded Quantizer 

Chapter 7 Loop Post-processing

Loop filtering technology, including two modules of deblocking filtering and pixel adaptive compensation. The block filter is used to reduce the block effect, and the pixel adaptive compensation is used to improve the ringing effect.

Deblocking filtering includes two steps: filtering decision and filtering operation.

Pixel Adaptive Compensation: For your strong edges in the image, due to the quantization distortion of high-frequency AC coefficients, ripples will be generated around the variable members after decoding. This distortion is called ringing effect

Chapter 8 Entropy Coding

Entropy coding refers to the lossless coding method carried out according to the upper element of information. Entropy coding can effectively remove the statistical redundancy of video element symbols, and is one of the most important tools to ensure the compression efficiency of video coding.

Shannon gave the definition of information: Information is the description of the uncertainty of the state of motion or the way of existence of things.

Variable length coding: Huffman coding

        Exponential Columbus

The complement rule structure of the Huffman code leads to the deficiency of the Huffman code, and the exponential Columbus code is widely used among them.

The zero-order exponential Columbus encoding and decoding is simple, and the compression of the generalized Gaussian source is also high. In H.265/HEVC, the zero-order exponential Columbus encoding method is used for video parameter set (VPS), sequence parameter set (SPS) , image parameter set (PPS)

and title information etc. in most of the grammatical elements involved.

Chapter nine

network adaptation layer

H.265 also adopts a two-layer architecture of video coding layer and network adaptation layer to adapt to different network and video applications. The main task of the network adaptation layer is to divide and encapsulate the compressed video data. And carry out the necessary identification, and carry out the necessary identification, so that it can well adapt to the complex and changeable network environment.

For network engineers. It is also difficult to understand the principle of encoding and analyze compressed video streams. Therefore, NAL directly encapsulates and identifies the compressed video stream according to the characteristics of the video data, so that it can be recognized by the network and optimized.

According to the content characteristics of the compressed video bit stream, NAL divides it into multiple data segments, encapsulates each data segment and identifies its content characteristics, and then generates NALU, and its content characteristic information is stored in the NALU header information.

NALU produces several combinations: one network packet contains one NALU: one network packet contains multiple NALUs, and multiple NALUs are combined into one network packet; multiple network packets contain one NALU, that is, one NALU is divided into multiple Network grouping; during network transmission, the network device can directly obtain the content characteristics of the video data carried by the NALU through the header information of the NALU, and optimize the transmission of the video stream on this basis. For the problem of grouping caused by network congestion, the network device needs Clear the priority of each packet, which can be obtained directly through the NAL header information.

Image type:

All the compressed video data is encapsulated into the load part of different NALUs. In addition to carrying VPS, SPS, and PPS, NALU mainly carries compressed data of video slices (slices). The NALU that carries compressed data of video slices is called VCLU (VCL NALU) , the NALU times that carry other information are called non-VCL NALU). Each VCLU in h.265/hevc contains compressed data of a video segment (SS), and the NALU times that carry video slice compressed data are called VCLU (VCL NALU) , the NALU that carries other information is called non-VCLU. Each VCLU in H.265/HEVC contains compressed data of a video segment SS (slice segment). SS is the compressed data output unit of VCL.

The decoding order is also the order of images in the compressed code stream. Images should be played in chronological order.

A video stream usually has some random access points (intra random access point IRAP) at intervals. Starting from IRAP, subsequent video streams (images whose playback order is after IRAP) can be independently and correctly decoded without referring to the information in front of IRAP . The first decoded image after the IRAP image is called the IRAP image, and the image whose decoding order is after the IRAP image and whose playback order is before it is called the front image of the IRAP image, and the playback order is after the IRAP image (the decoding order must be Subsequent) images are eaten as trailing images of the IRAP image.

The front image is divided into RADL (Random Access Decodable Leading) image and RSAL (Random Access Skipped Leading). The front image that relies on the IRAP pre-code stream information is called a RASL image, that is, it intervenes from the IRAP image, and its RASL image complements it. correctly decoded.

Three types of IRAP images: IDRP CRA BLA. IDR images are used as IRAP images in H.264. IDR images require that their pre-images must be RADL images, that is to say, IDR images and their subsequent code streams do not depend on this IDR image. The previous video stream information is independently decoded. The CRA image allows its preceding image to be a RASL image, allowing the reference to the video stream before the CRA image to obtain higher coding efficiency for the RASL image.

CVS is defined as the video stream between two adjacent IRAP images. The coded picture group contained in CVS is like the GOP in the early standard, which is a relatively independently coded unit. In the encoding process, a fixed time-domain encoding structure is usually set in units of CVS. The CVS formed by dividing the video stream by the IDR image has certain independence. The video compression data in different CVSs do not ambiguously refer to each other, which is often called a closed GOP. The CVS that divides the video stream into CRA images is less independent, and images in different CVSs may refer to each other. This is called an open GOP.

9.3 Network Adaptation Layer Unit

Each NALU consists of two parts: NALU (header) and NALU payload (payload). The length of the NALU header is fixed at two bytes. Reflect the content characteristics of NALU. The length of the NALU payload is an integer byte, and it carries the compressed bit stream segment of the video.

9.3.1 NALU payload

The RBSP tail consists of a 1 bit called the RBSP (RAW BYTE SEQUENCE PAYLOAD ) stop bit followed by zero or more 0 bits. RBSP is an integer-byte SODB, and the data type of RBSP is the data type of SODB.

RBSP cannot directly do the payload of the NALU, because in the byte stream application environment, 0x000001 is the start code of the NALU, and 0x000000 is the end code, so in the future, the byte stream segment in the NALU payload and the start code and end code of the NALU should be avoided. Conflicts, the following conflict avoidance processing needs to be performed on the RBSP byte stream.

0x000000->0x00000300

0x000001->0x00000301

0x000002->0x00000302

0x000003->0x00000303

When the last byte of RBSP data is equal to 0x00 (this case is only cabac_zero_word at the end of RBSP), byte 0x03 will be added to the end of the data.

9.4 NALU in video bitstream

The compressed video bitstream consists of consecutively arranged NALUs, and their order should be consistent with the decoding order. The concept of an access unit (AU) is introduced in H.265/HEVC, which is defined as a plurality of NALUs arranged in decoding order, and these NALUs are decoded to generate exactly one image. The AU can be regarded as the basic unit of the compressed video bit stream, and the compressed video stream is composed of multiple AUs arranged in order. Each NALU will belong to an AU unit, and the first NALU of the compressed video stream is the first NALU of the first AU.

9.4.2 Different parameter sets VPS, sps, and PPS are used as independent data units to generate corresponding NALUs. Different parameter sets should follow different usage methods. VPS and SPS contain the common parameter information of a CVS, and a CVS intelligently uses The same VPS and SPS. The PPS contains the common parameter information of an image. Only one PPS can be used for an image.

The parameter information contained in the PPS can be called by the NALU belonging to one image or multiple images. Each PPS will not be qed at the beginning of the decoding process, and at most one PPS is activated during the decoding process at the same time, and the activation of each PPS will invalidate the previous PPS.

The SPS contains a CVS-common parameter information that can be used by one or more PPSs or used by one or more SEI NALUs. Each SPS will not be activated at the beginning of the decoding process, and at most one SPS can be activated during the decoding process at the same time.

The VPS contains the parameter information of a CVS worker, which can be used by one or more SPSs, or one or more SEI NALUs. Each VPS will not be activated at the beginning of the decoding process. Only one SPS can be activated during the decoding process at the same time, and the activation of each VPS will invalidate the previous VPS.

9.5 Application of Network Adaptation Layer Units

One H.265/HEVC video stream is only transmitted in one RTP session (single-session Transmission, sst), and one H.265/HEVC video stream is transmitted in multiple RTP sessions (MST).

The compressed video service is divided into two application scenarios:

1. Packet stream: Packet stream application means that the NALU output by the encoder is directly used as the payload of the network packet, and the decoder at the receiving end can directly obtain compressed video data in the form of NALU from the network packet, such as based on RTP/UDP/IP real-time video communication.

According to the number of carried NALU. 

    One: Single NALU grouping: a grouping only carries one NALU

    Two: Aggregation Packet (Aggregation Packet, AP) A packet carries multiple NALUs.

    Three: Fragmentation Packet (Fragmentation Packet, FP) A packet only carries a part of a NALU. This is the most commonly used.

    The payload structure of RTP is related to the type of RTP packet. The first two bytes are called the payload header. The structure of the payload header is the same as that of the NALU header, and the value of each field is determined according to the header that carries the NALU.

2. Byte stream: Byte stream application means that NALU is organized into an orderly continuous byte or bit stream according to the decoding order for transmission and processing. In order to ensure that the decoder obtains video data in the form of NALU, a synchronization mark must be inserted at the boundary of the NALU. Stream-based transmission systems include H.320, MPEG-2.

Guess you like

Origin blog.csdn.net/Doubao93/article/details/118426164