Full analysis of H.265 video coding and technology

one. overview

As a new-generation video codec format, H.265 has been more and more widely used. Not long ago, Apple released the iPhone 6 in anticipation. Compared with the previous iPhone, this phone not only simply increases the screen, but also adopts H.265 video encoding and decoding technology, which can achieve Stream higher quality web video. Although for most people, H.265 is still a bit unfamiliar, but as early as two years ago, Ericsson has launched the first H.265 codec. A huge advantage of using the H.265 codec is that it only needs half the bit rate of the current mainstream encoded video H.264 to provide the same quality video.

Next, I will introduce the advantages of H.265 video coding technology in detail.

A new generation of video compression standard: H.265

Video codec standards are mainly derived from ITU-T and ISO. ITU's H.26× series video compression standards have gone through the evolution process of H.261, H.263, and H.263+. The ISO video compression standard is the MPEG series, from MPEG1, MEPG2 to MPEG4, and then the two standardization groups went to cooperate, established JVT, and launched the most effective video compression standard at present: H.264/AVC.

Subsequently, the scope of application of H.264 continued to expand, and its maturity continued to increase. In 2013, the next-generation video compression standard was finalized and released, named HEVC (High Efficiency Video Coding), which was adopted by ITU-T and ISO/IEC as an international standard. , namely the H.265 video compression standard. Based on the existing video coding standard H.264, H.265 further improves compression efficiency, improves robustness (Robustness anti-transformation) and error recovery capability, reduces real-time delay, reduces channel acquisition time and random access Delay and reduce complexity to achieve optimal settings.

High definition, low bit rate

Today, 4K ultra-high-definition technology is popular, and the traditional H.264 has faced bottleneck restrictions. In terms of coding units, the size of each macroblock (marcoblock, MB) in H.264 is fixed at 16x16 pixels. However, at higher resolutions, the image content information represented by a single macroblock is greatly reduced. After the macroblocks used in H.264 undergo integer transformation, the similarity of low-frequency coefficients is also greatly improved, and a large amount of redundancy occurs, resulting in H. 264 encoding significantly reduces the compression efficiency of high-definition video; secondly, the explosive growth of the number of macroblocks in the H.264 algorithm will lead to the prediction mode, motion vector, reference frame index, and quantization level of each encoded macroblock. Parameter information takes up more stream resources and reduces the encoding compression rate.

Compared with H.264, H.265 provides more different tools to reduce bit rate. The encoding unit of H.265 can be selected from the smallest 8x8 to the largest 64x64. Areas with a small amount of information have larger macroblocks and fewer codewords after encoding, while areas with more details have correspondingly smaller and more macroblocks, and more codewords after encoding, which is equivalent to The image is encoded in a focused manner, and the block is enhanced for important and more key details, thereby reducing the overall bit rate and improving the encoding efficiency accordingly. This process can be realized through self-adaptive identification of the system on H.265.

The latest H.265 standard basically inherits the framework of H.264. Due to the optimization of the algorithm, H264 can only realize the transmission of standard definition digital images at a speed lower than 1Mbps; H265 can realize the transmission of 720P (resolution 1280* 720) ordinary high-definition audio and video transmission. This means that H.265 is designed to transmit higher-quality network video under limited bandwidth, and at the same time, only half of the original bandwidth can be used to play the same quality video.

The benefits of this article, free C++ audio and video learning materials package, technical video/code, including (audio and video development, interview questions, FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, codec, push-pull stream, srs)↓↓↓ ↓↓↓See below↓↓Click at the bottom of the article to get it for free↓↓

Application of H.265 technology

Coding technology is mainly used in video playback equipment, software applications, and equipment for shooting and recording videos. What people are most familiar with is the PPS network video player. In terms of PC screen client products, PPS has launched high-definition video based on the H.265 standard in early 2013, and named "Zhen HD" as its high-definition brand. At the same time, PPS has continuously increased the research on the application of H.265 technology, pushed H.265 to the mobile phone client, and continuously optimized and improved the performance.

On video playback devices, the introduction of 4K ultra-clear TV is also closely related to H.265 encoding technology. The first domestic media player FMP-X10 that provides high-quality 4K content released by Sony in September this year, as well as the S900B series and S990B series that Sony has released abroad, all adopt the H.265 encoding format. Better than ever. For Blu-ray, the HEVC encoding technology of H.265 is enough to compress a huge 4K movie into a Blu-ray disc, saving more memory space.

In terms of camera equipment, manufacturers such as Sony, Hikvision, and ZTE have successively released such camera equipment, which are not only used for daily photography and photography, but also for regional monitoring and security. The ultra-high-definition surveillance using H.265 encoding technology can monitor a larger area, which not only reduces the number of installations, but most importantly improves the quality of surveillance images. It avoids the situation of installing multiple cameras at an intersection and causing huge redundant data.

Of course, the application of H.265 is not limited to this. It is reported that Sony PS4, which will soon enter China, will also support H.265 codec technology. In addition, with the advent of 4G networks, H.265 may gradually be put into mobile clients on a large scale, making high-definition ubiquitous, and making more people feel and enjoy the impact of this technology.

What needs to be pointed out here is that although H.265 brings much higher compression efficiency than H.264, it also brings decoding difficulty several times higher than H.264, and the amount of calculation soars to 400~500GOPS. There is no extensive H.265 hardware solution support. Moreover, in the face of problems such as lack of hardware support and too late establishment of standards, Google VP9 is also a challenge for H.265. After all, compared with H.265, which needs to pay licensing fees, VP9 uses free BSD copyright The attraction of the agreement to the industry is undoubtedly huge.

2. Nouns

CTU: Coding Tree Unit
CU: Coding Unit
PU: With CU as the root, divide the CU. A prediction unit PU includes a luma prediction block PB and two chroma prediction blocks PB.
TU: Take the CU as the root, transform unit TU It is divided on the basis of CU and has nothing to do with PU. The quadtree division method is used. The specific division is determined by the rate-distortion cost. The following figure shows the structure of a CU divided into TUs.

1. Introduction to H265-NALU-Type

NAL_TRAIL_N = 0, NAL_TRAIL_R = 1, NAL_TSA_N = 2, NAL_TSA_R = 3, NAL_STSA_N = 4, NAL_STSA_R = 5, NAL_RADL_N = 6, NAL_RADL_R = 7, NAL_RASL_N = 8, NAL_RASL_R = 9, NAL_BLA_W_LP = 16, NAL_BLA_W_RADL = 17, NAL_BLA_N_LP = 18, NAL_IDR_W_RADL = 19, NAL_IDR_N_LP = 20, NAL_CRA_NUT = 21, NAL_VPS = 32, NAL_SPS = 33, NAL_PPS = 34, NAL_AUD = 35, NAL_EOS_NUT = 36, NAL_EOB_NUT = 37, NAL_FD_NUT = 38, NAL_SEI_PREFIX = 39, NAL_SEI_SUFFIX = 40,

2. The difference between H264 and H265 acquisition frame type/NALU type

H264 acquisition method is code&0x1f to take the lower 5 bits, such as 00 00 00 01 65, then 65&0x1f=5 means IDR frame

H265 acquisition method is (code & 0x7E)>>1, such as 00 00 00 01 40 then (40&0x7e)>>1 = 32--vps

3. Introduction of frame types in H265 GOP etc.

The difference between H265 and H264 is that H265 introduces a variety of frame encoding methods. Secondly, the GOP of H265 is not closed like H264. H265 supports both closed and open; H264 closed GOP: a single GOP internal frame in H264 Decoding will not depend on other GOPs, that is, each GOP can be decoded independently. The key point is that the forced refresh of IDR frames acts as a barrier to avoid mutual influence of error frames. Therefore, the first frame at the beginning of a GOP in H264 must be an IDR key frame. IDR is a kind of I frame that uses intra-frame coding without reference, but I frame is not necessarily an IDR frame. Ordinary I frames will refer to the previous IDR. In RTC, the IDR key frame is used most.

About the GOP.

This is the meaning of Group of Pictures, which means that the encoded video sequence is divided into groups of ordered frames for encoding. Each GOP must start with an I frame, but it does not necessarily refer to the distance between two I frames. Because a GOP may contain several I frames, only the first I frame (that is, the first frame) is a key frame. In the program cfg, the length of the GOP and the distance between two I frames are also specified by two different parameters (such as IntraPeriod and GOP Size or similar parameters). Therefore, the distance between two I frames cannot be greater than the length of the GOP, and is generally smaller.

About IDRs.

The full name of this word is Instantaneous Decoding Refresh, which is a structure defined in H.264. In H.264, the IDR frame must be an I frame, and it must be the beginning of the GOP, which is also the key frame of the H.264 GOP. But the reverse is not true, I frame is not necessarily an IDR frame. The length of the GOP is not constant. In the H.264 encoder, if it is determined that the scene changes, and the end of the original GOP is not reached in time, an IDR will be added at this position as the beginning of a new GOP. . At this time, the length of the GOP is reduced.

Closed GOP and open GOP (closed GOP/open GOP), CRA

Closed GOP is the format of GOP in H.264. In the GOP of H.264, all GOPs are independently decoded and have nothing to do with other GOPs, that is, they are all "closed". However, in HEVC, the structure of GOP has changed, adopting an "open" structure, and data of other GOPs may be referred to during the decoding process. At this time, the starting frame of a GOP is named CRA, clean random access, and intra-frame coding is also used, but the inter-frame coding frame in this GOP can skip the CRA and refer to the data of the previous GOP, which is the open of the GOP.

About BLA

BLA is just a special case of CRA in the case of video stream switching. If a video stream requires switching to another video stream on a certain RAP to continue decoding, the CRA is directly connected to the access CRA in another video stream, which is the BLA. Since the video stream previously decoded into the cache by the BLA is irrelevant to the current video stream, its properties are similar to the CRA after random access directly from this point.

RASL and RADL

This is a picture type between two GOPs. If the decoder randomly accesses from a certain CRA, the next few frames of data in the display order cannot be decoded due to the lack of reference frames, and these images will be discarded by the decoder, that is, skip leading. For the data that is not accessed from the current CRA, these images can be decoded and displayed normally, so it is called decodable leading. Since these data may be discarded, other images (trailing pictures) cannot refer to these data, otherwise, if these images are discarded, more images will be affected by it and cannot be decoded normally.

3. Basic structure

HEVC Encoder overall framework:

CU is used as the basic module for inter-frame and intra-frame coding. It is characterized by squares. Its size ranges from 8×8 to the minimum 64×64. LCU is 64x64, which can be obtained by recursively split quadtree method. A large CU is suitable for smoother parts of the image, while a small part is suitable for areas with richer edges and textures. The CU adopts the splitting method of quadtree, and the specific splitting process is marked by two variables: splitting depth (Depth) and splitting flag (Split_flag).

In the case of setting the CTU size to 64X64, a luma CB is at most 64X64, that is, a CTB is directly used as a CB, and the minimum is 8X8, and the chroma CB is at most 32X32, and the minimum is 4X4. Each CU contains a prediction unit (PU) and a transform unit (TU) associated with it.

Z scan order:

Z scan

PU is the most basic unit of prediction, which is divided from CU. In HEVC, there are skip mode, intra mode and inter mode.
There are 2 division modes for intra prediction, and PART_NxN can only be used when the CU size is 8x8.
There are 8 division modes between frames, PU can be square or rectangular, but its division is not recursive , and it is still different from the division of CU. Sizes from a maximum of 64×64 to a minimum of 4×4.

TU is also divided by quadtree, with CU as the root, TU can be larger than PU, but it cannot be larger than the size of CU.
In the process of intra-frame encoding, the size of TU is strictly smaller than the size of PU;
in the process of inter-frame encoding, the size of TU is not necessarily smaller than the size of PU, but must be smaller than the size of its corresponding CU.

A Slice can contain an independent Slice Segment (SS) and multiple non-independent SSs. The SSs in a Slice can depend on each other, but cannot depend on other Slices. In the figure, the dotted line is the SS separation line, and the solid line is the Slice separation line.

The benefits of this article, free C++ audio and video learning materials package, technical video/code, including (audio and video development, interview questions, FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, codec, push-pull stream, srs)↓↓↓ ↓↓↓See below↓↓Click at the bottom of the article to get it for free↓↓

Tile is a rectangular block and Slice is a strip.
Tile and Slice need to meet one of the following two conditions:

1. All CTUs in any Slice belong to the same Tile:

2. All CTUs in any Tile belong to the same Slice:

for example:

Assuming that the display order of the video sequence is ①, this is a complete GOP, and the decoding order is ②

①I B B P B B P B B P

②I P B B P B B P B B

In H.264, the first I frame is IDR, and GOP is a closed structure, so the structure of two GOP video is

IBBPBBPBB PI BBPBBPBBP (display order)

IPBBPBBPB BI PBBPBBPBB (decoding order)

In HEVC, the two I frames are CRA, and the GOP is an open structure, so the structure of the GOP is:

IBBPBBPBB PB BIBBPBBPB (display order)

IPBBPBBPBBIBBPBBPB B... (decoding order)

The two red B-frames represent images encoded in the previous GOP referenced within the GOP after the CRA in decoding order. In this way, it is easy to know that if random access is selected in the second CRA, the two red B frames will be discarded because there is no reference frame and cannot be decoded. These two red B frames are RASP. If this CRA is not selected for random access, the two red B frames can be decoded smoothly, that is, they become RADP.

For BLA, the situation is similar. Due to code stream splicing, the B after the CRA of the second code stream will also be discarded because there is no reference frame and cannot be decoded. It is easy to understand that the reference frame data in the cache at this time is still from the previous code stream, which has nothing to do with the current code stream, and of course it cannot be used as a reference for B.

As for the purpose of HEVC's design, it should be considered for coding efficiency. Because the compression ratio of B frames is relatively the highest, the introduction of this design can increase the proportion of B frames as much as possible without affecting the random access performance, and improve the overall compression coding performance.

New structures introduced in H265:

Random access point picture

IDR(instantaneous decoder refresh)

BLA(broken link access)

CRA(clean random access)

General term for Leading picture RASL and RADL

RADL (random access decodable leading) images that can be decoded in sequence after a random access point

RASL (random access skipped leading) images that cannot be decoded because they contain inter-frame information--will be discarded

Temporal sub-layer access picture Random access point after the image in decoding and output order.

TSA(temporal sublayer access)

STSA(stepwise temporal sublayer access)

Therefore, it can be said that the image of the random access point must be an I frame, and the types of the I frame in H265 include the following IDR/BLA/CRA, that is, 16~21 in the NALU are all I frames, and the code can refer to ffmpeg related judgments code:

switch (type) {

case NAL_VPS:

vps++;

break;

case NAL_SPS:

sps++;

break;

case NAL_PPS:

pps++;

break;

case NAL_BLA_N_LP:

case NAL_BLA_W_LP:

case NAL_BLA_W_RADL:

case NAL_CRA_NUT:

case NAL_IDR_N_LP:

case NAL_IDR_W_RADL: irap++; break;

}

Four. intra prediction mode

A total of 35 (h264 has 9), including Planar, DC, 33 direction modes:

In addition to Intra_Angular prediction, HEVC also supports Intra_Planar, Intra_DC prediction modes like H.264/MPEG-4 AVC;
. Intra_DC uses the mean value of reference pixels for prediction;
. Intra_Planar uses two linear predictions obtained by using four corner reference pixels the mean value of;

Division mode: Only PART_2Nx2N and PART_NxN can be used in the frame

5. Inter-frame prediction

Skipped mode: Inter prediction mode without MV difference and residual information

For motion vector prediction, H.265 has two reference tables: L0 and L1. Each has 16 references, but the maximum number of unique images is 8. H.265 motion estimation is more complex than H.264. It uses list indexing and has two main prediction modes: Merge and Advanced MV.

1. Motion Estimation Criteria

Minimum mean square error (Mean Square Error, MSE),
minimum mean absolute error (Mean Absolute Difference, MAD),
maximum matching pixel count (Matching-Pixel Count, MPC)
absolute error and (Sum Of Absolute Difference, SAD)
minimum transform domain absolute error and (Sum Of Absolute Transformed Difference, SATD)

Generally use SAD or SATD. SAD does not contain multiplication and division, and is easy to implement in hardware, so it is the most widely used. In practice, other calculations are performed on the basis of SAD to ensure the distortion rate.

2. Search algorithm

· dia rhombus

· hex (default) hexagon

· umh variable radius hexagon search (asymmetric cross hexagon network search)

· star

· full full search

Full search: Calculate the matching error of the two blocks for all possible positions, which is equivalent to the movement of the original block one pixel by one pixel within the search window Matching
diamond search: In x265, it is actually a cross search, only for the diamond diagonal cross Searching the blocks on
HM is full search and TZSearch and optimized search for TZSearch.

3. MV prediction

HEVC proposes two new technologies in terms of prediction – Merge && AMVP (Advanced Motion Vector Prediction) both use the idea of ​​spatial domain and time domain MV prediction. By establishing a candidate MV list, select the one with the best performance as the prediction of the current PU. MV, the difference between the two:

· Merge can be regarded as a coding mode. In this mode, the MV of the current PU is directly predicted by the adjacent PU in the space or time domain, and there is no MVD ; and AMVP can be regarded as a MV prediction technology. The encoder only needs to The difference between the actual MV and the predicted MV is encoded, so there is an MVD .

· The length of the candidate MV list is different, and the way to construct the candidate MV list is also different

The motion information of the Merge
current block can be derived from the PUs motion information of adjacent blocks, and only the merge index and merge flag need to be transmitted, and no motion information needs to be transmitted.

Spatial merge candidates: select 4 merge candidates from 5 different position candidates

There are five PUs in the figure, but the standard stipulates a maximum of four, the list is established in the order of A1–>B1–>B0–>A0–>(B2), and B2 is a substitute, that is, when there are one or more other When it exists, the motion information of B2 needs to be used.

Temporal Merge Candidates: Select 1 Merge Candidate from 2 Candidates Select one
from C3, H:

AMVP
constructs a motion vector candidate list of spatio-temporal PUs, the current PU traverses the candidate list, and selects the optimal predicted motion vector through SAD.

Spatial motion vector candidates: select one from the left and one from the top of the 5 positions, a total of 2 candidates

The selection order of AMVP is A0–>A1–>scaled A0–>scaledA1 on the left, where scaled A0 means that the MV of A0 is scaled.
Above is B0–>B1–B2–>(scaled B0–>scaled B1–>scaled B2).

However, x265 doesn't care about standards, what we want is speed, so in the code of x265, we can only see that it uses AMVP and the corresponding variable is

And if-else for the left side and the upper side respectively, two are selected.

Temporal motion vector candidates: select 1 candidate from 2 different position candidates

C0 (bottom right) represents the bottom right neighbor and C1 (center) represents the center block.

Skip vs Merge:

Fractional Pixel Interpolation:
Used to generate predicted samples for pixel values ​​at non-integer sample locations.

6. Quantization transformation

7. Others

Entropy Coding
Currently HEVC specifies that only CABAC arithmetic coding is used.

The deblocking filter
eliminates the blocking effect caused by the prediction error after inverse quantization and inverse transformation, that is, the pixel value jump at the edge of the block.

Adaptive sample point compensation
By classifying the reconstructed images, adding or subtracting 1 to the pixel value of each type of image, so as to reduce distortion, improve compression rate, and reduce code stream.

At present, adaptive sample point compensation is divided into band compensation and edge compensation:

1. Band compensation, divided into different levels according to the intensity of the pixel value, a total of 32 levels, sorted by pixel value, the 16 levels in the middle are compensated, and the compensation information is written into the code stream, and the remaining 16 levels are not compensated , to reduce the code stream.

2. Edge compensation, select different templates, determine the current pixel type, such as local maximum, local minimum, or image edge.

Wavefront Parallel Processing (WPP)
The parallel technology of WPP is carried out in units of one line of LCU blocks, but the relationship between LCU lines is not completely truncated. As shown in the figure below, the CABAC state of the second block of Thread1 is saved and used for Thread2. The initial CABAC state, and so on, are encoded or decoded in parallel, so there is a large dependency between rows. Usually this method is more compressible than tiles.

The benefits of this article, free C++ audio and video learning materials package, technical video/code, including (audio and video development, interview questions, FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, codec, push-pull stream, srs)↓↓↓ ↓↓↓See below↓↓Click at the bottom of the article to get it for free↓↓

Guess you like

Origin blog.csdn.net/m0_60259116/article/details/128792336