HEVC Parallel Processing Technology Introduction

The complexity of h265 compared to h264

  1. Complexity Reflection
    ○ h265 has more intra-frame prediction modes, h265 includes 35 prediction modes such as angle prediction, DC prediction, and planar mode, far exceeding the 17 modes of h264, and the complexity of intra-frame mode selection is greatly increased; ○ h265 area
    division The method is more diversified, the tree partition structure is proposed, the size of the partition unit is more diverse, asymmetric partition appears, and the motion compensation is more complicated; ○ h265
    adds the concept of transformation unit, and the size of the maximum transformation unit is determined by h264 8x8 is increased to 32x32, and the amount of calculation is not the same;
    ○ The overall responsibility of h265 may be dozens of times higher than that of previous encoders;
  2. Solution
    ○ Parallel processing in a multi-core environment can double the encoding and decoding speed and become an effective solution.

Video Codec Parallel Processing Technology

  1. Basic concepts of parallel processing
    ○ Parallel processing generally refers to the processing mode in which many instructions can be carried out at the same time. Parallel processing usually decomposes the processing process into small parts, and then uses multiple computing units to solve them in a concurrent manner;
  2. Functional Parallelism
    ○ Functional Parallelism refers to dividing the application program into mutually independent functional modules. Data exchange and communication are carried out by means of streams, and finally the various units are connected in series.
    ○ For example, entropy decoding, inverse quantization, inverse transformation, etc. in the decoding process in h265. These modules are interconnected and their operating mechanisms can be re-divided and combined to achieve functional parallelism.
    ○ Make full use of time parallelism to obtain the effect of enzymes, which is more suitable for hardware implementation.
    ○ The disadvantage is that it is easy to cause load imbalance. In addition, data communication between different computing units is required, and additional resources are required for storage; in addition, the scalability of functional parallelism is poor.
  3. Data parallelism
    ○ Data parallelism divides data information into mutually independent parts, and each part is handed over to different computing units for execution, thereby realizing parallel processing; in this case, the programs executed on different computing units are the same, and the processing is mutually Independent data information does not require communication between computing units.
    ○ h265 provides structural units suitable for data parallel processing, such as slices and tiles. In different slices and tiles, data information is independent of each other, which is conducive to assigning them to different computing units for processing.
    ○ For division units smaller than slices and tiles, h265 supports Wavefront Parallel Processing (WPP), which is a method of data parallel processing for image units that are dependent on each other.
    ○ It is easy to achieve load balancing, has very good scalability, and is easy to implement in software.
    insert image description here

h265 parallel processing method

  1. Considering the characteristics of the h265 standard, during the decoding process, entropy decoding may initialize the probability model at the beginning of the Slice, Tile, and CTB lines. For different images, the initialization positions are different;

  2. Since the dependency relationship of entropy decoding is inconsistent with that of subsequent decoding, if no distinction is made in the parallel process, it will bring a lot of trouble to the entire parallel decoding, and the decoding efficiency will be greatly reduced;

  3. The decoding process of h265 can be divided into two serial functional modules, namely entropy decoding and parallel decoding. The entropy decoding part mainly performs entropy decoding, while the parallel decoding part mainly performs prediction, inverse transformation, inverse quantization and deblocking filtering, etc. .

  4. The h265 parallel decoding model mixes both functional parallelism and data parallelism .
    insert image description here

  5. h265 coding unit data dependencies
    ○ For parallel processing coding units, the influence of data dependencies must be eliminated;
    ○ Wavefront parallel technology must be carried out under the premise of satisfying the dependencies between each data unit CTB;
    ○ h265 coding unit Data dependencies are mainly generated by intra prediction, true inter prediction, deblocking filter and sample point adaptive compensation;
    ○ Intra prediction: the current CTB may depend on the pixel information of the left, upper left, upper, and upper right CTBs and mode information;
    ○ Inter prediction: the motion vector of the current CTB may need to be predicted from the left, upper left, upper, and upper right CTB;
    ○ Deblocking filtering: the process of row boundary filtering for the current CTB block needs to refer to the Four columns of pixels on the right side, four rows of pixels at the bottom of the upper CTB;
    ○ Sample point adaptive compensation SAO: SAO edge compensation has four situations: 0, 45, 90, and 135. When the current CTB is compensated, the surrounding 8 blocks are all likely to be referenced;

h265 parallel processing technology

  1. Tile
    ■ Tile is a newly added data unit in h265. It divides the original image into independent rectangular areas, and each rectangular area is encoded independently without mutual reference; ■ The introduction of
    Tile is mainly for parallelism, not synchronization and error correction, changing the original scanning method, CTB in Tile scans according to the raster;
    insert image description here

    ■ A Slice may contain multiple Tiles, and a Tile may also contain multiple Slices, but neither a Slice nor a Tile may cross the boundary of a Slice; ■ When multiple Tiles are located in the same
    Slice When , they can share the same Slice header, thus saving code rate;
    ■ The price introduced by Tile is that the rate-distortion performance decreases with the increase of the number of Tile, because the division of Tile units destroys the correlation of information near the Tile boundary, and in addition CABAC will also update the probability model at the Tile boundary;
    insert image description here

  2. Wavefront Parallel Processing WPP
    ■ The introduction of Tile and Slice will destroy the correlation and cause a certain performance drop, while the WPP technology introduced by h265 allows multiple rows of CTB to be processed at the same time, but the processing of the latter row lags behind the previous row by two CTB, so that parallel decoding can be performed without destroying the normal correlation, maintaining the original performance;
    insert image description here

  3. Dependent slice
    h265 adds the data unit of dependent slice, which divides a completed slice into different areas and encapsulates them into different independent NAL units, and these areas may be related to each other. The introduction of dependent slices is conducive to parallelism processing, especially at the decoding end;
    ■ A piece of CTB row data or a Tile for wavefront parallel processing can be packed into a separate NAL unit, such a unit is called a dependent slice;

h265 parallel strategy

  1. GOP-level parallelism
    GOP-level parallelism in h265 is feasible, but because the data volume of GOP is huge, frequent reading and reading of data storage is very time-consuming and resource-intensive, and the delay is also very large;

  2. Image-level parallelism
    In the image-level data structure, due to the existence of motion prediction, there will be temporal dependence between different images, but this dependence is not inevitable. A video sequence usually includes three images of IPB, I Both frame and P frame images may be referenced, but B frame images can be divided into level B frames, and B frame images of the same level can be independent of each other and achieve parallelism; ■ The implementation method is very limited by the number of
    B images;

  3. Slice-level parallelism
    CABAC in h265 will update the context model at the end of each slice, which allows the entropy encoding of different slices to be performed in parallel; in addition to entropy encoding, different slice data can also be sent to different computing units without Consider their correlation;
    ■ Slice parallelism has obvious disadvantages. The number of Slices in each picture is determined by the encoder, and the number of parallel decoders is difficult to determine; during deblocking filtering, it is possible to cross the boundaries of Slices, which reduces parallel expansion. Sex; cause load imbalance; may cause the code rate to double;
    insert image description here

  4. Tile-level parallelism
    Entropy decoding and prediction, different tile data are independent of each other, and CABAC will also update the context model at the end of each tile, which is beneficial to realize tile parallelism; ■ Deblocking filtering and
    sampling Adaptive compensation may be performed across Tile boundaries, requiring additional processing to eliminate dependencies;
    ■ The disadvantage is that it may cause load imbalance and soaring bit rates;

  5. CTB-level parallelism
    ■ The wavefront parallel processing technology WPP algorithm is used for CTB-level parallelism;
    ■ The disadvantage is that WPP technology cannot be implemented in the entropy decoding part, and can only perform entropy decoding and parallel processing separately. In addition, the stability of the output code rate cannot be guaranteed ; ■ Aiming at the shortcomings, a method of overlapping wavefront parallelism
    is proposed to reduce the decrease in parallel efficiency caused by the increase and decrease of the number of threads in the parallel process; but the premise of this algorithm is that the motion vector is small enough, otherwise it is easy to cause the latter image Lack of reference information when decoding; ■ From the analysis of data structure, Tile and CTB can be implemented in parallel at the same time, but the shortcomings are magnified, and there is no corresponding solution in h265.

    insert image description here

reference

  1. T-REC-H.265-202108-I.
  2. New Generation High Efficiency Video Coding H.265HEVC Principles, Standards and Implementation [Edited by Wan Shuai, Yang Fuzheng] 2014 edition.

Guess you like

Origin blog.csdn.net/yanceyxin/article/details/131962164