Overview of the AVS Perceptual Lossless Compression Standard - Visually Lossless Quality Level Video Shallow Compression

  //  

Shallow compression, also known as mezzanine compression, is a video compression level that can effectively reduce video bandwidth while maintaining the overall video quality. The compression ratio is usually 2:1 to 8:1. According to this compression ratio, both 4K and 8K programs can be transmitted using the 10G interface, which greatly reduces the cost of network equipment. LiveVideoStackCon 2023 Shanghai Station invited teacher Yang Haitao to introduce to us the practices and explorations of the AVS standard group and hardware manufacturers such as Shanghai HiSilicon in the field of shallow compression of lossless quality level videos.

Text/Yang Haitao

Editor/LiveVideoStack

I am very honored to have the opportunity to communicate with you about the latest video compression standard developed by AVS - perceptually lossless compression . As the name suggests, perceptually lossless emphasizes the quality of compressed images to a lossless level. It was originally named light compression, as opposed to heavy compression, which mainly emphasizes the relatively low computational complexity in the encoding and decoding process. Later, considering the effect, it was also called shallow compression. Compared with deep compression, shallow compression emphasizes a lower compression ratio. When the standard was about to be finalized, the AVS standards group reached a consensus - PLC, which stands for perceptually lossless compression, emphasizing the quality level of compressed video.

Today I will introduce it from six parts : applications and requirements, overview of AVS PLC standards, high-performance parallel processing mechanisms, underlying encoding tools, CVR code control and quality optimization, and future prospects .

-01-

Applications and requirements

aa838cfc87e6c88c31b55955d811a07b.png

The application scenarios for shallow compression are display interfaces and content production . These two scenarios are usually not covered by H.265 and AVS series compression. Display interfaces include HDMI, DP and other interfaces, including wired and wireless transmission. Their common feature is that the channel bandwidth is very abundant, which requires lossless image quality. Content transmitted on these interfaces reaches a digital lossless quality level without any distortion.

If the quality is so good, why compress it?

Taking the DP1.4 standard as an example, its bandwidth is 32Gbps. Under this bandwidth, if the signal is directly transmitted without any compression, a channel of 4K60 frames per second data can be transmitted. But if we want to transmit data at 8K60 frames per second, such a physical channel cannot carry it. Faced with these problems, one way is to continue to widen the physical channel and thicken the lines. However, this method is not particularly convenient. If the lines are thickened, they may not be able to be bent. Another way is to perform compression to reduce physical bandwidth requirements. This is interface compression. In terms of content production, usually the front-end professional cameras will collect signals in YUV or RGB format and pass them to the media workstation before editing the content. All work is performed on disk files, and disk reading and writing is a very big bottleneck. The current solution is to convert the signal into a format that is very convenient for editing after it reaches the media workstation. This format requires the encoding and decoding of a single picture and does not allow predictive coding between images. In this way, after each picture is edited, the content can be stored directly.

There is a big difference between shallow compression and deep compression in terms of technical requirements. The shallowly compressed content will not only be displayed directly on the display, but will also be distributed domain encoded in the background to be used as a parent. Its color formats are usually very high-quality formats such as yuv444 and rgb.

The color bit depth of shallow compression supports 8-16 bits in the standard. Supports both signal lossless and visual lossless. Shallow compression typically has a compression ratio of 3x to 10x, which is significantly different from video distribution. When encoding H.265, a typical bitrate, such as 1080p, is usually between 2MB and 4MB, which is already a very high-quality video. Typical compression ratios are 200:1 or even 500:1, which is so-called deep compression or heavy compression. From here it can be clearly seen that using shallow compression, even after compression, the code rate will reach the order of 100 Mbits or Gigabits.

Another very big difference is that shallow compression requires very low latency, especially the compression requirements of the interface, to achieve row-level latency. Shallow compression also requires a high degree of parallelism. Shallow compression signal specifications are very high, and real-time signal processing for such content must be done in parallel. Another point worth mentioning is that shallow compression has its own code control algorithm in the process of standard formulation because it has to consider cost and whether it can be actually used in specific scenarios. Random access has just been mentioned. Whether it is a production domain or an interface, random access of a single image is required. Low complexity is a little different in the two areas of display interface and content production. Content production has relatively loose cost requirements because many of its codec implementations are based on workstation software implementations. However, the standard implementation of the working interface is ultimately implemented in the chip, and the chip will be widely embedded in various consumer devices. The cost constraints are very tight, and the complexity of the algorithm needs to be controlled in the standard specification. There is also a special requirement in content production to not introduce significant quality degradation during multiple iterations of encoding.

-02-

AVS PLC standard overview

ae58e60f71db9f5026bab43a3cfef200.png

At the August 2021 meeting, we formally proposed the standard requirements. The AVS standards group has not formulated standards for shallow compression before, and there is no corresponding standard in China. Therefore, when formulating the standard, we carefully discussed the different needs of different applications, and proposed four technical requirements: fixed bpp compression rather than fixed ratio compression, pixel line-level delay, prescribed code control model, and differentiating display interface and production domain requirements. needs.

After that, we received two CfP technical solutions. Both of them were good attempts, but some problems were discovered during the evaluation process. First of all, the standards we proposed are strongly integrated with code control, but there is no code control part in the plans we received. In addition, the solution should be hardware-constrained so that it can be implemented in a low-cost interface chip. After discovering these problems, we established a shallow compression task force to work through "cloud closed development".

The formulation of standards can be divided into the following stages:

In March 2022, the focus at this time is to control edge decoder costs. In June, we improved the basic code control algorithm and provided a CBR code control model. Afterwards, the hardware pipeline solution will be further optimized.

At the end of 2022, we will once again converge towards low cost and become the best low-cost solution in the industry.

In March 2023, another round of technical discussion and convergence was conducted on the quality issues discovered during the extended testing. Finally, in April, the standard was completed and the FCD version was released.

a98d7d9ee11c0d3607ec5ecdbedd1214.png

The quality assessment uses very high-quality images, including RGB and YUV444 formats. The main bit depths covered are 8-bit and 10-bit. 16-bit was also fully evaluated in the later expansion.

Content is divided into two categories, one is natural content collected by cameras, and the other is computer-generated content. The specific evaluation refers to the ISO29170-2 standard. This standard contains two parts: one is the alternating flash method and the other is the side-by-side comparison method.

The alternating flicker method refers to playing the pre-encoding and post-encoding images alternately at 8 Hz. If you can see any flickering, it means the image quality is not up to standard. This is a very strict standard. In actual use, there is no condition to see the image before encoding, so the side-by-side comparison method is more commonly used. This method is to display the same part on two screens or one screen divided into two halves, and at the same time point out the distortion, that is, the place where the alternating flash method flashes, for everyone to watch. If no one can see it, the test passes. There are certain requirements for test equipment during testing. First, make sure that the monitor and playback device support high bit depth, and ensure that the display interface does not perform any processing on the transmitted image. If you are not sure, you can use a grayscale sample image to verify whether it has high bit depth display capabilities.

In actual operation, it was found that whether it is an 8-bit display or a 10-bit display, when distortion occurs, the distortion intensity tested by the alternating flash method is consistent. Therefore, in the subsequent standard-setting process, in order to simplify and allow more units participated and all tests were conducted using 8-bit monitors.

5a3aa3425418a606f9bb93e65d6b02a9.png

In April this year, the AVS standards group organized a very detailed test at Pengcheng Laboratory: 27 test pictures were used, including camera-collected and computer-generated pictures, covering RGB, YUV444 and other formats. After data screening and analysis, 25 of the 27 pieces of content finally passed the flicker test, and all 27 pieces of content passed the side-by-side comparison test.

090419ff3e96d0b2ec8d8d8255eeac60.png

In addition to the pass-through image quality evaluation of the PLC standard solution, a comparative evaluation was also conducted with the existing DSC specifications in the industry. The evaluation results are shown in the table in the lower left corner.

dc358b0bcaceb8410809d7fc59bdbedc.png

This is a standard architecture diagram. After the entire image is transmitted, it will be divided into slices. Since the height of the block determines the number of cache lines, it is divided into CUs of 16*2 size. In addition, although the DCT conversion module is included in the standard text, in order to achieve the low-cost goal, DCT is turned off in both the interface and frame storage levels.

-03-

High performance parallel processing mechanism

bf42135bd6c031c28077014332f40991.png

The underlying parallelism refers to the parallel entropy encoding and entropy decoding operations of the three components.

Entropy coding is the bottleneck in the video encoding and decoding process. In order to solve this bottleneck problem and truly achieve real-time encoding and decoding of 8k 60 frames or 120 frames, each component needs to be processed in parallel. Pack the compressed code stream of each component independently so that each component has its own independent entropy encoder and entropy decoder. Then through the sub-stream interleaving operation, it is ensured that the three components are synchronized.

The upper right corner is the format of the sub-tape. The size of the sub-tape is related to the bit depth of the processed image. Taking a CU of 16*2 size as an example, if the 10-bit content is multiplied by 32, the original value is 320 bits. Plus 16 bits of header overhead. A two-bit data header will be added before the data payload. This data header will indicate which YUV component the sub-strip belongs to, and can be distinguished during the decoding process. After inputting the image at the encoding end, the syntax elements of the three components of YUV will be output. When the syntax elements are waiting for the entropy encoding operation, they will be distributed to their respective entropy encoders. After encoding, they will be divided into sub-slices. Then it is placed in the bit stream buffer of each of the three components. Under a unified mechanism, the sub-stream slices are interleaved to form a single code stream and passed to the decoding end, thereby ensuring the synchronization of the components without any errors. After receiving a single code stream, the first thing the decoder needs to do is deinterleave. After deinterleaving, the sub-stream slices corresponding to the three components are placed in their respective position buffers for decoding to obtain the final reconstructed pixels. .

The signal complexity of the three components of YUV or RGB programming YCOCG is different. The Y component is more difficult to encode, and the number of sub-slices of the Y component and UV component produced after encoding a coding block may be extremely unbalanced. Using the sub-stream interleaving mechanism on the encoding end can ensure that the decoding end gets the encoded data of three components of a encoding block at the same time. The encoding end will buffer according to the complexity of the three components of YUV to ensure that the decoding end does not need to do any additional buffering.

da2d9ed1af1415c20e7899e7ad388d9d.png

High-level parallelism is better understood, which means that after getting the image, it is divided into rectangular strips - slice. Parallel encoding can be performed between each slice, which is essentially a scalable architecture. As video specifications increase, such as from 4k to 8k, from 30 frames to 60 frames, if you want to support higher specifications, you only need to design the hardware. Just add more processing units or hardware cores.

It should be noted that only strips arranged side by side in the horizontal direction can be processed in parallel. The core reason is that when the hardware processes images, it processes pixels line by line. The decoding end needs to ensure that it is output immediately after decoding one or two lines. This row may belong to different slices, such as slice0 and slice1. In this case, some special operations are required. There are also simple methods, one slice for each row. This method is of course possible, but there is no space prediction between slices, and the compression efficiency is very poor, so it is still necessary to set a larger slice. Each time Slice0 and slice1 finish encoding a block row with a height of 2, the code stream will be interleaved. The receiving end needs to set the bit stream buffer of the slice block row data, so as to ensure that the first row data of slice0 and slice1 are sent to different hardware cores for decoding at the same time. This is a design that is very tightly coupled to the hardware.

-04-

Low-level coding tools

d8dcc19465647ce30167cfb95fa364fe.png

The underlying tools can be simply divided into two categories, one is conventional coding tools, and the other is exception handling tools.

Conventional encoding tools, mainly used to provide basic compression efficiency. For cost reasons, we selected three tools. First, block prediction mainly relies on the pixels in the upper row and the reconstructed pixels on the left for directional angle prediction. Its advantage is that it has a very high degree of parallelism, and all pixels in the frame can obtain their predicted values ​​​​at the same time, but there is no way to perform good adaptation in the texture change area. The point prediction in the upper right corner can handle images with such complex textures well. By performing independent prediction, residual coding and reconstruction on each pixel, the reconstruction of the first pixel is used as the prediction of the adjacent second pixel. The prediction efficiency of this method is the highest, but there is a very fatal problem: its hardware performance is very poor. To this end, we have made some constraints to ensure that when all pixels in a block are processed, the number of pixels that need to be processed serially is at most 3. First, the even-numbered columns are predicted, and the yellow pixels are used to predict the pixels in the first row of the block. After obtaining the pixels in the first row, continue to predict downward to obtain the pixels in the second row. After completing these two steps, predictions in the left and right horizontal directions are used to predict and encode all odd-numbered column pixels. The last tool is block copy, which is mainly used for working with screen content. Much of the content processed by display-oriented compression is generated by computers, such as text and table borders in documents such as office software Word and Excel. This content is typically characterized by very sharp edges and very rich high frequencies, and is very repetitive. Good results can be achieved through block copying in 2×2 block units, and compression efficiency much higher than that of block prediction and point prediction can be obtained on typical content. The three conventional tools combined provide very good basic compression efficiency support.

Two exception handling tools, the first is primitive value mode. This mode is PCM mode, mainly to prevent coding expansion. In some special cases, predictive coding will cause the number of bits after encoding to be higher than the number of bits directly encoding the original value. Once this situation is discovered, it is necessary to return to the encoding of the original value. The second mode is fallback mode. Because we are coupled with a CBR code control, the core of code control is to determine the QP. After determining the QP, the encoded bits actually fluctuate with the expected target, that is, the code control cannot achieve bit-level accuracy. This requires a mechanism that can forcibly control the compression bits below a threshold to avoid buffer overflow. This fallback mode is more of a back-up process.

39851b2b0b09618eb1d5b68e5ac109d1.png

In practice, we found that 16x2 block-level prediction is not very well adapted to scenes with sudden changes in texture content. Through continuous exploration, we found that making predictions at the sub-block level can effectively solve the above problems.

To this end, we have developed several corresponding algorithms: The first algorithm is to directly divide smaller sub-blocks , and each sub-block independently performs DC prediction. Such expansion can indeed reduce subjective distortion. The second algorithm is sub-block DC compensation . After a coding is completed, if its flat area coding effect is not ideal, such a remedy can be used to additionally transmit the difference between the original value and the current reconstructed value at the 4×2 or 8×1 level. After compensation, the effect Significantly improved and the encoding quality is very good.

In many typical images, the background between the text is flat, but the background within the text is very complex, and is usually processed by block copying. If the text spacing can use spatial prediction, such as prediction in the vertical direction, the horizontal bar distortion between text can be significantly improved. The sub-block interpolation prediction in the upper right corner is to deal with a situation that is relatively rare but has a great impact on image quality. If all prediction methods of the current coding block have failed, for example, the current coding block is a flat block, but both the upper and left references are noise, and there is no way to obtain effective prediction values, it is possible that in this flat area There is coding distortion perceptible to the human eye. This problem can be solved by directly encoding the DC value of the entire block, and then encoding the difference between each pixel in the block relative to the DC value. In this mode, the current image block is encoded completely independently and does not depend on the pixels on the left and above. All modes combined will have very good processing results in some very small, easy-to-ignore places, even particularly sensitive flat areas.

c6c3a15825fb72b9ab0a50408b972474.png

No matter how bad the prediction mode is, if it can be processed with a very small quantization compensation, the subjective effect will be acceptable, just spending a few more bits. But this puts forward requirements for the quantification mechanism, which needs to support very fine adjustments.

Unlike the general criterion, our criterion is based on right-shift quantization. Taking AVS as an example, AVS2 and AVS3 perform fractional quantization. Every time QP increases by 8, the quantization step size doubles. Finely adjust a quantization operation, usually consisting of a multiplication and a right shift operation. Frequent high-performance processing is a big problem for hardware. So we canceled the multiplication operation and simply simplified the quantization to right shift. The absence of multiplication means that the quantization step size is a process of continuously doubling the q step. There are different levels of quantification. CU-level quantization is analyzed on the encoding side based on block complexity. Analyze brightness and chroma separately to classify which complexity level they belong to, and then derive the corresponding QP. The derivation process of QP is carried out simultaneously on the encoding and decoding sides, which is different from traditional encoding and decoding standards. In the traditional distribution domain heavy compression standard, QP will be derived at the encoding end and then passed to the decoding end. In addition to the basic QP of CU, additional adjustments can be made in each 2×2 sub-block. According to the texture complexity analysis referenced on the current block, if the current sub-block is judged to be relatively flat, an additional operation of subtracting 1 or 2 will be performed on the basis of the CU QP to encode with higher quality; if the sub-block is relatively complex, then Keep CU QP unchanged.

In addition to sub-block quantization, there is also point-by-point quantization, which is applied in combination with the point prediction mentioned above. If the residual of a point is relatively large, it means that this area is difficult to predict, and QP needs to be allocated relatively larger. Otherwise, it means that the area is easier to predict and is a flat area. The flat area needs to be protected more intensively, and the QP needs to be further reduced.

55f34a524e4719957018f4d7b41459a9.png

The core concept of coefficient encoding is to perform semi-fixed length encoding on each coefficient group. There is no way to perform variable-length encoding in high-performance processing, and the processing performance does not meet the requirements. Semi-fixed length coding will find its common code length among the residuals contained in a coefficient group and pass it in the header information of the coding block. Using the same length to perform fixed-length coding on all residuals in the coefficient group greatly improves the performance of entropy coding.

e81c797f309f75ebb8935045b8f6f166.png

RDO calculation is an operation on the encoding side. It should be pointed out here that since CBR code control is used to ensure that the buffer does not overflow, low bits are preferred in the process of selecting RDO. The RGB content will be converted into YCoCg and then encoded.

In high-level parallelism, rectangular slices need to be processed in parallel. The width and height of the image must be aligned with the CU's 16×2. A common alignment method is the familiar padding. There are two methods of Padding. One is to do it on the right side of the image. Its advantage is that it is less complex, but the encoding may be uneven. The second method is to do it on the right side of each slice, which can solve the problem of unbalanced coding, but at the same time it will face another problem-each hardware core needs to be filled, which will bring additional cost. The standard supports two padding methods at the same time, and you can choose according to your needs.

-05-

CBR code control and quality optimization

c9b4024d7980f72cb9acf0c855c0fc74.png

The above picture is the leaky bucket model described in the H.264 standard. When encoding a block, its encoded bits will be output to the encoding control buffer and processed by constant bit rate smoothing. Since the complexity of the blocks is constantly changing, there are complex noise areas and very simple flat areas, so when each block is processed at a constant rate, the actual code rate entering the buffer fluctuates.

1dce2484bf86f22c0998f8d0223d3a8d.png

Code control has two main inputs. The first input is whether the current block is complex or simple? The second input is whether the buffer is currently empty or full? The output of the code control is the QP of the brightness component and the chrominance component. The core idea is to determine whether all the blocks before a coding block in image coding are easy to edit or difficult to edit as a whole. Of course, only this information is not enough. After all the previous blocks are processed, a simple classification of the complexity of the blocks will be made and divided into 5 complexity levels. Each level will have corresponding predicted coding bits, so that there will be more sufficient information, allowing for better code rate planning and code rate allocation for the current block.

The impact of code control on the final compression efficiency can reach more than 10% or even 20%, and code control also requires some adjustments. When initially analyzing a block, decide whether the current block is complex or simple. How to define complexity and simplicity is not just about the complexity of the texture. In fact, complexity and simplicity are synonyms for easy and difficult editing. Even a block that looks complex is still a simple block if it can be compiled into smaller blocks. For this reason we have made a distinction. If a screen content can be well compressed using block copying, it will also be considered a simple block. This kind of operation has a very obvious improvement in quality.

In the overall code control, special attention needs to be paid to the flat area. The flat area in the middle of a complex area is the most difficult place to work. When processing these places, it will first determine whether the current buffer is full. If the buffer is not very full, even for a relatively simple content, more bits will be allocated to it to ensure the quality of the area. After adding the above code control optimization, the quality of the flat area in the middle of the complex area has been significantly improved.

fd0fb0b8825880650ee06a776df354a9.png

At the beginning of code control, there is a delay, whose function is to accumulate a certain amount of initial bits in the buffer. In this way, buffer underflow can be avoided during subsequent CBR transmissions. Even if buffer underflow occurs, there is an underflow filling mechanism in the standard to cover it up. Code control needs to use buffer fullness as input. However, at the beginning and end of the slice, the actual water level does not meet the coding control requirements. These two places need to use virtual buffer fullness for code control adjustment. Compression relies heavily on the upper reference pixel row, and if the upper pixel row is not available, greater encoding pressure will occur. If the upper row of pixels is not available, more bits need to be allocated to the pixels in the first row to have better quality and avoid distortion at the boundaries of the slice. A similar problem exists with the first column. Pixels to the left of the first column are also unavailable, which affects block copying. When the block copy cannot be searched to the left, it will be searched upward to improve the quality of the reconstructed image.

-06-

future outlook

7e791a6eb3dd6fc112ec4b3e0d2e9d0f.png

Currently, AVS422 and AVS444 compression for the production domain has been launched. This is also divided into two parts. Part of it is the camera side, or the acquisition domain, which does video compression. The AVS3 standard solution that currently only supports 4:2:0 can be extended to support 4:2:2 and 4:4:4 color formats, so that professional cameras can be used to capture and compress video content with higher color fidelity. Specific to the production domain, its compression requirements are slightly different. The AVS standards group is also planning to create a new production domain standard to meet the requirements for highly parallel and low-complexity single-frame editing operations in the production domain that are very software-friendly. In addition, in the field of three-dimensional medical image coding, it is required to be mathematically lossless or subjectively lossless, which can technically be classified as shallow compression. Here, the AVS standard group has carried out a unified standard planning.

thank you all!



6755df8b5e0a7007858ceb96b52c42c0.png

Scan the QR code in the picture or click " Read the original text " 

Direct access to LiveVideoStackCon 2023 Shenzhen Station 10% off ticket purchase channel

Guess you like

Origin blog.csdn.net/vn9PLgZvnPs1522s82g/article/details/133003191