Encoding optimization and application in Tencent Cloud V265/TXAV1 live broadcast scene

  //  

Editor's Note: With the continuous development of live video in the direction of ultra-high-definition, low-latency, and high-bit-rate, the emergence of Apple Vision has further expanded the video encoding requirements for 3D, 8K 120FPS, and video encoding optimization has become more and more important. more challenging. LiveVideoStackCon 2023 Shanghai Station invited Mr. Jiang Aojie from Tencent Cloud to share the encoding optimization and application in the live broadcast scene of Tencent Cloud V265/TXAV1, leading us to explore the infinite possibilities of audio and video technology.

Text/Jiang Aojie

Edit/LiveVideoStack

Hello everyone, I am Jiang Aojie, from Tencent Cloud, mainly responsible for codec development and optimization. The topic shared today is the encoding optimization and application in the live broadcast scene of Tencent Cloud V265/TXAV1.

983426010853c0a53e083300dabbdb44.png

There are three parts: 1. Introduction to V265/TXAV1 live broadcast capabilities; 2. V265/TXAV1 typical live broadcast business practices; 3. Key points of Tencent Cloud’s encoding optimization technology in live broadcast scenarios.

-Part 1-

V265/TXAV1 live streaming capability introduction

In today's Internet era, live video broadcasting has become a popular and widely used form of media because it can more directly connect service providers and consumers. People can watch all kinds of content in real time through the live broadcast platform, and express or obtain the most authentic insights and experiences at the first time. From online education to live broadcast of sports events, from live game broadcast to live broadcast with goods, live broadcast applications are constantly expanding their influence, and the development of the live broadcast industry shows a trend of diversification and rapid growth. With the increasing demand of users for high-quality video, the application of video coding technology in the field of live broadcast has become particularly important.

At present, the AV1/265 encoder has been widely used in the field of live broadcast due to its efficient compression performance and sound ecology. First, they enable the transmission of high-quality video content with greater compression efficiency. This means that under the same bandwidth, the live broadcast platform can provide clearer and more detailed images, allowing viewers to enjoy a more realistic viewing experience. Secondly, the AV1/265 encoder has a lower bit rate requirement, which can reduce the burden of network transmission and improve the stability and reliability of live broadcast.

In order to further optimize the performance of the live broadcast encoder and enhance the live broadcast service capability of Tencent Cloud, the Shannon Laboratory of the Tencent Cloud Architecture Platform Department has optimized the live broadcast for the AV1/265 encoder for more than a year, and is committed to solving the challenges in the live broadcast field, such as Aiming at the balance of speed and quality of video encoding under ultra-high-definition, high-resolution, and high-bit-rate conditions, low-latency live broadcast, and compression performance of 3D live broadcast, Tencent Cloud provides higher-quality, more stable, and more interactive live broadcasts experience.

525f625c8c7820ef75a98fdd80cefb39.png

In the just-concluded MSU2022 competition, our TXAV1/V265 encoder has achieved very good results in the corresponding AV1/265 track, and won the first place in most indicators. Especially in 30fp competitions related to live broadcasting, V265 can save more than 30% bit rate than X265 on average, and TXAV1 can save 40% bit rate on average. At the same time, in the cloud transcoding (480p, 720p, 1080p) competition, V265/TXAV1 also took the top two places in the competition.

858ee47d83db4ef26e9678e5695f554e.png

This is our internal iterative test, in the live broadcast scene, the performance of V265/TXAV1:

V265 compared to X265 medium: in the case of 20% acceleration, the code rate savings is greater than 36%; under 6 times acceleration, there is still a relatively large code rate savings.

Compared with X265 medium, TXAV1 can save more than 40% of the bit rate when the speed is the same; the bit rate can still save more than 35% when accelerated by 1.5 times.

Compared with V265, TXAV1 still has about 10% higher compression rate at similar speeds.

-Part 2-

V265/TXAV1 typical live broadcast business practice

2.1 8K live broadcast

fbb17e5a390eb00f4eca83b3feb8c514.png

In terms of 8K live broadcast, our business capabilities can be summarized into three points: full-featured, low-latency, and high-performance.

Full-featured: It can support 8K, 60fps, 10bit, 150Mbps, 422, HDR, ABR live broadcasting, basically meeting all the functional requirements currently on the market.

Low latency: A single device can be used to support up to 8K, 60fps, without distribution, and the encoding delay can be minimized.

High performance: Even at 8K 60pfs, the compression performance is still significantly higher than that of the X265 medium file, and there is an acceleration of nearly 9 times.

2.2 Quick Live

08b6727dd18832d71916c365145bbd18.png

Why do you need to live broadcast quickly?

Quick Live is mainly used in scenarios such as e-commerce live broadcast, show live broadcast and online education, which require real-time interaction and communication with the audience. Real-time interaction has very high requirements on delay. The delay requirement of fast live broadcast is within 500-1000ms, which is much stricter than that of standard live broadcast.

However, low latency will pose a great challenge to the performance of the encoder, mainly reflected in shorter pre-analysis (less effective information that can be obtained in advance), smaller GOP (greater pressure on compression performance), and frame-level parallelism Less (will affect the encoding speed under multi-threading). Therefore, the focus of optimization is to improve performance while ensuring speed. After targeted optimization, the current fast live broadcast scene has a bit rate saving of 5%-7% compared with that before optimization.

2.3 MV-HEVC

a9e19734bc9b82d47fb6c8b955d3cbe6.png

When the Apple Vision Pro was officially released at the Apple Worldwide Developers Conference (WWDC), it was mentioned that it significantly improved the subjective and objective experience of 3D video through hardware codecs that support the MV-HEVC coding standard. Prior to this, Tencent has completed the support for MV-HEVC encoding to help compress 3D video and obtain better subjective quality of 3D video. On the left is a common 3D video compression method: the left and right viewpoint images are spliced ​​and then compressed with a general encoder, and then decoded and then re-split into two left and right videos. This is why if the downloaded 3D video is not opened in 3D, the image that appears is separated from the left and right.

On the right is the MV-HEVC 3D video compression method. Instead of splicing two viewpoints, multiple viewpoints are formed into a group of images at the same time for encoding. The advantage of this is that the left and right eyes of the video are no longer independent of each other, but have a reference relationship, which can greatly improve the compression efficiency.

9d25158c24bca8304878a7b1059febe9.png

The principle of MV-HEVC: use the redundant information between the left and right viewpoint images to further improve the image compression efficiency. For example, if the left and right viewpoints of the first frame are both I frames according to the commonly used left and right splicing method, the compression performance is relatively poor. But if it is coded according to the multi-view encoding method, then the left view is an I frame, and the right view is a P frame, so the right eye can fully refer to the information of the left eye, and the compression effect can be significantly improved. When the parallax between the left and right eyes is small, the improvement of the compression effect will be more obvious.

Our current test results include 8 JCT3V test sequences and 5 3D movies, and the final average compression gain can exceed 20%. As shown in the results, the benefits of 3D movies are greater, because there are many distant views in 3D movies, so the parallax between the left and right eyes is small, and MV-HEVC has a higher compression benefit for the more intense the image movement or the smaller the parallax of the left and right eyes. will also be bigger.

-Part 3-

Key points of Tencent Cloud's encoding optimization technology in live broadcast scenarios

9cb90527b6991048248ebe5a7ad0f335.png

3.1.1 Engineering Optimization: Data Structure

d23fdfb3ccefb3bd98f1acc8744a6f78.png

For video coding with large resolution, high bit rate, and high frame rate, because of its higher requirements for speed, it needs a streamlined core data structure and an optimized process, because each repeated calculation or copy may bring to slow down significantly. Based on this requirement, for the V265 and TXAV1 encoders made from scratch, we have made a clear goal from the beginning of the design: to design the core data structure as efficiently and concisely as possible:

1. TreeNode: It is convenient to obtain node attribute information and avoid double calculation. For example, whether the image node can be divided, what available modes, width and height positions and other basic attributes can be calculated in advance.

2. CoreUnit: The core storage structure can store core coding information, which can not only save the time-consuming frequent access of algorithms, but also help to efficiently obtain information about surrounding blocks.

3. IdenticalCu: Use the same Cu calculation results to reduce the amount of calculation. In view of the fact that some blocks are divided in the same way under different nodes, IdenticalCu can avoid further division, store information in advance, and realize reuse.

4. SwapBuffer: Reduce copying and recalculation through alternate use of memory. Comparing the performance before and after the use of SwapBuffer, it is found that the pass-through test can increase the speed by 5% after use, and the speed can be increased by more than 20% in 8K. In other words, for 8K, repeated calculations, copies, and large memory accesses may all bring about a greater degree of slowdown.

3.1.2 Engineering Optimization: Process Optimization

b5220d38786dd0949f0397e5fc987f67.png

For the video coding characteristics of ultra-high-definition, high bit rate, and high frame rate, the process is further optimized. Taking AV1 as an example, in the original process, the analysis, filtering, and encoding of the same CTU are not completed at the same time. This is because the filtering depends on the frame-level parameters, and we cannot obtain the frame-level parameters after the current block operation is completed for filtering. Through some algorithms, the filter parameters of the whole frame can be obtained as early as possible to improve the parallelism. But such an approach will increase data copying, affecting speed and cache hit rate.

After analysis, it is found that filtering has less impact on ultra-high-definition, high-bit-rate, and high-frame-rate video compression performance, but has a greater impact on overall speed and lower cost performance. Therefore, some filtering operations can be appropriately reduced. In addition, at high frame rates, the similarity of filtering parameters between frames is higher, and high-level frames tend not to be filtered. So, is it possible to skip or reuse other frame filter parameters? Therefore, we designed a set of adaptive filtering algorithms to reduce the use of filtering or parameter derivation.

In this way, there is a lot of room for optimization in the whole process: for frames that do not need parameter derivation, the current CTU can immediately perform filtering and encoding after the analysis, eliminating the need for copying and loading operations, and significantly improving the encoding speed. In fact, V265 has always adopted this encoding process, but due to the derivation of AV1 filter parameters, there was no way to implement it in the initial design.

After the final modification, the test on the 8K sequence can be improved by more than 5%. In the pass-through sequence, since the filter skips less, the speed impact is relatively small.

3.1.3 Engineering Optimization: Multithreading

3c7a0ca3032a18ee9733960a09be5eb8.png

This is the general flowchart of multithreading. We divide multithreading into many parts, including pre-analysis, frame level, SLICE/TILE level, macroblock level, post-processing, etc., and design algorithms that can improve parallelism for each part.

For example, there is no reference inter-frame parallelism at the frame level, high-concurrency reference frame optimization, and frame-level priority adjustment; WPP analysis parallelism and WPP-like parallelism at the macroblock level; filter dislocation reference derivation in post-processing, filter macroblock level parallelism, and multiple filtering in parallel, etc.

At the same time we also have a set of adaptive parallel control. In many cases, parallelism is not lossless, so it is necessary to consider how to reduce losses while improving speed. For example, when the number of threads, number of device cores, image width and height, and image complexity are known, it is possible to adaptively adjust WPP parallelism, CTU size, GOP length, and so on.

a8c60e58cc55866b70e5a6b995232bca.png

Although we already have a complete set of parallel solutions, many new problems will arise in the 8K scene:

The first problem is that the first frame delay is high and the CPU usage is low. Normal intra-frame parallelism is mainly based on WPP, but it will rely on its left block and upper right block, resulting in a misalignment relationship between them to be parallel. That is to say, there are only 4 blocks at most in parallel on the left image. In order to better calculate the degree of parallelism, we have summarized a formula: w, h represent the number of width and height CTUs, but this is only the maximum degree of parallelism, because WPP degree of parallelism has a process of gradually rising and falling, so this is only an ideal The best parallel effect in .

Therefore, in order to reduce the delay of the first frame, it is necessary to increase the maximum degree of parallelism, and when TILEs are parallelized, there is no dependency between multiple TILEs, so TILEs parallelism is a feasible solution. Taking 4K as an example, the maximum parallelism of WPP is 16.2. Therefore, as long as the number of TILEs is greater than 16, theoretically there will be speed benefits. The actual calculation results verify this point. TILE division 4×4 can already increase the speed by 2.2%, and 4×8 can increase the speed by 25.59%. The running effect is better than WPP.

But how to reduce or avoid the performance loss in the process? If the goal is to reduce the delay of the first frame, then it is not necessary to perform multi-TILE encoding for all images, and an adaptive method can be used. There are two directions of self-adaptation here. Firstly, it can adaptively calculate how many TILEs to add; secondly, it can self-adaptively only encode multi-TILE for key frames, and use the original encoding method for other frames. Such a combination can effectively reduce latency while controlling performance loss.

53952dd2b67320202661aa03aecd110c.png

The second problem is that on ultra-multi-core devices, the speed of 8K video cannot be increased. Through the overall analysis and comparison, it is found that the pre-analysis becomes the bottleneck of the entire encoding process, which limits the encoding speed. There are multiple optimizations here, from input to pre-analysis encoding. For example, CUTREE parallel optimization, multi-SLICE parallel input, SLICE&BATCH parallel optimization, and multi-SLICE load balancing can avoid the problem of excessive delay caused by unbalanced load. Multi-SLICE load balancing is to divide the stripes according to the image to avoid that some stripes are relatively large. However, due to the complex content issues involved in the encoding process, we will record the time of each strip in actual use, and finally achieve better results through continuous adjustment.

The picture on the right is SLICE&BATCH in parallel. The BATCH mode is actually multi-frame parallelism, and the SLICE mode is intra-frame strip parallelism. We found that in the 8K scene, the effect of the two alone is not enough, and the combination of the two can achieve a better effect, and the degree of parallelism has been greatly improved.

After a series of acceleration measures, the overall effect can reach 101% acceleration, 0.1% loss in compression performance, and an acceleration ratio of 1001:1.

3.2 Algorithm optimization

c654f2d0ed8856921ecfc25683a8b5b2.png

New problems will also arise in the optimization of some acceleration algorithms in the 8K scene. For example, in 8K, the proportion of the conversion process is more prominent, which affects the encoding speed. So try to use non-standard DCT to simplify the DCT process to speed up. For example, for a 64x64 block, only the first 16 rows are subjected to DCT column transformation, and after transposition, only 16 rows are subjected to column transformation again, so that only a few positions are transformed, and other positions are filled with 0. Through such processing, the time for DCT forward transformation can be reduced by 50%-60%. Since the DCT transformation accounts for a large proportion in 8K scenes, the time saved is also very obvious.

However, during the implementation, it was found that this acceleration algorithm would cause a relatively large performance loss for some blocks with clear boundaries, such as images with complex textures, the edges of words, and human eyes. So here is an attempt to perform CTU-level scene detection, using the image complexity information obtained by pre-analysis to detect whether the current block belongs to a smooth block, while avoiding additional calculations. In this way, only the 64x64/32x32TU of the smooth block adopts non-standard DCT transformation. The end result is a 6% speedup with a 0.5% loss in compression performance.

9283442131005dd3e26c22f21ce95bff.png

In low-latency live broadcast scenarios, in order to avoid a large loss in compression performance, it also needs to be optimized. The above picture shows the IPPP structure commonly used in low-latency scenarios. When analyzing the performance of this structure, it is found that the QP offset obtained by each frame is similar because the previous frame is simply used as a reference in low-latency scenarios. , that is to say, the importance of each frame is the same, and it belongs to the same layer, and there is no layered state. It is not good for the encoder that all frames are in the same layer. For example, the commonly used layered structure will have a smaller QP for the lower layer frame, which can better improve the compression performance.

Therefore, the proposed optimization scheme is to introduce a miniGOP structure with 4 frames as a group, adjust the reference relationship, optimize cutree propagation for this low-latency miniGOP, and strengthen the reference of low-level frames, so that layering is naturally carried out, and at the same time Enhanced overall fault tolerance. Furthermore, in order to ensure the performance of the low-level frame when the length of the lookahead is short, the output is adjusted to a single-frame push-and-pull structure, which ensures that the backward time-domain dependence can be used to complete the frame-to-frame when the low-level frame is pushed out. The QP calculation. The structure and algorithm changes are used to better distribute QP in different layers to achieve the final optimization effect. In low-latency live broadcast scenarios, the performance improvement is very obvious.

3.3 Subjective optimization

c44360506a47b9fbf01ad17dc760f16e.png

In order to further improve the subjective quality in live broadcast scenarios, we have added support for ROI (Region of Interest) encoding to improve the quality of the human eye's attention area. But at the same time, the addition of the ROI function also caused two problems:

Problem 1: The code rate fluctuates greatly. Because the ROI adjusts the QP distribution of the image, the actual bit rate has a relatively large bit rate fluctuation compared with the target bit rate. In order to solve this problem, we optimized from two aspects: intra-frame and inter-frame: First, for intra-frame, in order to better find the QP offset of ROI and ROU area, through the strength, area and complexity of ROI and ROU area The degree of function fitting is adjusted, and the calculation formula of QP is adjusted:

QP_roi=QP−func1(QP_frame, roi_cplx∗roi_area,rou_cplx∗rou_area,roi_strengtℎ,rou_strengtℎ);

QP_rou=QP+func2(QP_frame, roi_cplx∗roi_area,rou_cplx∗rou_area,roi_strengtℎ,rou_strengtℎ)。

The complexity here extends the complexity evaluation of the block by the encoder itself, so there is no extra computation. At this time, the bit rate fluctuation is reduced from 32% to 15%.

Secondly, for the inter-frame, in order to further reduce the bit rate fluctuation between the images between the frames, we adjusted the bit rate reservoir model so that it can further optimize the bit rate fluctuation by adjusting the ROI strength. When the code rate rises, increase the strength of the ROU and reduce the code rate; but when the code rate continues to rise to a certain threshold, reduce the strength of the ROI and further reduce the code rate, and vice versa. Through this series of measures, the code rate has been reduced from 15% to 5%, which has reached our expectation of code rate control.

31a8c136a8eafb0b64ab6d70ff4c0ed3.png

Problem 2: The speed drop is obvious . We also optimized in two ways: first, the fast algorithmic optimization of encoding partition decisions, because regions of interest are easier to divide into small blocks, while regions of non-interest are easier to divide into large blocks. Now that ROI and ROU have been divided, we can adjust its division decision algorithm according to different regions. In this way, the speed of encoding can be improved.

At the same time, we have also carried out corresponding engineering optimizations, such as acceleration of inference framework engineering, model upgrading and pruning, dynamic adjustment of algorithms and parameters for downsampling of input images, and so on.

The final effect reached an average time-consuming increase of only 5%, and the subjective image quality of manual evaluation was improved by 32.3%, which significantly improved the subjective effect.

3.4 Others

0afc89538334156ed910f30c86fc326d.png

The preceding is only part of the optimization, and there are many more.

Algorithms: such as adaptive Intra skip algorithm, pre-analysis MVP skip algorithm, pre-analysis sub-pixel ME skip algorithm, pre-analysis Intra mode search optimization, Inter search mode skip algorithm optimization, Intra chroma mode RD search optimization, Inter compound mode skipping algorithm, reference frame selection algorithm optimization, filtering hierarchical skipping algorithm, fast filtering algorithm based on reference block information, etc.

Engineering: such as CTU coding optimization, CTU reference mode information copy optimization, Intra reference pixel copy optimization, Tile syntax update optimization, CTU residual information copy optimization, coding information statistics optimization, Cost table calculation optimization, edge expansion optimization, reconstruction copy optimization wait.

8b003daa2603f8da38b9164055364aca.png

Finally, I would like to add that our team not only has server-side encoding capabilities, but also uses Tencent Cloud to output real-time encoding capabilities for R265 terminal live broadcasts. It has been widely used on Tencent Cloud terminals. It can support better zero-latency compression on X86 platform and ARM platform. R265 can save 30% bitrate at a similar speed to X264@veryfast.

My sharing ends here.


5b5d196363a6eef775d0a25652ad94b3.jpeg

LiveVideoStackCon is the stage for every multimedia technician. If you are in charge of a team or company, have years of practice in a certain field or technology, and are keen on technical exchanges, welcome to apply to be the producer/lecturer of LiveVideoStackCon.

Scan the QR code below to view lecturer application conditions, lecturer benefits and other information. Submit the form on the page to complete the instructor application. The conference organizing committee will review your information as soon as possible and communicate with qualified candidates.

984f2e4340c924c92ab7ab68b43dabf5.jpeg

Scan the QR code above 

Fill out the Instructor Application Form

Guess you like

Origin blog.csdn.net/vn9PLgZvnPs1522s82g/article/details/132419152