Lower cost and better viewing experience-analysis of self-developed S265 codec

Bandwidth is the biggest cost in live broadcast operations. According to Qianzhan.com, it is estimated that the CDN expenditure of the whole industry in 2020 will exceed 30 billion yuan, and it will be close to 100 billion yuan in 2025 (https://bg.qianzhan.com/trends/detail/506 /200715-ec767b9b.html), it can be said that reducing bandwidth is a crucial part of cost control.

At the same time, reducing bandwidth should not be at the expense of experience. When we watch a live broadcast, the first requirement is that the picture is clear and there is no mosaic, and we can clearly see the host and various products in the live broadcast room. How to ensure high-definition picture quality and controllable cost at the same time, the core technology here is video codec;

The video digital signal collected by the camera is usually in yuv format, and each pixel needs 1.5 Byte to represent. Take 720p 25fps as an example, the bandwidth is 263.67Mbps, and the total traffic of live broadcast is 124.4GB for 1 hour. If 1 million people watch at the same time For this live broadcast, CDN costs will exceed 100 million yuan. Fortunately, there is a very high correlation between frames within a video image. After the correlation is removed by video compression technology, the bandwidth can be reduced to 1/100-1/400 of the original.

Video compression standards are mainly composed of the MPEG series formulated by ISO (International Standards Organization) and the H.26X series formulated by ITU (International Telecommunication Union). Since the 1990s, the video compression industry has experienced three major generations of standards, namely H.262/MPEG -2, H.264/AVC, H.265/HEVC, every ten years, the compression rate brought about by the video compression standard upgrade will double. HEVC (H.265), as a newer generation of video compression standard than AVC (H.264), provides a more flexible coding structure and division method, and is used in motion compensation, motion vector prediction, intra prediction, transformation, and deblocking filtering. A lot of improvements and optimizations have been made in aspects such as, entropy coding, etc. Thanks to these new coding tools and characteristic technologies, it can save up to half the bit rate compared to H.264 under the same image quality.

The existing hevc encoder has many problems and cannot meet the demand. For example, the hard-coded image quality of mobile phone chips is poor, and the open source soft encoder X265 is slow and its compression performance is not ideal. In the industry-famous MSU encoder 2019 competition, X265 saves 23.4% in bit rate compared to X264, and the PSNR indicator The number one encoder (Tencent V265) can save 52.3% bit rate compared to X264.

S265 is a high-performance H.265 encoder developed by the Amoy technical team and Alibaba Cloud. It has the three characteristics of high compression, high efficiency, and wide adaptation to the scene . It uses 10 bit rate points of 100 1080p videos in the 2019 competition as the test benchmark. Compared with X264, S265 can save 56% of bitrate, 4% ahead of Tencent V265. At present, S265 has been launched in Taobao live broadcast, Youku video, Alibaba Cloud MTS, VMate, Dingding Conference and other services. S265 after the gear adjustment can achieve a gain of more than 20% BDrate and a coding speed increase of 100%-600 compared with the open source X265. %; Compared with last year, Taobao live broadcast has more than doubled its online scale, and its total annual cost has hardly increased. S265 also played a vital role.

S265 JCTVC class B~F sequence


Ali265 VS   X265(RC=ABR)

Ali265 VS   X264(RC=ABR)

Speed ​​grade

BitSaving@

Same quality

SpeedRatio @

same bitrate

BitSaving@

Same   quality

SpeedRatio @

same bitrate

Veryfast

-20.2%

210%

-40.7%

55%

Medium

-18%

396%

-42.3%

66%

veryslow

-21.5%

620%

-50.4%

62%

At present, Taobao Live has fully supported S265, achieving 720p high-quality compression at 800kbps. On Double Eleven this year, 1080p HD live broadcast was also launched on PC live broadcast. In order to meet the needs of real-time traffic control and low-latency live broadcast, S265 also supports second-level bit rate control and extremely low-latency compression mode. In Youku on-demand, S265 implements 10bit HDR and achieves non-real-time extreme compression capabilities, saving a lot of bandwidth costs for Youku Double Eleven;

The optimization idea of ​​S265 optimizes encoding quality from two directions of rate control and encoding tools, and optimizes speed from two directions of fast algorithm and engineering algorithm, and greatly improves the encoding quality, encoding speed and encoding delay respectively. .

Coding quality optimization based on perceptual model


The goal of code rate control is to allocate codewords to more valuable places, so as to minimize the coding distortion under the target code rate, or minimize the code rate under the premise of fixed distortion.

In frame-level coding control, the traditional method counts the long-term complexity of all coded frames, and the QP calculated based on a single factor cannot respond to changes in the picture in time. Based on the cutree theory, we accurately estimate the rate of the IBP frame in the pre-analysis length and the expected encoding size, so as to obtain more accurate quantization coefficients before encoding.

ABR is a commonly used rate control method. However, the code rate control method based on the ???? model in HM does not consider the reference intensity relationship between image blocks, and the coding quality is relatively poor. The QP customization of the MB-tree method frame used in X265 and X264 is unreasonable, and the accuracy of code control is relatively poor, with an average of only 89%. Based on the principle that "each bit is allocated to any CU, the marginal value generated is the same", we have carried out theoretical innovations on the MB-tree method, which has increased the coding accuracy to 97% and the coding quality by 0.65db , Corresponding to a bit rate saving of 17%.

The first one is the QP derivation of I frames. X265 uses an empirical value without considering the characteristics of the video itself. This is very unreasonable. We use pre-analysis of the complexity and target bit rate of the medium and low resolution images. Second iteration search to get accurate QP;

Second, as time goes by, the complex weights of historical frames are getting higher and higher, and the weights of newly generated frames are getting lower and lower, which makes it unable to respond quickly to changes in complexity. We refer to the newly generated frames. The intensity calculates a Qlamda, which is weighted with the original Q to get the real Q, which can reflect the complexity of the newly generated frame and its subsequent frames in time;

Third, the original algorithm uses the Viterb-based P frame decision algorithm. Each frame needs to be compared with the historical frame. The complexity is very high, and the influence of QP is not considered when determining the P frame, and the accuracy is not high. Our algorithm only needs to calculate the rate of change of adjacent frames, and introduces QP as the decision threshold, which greatly reduces the computational complexity and improves the accuracy.

Block-level code control distribution is affected by time-domain cutree and spatial-domain AQ. In the time domain, s265 counts the relationship between noise energy, motion intensity, texture edge intensity, and encoding configuration parameters and the information transfer ratio; in the spatial domain, Starting from the human visual system, the value of different blocks in the perception model is calculated, and more code rates are allocated to the area where the human eye has more attention.

We have a CCFA paper on the optimization of rate control:

An Exploration of Lookahead in Frame Bit Allocation and Slice Type Decision,

10.1109/TIP. 2018.2887200, Zhenyu Liu, Member, IEEE, LiboWang, Xiaobo, Li, and Xiangyang Ji, Member, IEEE

In terms of coding tools, S265 improves the algorithms of traditional scene switching detection, frame type decision, SAO, DEBLOCK, two-pass coding, RDOQ and other coding tools, and implements a batch of coding tools such as long-term reference frames.

Intelligent code control is a self-developed code rate control algorithm. In order to pursue the target code rate, the ordinary ABR or CBR code rate control wastes a lot of code rate in low-complexity scenarios. According to the subjective quality model of the human eye, when psnr is higher than a certain After the threshold, if the quality is improved, the human eye cannot detect it and will only consume too many codewords. We use machine learning methods to estimate the quantization coefficient of the frame to be encoded above the quality threshold based on the 17 types of historical encoding information and the complexity of the frame to be encoded, and limit it to the ABR target bit rate to ensure that each frame The most suitable bit rate encoding.

After Taobao live broadcast online verification, it can reach 15% stream saving. Using Dingding live broadcast saves 52% of bandwidth and reduces 62% of streaming side jams.

Due to the richness of current Taobao live broadcasts, the texture, lighting, background, and motion levels in various scenes are different. Outdoor anchors often move around, and the frequency of frame changes is high. Beauty anchors mostly sit indoors, and the light is basically brighter. Jewelry anchors mainly shoot objects, and the pictures are mostly still. A single encoder configuration cannot meet the current needs of Taobao live broadcast. How to select the best parameters for content has become the direction of industry research.

Under this demand, we have proposed coding parameter configuration strategies based on different scenarios. First, we used multiple deep learning and machine learning models to train and classify tens of thousands of live videos of various content, and use a large-scale server set to search for the best encoding parameters, and automatically and efficiently search for the current video encoding. The best combination of coding parameters can reduce bit rate consumption as much as possible while improving image quality. Finally, it is clustered into multiple parameter configuration items according to the encoding parameter set. Through this method, we obtained 7-10% of BDrate income in Taobao live broadcast, and more than 10% of BDrate income in Taobao production.

Encoding speed optimization to achieve full H265


In terms of speed, s265 adds two levels of optimization, fast algorithm and engineering.

HEVC can divide image blocks from 64x64 to 4x4. At the same time, the number of block types has increased sharply. The number of alternative coding modes is several times that of h264. Therefore, block division and mode decision-making have become an important bottleneck. s265 obtains prior information from reference blocks and picture textures, through hierarchical prediction, jumps out of the algorithm in advance, and CNN model to assist decision-making three steps, reducing a large percentage of calculations.

We divide the CU division decision module into two steps. One is texture intensity decision-making. The flat block and the complex block are distinguished by calculating the texture gradient of the CU. If it is a flat block, it will exit directly, and if it is a complex block, it will continue to be divided down. The first step can solve most of the division decision problems, but for ambiguous blocks, you need to rely on the CNN model to assist in division.

We used a small 5-layer network model to increase the accuracy of decision making from 72% to 96%;

We have introduced this result at the DCC meeting:

Enhance the HEVC Fast Intra CU Mode Decision Based on Convolutional Neural Network by Corner Power Estimation

Liangliang Chang†, Zhenyu Liu⋆, Libo Wang ‡, Xiaobo Li ‡, 2018 Data Compression Conference

We know that RDO contains two variables, one is distortion and the other is bit rate;

The calculation of distortion is the sum of the squares of the error between the original image and the reconstructed image, and it will also be SSE. Here we must first obtain the reconstructed image. This is a relatively long calculation process, which requires motion estimation, transformation, quantization, inverse transformation, inverse quantization, reconstruction and other steps, and the amount of calculation is quite large.

We noticed that the DCT transform has energy invariance, and the distortion can be calculated directly in the transformed frequency domain. In this way, the subsequent inverse transformation, inverse quantization, reconstruction, SSE and other processes can be bypassed, saving a lot of calculations. Through this method, the speed of the entire distortion estimation module can be doubled, while the bd-psnr loss is only 0.023db.

Another part of RDO is the bit rate.

H265 will divide the residual coefficients into 4x4 sub-blocks after quantization. Each sub-block contains 16 coefficients and coded together. The coding content contains two parts. One is the description of the coefficient distribution, including SCF, GTR1, GTR2, for this part , We set up a linear estimation model of statistical information, and estimate its size according to the coefficient characteristics; the other part is the Columbus codeword with non-zero coefficients, and the order k of each codeword is inconsistent. Here we assume that each codeword uses For the best k-level coding, a tailing coefficient compensation is given according to the distribution of the largest codeword.

Through this technology, the code rate estimation module can be accelerated by 35.6%, while the bd-psnr loss is only 0.057db.

AZB (All zero block) is short for all zero block.

After intra-frame and inter-frame prediction, the block with very small distortion is quantized and the residual coefficient is 0, but it is not known in advance that the quantized coefficient is all 0, or it will undergo complicated RDO calculations;

Is there a way to judge in advance that it is all zero blocks? The answer is yes. According to the rate-distortion theory, the distortion is proportional to the square of Q, and SATD can be used to estimate the distortion after prediction. When it is less than a certain threshold, it can be judged as an all-zero block.

Motion search is the process of finding the best matching block from the reference frame, including whole-pixel search and sub-pixel search. Sub-pixel requires 7-tap or 8-tap interpolation filtering, which requires a large amount of calculation; there are many fast algorithms for whole-pixel search. But sub-pixel search has not been a good way. Around the rectangular whole pixel in the figure, there are 60 sub-pixel points distributed. We establish a two-dimensional quadratic error plane equation, use the prediction error from 9 whole pixels to solve the 5 coefficients of the equation, and then calculate the deviation of the equation Guide, you can get the best sub-pixel position. Only need to calculate 1 1/4 pixel, avoid the calculation of the remaining 59 sub-pixels;

After talking about motion search, let’s take a look at intra prediction. Intra prediction uses the upper row and left column adjacent to the current block to predict the pixel value of the current block;

H265 has 35 kinds of intra-frame prediction modes, among which there are 33 kinds of angle prediction. If all 33 are calculated, the cost is very high.

We adopt a fast decision-making method based on the Bayesian model, first calculate the cost in the three directions of 10, 18, and 26, and use the cost distribution in these three directions to determine whether the optimal angle is horizontal or vertical. Here we directly calculate The amount is reduced by half, and the remaining 17 directions are calculated using the conventional fast algorithm 5-step method to obtain the best prediction angle.

Through this method, we reduced 33 angle calculations to 9 times, which increased the calculation speed of the module by 300%.

 

In addition to rdo, we also improved the slicetype decision algorithm, dynamic Lagrangian factor adjustment algorithm, fast deblocking and sao decision making.

In terms of engineering optimization, we have also added a number of optimizations. The first is C function optimization. We use methods such as optimizing process logic, splitting special paths, merging branches, looking up tables, and looping optimization to provide modules such as rdoq, coefficient analysis, and deblock. Here comes a nearly double improvement; secondly, we simd and optimize the execution speed of assembly code for intensive calculation functions.

On mobile devices, we also wrote a lot of armv7 and arm64 assembly code for S265, which is at least 1 times faster than the C version, and realized real-time encoding of 720P 30 frames per second on the low-end mobile phone iphone 6S.


Low-latency coding, without loss of picture quality, enhance interactive ability


In live broadcast, low latency means high communication efficiency and high experience, and reducing latency is of great significance. In the live broadcast end-to-end delay, the encoding delay accounts for a large proportion. However, the encoding delay and compression efficiency are a pair of contradictions. The lower the delay, the compression efficiency will also decrease. After our test on X264 and X265, in the low delay (300ms) mode, the encoding efficiency ratio is not limited. In the zero-delay mode, the coding efficiency is reduced by more than 50% compared with the unlimited delay.

In order to improve the coding efficiency in low-latency and zero-latency modes, we have adopted the following methods to optimize:

  1. Cu-tree's forward short-range propagation technology: s265 obtains the cache length-time domain transfer relationship model by modeling the cache length, which can realize a very short cache while still retaining the quality advantages brought by the long cache. The test results lookahead4 optimization can save 13.5% of the code rate than before optimization, effectively reducing the coding delay, the result diagram is as follows.

  2. Cu-tree backward propagation technology: In zero-delay mode, we do not have forward reference frames available, but we can borrow backward frames to predict the propagation cost, and improve compression efficiency through techniques similar to short-range propagation.

  3. GPB technology: In the zero delay mode, because there is no backward reference frame, the traditional B frame is no longer available. At this time, GPB can be used instead to improve the compression efficiency

  4. WPP Parallel: Frame-level parallel can greatly improve parallel efficiency but will increase latency. Intra-frame parallel wpp technology can not only make full use of the advantages of multi-core cpu, but also achieve zero latency.

After combining the above methods, our S265's low-latency (300ms) mode does not limit the delay, the compression efficiency is only reduced by 4%, and the compression efficiency in the zero-delay mode is only reduced by 15%;

High-performance decoder to solve client compatibility issues


The biggest problem facing decoding is compatibility and performance. To solve these two problems, we first need to do hard decoding adaptation. We have increased the support rate of android and ios hard decoding h265 to 95% and 75%, and the remaining 5% And 25%? What should I do if the web terminal does not support hard solutions at all?

This depends on our high-performance decoder;

We developed an extremely optimized h.265 soft decoder, hand-written 25,000 lines of assembly code, and increased the decoding speed to 240% of ffmpeg. The decoding speed of 1Mbps 720p h265 on Xiaomi 5 mobile phones is >240fps, and the CPU ratio is controlled at 20 %the following.

Many people think that optimization is just as simple as writing c and assembly, but in fact it requires a good understanding of computer architecture.

For example, this is a block filtering calculation, which is 3 times faster after SIMD optimization;

But through memory aligned loading, calculation merging, CacheMiss optimization, data prefetching, multi-core parallelism, branch optimization, register optimization, and delay slot optimization, we finally achieved 63 times the speed;


to sum up


Taobao Live fully uses S265 to achieve no degradation in image quality. The bit rate drops by more than 50%, which directly leads to a decrease in bandwidth costs. At the same time, the stall rate and second-to-second data are also optimized, and the stall vv is reduced by 25% per second. The absolute value of the opening rate increased by 1%.

Bandwidth compression is the consistent pursuit of encoders. In addition to serving Taobao live broadcast, S265 also covers live broadcast, on-demand, and conference services in the Alibaba video ecosystem, saving a lot of bandwidth costs.

  1. Youku’s long video-on-demand and the launch of S265 brought a lot of bit rate savings. Currently, the videos produced by S265 are played more than 100 million times a day. S265 also provides 10bit HDR function support for Youku;

  2. Alibaba Cloud MTS transcoding, S265 reduces the MTS short video transcoding rate under the premise of the same speed and image quality, and enables RTC real-time communication services;

  3. For DingTalk meetings and group live broadcasts, S265 compared to open264 scc mode can further reduce the bit rate by half;

The next-generation compression standard VVC (H.266) has been announced, and we have already invested in the construction of VVC encoders. It is expected that a self-developed encoder based on VVC will be launched on Double Eleven next year, reducing bandwidth costs by 40% again. By then, everyone can enjoy a lower bit rate and higher quality live video quality.

✿ Further   reading

Author| Tao Department Audio and Video Technical Team

Edit| Orange

Produced| Alibaba's new retail technology

Guess you like

Origin blog.csdn.net/Taobaojishu/article/details/110913995