Intelligent perception coding optimization and implementation practice

picture

Author | XHF

guide 

The perceptual coding optimization technology based on the visual characteristics of the human eye has become a key optimization method for UGC scenarios such as Internet short videos and OTTs. It can reduce the video bit rate while improving the video viewing experience.

There are mainly four aspects of content today. First, let me introduce the technical background of perceptual coding; the second part is the core technology; the third part is to introduce some practical applications; finally, I will briefly introduce the whole trend of coding.

The full text is 8207 words, and the expected reading time is 20 minutes.

01 Background of Intellisense Coding Technology

Now is an era of big video, and the traffic of video is also growing continuously. 4G communication has given birth to the explosion of the short video industry. To this day, it has not stopped. 5G communication technology has been widely used, and its high bandwidth and low latency bring a more ultra-high-definition video experience.

From beginning to end, there is a contradiction between the user's video experience and cost savings. The continuous growth of video traffic will bring higher bandwidth costs. How we can save bandwidth costs without reducing the experience to achieve a win-win situation is the goal of our continuous efforts as technicians.

picture

How do we compress bandwidth?

First of all, there must be a good encoder. The encoding standard has also undergone decades of evolution. There have been many generations of CODEC. Each new generation of CODEC has done a lot of encoding optimization on the basis of the previous generation. Competitive self-developed encoder.

Because the coding standard only stipulates the code stream information of the video, we can add more new tools and new algorithms to achieve better compression efficiency with the same quality. This is also the goal of the continuous work of the algorithm students.

The current situation is that HEVC or H.265 has already accounted for more than 90% of the distribution in the Internet UGC scene. Many manufacturers are also looking at whether the next generation is AV1 or H.266, and there are many discussions in the industry. So do we have more optimization methods besides the optimization based on the codec standard? The answer is yes.

AI-assisted human perception coding is an important optimization direction. It can bring us more bandwidth savings, and the industry has also proposed narrow-band high-definition, high-definition low-code, and so on. **The "perceptual encoder" in Baidu Smart Cloud's "Intelligent Sensing Super Clear" technology essentially uses the optimization technology of perceptual encoding. **Today we will make some new interpretations of this technology combined with practice.

picture

Let's first look at the basic principles of the entire perceptual optimization.

The most fundamental goal of video compression algorithms is to achieve better perceptual quality for the human eye. From the perspective of the human visual system, some traditional compression algorithms, such as the compression algorithms used in PGC scenes, generally use PSNR as the quality evaluation index, mainly to examine whether the compressed video is compared with the original video at the pixel level. Like enough, close enough. We can use some new video quality evaluation indicators that are closer to the human eye, such as SSIM or the latest VMAF, or even better artificial intelligence-based no-reference video quality evaluation methods.

A better video quality assessment algorithm is also our ongoing goal. Starting from the characteristics of the human eye, we can attribute several visual models, such as visual sensitivity, whether the human eye is more sensitive to texture or the quality of flat areas. In addition, the human eye has the characteristics of JND, which is the characteristic of just being able to detect frame loss. It may be that the human eye is more sensitive to frame loss to a certain extent. At the same time, we have a visual attention mechanism, that is, the human eye is more interested in parts, so we can use these human eye features to do more optimization.

In summary, perceptual coding optimization includes the following:

  1. First, perceive the content and enhance the image quality;

  2. Then, on the basis of image quality enhancement, optimize the bit rate allocation, such as based on the identification of the ROI area, to perform better bit rate allocation;

  3. Finally, combined with the optimization of the core encoder, through the comprehensive application of these technologies, the goal of further saving the bit rate is finally achieved.

In terms of goals, we not only want to compress bandwidth, but also improve user experience. The improvement of user experience can also be converted into bit rate saving in turn.

Perceptual optimization is a double-edged key. More specifically, when it is converted into bit rate saving through pre-processing, it will save bit rate if it is used well. If it is not used well, it may bring negative problems, so we have been It is emphasized that this is an improvement of comprehensive technical capabilities.

picture

Next, let's take a look at how to do it. In recent years, the intelligent video cloud team of Baidu Smart Cloud has been continuously building the "Intelligent Super Clear" technology brand. The perceptual encoder or perceptual optimization technology is a very important part of the entire "Intelligent Super Clear" technology brand.

I just mentioned these parts, including content-aware coding, perceptual processing, and optimization of the core encoder. Today's content mainly revolves around the following core technologies.

picture

Based on the "Intelligent Sensing Super Clear" technology, Baidu Smart Cloud's video cloud team has created a series of solutions for ToB customers.

We transform algorithms into actual products, including public cloud, privatization, and all-in-one machines. Through the acceleration of some hardware platforms, we can enable algorithms to empower customers' businesses end-to-end from video production to playback.

The figure below is a basic architecture of our entire product solution.

picture

The picture below is our entire product form.

As the output of cloud products, we cannot use technology to deeply couple and empower businesses like C-side businesses. We need to make more levels of product abstraction for technology, transform technology into standardized products, and make it available to more B-end users in the form of products.

In addition to the public cloud, privatization, and all-in-one machines just mentioned, we also support the SDK output of perceptual coding.

picture

GEEK TALK

02 IntelliSense Coding Core Technology

After introducing some basic product capabilities, let's take a look at the core technology of the entire intelligent perception coding.

We mentioned the encoder at the beginning, a good encoder is the basis of a job.

In the past two or three years, we have continued to invest in the research and development of the core encoder of BD265. So, how to build a competitive encoder in terms of coding standards?

We can do it in two ways:

  • On the one hand, it is driven by positive technology, such as on the basis of coding standards, through some positive coding tools or algorithms to improve the compression capability of the encoder, such as better code rate control methods, code rate allocation is a very important optimization point, and how to allocate the bit rate to a more reasonable place will be discussed later. We have also added many algorithms that are not covered by the code standard, such as pyramid B frame, GPB, and some algorithms related to pre-processing, which are oriented towards positive quality optimization.

    At the same time, in terms of engineering optimization, it mainly refers to the optimization of encoding efficiency, that is to say, to improve its encoding speed as much as possible without loss of encoding quality, especially at present, the encoder also supports the special optimization of ARM platform, including mobile terminal and server terminal. , but also supports a variety of fast algorithms. These are some positive driving technologies. At the same time, to a certain extent, we also use scene feedback to drive. For example, our application driver has done a lot of special optimizations in live communication scenarios.

  • Another aspect is scene feedback-driven optimization. For example, in the on-demand scene, in the face of continuous pressure to reduce costs and increase efficiency, we have also made a lot of optimizations in the extreme compression scene. For example, a lot of subjective optimizations have been made on the entire compression algorithm. As you can see from the picture in the upper right corner, we are in the process of continuous iteration, starting from version 1.0 and continuing to version 5.0. Compared with the open source X265, our compression rate can achieve more than 40% bit rate savings in terms of objective indicators including PSNR/SSIM or VMAF. So we use the encoder as the basis for more perceptual optimization work.

picture

We have been talking about perception, so what can we perceive? It is the content characteristic.

Which features are related to video quality. The so-called subjective quality of "human" in the senses, everyone thinks that the higher the bit rate, the better the quality, the higher the resolution, the better the quality, and the higher the frame rate, the smoother and smoother the playback may be. Then it is related to CODEC, different CODEC, different compression ratio. But there is another very important thing, that is, the content characteristics. Because each different content has its own RD (rate-distortion) curve, and each curve is closely related to the characteristics of the content.

So how do we give a better bit rate allocation for each video content, or even each frame of the video? Or give an optimal bit rate configuration?

More specifically, how do we give the optimal bit rate and resolution combination of a video in business. Because there are ABR (adaptive bit rate) solutions in business now, how to give the optimal combination requires in-depth analysis of content characteristics.

Because content adaptive coding is essentially a rate control problem, it is necessary to find the relationship between video quality, video content, video bit rate, and resolution based on content characteristics. Then how to find it is still a bit difficult. The most stupid way is to traverse it again, but this obviously does not meet the timeliness requirements of the business.

Artificial intelligence provides us with the means to analyze and understand the content, and on this basis complete fast and optimal parameter coding predictions.

picture

This is a work we did in 2019, and related papers were published at the PCS2019 conference.

We built an AI-based model. First, we input a video on the video scene side, and then segment the video. Each segment of the scene is predicted by the video parameter prediction model.

How do we evaluate it? The first is the video complexity we have been talking about. Video complexity includes time complexity and space complexity. Then some features of time and space can be extracted, and some features can be extracted based on the pre-trained large-scale CNN network. Then use the TSN network to do some fusion of these features, and after fusion together, pass a prediction model to get better coding parameters.

Our advantage is to build a training set of millions of video scenes based on online videos, and do more training according to different resolutions. This model has been running online for more than 3 years, and the effect is very stable.

We have done more iterations on this basis, including more refined data labeling according to different resolutions. At the same time, the current algorithm also supports the entire engineering. ToB output can also be supported by this model, through the form of FFmpeg filter Prediction of coding parameters is performed.

But it has a shortcoming, that is, it is now an application for on-demand scenarios, because when we first designed it, we designed it in a transcoding project, and we will make some improvements later.

picture

On the basis of this requirement, we also developed CQE technology later, which is a constant quality encoding for rate control and a more lightweight method.

In principle, some pre-analyzed features in the encoder can be used, and then a model can be designed, and finally the parameter decision of the encoding can be given. If there is a problem with the quality in the on-demand scene, reprogramming can be used to solve it. Of course, we now focus on supporting CQE technology in live broadcast scenarios, which achieves almost zero delay.

picture

After talking about the perception of content characteristics and giving an optimal encoding parameter, we can do more things in bit rate control.

First of all, what people have been doing is ROI-based coding. The so-called ROI is the region of interest to the human eye. As mentioned just now, the human eye has an attention mechanism, and the human eye has a more sensitive area when looking at an image or watching a video. The attention mechanism is very popular recently, which is the basic principle of the large model, and Transformer is Attention Is All You Need.

After the entire video is input, the area of ​​most interest to the human eye is the human body first, then the face, and the subtitles are also very sensitive. After we detect the regions of interest, we can preprocess the regions of these ROIs. Some edge enhancement sharpening can be done for processing, and more processing can be done for flat areas.

We provide a space for the UGC scene, because the quality of the input video may be uneven, with some good quality and some poor quality, so more targeted processing is required in the algorithm. What is the level of quality and what kind of algorithm is used? We will have more processing.

I just talked about pre-processing, and there is another bit rate allocation. It is not enough to only do pre-processing, and more consideration needs to be done in the link of bit rate allocation. After the pre-processing, the content characteristics of the video have changed, so the bit rate allocation strategy in this place requires a specific algorithm to make better adjustments.

Of course, we also have a faster detection method, which can reach a detection speed of 1 ms, which is also very important.

picture

Then let's take a look at the effect, and we can see that the video on the left is the original 1080P video with a bit rate of 8.9 Mbps. The one on the right is 720P, after ROI processing and encoding, the bit rate is 514 Kbps.

Through the comparison of these two videos, it can be seen that through Baidu Smart Cloud's ROI perception optimization technology, its video quality is positively improved, and its bit rate is compressed by as much as 18 times. This reflects the advantages of the entire ROI-aware coding.

picture

Let's take another look at the encoding of the ROI and the region video.

Human face is a very important and common area of ​​interest in the entire short video product, because people look at human faces every day, and you will meet all kinds of people on the road. When we watch short videos, we often look at human faces. Because the human eye has been trained for a long time, it is very sensitive to the quality of the face. A little bit of noise or a little bit of mosaic on the face area will be seen soon.

Therefore, when doing pre-processing, we can't do it in a more general way, but we still have to do different processing according to the regional characteristics of the face. At the same time, the human eye is particularly sensitive to skin color, whether it is reddish, greenish or yellowish. And the human eye is also very sensitive to different races.

Therefore, more details must be presented in terms of strategy. In extremely compressed scenes, it is necessary to control artifacts and block effects as much as possible. This is a problem that we must continue to solve in subjective optimization. So we also use a lot of algorithms here to do special processing. From the comparison with competing products, we can see that our BD perceptual encoder has more details in the face area, which is not as blurry as competitors.

picture

This is the effect compared with competing products, and more details can be seen on the right.

picture

After talking about the human eye experience, let's look at color again.

Color enhancement is also an important aspect of subjective optimization. If it is more heavyweight, you can use AI to do it, and if it is lighter, you can use some traditional methods.

We can see that in terms of color, on the left is a compressed video with a bit rate of 606Kbps without color enhancement, and on the right is a compressed video with a bit rate of 485Kbps with color enhancement. It can be seen that after passing some color processing algorithms, the bit rate can be reduced by 20%, but the subjective effect has been significantly improved.

But this will also have problems, that is, color enhancement is also a double-edged sword. If it is not used properly, it will be easy to make mistakes. This is also some of our experience in face-specific optimization.

picture

We just talked about face processing in the short video scene. In fact, there are many more difficult cases, which are also very common in the entire broadcast TV scene or some other OTT scenes. So how to deal with these more difficult cases?

We can specially cut out the human face, turn the human face into a dedicated human face super-resolution model, and process it in different regions.

We adopt a model based on the generative confrontational neural network (GAN). The main optimization is in the aspect of engineering landing and data processing, and in the design of loss function, we have done a special engineering of face super-resolution. The loss function includes some Loss of confrontation, loss of ROI and loss of identity. Make the super-resolution face look like the same person as the original face.

At the same time, we have done a lot of optimization work on the data. After the completion, we use the algorithm of face partition fusion to make the super-resolution effect of the face and the super-resolution effect of the video can be well integrated together.

picture

We can see the effect.

These are some actual customer cases online. It can be seen that after passing the face-specific optimization model, the clarity of the picture has been greatly restored. This is also based on the idea of ​​​​generating the model to do more restoration. It is very natural in terms of the visual quality of the human eye. At the same time it is clearer. This is a goal to be achieved by doing special face optimization.

picture

This is a case of an education scene. After processing, it can be seen that the clarity of the teacher and the subtitles behind have been significantly improved.

picture

In fact, in many scenes, there are other areas of interest besides the human face. For example, there is no human face in the image, and there may be a dog on the left. The dog is also a very important area of ​​interest. So saliency can be understood as the expansion of the region of interest.

In fact, the basic ideas here are similar, and some unique optimizations have also been made. At the same time, when we are doing saliency detection, we not only extend the area to non-human bodies, but also include some foreground objects. The focus is on the protection of the face area, that is to say, there are both human faces and non-human faces and non-human bodies. We optimized processing and bit rate allocation according to the priority of salience, so as to achieve the final goal.

picture

The saliency detection technology is also based on the more classic U-shaped network, namely U2-NET. On this basis, a lot of engineering optimization work has been done, including model cropping, a new face branch and other efficiency optimizations. .

At present, we can achieve a very fast detection speed on the CPU, which can almost replace a single face detection model, which is also a goal of our optimization.

picture

This video has mountains and rivers in the background. Because there are many trees on the mountain, this is a place that consumes a lot of code, but the human eye may not be very sensitive. At the same time, the human eye will focus on whether the little girl's clothes and the little girl's face are clear enough.

The upper left corner is a non-salient compressed video, the upper right corner is a significant compressed video, the lower left corner is the 20M original video, and the lower right corner is the display of the salient area. Therefore, better results can be achieved in perceptual quality through significant optimization.

picture

We just gave you the core technical points of perceptual optimization.

As mentioned at the beginning, in fact, the entire perceptual optimization technology is not taken and put into the encoder, it can work well, this is unrealistic. What we need to do is to deeply integrate all tools and methods with the most basic BD265 encoder, including pre-analysis and detection.

Then it is combined with CAE technology in code rate distribution. By enhancing the control to affect the subsequent allocation, we have also implemented a bit rate balancing strategy to ensure that the bit rate after pre-processing can be controlled within the range, as well as the optimization of the entire bit rate allocation including ROI. At the same time, artificial intelligence's AI heavyweight optimization technology, including face super-resolution technology, can be used as an option.

Depending on the assessment of raw video quality, AI-based processing can be selectively utilized, including enhanced HDR and more. In this case, in summary, it is called the comprehensive application of perceptual coding technology. Only in this way can perceptual coding technology be able to achieve the goal of being available online.

picture

GEEK TALK

03 Technology landing practice

Within Baidu, Baidu FEED streaming video is also a large volume, so I can share the whole thing first, including image quality, evaluation, and the basic process of going online.

The first is the research and development of core algorithms, including the development of objective compression algorithms, perceptual optimization algorithms, and evaluation indicators for subjective optimization algorithms. After passing the subjective evaluation and self-test, the efficiency test and stability test of the encoder are carried out.

In terms of engineering, first run a batch of videos, and ask PM students or operation students to do subjective GSB data evaluation. GSB (Good, Same, Bad) is a platform with subjective evaluation. For example, G means that A is better than B, S means that B is as good as A, or B means bad, which means that A is worse than B. In this way, some scores will be formed. After these scores, it will be decided whether the GSB has passed the test and whether it can meet the standards and requirements for going online.

We also need to do online experiments. Do a lot of experiments through Baidu's AB experiment platform. As for the subjective evaluation just mentioned, Baidu has a set called "Spiritual Mirror", a subjective evaluation platform that supports multiple terminals, and can be scored based on the evaluation of human eyes. Then, when it comes to the AB experiment, there will be more data-driven data, including whether it can bring actual bandwidth savings. Then in terms of user indicators (UBS), including distribution, duration, user playback experience and start-up, freezing, loading rate, etc., all need to go through very strict index assessments to achieve the goal of whether the upper limit can be reached in the end, that is Said that the above several hurdles have to be passed.

Finally, it is a full-scale process. After development experiments, it is necessary to verify whether the results meet expectations, that is, whether the entire bandwidth meets expectations, and whether other indicators have deteriorated. This is the entire online process.

picture

With this process in place, we have continued to generate some revenue online from last year to this year.

First of all, we can achieve 35% to 40% bit rate savings in terms of objective indicators through algorithm optimization of the core encoder.

Then we can achieve another 40%-50% savings through content-adaptive encoding. Through deep integration with perceptual coding techniques, the final perceptual encoder can bring 50% to 60% bit rate savings. Through the perceptual coding optimization technology, the user indicator data (UBS) has been significantly improved, and the total distribution, total duration and final business indicators have been significantly improved.

This is to give back to what I told you at the beginning of how to further save bandwidth while not reducing user experience, or even improving user experience. In other words, we have achieved the final desired goal through the optimization of perceptual coding technology.

picture

GEEK TALK

04 Intelligent Codec Technology Trend

Finally, there is still a little time, I will give you a brief introduction to the technical trend of intelligent codec.

The following materials mainly refer to some paper trends. The first is a Review article based on the coding direction of deep learning

Why should I refer to this? Because it introduces the discussion of AI tools for each module based on the classic codec framework. Can these modules use AI for assisted coding empowerment? In fact, in the next-generation coding process, such as AV1 or H.266, more and more AI-assisted modular auxiliary coding tools will be used to speed up or make better decisions. Of course, the initial CAE technology also used AI-assisted rate control, and other aspects, such as end-to-end AI pre-processing, is from open loop to closed loop, which is what we will do next Work.

picture

The next sharing is video quality evaluation, which is also a very important topic.

Why is it called a subject? Because this topic is what everyone continues to do, but there is no best, only better. Everyone has been doing YouTube sharing work recently, which is based on the use of multiple features based on AI, including content, distortion (compression loss), and compression itself using multiple networks for feature fusion to return, so that better results can be achieved. The user experience of the human eye is also claimed to be the result of SOTA.

Why do you say this? Because our entire perceptual coding is around quality. So what is perceived quality? How to better evaluate? This is also an objective indicator and cannot solve the problem. Therefore, we need to promote the further development of technology by establishing a quality evaluation model combined with business. Even CAE technology and subjective processing strategies can be assisted by this better subjective quality model to achieve better quality. Effect.

picture

Finally, I would like to share a few very shallow thoughts, which are still related to perceptual coding, and perceptual coding will increasingly use AI methods.

Here are a few summary points:

  1. AI is the foundation, and it needs to be closely combined with the needs of applications to build comprehensive R&D capabilities. It means that it is not enough to have a single point of ability. It is necessary to string together all the comprehensive abilities and make comprehensive use of them.

  2. AI is a tool, and it needs to be closely aligned with the needs of customers. Solving the pain points of the industry means that it is a tool, and what kind of problem needs to be solved, what kind of tool is needed.

  3. AI-assisted coding will generate more benefits in next-generation coding, because the tools of next-generation coding are becoming more and more complex, and more acceleration can be done through AI.

  4. The direction of AI video processing needs to be continuously polished, solving problems by scene and step by step, and solving effects and efficiency problems. This is also some of the summaries we have made in ToB scenarios over the years, especially in the process of implementing AI processing scenarios. AI is not perfect now, but we still continue to polish individual scenarios by subdividing the scenarios, in order to solve the actual problems of customers and meet the requirements of production effect and efficiency.

  5. The application of AI in the process of video production processing and encoding still has a lot of room for development. This sentence is a bit general, but everyone knows that now, especially large models, such as ChatGPT, Wenxinyiyan, etc., will have some promotional effects in the process of video production and processing.

picture

The above is my sharing.

——END——

Recommended reading:

Augmented language models - the road to general intelligence?

Realization of full message based on public mailbox

Baidu APP iOS terminal package size 50M optimization practice (2) Image optimization

On the recompute mechanism in distributed training

Analyze how the Dolly Bear business practices stability construction based on the distributed architecture

Software Quality and Testing Essays by Baidu Engineers

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4939618/blog/8797447