What new changes have the development of RTE brought to video codec standards? 丨 Dev for Dev column

insert image description here
This article is the content of the "Dev for Dev column" series, and the author is Dai Wei, the senior video algorithm leader of the sound network .

01 History and present of video codec standards

The formulation of the H.261 standard around 1990 started the process of video codec standardization. After more than 30 years of hard work, video coding efficiency has been greatly improved. In the figure below, we roughly list the formulation organizations and release times of all video coding standards.

insert image description here

We can find that so far, there are still three major organizations in the field of video coding that are developing standards. They are the joint organization of ITU-T and MPEG that formulate the H.26x series of codec standards, the AOM alliance headed by Google that formulates the Avx series of codec standards, and the Chinese expert group that formulates the AVSx series of codec standards.

Among these standards, the most widely used standards in the world are H.264 and H.265 jointly developed by ITU-T and MPEG, VP9 developed by Google and AV1 behind them. Among them, H.264 has a history of nearly 20 years, and its coding efficiency is far lower than the latest codec standards such as H.266. However, it is still the most supported one because of its extremely wide range of devices. codec standard.

The picture below is a codec usage questionnaire conducted by Bitmovin for 4 consecutive years. We can find that, in addition to H.264, H.265 is also a codec standard with relatively high support, and VP9 and AV1 also have a certain increase momentum. Generally speaking, the coding standards of the H.26x series will be more easily realized by hardware chips due to its extensive company participation and support. However, since H.265, the unclear patent fee has also become a major obstacle to their promotion. On the contrary, AV1 has set the goal of free of patent fees from the early stage of development, and has received more and more companies' support and participation in the development process. We believe that the support of AVx series will continue to develop at an astonishing speed.

insert image description here

The primary goal of the video codec standard is to compress the video file size as much as possible while ensuring the image quality, saving storage and transmission costs . This goal is also very suitable for the initial use of video, that is, VCD and DVD at the beginning, and video-on-demand websites after the rise of the Internet in the later period. However, with the continuous development of RTE, many new requirements have emerged in the video field, and these new requirements will in turn promote the further development of codecs. So, in the era of RTE, what new demands have emerged?

02 New requirements for codec standards in the RTE era

The most obvious difference between video application scenarios in RTE and common application scenarios lies in the end-to-end delay requirements for video. Ordinary video on demand basically does not pay attention to the problem of delay because there is no need for interaction. In the RTE scene, a large number of interactions and interactions are usually involved, so there is a very high requirement for the delay, and the general requirement is at least within 300ms. And because of the limitation of ultra-low latency, the encoded output bit rate is more sensitive to changes in network bandwidth.

In addition, because RTE is a real-time application scenario, all generated videos can be considered ready-to-use, which is also a great difference from ordinary video scenarios. Because in ordinary video scenes, the number of times video is played is far greater than the number of times it is compressed, which is why all codec standards try their best to control the complexity of the decoder. In the field of RTE, if special recording and other requirements are not considered, all videos are encoded once and played a limited number of times.

Under these two major premise, the encoding and transmission of video has spawned several important issues in the field of RTE:

1. How to ensure the stability of image quality in the case of bit rate fluctuations

In traditional video coding scenarios, a fixed bit rate is generally given first, and then the entire video is encoded with this constant bit rate as the target. In this scenario, the encoder can relatively easily control the stabilization of the entire video quality. However, in a real-time transmission system, the network bandwidth changes in real time. It may be that the network condition was smooth one second, and the target bit rate suddenly drops due to network congestion in the next second.

For example, we are currently encoding a 720P 15fps video at a bit rate of 1.2Mbps. Suddenly, due to network reasons, the current available bandwidth drops to 300kbps. At this time, if the encoder still encodes this bit stream at 720P 15fps, You will experience a very serious problem of subjective quality degradation.

In addition, in the state of bandwidth fluctuations, how to ensure the smooth transition of image quality is also a very important issue. When the bandwidth suddenly drops from 1.2Mbps to 300kbps, and then slowly climbs back to 1.2Mbps, how to ensure the picture quality during this period? Quality stability is also a major test for user experience.

2. How to ensure the smoothness of the video in case of packet loss

In traditional Internet on-demand scenarios, there is generally a playback buffer time of 2-3 seconds. If you encounter network packet loss, the delay may be more than 5 seconds. These delays are basically imperceptible in the real viewing experience except for the moment when the delay becomes larger. However, in the RTE scenario, due to the requirement of real-time interaction, there is a very high requirement for the end-to-end delay. Excessive delay will directly lead to the deterioration of user experience.

Under the premise of low latency, the decoder does not have much time to wait for packets lost during transmission. Therefore, in a scenario with a high packet loss rate, it is easy for several frames, or even several consecutive frames, to fail to be completely received at the receiving end. If there is no good anti-packet loss strategy, it is likely that subsequent video frames will not be able to be decoded because the corresponding reference frame cannot be found.

In order to allow the video to be re-decoded, the encoder generally needs to resend an IDR frame. However, the IDR frame can be decoded independently and does not depend on whether the previous frame is decodable, so its frame size is more than 2-3 times that of the general P frame. Such a large frame will have a higher probability of not being completely received under network conditions with a high packet loss rate, which will further aggravate the video freeze rate and lead to poor user experience.

3. How to improve the video quality under low bit rate

**The biggest challenge in RTE scenarios is how to ensure a good user experience in weak network scenarios. **In traditional codec standards, when the given bit rate is very low, the only way to achieve low bit rate is to use a large quantization step to discard image information as much as possible. However, this approach will introduce a very serious block effect, which greatly affects our subjective experience.

In the RTE scene, we can improve the subjective picture quality through a series of methods. At the sending end, there are methods including frame rate reduction and resolution reduction. Among them, the frame rate reduction does not need to restart the encoder, while the resolution reduction generally requires the entire encoder to be restarted, discarding all previous encoding information. At the receiving end, the image quality can be improved by super-resolution and other methods.

However, these methods cannot make up for the lost image quality well when the bit rate is extremely low and the QP is extremely large, and may introduce additional subjective image quality problems. For example, if the super-resolution picture has such serious block effects, the super-resolution picture will still retain the boundaries of these block effects to a large extent, and may even introduce more serious artifacts on these boundaries.

03 Technical Prospect of New Codec Standard

After talking about so many new requirements for video coding and decoding in RTE scenarios, let's talk about some new technical prospects for video coding and decoding that these new requirements will lead to.

1. Dynamic resolution adaptation

In traditional codec standards, once the encoding is started, the resolution cannot be changed throughout the process. If you want to change the resolution, there are several options:

① Use multiple streams to encode different resolutions, and switch streams at corresponding positions to achieve the purpose of resolution change.

② Use SVC technology to switch dynamic resolution.

Comparing the two methods, we can see that the architecture of method 1 is much simpler than that of method 2, and the overall business logic will be relatively simple and clear.

However, because the multiple streams in method ① are independent of each other, these streams will occupy more of the upstream bit rate when sending upstream; in method ②, because the streams are interdependent, its uplink bit rate is higher than Method ① will be a little less. However, for the receiving end, if the receiving end watches all videos with small resolutions, there is no difference between the two solutions for the receiving end. However, if the receiving end is watching a high-resolution video, the overall code rate of method ② will be about 10% higher than the single stream of method ① because of the redundancy of information between streams.

Therefore, the above two methods have their own advantages and disadvantages. The so-called dynamic resolution adaptation technology is to dynamically change the encoding resolution during the encoding process, and then when decoding, all samples are sampled to the same resolution through upsampling technology, which absorbs the first two resolutions very well. advantages of this method.

We can find that a method for super-resolution during encoding has been proposed in AV1. That is to say, when encoding 1280x720P, if you encounter problems such as bandwidth and need to reduce the resolution for encoding, it will encode the current frame with a resolution of 640x720P, and then when decoding, it will super-divide the picture back to 1280x720P in a certain way.

insert image description here

The design of AV1 took into account the limitation of hardware cache at that time, so it only scaled in the horizontal direction, and did not scale in the vertical direction.

However, only half of the scaling in the horizontal direction does not play a very important role in reducing the bit rate. Future encoders should be able to support more directions and greater scaling to achieve dynamic bit rate reduction. Target.

2. Configurable in-loop filtering

The traditional codec standard only stipulates the entire decoding process, including various coefficients used in the process, which are clearly specified in one document. The main purpose of doing this is to ensure that no matter when, what equipment encodes the code stream, as long as it meets this standard, it can be correctly decoded by a decoder designed according to this standard anytime and anywhere. Essentially, it is designed for the purpose of encoding once and decoding countless times. This design method has an obvious problem, that is, their in-loop filtering has too few adjustable parameters, and cannot obtain optimal results in all scenarios.

In the RTE scenario, we actually don’t need to think too much about what to do if others want to watch it an hour later. We need to pay more attention to how to get the best user experience during the interaction. Then we can change the coefficients of these filters to be configurable. For example, in outdoor scenes we can use a set of models specially designed for outdoor training, in indoor scenes we can use a set of models specially designed for indoor training, and when sharing screens, we can use a set of models specially optimized for fonts. Even when encountering a new scene in the future, train a set of models under this new scene.

Furthermore, we can use codec-related post-processing as a new in-loop filtering module in the codec process. In this way, the quality of the reference frame can be further improved, and the efficiency of compression can be improved.

In this way, we can get a better user experience in various scenarios.

3. More adaptable to network loss and error mechanism

In the process of formulating traditional video codec standards, it has not considered too many ways to recover from errors when various network packets are lost, and has not stipulated the ways and methods of error recovery. Instead, this capability has been handed over to It is given to each developer to define. To some extent, this provides developers with great flexibility in error recovery schemes, but the price of doing so is that all error recovery can only be operated after decoding. From another point of view, it also limits the space for the encoder to obtain higher encoding efficiency. So in the RTE scenario, we need a set of in-loop error concealment methods to maximize the coding efficiency.

In addition, at the syntax level, the traditional video codec algorithm does not have any special processing methods. We need to further adapt the packet loss scenarios of the network at the syntax level, fully considering various packet loss situations. processing to achieve the highest coding efficiency.

04 Summary

Today, as RTE is more and more widely used, due to the special requirements of RTE scenarios, the traditional codec standards cannot well meet the requirements of RTE scenarios. We believe that as RTE is more and more widely used, it is inevitable to consider the new requirements of RTE when formulating new codec standards, and create a new RTE video codec standard.

(end of text)

About Dev for Dev

The full name of the Dev for Dev column is Developer for Developer, which is an interactive innovation practice activity for developers jointly initiated by Agora and the RTC developer community.

Through technology sharing, communication and collision, project co-construction and other forms from the perspective of engineers, it gathers the power of developers, excavates and delivers the most valuable technical content and projects, and fully releases the creativity of technology.

Guess you like

Origin blog.csdn.net/agora_cloud/article/details/128653749