"Multiple" Dimensional Evolution: Research and Practice of Intelligent Coding Architecture

56f521273385e0d38041788ca27882c0.png

Evolve towards smarter and more compatible.

Chen Gaoxing Speaker

Cloud 

Imagine

Hello everyone, I am Chen Gaoxing from Alibaba Cloud Video Cloud. The topic I am sharing with you today is "Multi-dimensional Evolution: Research and Practice of Intelligent Coding Architecture".

This sharing is divided into four parts: the first is the industry trend in the direction of video coding and enhancement, and the second is the introduction of the intelligent coding architecture of Alibaba Cloud Video Cloud derived from this background, as well as the technical details of the "multi-dimensional" evolution. Finally, some thoughts and explorations on intelligent coding.

01

Industry Trends in the Direction of Video Coding and Enhancement

b0cf219e06af691d9b32d64ac41060d1.png

Firstly, the industry trend in the direction of video coding and enhancement will be introduced. The development trend of video technology is always pursuing higher definition, more real-time, more interactive, lower cost and more intelligent .

In the past few years until 2022, although the popularity of topics such as AR/VR and immersive 8K has decreased from the perspective of "high-definition", with the launch of Apple VisionPro in the first half of 2023, the popularity of VR has risen again . In addition to conceptual hype, video "high-definition" is also a real trend. For example, compared with the live broadcast of the 2018 World Cup, it can be found that the bit rate and resolution of the live video of the new 2022 live broadcast have been significantly improved, and it is expected that the next one will be further improved.

Around the trend of more "high-definition", we can see that in recent years, major companies have successively launched self-developed next-generation encoders, including 266, AV1, and even private standard encoders. At the same time, we also see a lot of demand for intelligent coding and enhancement. In order to reduce the cost pressure brought by "high-definition", heterogeneous software and hardware solutions for video coding have become a hot spot, including Alibaba Cloud's Yitian 710 ARM heterogeneous and the layout of ASIC hardware transcoding solutions by multiple competitors.

From the perspective of "low latency", with the popularization of 5G infrastructure, millisecond-level latency technology has gradually matured, and has been applied and implemented in multiple scenarios. The CCTV cloud archaeological program "Fantasy Journey to Sanxingdui" in June 2022 supported by Alibaba Cloud Video Cloud, and the CCTV "New Year Cloud Temple Fair" launched in the Spring Festival Gala in 2023 all used related ultra-low-latency cloud rendering technology. During the 2022 World Cup, ultra-low latency live RTS is also gradually increasing. Of course, judging from the general trend, the current "ultra-low latency" live broadcast is just needed in some fields and scenarios, and the real outbreak still needs to rely on more practical application scenarios.

In terms of "intelligence", we have observed that on the basis of the coding core, the industry continues to focus on using AI capabilities to improve video coding compression rates, including the combination of video coding and processing, the combination of video coding and quality evaluation, the combination of video coding and The combination of AI generation and joint optimization of the device and cloud continuously improves the subjective and objective compression ratio of video encoding. In the fields of "video enhancement" and "content adaptive coding" that everyone has paid attention to in recent years, we can also see the continuous implementation of GAN-based detail restoration generation technology.

With the explosion of ChatGPT and large language models in 2023, AIGC has become a current technology hotspot. The popularity of drawing software such as MidJourney, and the rapid development of open source models such as Stable Diffusion allow us to see the great strength of AIGC in the image field, and Vincent video technology is also gradually emerging.

b51e57d026bc83de92cd46ca2e46c767.png

Then, along with higher-definition, more real-time, more efficient, and smarter encoding requirements, we are also facing contradictions between many technologies and current needs .

With the advent of the AR/VR era, the resolution, frame rate, and color gamut of videos will continue to expand, and the amount of information in a single video will increase exponentially. Low latency means higher requirements for encoding speed, and the processing power of CPU chips no longer follows Moore's law, and the contradiction between clarity, bandwidth, computing cost and encoding speed will become more and more serious, mainly reflected in the following four points :

First, the upgrading speed of coding standards is much slower than the speed of video information expansion . The development of coding standards in the past ten years has only brought about a 50% increase in compression rate, which is far behind the traffic growth brought about by video and experience upgrades.

Second, the rate of increase in the compression rate of the new coding standard is much lower than the rate of increase in video frame rate and resolution . From 720P 30fps to 8k 60fps, the amount of video information will increase by 72 times, which is in great contradiction with the development speed of coding standards.

Third, the increase in complexity of new coding standards is much higher than the increase in CPU performance . From 264 to 266, the complexity of each generation of coding standards has increased by more than 10 times compared with the previous generation, which is much higher than the increase in CPU processing power.

Fourth, it is difficult for a single coding standard to cover a variety of application requirements . With the expansion and deepening of video in more application scenarios, such as the immersive coding standard required for VR scenes and the VCM video coding standard for machine vision tasks, it is even more necessary to optimize the coding standard in specific scenarios.

bf23c706caa06514a2ffffaf940efb38.png

Against the background of the above seemingly difficult to reconcile contradictions, if you want to achieve "have it both ways", the following five issues are worth discussing.

First of all, in addition to bit rate and quality, what other indicators can video coding focus on ? For example, the quality stability of different content, ensuring the quality stability of the sequence level or sequence fragments, or even between the sequence and the Gop, considers the stability of coding complexity from the perspective of objective to subjective, and from the perspective of resource consumption.

Second, how to make good use of existing coding standards ? Although various existing standards, such as the aforementioned VR immersive standard and VCM, have open source codes, it can be seen from the results of MSU competitions over the years that they still have a lot of room for optimization. Therefore, the research and development of multi-standard encoders is also a direction that the industry has been paying attention to and researching.

Third, what are the dimensions that the video coding standard itself cannot cover ? In fact, judging from the standards of each generation, the goal of video coding is to be as consistent as possible with the "source", so in most cases, pure encoders use active objective indicators as reference standards for optimization, but this method is not suitable for low-quality Scenario does not apply.

In view of the fact that the encoded video is still for the human eye to watch, although the subjective evaluation of the human eye is time-consuming and laborious, it is actually a direction that can bring value to customers. Therefore, introducing human eye evaluation into video intelligent enhancement to improve image quality is also one of our main research directions.

Fourth, in terms of coding standards , the existing standards are still insufficient in mining visual redundancy and adapting to scenes. Existing standards actually only define a rough set of tools and decoders, but if multi-level adaptive coding can be introduced and the "coupling" capabilities between modules can be further explored, the upper limit of the quality of the encoder can actually be further improved.

Fifth, how to break the stacking of resources and replace the technical thinking inertia of improving video compression efficiency. From the perspective of complexity, in fact, we don't need to consider it purely from the perspective of hardware, for example: only through the stacking of hardware resources to achieve the effect of coding inclusiveness. Through multi-platform support, such as deep coupling with the underlying architecture or hardening some modules, we can take into account the "flexibility" of the software and the "efficiency" of the hardware to achieve the inclusiveness of the algorithm.

Therefore, Alibaba Cloud Video Cloud's solution to the above five problems is the five "multiple" dimensions on the right .

02

Introduction to Smart Coding Architecture

ca7ab9d82432c571ef8422af69ca5105.png

As shown in the figure, our intelligent coding architecture is mainly reflected in five dimensions.

The encoding process of the traditional encoding architecture starts from the video source, enters the code control and core part of the encoding through the optional video processing module, and then outputs the code stream.

The most notable feature of the intelligent coding architecture is the "multi-level adaptive coding capability" . It will analyze the video source, evaluate the impact of processing, code control, kernel and other links in the encoding process on the final output based on the source, and adapt the internal parameters and tool combinations of the decision-making module.

At the same time, in order to achieve multi-level adaptive coding, we provide multi-directional coding tools and capabilities in video processing, code control and kernel. Finally, this coding architecture needs to be adaptive and modular, so that it can adapt from soft coding to different hard coding platforms.

6d8952a9b1fbee4907424936b31ec7df.png

The specific atomic capabilities of the five dimensions are shown in the figure above. In addition to the classification of business scenarios and video popularity, multi-level adaptive coding also includes semantic-level adaptation based on scene content and film source quality; while content adaptation includes: pre-coding based on different encoding targets, and based on JND, ROI, etc. of the human eye; tool adaptation is to combine various encoding modules, including code control and kernel modules.

In terms of video processing, "multi-dimensional" intelligent video enhancement includes image quality enhancement, video noise reduction, detail restoration generation, decompression distortion, and SR and FRC technologies in time domain and space domain.

In terms of coding and code control, the multi-target coding capability is not only compatible with bit rate and quality, but also includes multi-target coding compatibility for target coding complexity, quality fluctuation, and some CV tasks.

On the core, we have self-developed multi-standard encoders , covering 264, 265, 266, AVS3, AV1 and VCM encoders.

In terms of multi-platform support , our architecture can support X86 and ARM architectures from soft programming to joint optimization platforms that use hard coding.

03

"Multi-dimensional" Evolution of Intelligent Coding Architecture

2d7cb54dc54c691794a8892782688540.png

Next, the "multi-dimensional" evolution of the intelligent coding architecture will be introduced in detail. The first is multi-level adaptive coding, the key of which is content adaptation based on the quality of the film source , because the quality of the film source is a very important decision-making feature for video processing and coding .

Based on a large amount of customer scene data, we classify the film source in multiple dimensions. In addition to the semantic quality of the film source, we also evaluate the complexity of time and space domain as shown in the above figure, and consider the influence of coding to perform R/D slope . Analysis, and intelligent rate allocation for different sequences at the sequence level.

The quality analysis module is very important. In terms of video quality, fully understanding whether the video has noise, compression or quality loss caused by transmission will play a key role in guiding subsequent processing and enhancement. Especially when some low-cost enhancement and coding schemes need to be used, it is difficult for us to use a module to adaptively deal with all quality degradations. Therefore, adding the quality analysis module can help us better obtain the upper limit of the encoding quality. For good quality sources, there can be little or moderate enhancement. For poor quality sources, more quality can be boosted.

In addition, the quality of the video source can also affect the encoding decision. If a segment of the video source is complex, there may be a lot of "blocking" at low bit rates, so we tend to allocate more bit rates in this scenario.

32893918578da657942ceb86e2d88efa.png

Another part of content adaptation is the JND and saliency map based on human eyes . JND is a very important direction for the industry. Traditional video coding is based on information theory, which reduces temporal redundancy, spatial redundancy, statistical redundancy, etc. from the perspective of prediction structure, so as to achieve video compression, but the mining of visual redundancy is still far away. not enough.

The basic principle of JND is shown in the figure above. The RDO curve used in traditional video coding is a continuous convex curve, but what the human eye actually perceives is a discontinuous stepped line. If the convex curve is replaced by a stepped curve, less bit rate can be used with the same distortion.

The traditional JND scheme is divided into "top-down" and "bottom-up" two ways. We choose more "bottom-up" methods to represent the visual features of the visual cortex, such as color, brightness, contrast, and motion. Luminance masking, contrast masking are considered in the spatial domain, and motion-based masking is considered in the temporal domain.

We will introduce a deep learning method to predict the subjective impact of the JND module on the human eye, and then combine the code control module inside the codec to calculate the space that can further expand the quantization step size of each current block. At present, our JND module can save more than 30% of the code rate in general scenarios and under the same subjective conditions, and even save more than 50% in some vertical scenarios.

f05dee450907f521cdbf32a0d298842c.png

In addition to JND, another important technology for mining human visual redundancy is saliency map . We have laid out two directions on the saliency map: one is low-cost face- based ROI . In order to be used in more inclusive live broadcast and ultra-low-latency scenarios, we developed this tool for faces. It combines the detection and tracking algorithm to adjust the JND of the detected face area and the surrounding pixel blocks to ensure that the boundary feeling between ROI area and non-ROI area is reduced while improving the subjective image quality.

The second is saliency map technology , such as some sports scenes and UGC scenes shown in the above picture. We use the eye tracker to collect information such as temporal attention, and build a set of human eye attention models by collecting more than 2,000 videos and collecting more than 1 billion gaze points.

The highlighted area in the above image represents the main focus area of ​​the human eye, which will change somewhat over time. The model is combined with an encoder to perform code rate allocation for different regions. Under continuous viewing, the subjective picture quality can be improved.

4a4fe10ace73952f1fca37dc990f2659.png

Next, the tool adaptation technology applied inside the encoder is introduced . We believe that the traditional rate-distortion theory is based on objective, and it will amplify the blocking effect in most cases of low bit rate. If skip or DC mode is selected at a low bit rate, block effects are prone to occur.

While tools such as deblocking filters exist in coding standards, they are not strong enough to compensate for the actual blocking artifacts produced. From a subjective point of view, if you add a little noise and blur to flat areas, the subjective experience will be better.

We used two methods for subjective optimization, one is one-way, based on the source content and information after encoding to predict whether the area is prone to block effects in the future, and carry out targeted code rate protection for subsequent areas, and also distinguish Is the blocking effect caused by the source or the encoding. The second is 2pass encoding for on-demand scenarios, which can be processed twice based on the actual encoding results of the first pass.

cd945f0d2576d9ab0e41be13766611cb.png

The above picture shows the subjective comparison results. The right side of the side-by-side comparison picture shows the effect after the tool is turned on. It can be seen that the block effect is significantly reduced in the face area, and the bit rate of this frame has increased by about 5%. From the perspective of code control of the encoder, the upper limit of the increase to ensure that the overall code rate of the sequence remains unchanged is limited. It can be seen that the position of the arm of the human body in the picture is still relatively blurred.

380ab35661d8844d60fd6f5a9066beee.png

For the multi-dimensional video enhancement part, we will mainly introduce the self-developed narrowband HD brand. As early as 2015, Alibaba Cloud proposed the concept of "narrowband high-definition", and in 2016, it officially launched the narrowband high-definition technology brand and commercialized it. At present, through multiple rounds of screening and discussion, the two directions of Narrowband HD 1.0 and Narrowband HD 2.0 have been settled.

bf9d0930f7ac40f425f28651f9572633.png

Narrowband HD 1.0 is a balanced version. Its main function is to achieve adaptive content processing and encoding with the least cost, and to improve image quality while saving bit rate. It will make full use of the information in the encoder to help video processing, that is, use a low-cost pre-processing method to achieve low-cost content adaptation.

Narrow channel HD 1.0 is divided into two subdivision levels in video processing, one is indiscriminate sharpening enhancement with relatively low computational complexity. The other is to perform de-artifacts and deblur adaptive sharpening enhancements based on the quality of the film source. For film sources with poor quality, the corresponding deartifacts have a greater weight.

Narrowband HD 2.0 has undergone multiple technical selections, and is finally defined as the restoration of spatial dimension details to solve the loss of image quality caused by the video production link, that is, the loss of image quality caused by multiple encoding and compression. More adaptive capabilities will be added to the encoding, including JND, ROI, SDR+, etc.

844d64d4294c2e1229d3216f45d8b47f.png

The image above shows the enhancement of Narrowband HD 2.0. The conventional CNN model has a better smoothing effect on artifacts such as block effects, edge aliasing, and burrs caused by encoding compression, which can make the entire picture look cleaner, but it will cause a smoothing effect. Narrowband HD 2.0 chooses GAN-based detail enhancement to improve picture quality, such as corners of eyes, lips, etc.

8c349558630fa4acd52ea005ee00db35.png

Narrowband HD 2.0 detail repair generation core technology module includes the following seven aspects:

One is the diversity of training samples : establish a rich type of high-quality video library as model training samples. The training samples contain a variety of texture features, which is of great help to the realism of the texture generated by GAN;

The second is to continuously optimize the training data through refined modeling, based on in-depth analysis of the image quality problems faced by business scenarios, and continuously optimize the training samples according to the scenarios to achieve refined modeling effects;

The third is to explore more effective model training strategies , including training loss function configuration tuning. For example, the use of different layer features for perceptual loss will affect the granularity of generated textures, and the weight ratio of different loss will also affect the effect of texture generation. We use a strategy called NoGAN/progressive training during model training. On the one hand, it can improve the processing effect of the model, and on the other hand, it can also help the stability of the model generation effect.

Fourth, in order to improve the model's ability to adapt to the quality of film sources , we have done a lot of work on the diversity of training input sample quality and the training process. The end result is significant enhancement for low-to-medium quality sources and moderate enhancement for high-quality sources.

Fifth, according to the experience of the academic community, the clearer the prior information of the processing target, the stronger the generation ability of GAN. Therefore, in order to improve the processing effect of GAN on different scenes , we adopt a 1+N processing mode, that is, a general scene model with gentle generation ability + N vertical subdivision scene models with radical generation ability, such as football Grass details, edge lines of animation scenes, portraits of variety show scenes.

The sixth is efficient and controllable model reasoning . After model distillation/lightweight, based on the Aliyun Shenlong HRT GPU reasoning framework, the GAN detail generation model is on a single card V100, and the processing efficiency can reach 1080P 60fps.

Seventh, in order to ensure the inter-frame consistency of the GAN model generation effect and avoid the visual flicker and coding burden caused by inter-frame discontinuity, Alibaba Cloud Video Cloud, through cooperation with universities, proposes a plug-and-play inter-frame consistency enhancement model .

1757f324688dde613479ce2f21f69ad0.png

A few specific customer cases are introduced next. The first is Jiangsu Mobile's World Cup transcoding in 2022. For this scene, the above-mentioned detail restoration generation ability is mainly used . The left side of the comparison chart is the effect after restoration, generation and encoding, and the right side is the film source. After zooming in, you can see that the details of human hair and the sharpness of text edges have been significantly improved.

c05d4ee48c5e8023af0debbe79652673.png

Similarly, a similar effect can be achieved on BesTV's NBA live transcoding. Comparing the narrow-height encoded picture and the film source, it can be seen that the text area, jersey details and floor texture are richer.

c6d166b2fbdd14462dd1a1758780383b.png

In addition to the sports scene, we also supported the "Ideal Road" concert scene, which is characterized by poor source quality (the scene is a dark scene, accompanied by frequent switching of lights, smoke and scenes), and it can be seen that the picture has Obvious block effect. For this scenario, in addition to narrow-band HD 2.0, we also use the portrait custom template and semantic-based segmentation guidance technology to restore the image.

992cc47f30cd843e1161c23e605a8250.png

The image above shows the comparison between the transcoded image and the original image. It can be seen that the block effect of the smoke behind the characters has been improved, and the details of the face and hair have also been improved. The picture on the right is the feedback from the viewers, who spoke highly of the clarity of the live broadcast.

8553e2caf33d41d85048a5065e6a9dd7.png

In addition to the aforementioned sports live broadcast and concert scenes, we have also optimized some immersive scenes, such as using JND and saliency map technologies based on VR perspective and longitude and latitude for narrow and high VR scenes .

In order to further optimize the immersive experience, we also provide spatial audio technology that can present the spatial orientation of the sound source, so that users can feel the change of the sound source during the listening process, making the real-time interaction change from "online" to "presence".

89ec306bac37327006bc042bd62fb024.png

Next, we will introduce the compatibility of multi-target encoding capabilities . In addition to the usual focus on bitrate and quality, we also consider target complexity and target quality encoding.

ab7b7b9b0e88f2f6874b48f3d065ab43.png

The first is target complexity encoding . The encoding speed and machine resource consumption of traditional encoders change with the change of video content, resulting in relatively uncontrollable encoding water level in most cases. Therefore, in actual use, we also have a requirement for the complexity of the encoder.

The complexity allocation is specifically fed back from the sequence level to the GOP level to the frame level to the block level. The feedback content includes coding quality, speed and some of the aforementioned self-analysis content. This makes it possible to use more computing resources in exchange for subjective and objective quality improvements in simple scenarios. Similar to the concept of code-controlled VBV in complex scenes, it can limit the complexity of coding while avoiding the reduction of subjective and objective quality.

The second is target quality encoding , here we take VMAF as an example. The traditional ABR/CRF code control cannot guarantee the constant VMAF score of different sequences under the same code control parameter setting. At the same time, it is also impossible to quickly obtain the target bit rate or CRF parameters that should be set through the target VMAF score.

Although CRF is a code control method with relatively stable quality, when it comes to a specific index, the scores of different sequences still fluctuate greatly. Based on the above background, we developed the target quality coding tool. The lower right picture is the comparison before and after the tool is turned on. You can see the orange line after the tool is turned on, and the variance of the quality scores between different sequences is significantly smaller.

afc23988a99b22a28d5c9803b9484437.png

Next, we will introduce the multi-standard self-developed coding kernel in the architecture . The first is the three self-developed encoders: S264, S265 and Ali266, each of which has developed 100+ algorithms based on objective, subjective and scene constraints, covering live broadcast, on-demand, RTC scenes, as well as cloud, terminal, and natural scenes. SCC scene.

d70a676030cd2a12762d15322bac01ab.png

In terms of performance, compared with open-source encoders, S264 and S265 can increase the compression rate by 20% to 60% in all scenarios, and are especially optimized for ultra-high-definition and low-latency scenarios. Optimization methods include preprocessing (MCTF, Scene Detection, SCC Detection, GOP size adaptation), fast algorithms (block division, mode decision, motion estimation, SAO/ALF), bit rate control (CUTree, AQ, lambda optimization, CTU level code control) and engineering optimization (multi-thread parallelism, code refactoring, memory access optimization, SIMD optimization).

S264 and S265 participated in the World Encoder Cloud Competition in 2022, and won a total of 19 first prizes. Compared with the benchmark encoder AWS designated by the competition, they can save 63% of the code rate. From the perspective of transcoding efficiency, compared with competitors It also has an advantage of 2 to 6 times.

98b11377f191363fba474d32cd2da016.png

Ali266 participated in the World Encoder Codec Competition for the first time in 2021, and won 8 first prizes in the objective competition. Compared with the reference encoder x265, it can save 51% of the code rate under the same PSNR objective quality. At the same time, it scored in the subjective competition First.

In terms of the implementation of Ali266, Alibaba Cloud Video Cloud has worked closely with Bodhidharma Academy to promote the commercial implementation of Ali266 in media processing, live transcoding and other products. In January 2022, Ali266 will be officially launched on Youku, which will benefit significantly in terms of cost and user experience.

7c0c1cd462e98565b83348dc936a6440.png

In order to improve and promote the ecology of Ali266, we have also optimized the decoder of Ali266 , including multi-thread acceleration, assembly optimization, memory & cache optimization. After optimization, the decoding performance has been improved by 40% to 105% compared with the open source decoder, the memory usage has been reduced by more than 30%, and it supports high-definition real-time decoding of more than 90% of mobile devices.

2e7b7f20467b0172a11a3043222b4d5b.png

Next, we will introduce the support for multiple platforms.

1a6e6eb44d549052326300a8626f046f.png

First of all, it is the cooperation between Alibaba Cloud Video Cloud and Pingtouge solution team, based on the optimization of Yitian ARM server . On the Yitian 710, we mainly carried out in-depth optimization of the architecture for S264 and S265, mainly including three directions, one is the assembly optimization of calculation functions , which improves the overall performance by 40%; the other is parallel optimization of calculation functions , which also realizes about 40% performance improvement; the third is partial control function optimization , which combines algorithm design and optimization to improve performance by 20%. The final result is that the performance of S264 and S265 Yitian has been improved by more than 30% compared with C7 , and has been commercialized in large-scale video cloud on-demand scenarios.

c911b7704782a270de9520211da085a8.png

As shown in the figure, it shows a case of cloud rendering scene: Yangbo New Year Cloud Temple Fair. It requires low latency and comes with Nvidia inc encoder. By taking over the code control module of the encoder, integrating the self-developed JND and the AQ algorithm based on the airspace characteristic code rate allocation, and adding the pre-processing enhancement technology, we finally realized the landing of the narrow height landing in the cloud rendering scene. The right side of the juxtaposed figure is the effect after narrow height optimization, and it can be seen that a rich detail enhancement effect has been achieved.

04

Thinking and Exploration of Intelligent Coding

9fbc283657a6a6ceeb836ffd45c78667.png

Finally, share some practices and thoughts on intelligent coding. First of all, how do we define "good" when we face the contradiction between subjective and objective optimization ? The current coding direction has moved closer from "objective" to "subjective". Whether it is centered on "people" or starting from the final user experience, videos should focus on subjective experience.

In the research and development process, if we simply consider encoder optimization, we usually rely on active objective indicators such as PSNR, SSIM, and VMAF-NEG. But when the optimization goal is similar to the narrow height, it is to improve the subjective quality, then the improvement of the objective index score may not be reflected in the subjective quality.

Furthermore, there is also a problem with using a single objective indicator to measure video quality. From the perspective of coding standards, using the SAO and DB tools that come with the standard has little effect on PSNR and SSIM, but will lead to a decrease in VMAF scores; from open source software On the one hand, the PSY tool of the X265 encoder can subjectively increase some high-frequency details, but it also has a negative impact on objective indicators; the objective indicators reflected by our self-developed subjective optimization based on encoding feedback are also poor; the aforementioned JND The same is true, obviously the feedback on objective indicators is not good;

In terms of pre- processing enhancement , it can be clearly seen that clear and wrong textures in SRGAN perform better subjectively than blurred details, but PSNR and SSIM are worse.

This is our current dilemma in coding optimization.

3075d27bae142b1919c2c068aa9606dd.png

On the other hand, it is our related practice in AI for coding. We always pay attention to the development of AI Codec in the direction of video coding. At present, it can be seen that it can indeed continuously improve the objective quality of video, and can use GAN and Diffusion Model and other generation technologies to improve subjective quality in the direction of pre-processing and encoding. This is also an important direction we are researching.

Regarding the immersive encoding standard , we are currently paying attention to the "point cloud"-based encoding standard and the immersive-based MIV encoding standard, which will be added to encoders with multiple self-developed standards according to the implementation situation.

b4d6891a308ab5ad77268187826fcd49.png

Finally, regarding Coding for AI, we are currently focusing on VCM . With the same amount of information, its compression rate can be increased by 2-3 times compared with traditional coding. It can directly use structured code streams for visual tasks, and supports multiple multimedia tasks at the same time. . In terms of specific applications, we are carrying out relevant practice and exploration in the direction of bright kitchen, autonomous driving, and AI invigilation.

That’s all for today’s sharing, thank you all!


5f9ee48a46fc94ba58c6d5ebaa56acce.jpeg

LiveVideoStackCon is the stage for every multimedia technician. If you are in charge of a team or company, have years of practice in a certain field or technology, and are keen on technical exchanges, welcome to apply to be a producer/lecturer of LiveVideoStackCon.

Scan the QR code below to view lecturer application conditions, lecturer benefits and other information. Submit the form on the page to complete the instructor application. The conference organizing committee will review your information as soon as possible and communicate with qualified candidates.

6e35810bb38054087cf8544a24809f23.jpeg

Scan the QR code above 

Fill out the Instructor Application Form

Guess you like

Origin blog.csdn.net/vn9PLgZvnPs1522s82g/article/details/132309717