"I do technology in Taotian" Audio and video technology and its application in Taobao content business

Author: Li Kai

I. Introduction

In recent years, content e-commerce seems to have been fully integrated into people's lives: in our spare time, we have become accustomed to taking out our mobile phones and placing orders for our favorite products from the live broadcast rooms of e-commerce platforms or short video links.

Although high-quality goods, affordable prices, exquisite scenery, and interesting content output are all very critical influencing factors, content e-commerce must also be based on two premises: the picture quality must be high-definition and the playback must be smooth. In the past, many businesses and anchors were troubled by poor image quality in live broadcast rooms and did not understand how to achieve high-quality broadcasts.

With the support of a series of cutting-edge audio and video technologies, Taobao's audio and video technology team solved this problem.

The picture below is a case of perfecting the picture quality experience in the live broadcast room. The anchor achieved an ultra-low bit rate 1080p high-definition live broadcast through a series of audio and video technologies self-developed by the team, including video encoding, video enhancement processing, video quality evaluation, etc.:

Left: 720p live broadcast; Right: ultra-low bit rate 1080p live broadcast

There are also cases of perfecting the short video image quality experience. Through the above-mentioned audio and video technology self-developed by the team, the video clarity and texture details have been greatly improved:

Left: Before enhancement; Right: After enhancement

It is not difficult to find that in the above cases, the texture of the image after the transformation has become better, leaping from "standard definition" to "ultra definition", the skin color of the portrait has become more natural, and even the color of the product has become more accurate. This improvement that can be recognized by the naked eye comes from the audio and video technical capabilities provided by the team to create an industry-leading audio and video experience, especially video quality and smoothness.

But from a technical perspective, how to analyze and locate problems in video content and find targeted transformation methods is still a complex process. This starts with the past and present of audio and video technology.

2. Internet video trend

Today, digital TV technology can well satisfy our audio-visual experience. In the process of upgrading the experience brought by digital TV, with the advancement of technology, people bid farewell to storage media such as tapes and video tapes, and transitioned to VCD, DVD and now blue-ray (Blu-ray), in addition to the corresponding MPEG Video coding and compression technologies such as -2 (H.262), H.264/AVC, H.265/HEVC and H266/VVC are used to improve image quality and effectively save storage and bandwidth costs.

The video technology of radio and television is very professional, and the production cost and cycle are also very high. It includes complete and mature industrial links, such as setting, shooting, processing, editing, encoding, transmission, communication, etc. Long-term high-quality consumption experience actually gives consumers a strong mind. To a large extent, radio and television represents professionalism and high-quality experience, especially in terms of picture quality.

In the 2010s, there was an obvious trend of video Internetization, and the production and sharing of videos migrated on a large scale from traditional radio and television to the Internet and OTT. Long video, medium video, live broadcast, and short video related businesses are booming on the Internet. For an Internet company, from a technical perspective, the better the internetization of video, the better the experience, and the better it can attract merchants and C-end users. The e-commerce of content or the content of e-commerce has also become a focus battle for many leading Internet companies.

The cost of producing and sharing Internet videos is very low, and for C-end users it is almost zero. In order to support the Internetization of good video, a huge engineering and technical work is to realize the capabilities of the previous radio and television links on the Internet platform, thereby providing a radio and television level video playback experience.

3. Audio and video technology in Taobao

In Taobao's actual content business, the entire video business life cycle, including the supply and distribution of video content, requires comprehensive end-to-end full-link capabilities in video production, video processing, video transmission, video presentation, and audio. Only in this way can the high image quality and smoothness of the overall video be ensured. Consumers' demands for video quality are getting higher and higher. For example, they must take into account higher definition and smooth playback, and they must also control the overall cost from production to distribution.

This means that the platform’s video processing technology evolution must face the ever-changing market demands and various challenges brought about by the explosive growth of business volume. To this end, the team supports Taobao Live, Tab2 (Shopping), Homepage An overall solution for information flow and other content businesses and maintains continuous and high-speed iteration.

Through targeted self-research on the above technologies, including video encoders, video enhancement solutions, beauty/beauty/makeup, no-reference video quality evaluation models and media processing systems, and through access to the low-latency transmission network GRTN , providing underlying core technologies for content services such as live broadcasts and homepages, to create an industry-leading audio and video experience, especially video quality and fluency.

Through continuous technology polishing and algorithm innovation, we strive to empower Taobao's content business with high quality and low cost, and support Taobao's content strategy. The accumulated platform technology and product capabilities can also be reused by other businesses of the group. These technical capabilities accumulated over the years have also won prizes in many authoritative international audio and video core technology competitions.

4. Technical picture

The technical domain included in audio and video technology involves the entire life cycle of all audio and video streams on the platform, from production to distribution to final consumption. As shown in the following technical map, it includes several core technology modules:

Audio and video technology big picture

Note: This technology map lists many related technical solutions, but not all technologies have been adopted in business.

4.1 Video production

In order to improve the quality of video content, content producers will inevitably "edit" the content itself. Editing methods include but are not limited to improving the beauty of the characters in the video based on portrait beautification capabilities, improving the clarity of the content itself based on pre-processing capabilities, improving the atmosphere of the video content based on stylized filters, and generating based on predefined editing templates Different styles of videos and more. How to improve editing effects, enrich editing capabilities, and lower the threshold for use is the direction in which major video production software continues to optimize.

Among them, the editing ability of portrait beautification reveals to users five major sub-functions including beauty, beauty, whitening, beauty makeup, and body beauty. The underlying operators that support the portrait beautification effect include, but are not limited to, 2D to 3D visual algorithms for faces, human bodies, etc., and all algorithms require real-time processing of videos on the mobile terminal.

While improving and optimizing the effect of the underlying operator, some scenarios must take into account factors such as real-time performance and heat generation that are closely related to performance. The arrangement and joint optimization of operators, and the adaptation of different terminals (iOS, Android, PC) and different computing power platforms (NPU, GPU, CPU) are also the focus of portrait beautification.

With the explosion of generative AI technology, AIGC has become a new content production method after PGC and UGC. However, how to balance the richness and stability of generated content is a major challenge for the implementation of AIGC technology-assisted video production.

4.2 Video processing

After the video content produced by content producers, including short videos and live broadcast scenes, is uploaded to the server, it needs to go through a series of processing to improve the image quality and reduce the bit rate before it is finally distributed to end users. We call it TMPS ( Taobao Media Processing System).

TMPS mainly consists of three steps:

Firstly is to decode the source content, which needs to be compatible with various media formats and various audio and video coding standards, including supporting image formats.

Secondly Use the STaoVideo video enhancement solution developed by the team to enhance the decoded content to improve the picture quality experience, including noise removal, color, detail, brightness enhancement, super resolution, Superframe, HDR and other methods, including both traditional methods and deep learning methods. STaoVideo will automatically select different enhancement operators based on the characteristics and popularity of the source video content to maximize the effect of image quality improvement under limited computing power costs.

Finally is to re-encode using more efficient encoders, including S265 and S266 encoders self-developed by Taobao’s content technology team, to improve compression efficiency and reduce traffic costs while ensuring image quality. No apparent loss. TMPS supports simultaneous transcoding of multiple streams with different resolutions. High-resolution and high-bitrate streams are used to ensure the image quality experience of mainstream users, while low-resolution and low-bitrate streams are used to be compatible with low-end devices and weak networks. .

4.3 Video transmission

From the production of live broadcast content to what users see in the live broadcast room, it needs to go through a complex CDN transmission network. The traditional RTMP and HLS protocols have a large delay. With the rise of 5G, low-latency content such as live streaming and live answering questions have appeared. form, traditional agreements can no longer fully meet business demands. Therefore, Taobao and Alibaba Cloud have jointly established the GRTN, a low-latency transmission network that combines communication and live broadcasting, to achieve full-link RTC transmission. In conjunction with the CDN infrastructure, Taobao built the RTC streaming media transmission protocol from 0 to 1, and took the lead in implementing GRTN on the anchor push side and mobile Taobao playback side. It successfully launched Taobao Live and achieved full coverage.

The video transmission of Taobao Live has achieved an end-to-end delay of less than 1 second. At the same time, it can also quickly meet the underlying demands of Taobao's emerging business forms for audio and video media transmission, such as the "Multiple Personnel" in the "China New Anchor 2023" competition. "Lianmai PK" activity.

GRTN architecture diagram

In response to the demand for optimization of live broadcast and short video experience, a bandwidth prediction algorithm based on weak network classification was developed, combined with real-time image quality evaluation to achieve upstream streaming resolution decision-making, and the downstream ABR algorithm was optimized to achieve adaptive streaming of low-latency live broadcast and on-demand services. , optimizing Qos through congestion control, preloading and other algorithms, reducing the first frame time by 200ms and reducing lag by more than 50%.

Algorithms such as error concealment, packet loss retransmission, smooth transmission, and time domain layering are also being explored to improve user experience.

4.4 Video presentation

As Taobao's contentization process deepens and users pursue "high-definition, good-looking, and fun" content, Taobao is also exploring new media forms, with the emergence of new content forms such as live broadcasts, online question answering, voice broadcasts, and game live broadcasts. , it is necessary to accept new capabilities and new users with a good experience.

First of all, the player architecture is upgraded to optimize playback logic, improve hard decryption coverage, and establish adaptive stream selection/cutting capabilities through performance and network, optimizing issues such as lagging and heating on mid- to low-end mobile phones; through Supports over-resolution on the playback side, post-processing enhancement and other methods to effectively improve user clarity under weak networks.

Secondly, by supporting VR/AR and HDR video playback, the video presentation method is further improved. The client-side interactive capabilities are also constantly being built, with more props and interactive gameplay, such as face-covering and co-photography, making users feel more fun and more willing to participate.

4.5 Audio end-to-end

Sound is an important medium for transmitting information, but Taobao live broadcast environments and equipment are diverse. Various types of noise are often mixed in the live broadcast room to affect the user's hearing. In a continuous microphone scenario, problems such as echo and howling are likely to occur; anchors They often also hope to have background music, voice changing, sound effects and other gameplay; content anchors also hope to achieve concert-like sound quality. How to use technical means to enable users to obtain an immersive audio-visual experience with "sound" has become an important task.

The content technology team optimizes the live broadcast sound quality experience from the entire link of audio collection, pre-processing, encoding, decoding, transmission, and playback, and self-developed 3A algorithm SDK (echo cancellation, adaptive noise reduction, automatic gain control) to better meet the needs of For pre-processing requirements, the audio subsystem supports the function of connecting microphones, and has the capabilities of weak network resistance, audio and video synchronization, etc.; in response to the need for no-reference audio quality evaluation, machine learning methods are used to implement the MD-AQA sound quality evaluation model, which is used in large-scale markets. Sound quality monitoring realizes a closed loop between sound quality processing and evaluation.

Live broadcast room noise example

5. The practice of audio and video technology in Taotian

As the entire industry's investment in the audio and video field expands and the overall technical level improves, as well as Taobao's increasing emphasis on user experience, including image quality, we have also conducted in-depth self-research and continuous iterative polishing of some core technical modules. , especially in the important scenes of Taobao live broadcast and short videos (including shopping), it has achieved better results in improving experience and reducing costs.

As can be seen in the schematic diagram below, whether it is live broadcast or short video processing, it is inseparable from video enhancement, processing and video encoding technology. The difference is that the two scenarios have different requirements for real-time performance. At the same time, in order to pursue a high-quality presentation experience, the distortion of all links in the entire end-to-end link, even the low image quality of the image source itself, needs to be better considered and quantified. Therefore, there is no reference test for quality. Reviews are also crucial in the process of measuring picture quality experience. Video enhancement, video coding and parameter-free video quality evaluation are three important technical directions to ensure video quality.

5.1 Video enhancement

In both live broadcasts and short videos, we need to pay attention to image quality and strive to provide users with the best image quality experience. We built the STVideo video enhancement solution to specifically enhance image quality through different operators.

Live streaming is more focused on making up for the lack of camera imaging. To solve the problem of high noise in mobile cameras, we have launched a noise removal operator. To solve the problem of insufficient low-end color, we have provided color enhancement algorithms for users to use. Short videos mainly use cloud operators to enhance the video during the transcoding process, including the differentiated Zhimei HD and Puhui HD operators, which respectively improve the image quality of hot videos and large-scale videos and reduce the cost of the transcoding process. computing power cost. For low-resolution videos, super-resolution algorithms are further used to improve the resolution.

The team not only focuses on the business and the subjective experience of the human eye, but also pays attention to the progress of the industry and actively explores methods that can improve objective indicators. A new method explored by team members in daily business research and development: the two-stage video recovery method of progressive training. In the 2022 CVPR NTIRE competition, we won two track championships and one track runner-up in the three tracks of the video super score and quality enhancement competition. CVPR NTIRE (New Trends in Image Restoration and Enhancement workshop and challenges on image and video processing) is the world's top competition in image and video enhancement. After winning the MSU World Coder Competition, the team once again won the prestigious competition in the core direction of audio and video.

The competition has gathered more than a dozen participating teams at home and abroad, including well-known technology companies such as Tencent, Byte, and Huawei, as well as scientific research institutions such as the Chinese Academy of Sciences, Peking University, Hong Kong Chinese Language Institute, and ETH. Many of the contestants have many years of experience in participating. After fierce competition, the team finally achieved two championships and one second place.

CVPR NTIRE 2022 Video Super Score and Enhanced Competition Rankings

Left: Source video, right: Video enhancement focusing on the ability to generate portrait areas (video comes from purchasing video at the end of May)

Facing the future, we will provide more segmented and differentiated video enhancement methods for videos in different businesses and scenarios:

For medium and low quality videos, blurred images are a common problem. For this reason, we need to provide a strong deblurring model, link with the MD-VQA image quality score, adaptively select the intensity and area of deblurring, and achieve universal deblurring in multiple scenarios. Vague;
For videos that mainly focus on portraits, the attention mechanism of the portrait area is added to guide the model to enhance the generation ability of the portrait area, while constraining the face area to maintain a natural look and feel, realizing the transition from low-quality portrait videos to high-quality portrait videos (see The following figure);
For videos with good image quality but insufficient color and brightness, we provide customized color brightness enhancement capabilities to further enhance the look and feel of the image;
For live broadcast scenarios, we will provide more abundant image quality enhancement capabilities according to the model, including the ability to improve color brightness and image transparency.

5.2 Video encoding and transmission

With the rise of Internet content, especially the popularity of video and live broadcast, video encoding has become one of the core basic technologies of the business. Uncompressed high-definition video is huge in size and is not conducive to network transmission and storage.

Since the early 1990s, two major organizations, the International Telecommunications Union ITU-T VCEG and the International Organization for Standardization ISO/IEC MPEG, have separately or jointly released several generations of video codec standards. Currently, the most commonly used in the industry is H.264/ AVC and H.265/HEVC. The former is widely used in digital TV, Internet, video conferencing and other services, while the latter has made an important contribution to the popularization of high-definition ultra-high definition video and HDR video.

H.266 (VVC) is the latest international video coding standard. Its first version was formulated in July 2020. Compared with the previous generation standard, it can reduce video bandwidth by 40% under the same subjective quality conditions, and has huge application prospects. .

【Business】S265 Application

The S265 encoder self-developed by the Taobao content technology team is an efficient implementation of the H.265/HEVC standard. After years of productization and polishing, it has been fully used in Taobao content businesses including Taobao live broadcast, homepage information flow, and Taobao shopping. , and achieves high-definition image quality encoding with lower bandwidth and resource consumption. Compared with the previous generation standard, the bit rate is reduced by more than 40% while maintaining the same image quality. After compression by the S265 encoder, ordinary mobile phones can also operate smoothly on 3G networks. Watch 1080p high-definition, and the latest mobile phones can also support 4k 30FPS ultra-high-definition live broadcast.

【Competition】S265, S266

Based on the core technology of S265, the team also developed the H.266/VVC standard encoder S266. The two encoders participated in two consecutive competitions in MSU 2020 and 2021 respectively, and achieved first place on multiple tracks.

On the MSU2020 Full HD objective performance track, S265 won the first place in two PSNR indicators; on the MSU2021 Full HD objective performance track, S266 even won 8 of the 14 evaluation indicators first; in the subjective performance track, S266 Among the 16 participating encoders, it won the first place by a large margin. With the same subjective quality as MSU's officially designated benchmark encoder x265, the bandwidth was saved by 71%. S266 also became the only one that entered all indicators in both competitions. Top three encoders.

The MSU (Moscow State University) World Video Encoder Competition is the most authoritative global top-level competition in the field of video encoding. It has been held by MSU's Graphics & Media Lab for 18 consecutive years. Its evaluation reports are widely recognized by the industry and attract Including Google, Netflix, Intel, Nvidia, Tencent, Byte, Huawei and other well-known domestic and foreign technology companies participating, representing the vane of industry development.

MSU 2020 Main FullHD 1 fps YUV-PSNR Ranking

MSU 2021 Main FullHD 1 fps YUV-PSNR Ranking

The S265 encoder has innovated in several aspects such as code rate control, fast algorithm, encoding tool implementation, and engineering acceleration, surpassing the X265 encoder, and the YUV-PSNR indicator is 35% ahead at the 1fps speed gear.

Based on the S265 encoder, S266 is further optimized to comply with the VVC standard. The main work includes adaptation to new tool sets, such as extending many optimization methods in S265 to larger coding unit blocks (CTU) in VVC, more complex and changeable block division structures, and different motion vector estimation. and other new coding tools; at the same time, technologies such as pre-analysis, adaptive quantization, and temporal motion filtering are introduced to improve coding efficiency; more fast algorithms are used in the coding process to reduce the overall computational complexity, and intensive calculations are made possible through assembly optimization The module speeds up, and finally uses frame, CTU line, and block-level parallelism to reduce the overall encoding time, making the S266 encoder significantly faster than the H.266/VVC reference software VTM11, and can run at the 1fps speed range ( Large-scale VVC offline encoding applications become possible).

S266 provides a 50% improvement in coding efficiency compared to the H.265 open source software X265 very slow file (under the same image quality, the code rate is reduced by 50%), and won the first place in multiple indicators such as PSNR in this MSU competition.

【Business】S266 landing

Passing MSU's authoritative certification demonstrates the powerful compression efficiency of S266, but there is still a long way to go to promote the commercial use of the VVC standard. This is because VVC, as the next generation encoding standard of HEVC, has introduced many new encoding tools. On the one hand, these tools have improved compression efficiency, but also put forward higher requirements for computing power. At the same time, current mobile phone chips cannot support Under the premise of H.266 hardware decoding, software decoding problems such as heating and lagging will be greatly restricted. The Taobao content technology team has been committed to optimizing the computing power of the S266 codec.

In view of the characteristics of mobile phone chips, the team has optimized multiple dimensions, including multi-core parallelism, ARM assembly, memory access efficiency, memory footprint, etc. Low-end mobile phones can decode 720p videos using only 2 cores, and mid-to-high-end mobile phones can support 1080p real-time decoding.

In response to the needs of mobile Taobao in terms of stability, memory usage, packet size, etc., tens of thousands of abnormal code streams have been rigorously tested to ensure stability; fixed memory management has been adopted to avoid repeated allocation and release, and the reference frame management strategy has been optimized. It works with the encoder to reduce the number of reference cache frames and achieve lower memory usage; it also performs extreme tailoring in terms of package size, so that the increment in hand Taobao package size is within 800k.

As the optimization of codecs gradually matures and the computing power of equipment gradually increases, the team will start the implementation of VVC on Taobao in 2023.

First of all, Taobao media processing system TMPS embeds the S266 encoding plug-in and supports the encapsulation and decapsulation of ISO/IEC MP4 containers. It supports transcoding templates combined with Zhimei HD to achieve a powerful combination of encoding and enhancement.

Secondly, the Taobao player is adapted to the S266 decoding plug-in, which is optimized for scenes such as seeking, scrolling up and down, and preloading. It is also compatible with playback degradation logic, supports stream selection logic in multiple formats and resolutions, and realizes memory decoupling of playback and decoding. . On the content bus and business side, multi-stream transcoding and broadcast control delivery logic are also implemented.

During the upcoming Double Eleven, mobile Taobao users will be able to watch VVC high-definition videos based on S266 technology and enjoy a smooth playback experience.

In order to meet Taobao Live's demand for real-time encoding, the team also developed the S266 fast file. By selecting cost-effective tools, optimizing algorithms such as block division, mode selection, and filtering, and introducing the AVX512 instruction set, it further improved frame-level and line-level performance. Level of parallelism enables S266 to achieve 1080p real-time encoding on a personal PC. The full live broadcast link will also support VVC over RTMP/RTP streaming, transmission, and playback. Users will soon be able to watch live broadcasts based on VVC technology on Taobao Live.

【transmission】

On the video transmission side, the adaptive bit rate algorithm (ABR) adaptively adjusts the playback resolution based on user network conditions and cache and other information to achieve a balance between image quality and stuttering QOE. Based on the low-latency characteristics of live broadcast, the Taobao content technology team added a source-end code rate transmission channel to accurately obtain code stream information in real time, and obtained user bandwidth information in real time through bandwidth detection, improved the ABR network structure and QOE status model, and considered live broadcast frame skipping and Faced with the Reward alignment problem faced by fast and slow broadcasts, we proposed our own ABR algorithm, which for the first time realized adaptive stream cutting under low-latency live broadcasts, reducing the number of live broadcasts by 27%.

In short video stream selection, bandwidth is estimated based on the download time of historical slices, transport layer information and network type. The optimal parameters are determined through a large number of AB experiments and the problem of quality and bit rate mismatch is solved, which helps to significantly reduce the degradation rate of 1080p playback. reduce.

1080p ratio VS freeze rate per 100 seconds Exit rate VS freeze duration

5.3 No-parameter video quality evaluation

In recent years, the trend of video content on the Internet is inseparable. From life, entertainment to learning, video has become the first medium for many people to obtain information. Among them, UGC video content accounts for almost 70% to 80% of the entire Internet video traffic. People not only consume these UGC video contents, but also create their own "works". Anyone can use a mobile phone to shoot and upload short videos, or open their own live broadcast account to share their life.

But the quality of UGC videos tends to vary. First of all, because its quality is subject to factors such as shooting equipment, shooting environment, shooting skills, etc., even if the producer of the video content is very experienced and the original video quality is very high, once it has gone through various processing and distribution links on the platform or is reproduced by other users, Otherwise, the video effect that consumers see at the other end may be compromised.

As quality evaluation scenarios lacking ideal video reference sources gradually become mainstream, reference-free video quality evaluation, as the main technical means of quality evaluation, has received more and more widespread attention in the past few years. However, this field lacks a credible baseline, and there are no traditional indicators such as PSNR, SSIM, and VMAF in traditional radio and television. Moreover, research on UGC video quality evaluation in academia is still in its infancy, and there is no consensus on authoritative directions and standards that can be directly applied by the industry.

Therefore, based on Taobao Live, Tab2, homepage information flow and other content businesses, the team self-developed a reference-free video quality assessment model for UGC videos - MD-VQA (Multi-Dimensional Video Quality Assessment), which integrates the semantics and Multi-dimensional information such as distortion and motion are integrated in the spatio-temporal domain to measure the absolute quality of the video. On the public video quality evaluation data sets LIVE-WC and YT-UGC+, as well as TaoLive (derived from Taobao video business, containing 3,762 videos, covering different content, distortion, and quality, and through professional subjective annotation), MD -VQA surpasses the SOTA (State-Of-The-Art) method in both mainstream video quality evaluation indicators SRCC and PLCC, achieving advanced performance.

At present, MD-VQA has been fully used in Taobao content businesses including Taobao Live, Taobao Information Streaming, Taobao Shopping, etc., to "quantify" and monitor the overall image quality changes of the video business, and quickly and accurately filter out live broadcasts with different image quality levels. and short videos to help improve the quality of platform content. Taking Taobao Live as an example, MD-VQA provides minute-level online quality monitoring capabilities, which can quickly and accurately screen live broadcast rooms with different image quality levels, assist in the mining and analysis of online low-quality badcases, and alert anchors of image quality issues in real time. Bottleneck problem, in conjunction with the "Guidelines for Launching E-commerce Live Broadcasts with High Image Quality", we provide improvement measures, which significantly improves the image quality satisfaction of Taobao Live broadcast anchors: among the anchors who have received reminders, 75%+ want to maintain and improve real-time reminders Serve.

In addition, MD-VQA is also supporting more and more image quality evaluation-related businesses within the entire Alibaba Group, such as DingTalk Live, ICBU Live and Alipay Live, assisting in monitoring the image quality experience of video-related businesses. Relevant papers were successfully included in the IEEE/CVF Computer Vision and Pattern Recognition Conference 2023 (CVPR 2023), the top conference in the field of computer vision.

At the same time, based on the accumulation of experience in daily business, the team developed the reference-free video quality evaluation model TB-VQA based on MD-VQA, and participated in the CVPR NTIRE 2023 video quality evaluation competition and won the competition (the only track) champion.

This competition has gathered dozens of top participating teams at home and abroad, including well-known technology companies such as Byte, Kuaishou, NetEase, Xiaomi, and Shopee, as well as universities such as Beihang University and Nanyang Technological University in Singapore. TB-VQA stood out from 37 teams and ranked equally in the three indicators of Main Score (Main Score), SRCC (Spearman Rank Order Correlation Coefficient) and PLCC (Pearson Linear Correlation Coefficient, the higher the SRCC and PLCC, the closer it is to GT) Ranked first.

CVPR NTIRE 2023 Video Quality Evaluation Competition Ranking

Beauty picture quality: FACE-VQA&Audio quality evaluation: MD-AQA

In addition to the MD-VQA model for general scene video quality evaluation, we also developed the FACE-VQA model for beauty quality evaluation and the MD-AQA model for audio quality. FACE-VQA first detects the face in the video, and then conducts a multi-dimensional comprehensive evaluation of the skin texture, color and shape of the face based on people's aesthetic standards. FACE-VQA has been used to iterate the beauty algorithm and monitor the beauty effect of Taobao Live. In the future, we will continue to improve the accuracy of FACE-VQA and improve the impact of makeup on beauty.

In response to the clear need for reference-free audio quality evaluation, MD-AQA starts from multiple dimensions and uses a deep CNN self-attention model to score the four dimensions of noise, speech continuity, loudness, and timbre, and predict the MOS score at the same time. Currently, MD-AQA has been used to monitor the sound quality of Taobao live broadcasts, helping to discover and improve live broadcast rooms with better/poorer sound quality.

6. Like-minded

If you are interested in the audio and video business, you can directly submit your resume to this email: [email protected]. You are welcome to join us.