Alibaba Cloud's latest research results in the field of video cloud face generation were selected for CVPR2022

CVPR (IEEE Conference on Computer Vision and Pattern Recognition), as the top conference in the field of computer vision and pattern recognition, is highly authoritative in the world. At present, in the ranking of international academic conferences recommended by the China Computer Federation, CVPR is a category A conference in the field of artificial intelligence.

With solid accumulation and cutting-edge innovation in the field of face generation, the latest research result of Alibaba Cloud Video Cloud and Hong Kong University of Science and Technology is "Depth-Aware Generative Adversarial Network for Talking Head Video Generation) is received by CVPR2022.

The latest CVPR 2022 will also be held in New Orleans, Louisiana, USA from June 19th to 24th, 2022.

1647314685038-27415aa9-dd05-4026-b9fe-946345832c1c.png

Face reenactment (talking head) has received more and more attention in recent years. Existing face reenactment methods rely heavily on 2D representations learned from input images, and rarely introduce 3D geometric information for Guidance and constraints lead to the inaccurate structure, pose and expression of the generated face, and poor generalization, which is difficult to apply to practical scenes on a large scale.

The Alibaba Cloud video cloud technology team and the Hong Kong University of Science and Technology jointly proposed a face replay algorithm with depth perception. The emergence of this algorithm is a major innovation in the field of face reproduction, and its academic and application value is worth looking forward to. Especially in the field of video cloud, the application of this algorithm is expected to make a great breakthrough in the efficiency of audio and video encoding and decoding.

The algorithm uses a self-supervised depth estimation model to obtain pixel-level depth maps from videos without any 3D annotations, which in turn guide the detection of facial keypoints and the synthesis of motion fields. In the face generation stage, a cross-modal attention map can be learned using this depth map to capture more action details and correct the face structure.

Therefore, this technology provides a new solution for video encoding and decoding in specific scenarios. For example, in a videoconferencing scenario, our model learns to synthesize a video of the person speaking from the head using a source image containing the appearance of the target person and a driving video. Our motions are encoded based on a new keypoint table annotation, and our compact keypoint annotations enable videoconferencing systems to achieve the same visual quality as the commercial H.264 standard, while using only one-tenth the bandwidth. That is, while the bandwidth requirements are greatly reduced, high image quality and low latency can still be achieved.

In addition, this technology can be widely used in conferences, live broadcast scenes, or interactive entertainment scenes such as metaverse and virtual humans, and can meet the needs of video-based pictures in various scenes. That is, according to the expected actions, driving face pictures of various styles to obtain corresponding videos. It can be seen that using the breakthrough of this technology path and flexibly applying it to the business path of hot industries will gain an immeasurable boost.


"Video Cloud Technology", your most noteworthy public account of audio and video technology, pushes practical technical articles from the frontline of Alibaba Cloud every week, where you can communicate with first-class engineers in the audio and video field. Reply to [Technology] in the background of the official account, you can join the Alibaba Cloud video cloud product technology exchange group, discuss audio and video technology with industry leaders, and obtain more latest industry information.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4713941/blog/5514434