The distance between digital life and us in "The Wandering Earth 2"

In "The Wandering Earth 2", we saw the immortal digital human Yaya. She not only has Yaya's own image and memory, but also has the same consciousness and personality as her own. With the continued popularity of the Metaverse, digital human technology has made great progress in recent years, but how far is the current technology from the digital human Yaya shown in the movie? Next, we will briefly introduce the current technological progress, technical bottlenecks, and future technological prospects from the perspectives of digital human perception, cognition, consciousness, and personality, as well as the reproduction and expression of digital twins.

Twin digital human reproduction, expression technology

The reproduction of twin digital humans generally refers to duplicating and displaying real characters in 2D or 3D, while expression refers to the ability of digital humans to perform various body movements and facial movements. Among them, 2D-based twin digital human replicas can be easily completed by taking photos or recording videos. However, for 2D digital humans to complete specified actions, an excellent generative algorithm model is required. The current popular Talking Face task research can already be done through voice The driver generates lip movements that align with speech, but how to generate high-definition and accurate 2D body movements still faces great challenges. On the other hand, although the more popular generation technologies such as GAN, NeRF, and Diffusion Model can complete basic generation tasks, although some progress has been made in the research on how to naturally and smoothly generate digital humans with various expressions, it is necessary to If you want to apply these technologies to actual products, you still face problems such as the generated video is not high-definition, the expressions are not natural enough, and the movements are not rich enough.

△ The picture comes from the paper [1]

For example, the recent popular research on style transfer digital human synthesis can synthesize a video with a reference style and audio mouth alignment by inputting a piece of audio, a style reference video and a photo of a person. Next, we take the paper StyleTalk [1] as an example to introduce this algorithm structure in depth.

The overall structure of the algorithm:

Firstly, the serialized 3DMM expression coefficients are extracted from the style reference video, and then the expression coefficients are sent to the style encoder to obtain the style code. At the same time, an audio encoder is used to convert phoneme labels into audio features. Then use the style designed in the article to control the dynamic decoder to generate style expression coefficients with style and audio characteristics, and finally input the style expression coefficients and personal reference images into the image rendering model to complete the personal with reference video style and in line with the input audio video driver.

△ The picture comes from the paper [1]

Decoupling the upper and lower faces: Considering that the upper and lower faces have different motion patterns. The upper face (eyes, eyeballs) moves at a low frequency, while the lower face (mouth) moves at a high frequency. Therefore, in this paper, the expression coefficients are divided into the upper half face group and the lower half face group, and two parallel style control dynamic decoders are used for training. Among the 64 expression coefficients, 13 coefficients highly correlated with mouth movements were selected as the lower face group, and the remaining coefficients were selected as the upper face group. Afterwards, the generated two sets of expression coefficients are spliced ​​together to obtain the final expression coefficients.

Objective function design: In order to learn an excellent model in this article, a mouth shape alignment discriminator, a timing discriminator, a style discriminator and a triplet constraint function are used.

  • Mouth alignment discriminator: Cosine similarity is used in the article to represent the alignment of mouth mapping ( em  ) and audio mapping ( ea  ),

    The synchronization probability is then maximized by computing the synchronization loss for each frame of the resulting video slice,

  • Style discriminator: Similar to the PatchGAN structure, a style loss is used to guide the model to generate a vivid speaking style,

  • Timing Discriminator: It is to learn to distinguish the authenticity of the input 3DMM [2] expression coefficient sequence, also referring to the PatchGAN [3] structure, using the GAN hinge loss Ltem.

  • Triple Constraint: Given a video slice with a speaking style, randomly sample two slices, one with the same style and one with a different style. The style codes Sc, Snc  and Spc  of the three slices are respectively extracted , and the triplet loss is used to constrain their distance in the style space,

Finally, the loss function of the overall model combines the above four losses and the generation loss function (L1 loss).

The 3D-based twin digital human itself has two tasks: digital human reproduction and digital human expression. At present, there have been many studies on how to reproduce 3D digital humans based on real characters. Some studies can use camera technology to shoot a circle around real people, and reproduce 3D virtual characters through videos. Some studies rely on multiple cameras. Professional shooting studio, by taking photos from multiple angles to synthesize 3D images and so on. Although the existing technology can reproduce the basic 3D characters, there is still a big gap compared with real people. Especially in the reproduction of details such as clothes and skin, it has not yet reached a very fine level. Compared with 2D expression, 3D expression is easier to realize. The main reason is that 2D expression basically depends on generation technology, while 3D expression relies more on driving technology. Generation technology is prone to problems such as frame sequence violation, distortion, blur, and deformation, while 3D driving can complete specified actions and expressions based on key points of the face and body. In particular, motion capture technology can accurately drive facial expressions in real time and body movements. Even so, how to get rid of human control and automatically drive each key point in the 3D character to make accurate movements is still a hot research direction.

Cognition, Consciousness and Personalization of Digital Humans

Human cognition and consciousness are very complex puzzles that have not been completely solved. Enabling digital humans to be able to recognize, have consciousness, and even have their own personality is the main direction of many researches. Moreover, this kind of work is more related to brain science, making its research more complex and challenging. When a digital human has certain consciousness or individuation, what form is used to express it is also a more cross-comprehensive research topic, such as what kind of language form, what kind of voice intonation, what kind of facial expression and body movements, How to coordinate and match various forms of expression together, etc. The existence of these problems, on the one hand, promotes the rapid development and iteration of related individual disciplines or fields, and on the other hand, it also puts forward higher requirements and challenges for the integration of interdisciplinary and interdisciplinary fields.

In terms of conscious intelligence, some theoretical research progress has already begun. For example, Professor Manuel Blum, the winner of the Turing Award, proposed a formalized theoretical computer model of consciousness in 2022—the Conscious Turing Machine (Conscious Turing Machine) [4]. Conscious Turing Machine aims to establish a mathematical model of consciousness, help us understand the mechanism of consciousness, and finally realize conscious AI. The idea of ​​a conscious Turing machine originated from the Global Workspace Theory (GWT) of cognitive neuroscientist Bernard Baars. The theory posits that consciousness is linked to a global "broadcast system" in our brain that broadcasts information throughout the brain. When the brain processes various input information, under normal circumstances, various dedicated information processors (visual processors, auditory processors, tactile processors, etc.) in the brain will automatically process the information. At this time, our brain will not generate consciousness. But when the brain is faced with new or abnormal stimulus information, various exclusive information processors will conduct "focused" analysis and processing of new and different stimuli in the global workspace through cooperation or competition, and consciousness is precisely in this produced in the process. The global workspace theory can be vividly explained by the "theatre metaphor", we can compare consciousness to theater actors performing in the spotlight, and their performance is observed by a group of audiences (or unconscious processors) sitting in the dark carried out below. In Blum's Conscious Turing Machine, the stage of the global workspace is represented by a short-term memory, and the audience is represented by processors, each with its own expertise, that make up the Conscious Turing Machine long-term memory storage. These long-term memories and processors automatically process and predict information and get feedback from the world of conscious Turing machines. Learning algorithms inside each processor improve the processor's behavior based on this feedback. At the same time, each processor will compete with each other, and the winner of the competition receives and processes information in the form of chunks on the stage, and then broadcasts the processing results through the broadcast mechanism. In the Turing machine model of consciousness, it is the process of receiving conscious information by the long-term memory processor that makes our brain produce conscious awareness.

In general, a conscious Turing machine consists of a tree structure from the conscious to the unconscious: the root node is the conscious processor of the short-term memory, and the sub-nodes are a large number of long-term memory unconscious processors. Short-term memory is like the stage in the global workspace model, and the information in short-term memory on the stage is the information in consciousness.

Left: Global Workspace Theory Framework, Right: Conscious Turing Machine Framework

As the first mathematical model of consciousness, the Conscious Turing Machine helps us understand the mechanism of consciousness, and also provides a theoretical direction for us to design a truly conscious AI. However, as far as the development of existing technologies is concerned, there is still a long way to go before digital humans have cognition, consciousness and personalized expression similar to Yaya.

Summarize

Yaya in "The Wandering Earth 2" is a concrete manifestation of the research direction of digital human technology, although the current technology has various challenges in digital human reproduction, expression, cognition, consciousness and personalization. But we still believe that with the rapid iteration of AI technology, more and more intelligent digital human products will be presented to everyone.

About Huayuan Computing

Huayuan Computing Technology (Shanghai) Co., Ltd. ("Huayuan Computing" for short), established in 2002, focuses on algorithm research and innovative applications, focusing on the research, application and development of cognitive intelligence technology. Based on the development of mathematics application and computing technology, the company focuses on cognitive intelligence technology and innovates self-developed underlying algorithms; based on the scenario application of the cognitive intelligence engine platform, it provides AI+ industry solutions for digital governance, intelligent manufacturing, digital cultural tourism, retail finance and other industries Solutions, to achieve comprehensive empowerment, so as to promote the transformation and upgrading of industry intelligence, and make the world smarter.

         

reference:

[1] Ma Y, Wang S, Hu Z, et al. StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles[J]. arXiv preprint arXiv:2301.01081, 2023.

[2] Blanz V, Vetter T. A morphable model for the synthesis of 3D faces[C]//Proceedings of the 26th annual conference on Computer graphics and interactive techniques. 1999: 187-194.

[3] Isola P, Zhu J Y, Zhou T, et al. Image-to-image translation with conditional adversarial networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1125-1134.

[4] Blum, L., & Blum, M. (2022). A theory of consciousness from a theoretical computer science perspective: Insights from the Conscious Turing Machine. Proceedings of the National Academy of Sciences, 119(21), e2115934119.: 1125-1134.

Author: Wang Xiaomei Shen Weilin

Guess you like

Origin blog.csdn.net/winnieg/article/details/129844129