One article to understand Huawei ML Kit digital human, super simple integration

1. Introduction to Digital People

The virtual digital human is a comprehensive multi-modal AI capability that combines various AI technologies such as image vision, emotion generation, voice cloning, and semantic understanding, and is widely used in many scenarios such as media news anchors, financial customer service, and virtual games.

The application of digital people in the industry:

Insert picture description here

2. HMS ML Kit digital human

The HMS ML Kit digital human is a new comprehensive multi-modal AI capability based on Huawei's powerful AI core technologies such as image processing, speech synthesis, voice cloning, and semantic understanding. For education, news, and multimedia production companies, it provides high-quality, low-cost, and innovative content creation models. Compared with digital humans from other manufacturers, the advantages of HMS ML Kit digital humans are obvious:

Support Ultra HD 4K cinematic effect

  • Supports large-screen display, and the whole body details and textures reach the same definition

  • Generate seamless fusion with real background image, no trace of fusion at HD resolution

  • Lip details, lipstick is brightly reflective, and the texture is clear

  • The teeth are clearly visible, and the interdental texture is clear and true

Synthetic effect fidelity

  • Really restore the reflective details of teeth (non-textured), lips, and even lipstick.

  • Really restore facial lighting, contrast, shadows, dimples and other details.

  • The generated texture of the mouth skin seamlessly connects with the real texture.

  • Compared with 3D anchors, there is no animation bluntness.

Insert picture description here

3. HMS ML Kit digital human generates digital human video display

Insert picture description here

From the above picture, we can see the ultra-high-definition live-action video effect of the HMS ML Kit digital person. Not only is the speech clear, the ML Kit digital person also has better control of some details: lip details, lipstick reflective details, more realistic facial pronunciation, and Detailed facial lighting effects.

Four, HMS ML Kit digital human service integration

4.1 Service integration process

4.1.1 Submit the text information to be generated

Call [Customized Text to Virtual Digital Human Video Interface], and transfer some configuration (config) and text (data) that need to be converted to the back-end for processing through this interface: First, the text character length of the transmitted data must be calibrated The maximum character length of the Chinese text must not exceed 1000, the single character length of the English text must not exceed 3000, and the word length of the English text must not exceed 3000. Check the transmitted config for non-blank, then submit the config and data to convert the text to text It is an audio file.

4.1.2 Timing tasks executed asynchronously

There will be an asynchronously executed timing task to process the submitted data, call the algorithm provided by TTS, convert the text file into a video file, and synthesize the audio file obtained in the previous step with the video file.

4.1.3 Whether the query text is successfully converted

Call [Text to Virtual Digital Human Video Results Query Interface] to query whether the asynchronously executed text to video has been executed in real time; if the execution is completed, a link to generate the video will be returned.

4.1.4 Access video files based on video links

Access the generated video file according to the video link returned by [Text to Virtual Digital Human Video Results Query Interface].

4.2 The main interface of service integration

4.2.1 Custom text to virtual digital human video interface

URL
http://10.33.219.58:8888/v1/vup/text2vedio/submit

Request parameters :

Insert picture description here

Main function : The
input text is converted into a virtual digital human video interface. This interface is an asynchronous interface. The current version conversion takes a certain time. The offline method is adopted. The final conversion result needs to be queried through the [Text to virtual digital human video result query interface]. If the submitted text has been synthesized, return directly to the playback URL.

Main logic :
According to the text data data that needs to be synthesized transmitted from the front-end page, according to some configurations provided by config, the text text is converted into an audio file. Asynchronously execute multithreading, generate a video file suitable for pronunciation according to the provided algorithm model, and then synthesize the video file with the audio file to generate the required digital human video. If the submitted text has been synthesized, return directly to the playback URL.

4.2.2 Query interface of text to virtual digital human video result

URL
http://10.33.219.58:8888/v1/vup/text2vedio/query

Request parameters :

Insert picture description here

Main functions :

Check the conversion status in batches according to the submitted text ID.

Main logic :
According to the synthesized text data ID list transmitted by the front-end page, that is, the textIds field, query the task status of the obtained video file synthesis, store the obtained status results in a collection, and insert them into the returned request as a return parameter. If the requested text has been synthesized, directly return to the playback URL.

4.2.3 Batch offline interface for converting text to virtual digital human video

URL
http://10.33.219.58:8888/v1/vup/text2vedio/offline

Request parameters :

Insert picture description here

Main function :
batch offline according to the submitted text ID.

Main logic :
According to the synthesized text data ID array transmitted by the front-end page, that is, the textIds field, the video corresponding to all IDs in the array is set to offline, and its status is changed to offline, and the video file is deleted at the same time, and it is offline 'S video cannot be played and watched.

4.3 The main functions implemented by the HMS ML Kit digital human service

The HMS ML Kit digital human service is very powerful:

  1. Bilingual pronunciation : As the current system supports Chinese pronunciation and English pronunciation, Chinese text and English text can be transmitted as pronunciation data.
  2. Multiple virtual anchor images : support different virtual anchor pronunciations. At present, the system is equipped with 4 virtual anchors: Chinese women's pronunciation, Shanghai Daily, English women's pronunciation, English men's pronunciation.
  3. Picture-in-picture video playback : In addition to the setting of the virtual anchor, the video playback supports picture-in-picture, that is, a small window to play video. When the video is played in the picture-in-picture mode, the video window moves with the screen, you can view the text while playing the video The video window can also be dragged to any position so as not to block the text position.
  4. Adjustable speech speed, volume, tone : can meet different needs of pronunciation speed, pronunciation volume and pronunciation tone.
  5. Multi-background settings : You can set different virtual anchor backgrounds. Currently, the system has built-in three backgrounds: transparent background, green screen, and technology themes. You can also customize your favorite background by uploading pictures.
  6. Subtitle settings : The system can automatically configure subtitles, you can set Chinese subtitles, English subtitles or bilingual subtitles.
  7. Multi-layout settings : The position of the virtual anchor on the screen can be adjusted by parameters: left, right, and in the middle of the screen; as well as adjusting the size of the virtual anchor's character and displaying the full body or half of the body. When selecting the position where the virtual anchor appears on the screen as the left or right, you can also set the position where the station logo and station logo appear, and display the video files that need to be played in the video to achieve the effect of the video picture-in-picture. Restore the real news broadcast scene.

Video picture-in-picture display:

Insert picture description here

Five, after the conclusion

As a developer, after using the HMS ML Kit digital person to generate a video, especially the video picture-in-picture function, I was amazed. This tangibly restores the news broadcast scene where the real anchor is located. People can't help but wonder whether it can completely replace the real person broadcast with the implementation of the perfect digital person.

For more detailed development guidelines, please refer to the official website of Huawei Developer Alliance

https://developer.huawei.com/consumer/cn/hms/huawei-mlkit


Original link: https://developer.huawei.com/consumer/cn/forum/topicview?tid=0202351501845870559&fid=18
Author: say hi

Guess you like

Origin blog.51cto.com/14772288/2546244