Amazon cloud technology generative AI technology assisted teaching field, near real-time intelligent response 2D digital human construction

 Long before the rise of large language models such as GPT-3.5 and their widespread adoption, the education industry had made various attempts in the field of AI-assisted teaching. In the education industry, the adoption of artificial intelligence technology helps the education industry to better achieve teaching goals, improve teaching quality, learning efficiency, learning experience, and learning outcomes. For example, artificial intelligence technology can help teachers better manage classrooms, better identify students' learning needs, better provide personalized learning content, better evaluate students' learning outcomes, and better provide learning support. In addition, artificial intelligence technology can also help the education industry to better automate and improve the efficiency and effectiveness of the education industry. In short, the adoption of artificial intelligence technology in the education industry will bring about huge changes and bring more development opportunities for the education industry.

 Amazon Cloud Technology has also been committed to providing more convenient, faster, and more powerful AI services to support technological innovation and business innovation of customers in the education industry. In particular, products such as Amazon Transcribe, Amazon Polly, Amazon Textract, Amazon Translate, Amazon Personalize, Amazon Rekognition, and Amazon SageMaker have provided strong technical support for the education industry in terms of natural language processing, graphics and image processing, and model development and deployment.

 This article combines Amazon Transcribe, Amazon Polly, and OpenAI's large language model and D-ID.com's 2D digital human generation technology to introduce the service and specific implementation process of an intelligent 2D digital human design with voice dialogue for demonstration .

 Solution Architecture

 In order to present voice input, voice output, and the overall effect of 2D digital human video playback in a unified user interface, this solution chooses the Gradio framework to realize the WebUI function. The rendered WebUI is as follows:

 Users can directly input text content or use a microphone to input voice. The text content will use Langchain to add a certain context and then send it to OpenAI’s GPT interface call. Voice input will first call the Amazon Transcribe service for voice-to-text conversion. The text content returned through the GPT interface will call AWS Polly to form a voice file, and the voice file will be used as an API provided by D-ID.com to render a 2D dynamic video and automatically display and play it on the front end.

 In this solution, the functions of voice input, voice output, text response generation, and digital human video generation can be combined and replaced freely. In particular, the call to the OpenAI interface can be replaced with a call to the self-deployed large language model. At the same time, the generation of 2D digital human video can also consider other similar services, such as Heygen, etc.

 Implementation

 voice input section

 Amazon Transcribe supports transcribing speech in real time (streaming) or from speech files in an Amazon S3 bucket (batch processing). Transcribe supports up to dozens of languages ​​in different countries.

 Transcribe's real-time transcription capability is very powerful. While processing streaming data, it continuously uses the previous context to correct the results in real time. You can see the effect of Transcribe's real-time transcription output through the screenshot below:

 Response content generation part

 In this solution, the answer content is generated by using the open source framework of Langchain, calling the coversation interface based on OpenAI, and using the memory library to save the dialogue context for 5 rounds. In actual customer scenarios, richer ways can be considered to regulate the validity and objectivity of the content of the reply.

 For example, you can use Langchain's dialogue templates to preset the roles of large models, or use knowledge base construction and retrieval engines such as Amazon Kendra and Amazon Opensearch to further limit the content of large model responses.

 voice output section

 Amazon Polly turns text into lifelike speech. It supports multiple languages ​​and contains a variety of realistic voice simulations, including Mandarin Chinese voice simulations.

 You can build voice-enabled applications that can be used in a variety of locations and choose the voice that suits your customers. Amazon Polly also supports Speech Synthesis Markup Language (SSML), an XML-based W3C standard markup language for speech synthesis applications that supports the use of common SSML tags for sentence segmentation, accent, and intonation. Custom Amazon SSML tags provide unique options, such as being able to make certain sounds in the style of a newscaster speaking. This flexibility helps you create lifelike voices that grab and hold your audience's attention.

 In this solution, you can use Polly’s real-time voice generation interface, use VoiceID: Zhiyu pronounced in Mandarin Chinese, and customize the pronunciation of specific characters, which is also a very useful function of Polly (Lexion).

 Generation part of 2D digital human video

 An external third-party SaaS service can be used here. The service is provided by D-ID.com, and the corresponding API can directly receive text input and a face picture to generate a corresponding dynamic broadcast video, and can also accept voice files plus pictures as input.

 When you enter text, the API interface can choose to specify different Voice IDs in the AWS Polly service to automatically synthesize speech for you.

 In this solution, I want to reflect the effect of Chinese voice output, but the API interface of D-ID cannot directly specify Chinese Voice ID for Chinese text. So I chose to use Polly's API to generate voice first, and then send the voice and pictures to D-ID's interface to generate video.

 Summarize

 This year is the year when AIGC broke out, and it is also a year when customers in the education industry see the turning point of the industry. At this critical historical point, Amazon Cloud Technology is willing to face these new opportunities and challenges together with customers, and is guided by customer needs to help customers seize the dividends brought by the wave of AI.

 At present, in addition to the 2D digital human solution shown in this article, Amazon Cloud Technology can also help customers provide solutions such as live broadcast and interaction based on 3D digital human or other 3D digital images. At the same time, Amazon Cloud Technology will also introduce more technical partners such as Leap Engine to enrich the entire digital human, solutions for digital image live broadcast, on-demand, interactive and other scenarios, and help more customers in the education industry accelerate the adoption and implementation of AI technology.

 Original title: Near real-time intelligent response 2D digital human construction

 Original link: https://aws.amazon.com/cn/blogs/china/near-real-time-intelligent-answering-2d-digital-human-construction/

Supongo que te gusta

Origin blog.csdn.net/MJ0705/article/details/132580450
Recomendado
Clasificación