Near real-time intelligent response 2D digital human construction

b8495694281d8a6711042ecfdf825d47.gif

background

Long before the rise and widespread adoption of large language models such as GPT-3.5, the education industry had made various attempts in the field of AI-assisted teaching. In the education industry, the adoption of artificial intelligence technology helps the education industry better achieve teaching goals, improve teaching quality, improve learning efficiency, improve learning experience, and improve learning outcomes. For example, artificial intelligence technology can help teachers better manage classrooms, better identify students' learning needs, better provide personalized learning content, better evaluate students' learning outcomes, and better provide learning support. In addition, artificial intelligence technology can also help the education industry to better automate and improve the efficiency and effectiveness of the education industry. In short, the adoption of artificial intelligence technology in the education industry will bring about huge changes and bring more development opportunities for the education industry.

Amazon Cloud Technology has also been committed to providing more convenient, faster and more powerful AI services to support the technological innovation and business innovation of customers in the education industry. In particular, Amazon Transcribe, Amazon Polly, Amazon Textract, Amazon Translate, Amazon Personalize, Amazon Rekognition, Amazon SageMaker and other products provide strong technical support for the education industry in terms of natural language processing, graphics and image processing, and model development and deployment.

This article combines Amazon Transcribe, Amazon Polly, and OpenAI's large language model with D-ID.com's 2D digital human generation technology to introduce a demonstration service for intelligent 2D digital human design with voice conversations, and the specific implementation process.

Solution Architecture

8f4315d340e180ada53c7c21526aeca6.png

In order to present voice input, voice output, and the overall effect of 2D digital human video playback in a unified user interface, this solution chooses the Gradio framework to realize the WebUI function. The rendered WebUI is as follows:

646c5632b3214b8d8eca0f6c9dc32239.jpeg

Users can directly input text content or use a microphone to input voice. The text content will use Langchain to add a certain context and then send it to OpenAI's GPT interface call. Voice input will first call the Amazon Transcribe service for voice-to-text conversion. The text content returned by the GPT interface will call Amazon Polly to form a voice file, and the voice file will be used as an API provided by D-ID.com to render a 2D dynamic video and automatically display and play it on the front end.

In this solution, the functions of voice input, voice output, text response generation, and digital human video generation can be combined and replaced freely. In particular, the call to the OpenAI interface can be replaced with a call to the self-deployed large language model, and the generation of 2D digital human video can also consider other similar services, such as Heygen, etc.

Implementation

voice input section

Amazon Transcribe supports transcribing speech in real time (streaming) or from speech files in an Amazon S3 bucket (batch processing). Transcribe supports up to dozens of languages ​​in different countries.

Transcribe's real-time transcription capability is very powerful. While processing streaming data, it continuously uses the previous context to correct the results in real time. You can see the effect of Transcribe's real-time transcription output through the screenshot below:

fbe50bdbe8cccc494cb35b8c95cb6f85.png

In this solution, we use batch processing to process the input speech transcription, the specific code is as follows:

def transcribe_func_old(audio):
    audio_file = open(audio, "rb")
    file_name = audio_file.name
    print("audio_file: "+file_name)
    
    # Set up the job parameters
    job_name = "ai-bot-demo"
    text_output_bucket = 'ai-bot-text-material' #this bucket is in us-west-1
    text_output_key = 'transcriptions/question.json'
    text_output_key = 'transcriptions/'+job_name+'.json'
    language_code = 'zh-CN'


    # Upload the file to an S3 bucket
    audio_input_bucket_name = "ai-bot-audio-material"
    audio_input_s3_key = "questions/tmp-question-from-huggingface.wav"
    
    s3.upload_file(file_name, audio_input_bucket_name, audio_input_s3_key)
    
    # Construct the S3 bucket URI
    s3_uri = f"s3://{audio_input_bucket_name}/{audio_input_s3_key}"


    response = transcribe.list_transcription_jobs()
    
    # Iterate through the jobs and print their names
    for job in response['TranscriptionJobSummaries']:
        print(job['TranscriptionJobName'])
        if job['TranscriptionJobName'] == job_name:
            response = transcribe.delete_transcription_job(TranscriptionJobName=job_name)
            print("delete transcribe job response:"+str(response))


    # Create the transcription job
    response = transcribe.start_transcription_job(
        TranscriptionJobName=job_name,
        Media={'MediaFileUri': s3_uri},
        MediaFormat='wav',
        LanguageCode=language_code,
        OutputBucketName=text_output_bucket,
        OutputKey=text_output_key
    )
    
    print("start transcribe job response:"+str(response))
    job_name = response["TranscriptionJob"]["TranscriptionJobName"]
    
    # Wait for the transcription job to complete
    while True:
        status = transcribe.get_transcription_job(TranscriptionJobName=job_name)['TranscriptionJob']['TranscriptionJobStatus']
        if status in ['COMPLETED', 'FAILED']:
            break
        print("Transcription job still in progress...")
        time.sleep(1)
    
    # Get the transcript
    #transcript = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    transcript_uri = transcribe.get_transcription_job(TranscriptionJobName=job_name)['TranscriptionJob']['Transcript']['TranscriptFileUri']
    print("transcript uri: " + str(transcript_uri))
    
    transcript_file_content = s3.get_object(Bucket=text_output_bucket, Key=text_output_key)['Body'].read().decode('utf-8')
    print(transcript_file_content)
    json_data = json.loads(transcript_file_content)


    # Extract the transcript value
    transcript_text = json_data['results']['transcripts'][0]['transcript']
    return transcript_text

Swipe left to see more

The above code mainly completes the work of several steps:

  1. Upload the speech file to be processed to Amazon S3

  2. Create Amazon Transcribe worker tasks and poll to check task status

  3. Get the parse results of Amazon Transcribe completed tasks from Amazon S3

Response content generation part

In this solution, the answer content is generated by using the open source framework of Langchain, calling the coversation interface based on OpenAI, and using the memory library to save the dialogue context for 5 rounds. In actual customer scenarios, richer ways can be considered to regulate the validity and objectivity of the content of the reply.

For example, you can use Langchain's dialogue templates to preset the role of the large model, or use knowledge base construction and search engines such as Amazon Kendra and Amazon Opensearch to further limit the content of large model responses. The relevant code is as follows:

memory = ConversationBufferWindowMemory(k=5) 
conversation = ConversationChain(     
    llm=OpenAI(streaming=True, callbacks=[StreamingStdOutCallbackHandler()], max_tokens=2048, temperature=0.5),      
     memory=memory, 
)

Swipe left to see more

voice output section

Amazon Polly turns text into lifelike speech. It supports multiple languages ​​and contains a variety of realistic voice simulations, including Mandarin Chinese voice simulations.

You can build voice-enabled applications that work in a variety of locations, and choose the voice that suits your customers. Amazon Polly also supports Speech Synthesis Markup Language (SSML), an XML-based W3C standard markup language for speech synthesis applications that supports the use of common SSML tags for sentence segmentation, accent, and intonation. Custom Amazon SSML tags provide unique options, such as the ability to make certain sounds in the style of a newscaster. This flexibility helps you create lifelike voices that grab and hold your audience's attention.

In this solution, we use Polly's real-time speech generation interface, using the VoiceID of Mandarin Chinese pronunciation: Zhiyu, and customize the pronunciation of specific characters, which is also a very useful function of Polly (Lexion).

def polly_text_to_audio(audio_file_name, text, audio_format):


    if os.path.exists(audio_file_name):
        os.remove(audio_file_name)
        print("output mp3 file deleted successfully.")
    else:
        print("output mp3 file does not exist.")


    polly_response = polly.synthesize_speech(
        Text=text,
        OutputFormat=audio_format,  
        SampleRate='16000',
        VoiceId='Zhiyu',
        LanguageCode='cmn-CN',
        Engine='neural',
        LexiconNames=['xxxxCN']
    )   
    
    # Access the audio stream from the response
    if "AudioStream" in polly_response:
        # Note: Closing the stream is important because the service throttles on the
        # number of parallel connections. Here we are using contextlib.closing to
        # ensure the close method of the stream object will be called automatically
        # at the end of the with statement's scope.
            with closing(polly_response["AudioStream"]) as stream:
               try:
                # Open a file for writing the output as a binary stream
                    with open(audio_file_name, "wb") as file:
                       file.write(stream.read())
               except IOError as error:
                  # Could not write to file, exit gracefully
                  print(error)
                  sys.exit(-1)


    else:
        # The response didn't contain audio data, exit gracefully
        print("Could not stream audio")
        sys.exit(-1)

Swipe left to see more

Generative part of 2D digital human video

Here we use an external third-party SaaS service. The service is provided by D-ID.com, and the corresponding API can directly receive text input and a face picture to generate a corresponding dynamic broadcast video, and can also accept voice files plus pictures as input.

When you enter text, the API interface can choose to specify different Voice IDs in Amazon's Polly service to automatically synthesize speech for you.

In this solution, we want to reflect the effect of Chinese voice output, but the API interface of D-ID cannot directly specify Chinese Voice ID for Chinese text. So we chose to use Polly's API to generate voice first, and then send the voice and pictures to D-ID's interface to generate video. The specific code is as follows:

def generate_talk_with_audio(input, avatar_url, api_key = did_api_key):
    url = "https://api.d-id.com/talks"
    payload = {
        "script": {
            "type": "audio",
            "audio_url": input
        },
        "config": {
            "auto_match": "true",
            "result_format": "mp4"
        },
        "source_url": avatar_url
    }
    headers = {
        "accept": "application/json",
        "content-type": "application/json",
        "authorization": "Basic " + api_key
    }


    response = requests.post(url, json=payload, headers=headers)
    return response.json()




def get_a_talk(id, api_key = os.environ.get('DID_API_KEY')):
    url = "https://api.d-id.com/talks/" + id
    headers = {
        "accept": "application/json",
        "authorization": "Basic "+api_key
    }
    response = requests.get(url, headers=headers)
    return response.json()


def get_mp4_video(input, avatar_url=avatar_url):
    response = generate_talk_with_audio(input=input, avatar_url=avatar_url)
    print("DID response: "+str(response))
    talk = get_a_talk(response['id'])
    video_url = ""
    index = 0
    while index < 30:
        index += 1
        if 'result_url' in talk:    
            video_url = talk['result_url']
            return video_url
        else:
            time.sleep(1)
            talk = get_a_talk(response['id'])
    return video_url

Swipe left to see more

In practical applications, if you want Amazon Polly to achieve near-real-time text-to-speech, you can combine the streaming output of the large language model for real-time processing. The sample code is as follows:

response = generate_response(prompt)
# create variables to collect the stream of events
collected_events = []
completion_text = ''
sentance_to_polly = ''
separators = ['?','。',',','!']
already_polly_processed = ''
# iterate through the stream of events
for event in response:
    collected_events.append(event)  # save the event response
    event_text = event['choices'][0]['text']  # extract the text
    if event_text in separators:
        sentance_to_polly = completion_text.replace(already_polly_processed,'') 
        #print("sentance_to_polly: "+sentance_to_polly)
        polly_text_to_audio(response_audio_filename, sentance_to_polly, 'mp3')
        already_polly_processed = completion_text
    completion_text += event_text  # append the text
    print(event_text, end='', flush=True)  # print the delay and text

Swipe left to see more

The above code takes advantage of the real-time processing capabilities of Amazon Polly. According to the stream text data returned, in the found '? ','. ',',','! 'When splitting the punctuation, immediately call Amazon Polly to generate the voice of the latest text, and then append it to the end of the current video file. After the text stream is received, the voice conversion is basically completed.

This solution is currently hosted in the workspace provided by the huggingface website: https://huggingface.co/spaces/xdstone1/ai-bot-demo, and the corresponding code can also be obtained through the following channels:

git lfs install
git clone https://huggingface.co/spaces/xdstone1/ai-bot-demo

Swipe left to see more

The demonstration video of the scheme can be watched through the following link:

https://d3g7d7eldf7i0r.cloudfront.net/。

Summarize

This year is a year when AIGC broke out, and it is also a year when customers in the education industry see an industry turning point. At this critical historical node, Amazon Cloud Technology is willing to face these new opportunities and challenges together with customers, and is guided by customer needs to help customers seize the dividends brought by the AI ​​wave.

At present, in addition to the 2D digital human solution shown in this article, Amazon Cloud Technology can also help customers provide solutions such as live broadcast and interaction based on 3D digital human or other 3D digital images. At the same time, we will also introduce more technical partners such as Warp Engine (https://www.warpengine.cc/) to enrich the solutions for the entire digital human, digital image live broadcast, on-demand, interactive and other scenarios, and help more education industries Customers accelerate the adoption and implementation of AI technology.

The author of this article

ca987e33899dee354fef4f00d31158c6.jpeg

Xue Dong

Amazon cloud technology solution architect, responsible for solution consulting and design based on Amazon cloud platform, currently serving customers in the education industry in Amazon cloud technology Greater China. Focus on technical directions such as serverless and security.

30dbcc876fa86a09978c1e4854fd8e25.gif

8a4836c092695d8fd8c4697037080772.gif

I heard, click the 4 buttons below

You will not encounter bugs!

1159c7bb319dfd3bdf3b0d709d67d456.gif

Guess you like

Origin blog.csdn.net/u012365585/article/details/132400311