MetaAI speech translation large model Seamless debuts, focusing on AI seamless simultaneous interpretation

32803256fac24f11a4a1e3a38ac626d0.png

 

论文题目: Seamless: Multilingual Expressive and Streaming Speech Translation
论文链接: https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/
代码链接: GitHub - facebookresearch/seamless_communication: Foundational Models for State-of-the-Art Speech and Text Translation
项目主页: https://ai.meta.com/research/seamless-communication/

Since this year, large language models and visual language multi-modal large models represented by ChatGPT and GPT-4 have almost completely led the development trend in the field of artificial intelligence, and some industry-specific models have also been derived in vertical fields. Large models, such as large financial models, large transportation models, and large remote sensing models, etc. As for the three basic modes of data input, the importance of speech signals in the field of AI is self-evident. Recently,MetaAI’s research team released a full-process language speech model Seamless ("seamless communication"). Seamless focuses on smooth and efficient multi-language seamless translation functions , based on the traditional translation system, quickly simulate the user's speaking style, to ensure that the translated speech signal completely retains the user's tone, pauses, emphasis and other key points Information helps us better convey emotions and intentions. It should be pointed out that Seamless is composed of three basic models:

(1) SeamlessExpressive:A model designed to preserve cross-language expression and complexity, currently supporting English, Spanish, Languages ​​such as German, French, Italian and Chinese.

(2) SeamlessStreaming:Efficient streaming translation model, which can perform speech and text translation with a delay of about two seconds.

(3) SeamlessM4T v2: It is an upgraded version of SeamlessM4T released by Meta in August this year. Basic multi-language and multi-task model, based on nearly 4.5 million hours of voice data Trained and achieved performance improvements on a variety of baseline tasks such as automatic speech recognition, speech-to-speech, speech-to-text, and text-to-speech.

Seamless has attracted widespread attention since its release. As Meta’s chief artificial intelligence scientist, LeCun immediately promoted Seamless.

114d6dce23cb4d0ca8bfa4ccd7561da4.png

 

In addition,Open source area boss Georgi Gerganov has begun work on rewriting Cpp and inference acceleration for Seamless. Previously, Georgi Gerganov has C++ versions have been developed for large star models such as Meta's LLaMA and OpenAI's Wisper. Among them, the star number of llama.cpp on GitHub has exceeded 6.50,000.

07d42be0587c4d99b81576a322e765c2.png

 

01. Multi-tasking base model SeamlessM4T v2

The multi-task pre-training paradigm can be said to be the underlying technology of the GPT series models. Seamless, as a unified system in the field of speech translation, also draws on this construction logic. SeamlessM4T has conducted large-scale pre-training on a wide range of languages ​​and speech translation tasks. When building the SeamlessM4T v2 version, the author team focused on upgrading its multi-task prediction unit UnitY.SeamlessM4T v2 will Speech translation tasks are divided into two types: speech-to-text translation (S2TT) and text-to-unit conversion (T2U). Since the previous version of UnitY had hallucinations when faced with the mismatch in the length of speech sequences and text sequences, the author proposed a new two-stage UnitY2 unit. UnitY2 adopts a non-automatic The non-autoregressive (NAR) unit decoder architecture can better model discrete units. The overall architecture of the SeamlessM4T v2 model based on the UnitY2 prediction unit is shown in the figure below.

76c2c9b576cd4f5eb11c967cb943e264.png

 

The update to UnitY2 improves the translation quality of SeamlessM4T v2 on various tasks,Currently, SeamlessM4T v2 implements speech-to-speech and speech-to-text translation in 100 languages SOTA performance.

7b901d8870384256b49f5a28e1946406.png

 

02. SeamlessExpressive perfectly overcomes the problem of preserving intonation in translation

Rhyme in speech plays an important role in human communication. It can express the speaker's emotional state, attitude and intention. However, this important factor has been ignored in previous speech translation models and systems. . Usually, we use changes in pitch (high or low), loudness (strong or weak), and duration (fast or slow) to express our true intentions in different situations. SeamlessExpressive can While retaining the semantic content, accurately capture the speaker's speaking speed and pauses and other information, and use the target language to paraphrase.

13a06fdc29f74642b93997b61c55055a.gif

 

The following figure shows the overall framework of SeamlessExpressive.From an implementation perspective, SeamlessExpressive is mainly built based on the SeamlessM4T v2 model, which inherits high-quality semantic translation capabilities, Prosody UnitY2 and PRETSSEL They can complement each other in conveying the expressiveness of the source language speech. Specifically, Prosody UnitY2 focuses on phrase-level prosody in speech, such as speech rate or pauses, while PRETSSEL focuses more on translating utterance-level expressiveness, such as overall vocal style. proposed a prosody perception unit Prosody UnitY2 based on the UnitY2 unit, and also proposed a text-free acoustic model PRETSSEL. The author team

927977bab3534840bb0675023db273a0.png

In order to achieve prosody alignment among multiple languages,the author constructed a large-scale prosody alignment and speech alignment data set through data debugging, automatic alignment and synthesis. , supporting 6 languages ​​including English, French, German, Italian, Mandarin and Spanish.

03. Simultaneous interpretationSeamlessStreaming

In international conferences, simultaneous interpretation is a very critical meeting task.Human interpreters need to quickly understand the meaning of the speaker and interpret it at a low level based on their own experience and knowledge. To find an appropriate balance between delay and accurate translation, you also need to pay attention to the speaker's intonation, pauses, attitude and other signals. Overall, the difficulty of this task is very high, and SeamlessStreaming achieves it perfectly. The key points of simultaneous interpretation listed above are discussed.

2a375e60214942c89e4bab44ef5492c5.gif

Compared with traditional translation systems, SeamlessStreaming does not wait for the speaker to finish speaking before translating, but translates at almost the same pace as the speaker, which can achieve a An effect close to real-time translation. Currently, SeamlessStreaming supports automatic speech recognition and speech-to-text translation for nearly 100 input and output languages.

1394f072ae5b490389a4396d7211240f.png

 

SeamlessStreaming is initialized directly from SeamlessM4T v2. Its construction process is shown in the figure below. It inherits the multi-task real-time translation capability of the SeamlessM4T v2 model. SeamlessStreaming’s efficient stream reasoning capability mainly comes from the new EMMA (Efficient Monotonic Multihead Attention) multi-head attention module proposed by the research team. EMMA is a kind of monotonic attention. force approach, where each attention head implements a separate synchronization strategy. This allows the model to intelligently determine whether the current state has enough information to generate the next speech segment or target text, which is suitable for low-latency speech. Translation is critical, especially for long input sequences.

04. Audio watermark technology

Although the current large models can help us better carry out production and life, it is equally important that we must consider taking certain measures to prevent these technologies from being abused into harmful scenarios, So the MetaAI research team developed an audio watermarking technology for Seamless. This watermark is mainly based on some signals that are imperceptible to the human ear, but it can still use the detector model to detect audio signals. detected in.

6b92c13750f446549d3ed579fa2b3451.png

 

In addition to being able to authenticate the identity information of the generated audio,Seamless watermark also supports attack resistance. For example, some saboteurs may try to add noise, echo or filter some frequency to modify the audio and dilute the watermark to bypass detection. Seamless watermarks are robust to a variety of attacks and can achieve frame-accurate positioning of audio clips. In addition, the author mentioned that the cost of the Seamless watermark model is very low, and it can be individually fine-tuned without affecting the translation effects of SeamlessExpressive and SeamlessStreaming.

05. Summary

The Seamless model released by MetaAI shows us amazing simultaneous interpretation effects and supports nearly 100 languages. Among them, the multi-task basic model SeamlessM4T v2 achieves SOTA performance on multiple speech baselines. a>, No need to wait for the current utterance to end. As the next generation of large speech intelligence models, the Seamless series models demonstrate an end-to-end multi-lingual, expressive and low-latency streaming translation model, marking a new breakthrough in artificial intelligence technology in the field of speech translation. Seamless Expressive can ensure that the speaker's rhythm and voice style are preserved during translation. The efficient multi-head attention EMMA in SeamlessStreaming can achieve targeted parallel low-latency translation

Guess you like

Origin blog.csdn.net/hanseywho/article/details/135051134