GitHub Stars exceeded 10,000 in seconds, and Meta open source can recognize 4,000 languages and generate more than 1,000 large speech models!

451a9090e369c8ba1680b826ae37fd65.gif

Organize | Tu Min

Listing | CSDN (ID: CSDNnews)

Parting ways with OpenAI and Google, Meta has gone deeper and deeper in the direction of open source large models.

Today, Meta once again open sourced a new AI language model on GitHub——Massively Multilingual Speech (MMS, large-scale multilingual speech), which is very different from ChatGPT. This new language model can recognize more than 4,000 words. spoken languages ​​and generate over 1100 speeches (text-to-speech).

In just a few hours of going online, the GitHub repository has gained 23k Stars, and the number of Forks has reached 5.5k.

e17cf8a510b86240b29dbcb1338fdc26.png

GitHub address: https://github.com/facebookresearch/fairseq/tree/main/examples/mms

aec8daa56ea4255961d2102c4a9a52a4.png

original intention

Regarding the development of this MMS model, Meta stated that "equipment of devices with the ability to recognize and generate speech can make information accessible to more people".

However, although there are more than 7,000 known languages ​​in the world, the existing AI speech recognition model can only cover 100 languages, which is too small.

Meanwhile, speech recognition and text-to-speech models typically require training on thousands of hours of audio, which simply doesn't exist for most languages. Even with the continuous development of the times, many languages ​​in the world are in danger of disappearing in the next few decades.

In order to protect the diversity of languages ​​in the world and also want to make some contributions to languages ​​that are on the verge of extinction, the Meta research team developed and open sourced MMS, "We share our models and codes publicly so that others in the research community can share our build upon the work.”

Of course, wanting to collect audio data in thousands of languages ​​is the first difficulty to overcome in developing this large model.

In their latest open-source speech model for MMS, Meta took an unconventional approach to collecting audio data, using religious texts such as the Bible.

Meta explains this, “We use religious texts because these texts have been translated into many different languages ​​and their translations have been extensively studied for text-based language translation studies. There are public recordings of these translations, documented of people reading these texts in different languages.” 

As part of the Large Models project, Meta created datasets in more than 1100 languages, providing an average of 32 hours of data per language.

Additionally, by combining unlabeled audio recordings of the Bible and similar texts, Meta researchers increased the model's available languages ​​to more than 4,000.

047ff256c35dab39188cdaaa78d62485.png

Single Speech Model Supporting Thousands of Languages

Of course, relying on these data, many people think that this is an AI model that is biased towards religion. In fact, Meta says, not so.

Meta wrote in the announcement, "While the content of the recordings was religious, our analysis showed that this did not bias the model unduly towards producing more religious language. We believe this is because we used a connectionist temporal classification ( CTC) approach, which is much more limited than large language model (LLM) or sequence-to-sequence models for speech recognition." 

When training this model, Meta combined with its own "self-supervised speech representation learning" model - wav2vec 2.0, which can be trained on unlabeled data. Combining unconventional data sources and self-supervised speech models can lead to promising results.

According to the official test data, Meta trained a multilingual speech recognition model on more than 1,100 languages ​​using a 1B parameter wav2vec 2.0 model compared to existing models. Performance does drop as the number of languages ​​increases, but only slightly: from 61 to 1,107 languages, the character error rate only increases by about 0.4%, but the language coverage increases by more than 18 times.

c58458678c132300a2accd71c4cab924.png

When compared to OpenAI's Whisper model, its model trained on Massively Multilingual Speech data achieved half the word error rate, but Massively Multilingual Speech covered 11 times as many languages.

5d61c910d7e48e6d09778dbead8fbc28.png

There is no doubt that the arrival of the MMS open source model not only expands the language range of text-to-speech, but also greatly improves the accuracy rate.

07637d26d1f8a4921fbd7706c93af296.png

limitation

However, Meta cautions that its new model isn't perfect. "For example, there is a risk that speech-to-text models may mistranscribe selected words or phrases," the company wrote. 

"Depending on the output, this could lead to offensive or inaccurate language. We continue to believe that collaboration across the AI ​​community is critical to the responsible development of AI technologies."

For the future of the large speech model, Meta said that it has also made a vision, hoping that one model can solve multiple speech tasks in all languages. "While we trained separate models for speech recognition, speech synthesis, and language recognition, we believe that in the future, one model will be able to do all of these tasks and more, leading to better overall performance," Meta said.

Of course, we also hope that this day will come soon.

More content can be found in the MMS paper: https://scontent-lcy1-1.xx.fbcdn.net/v/t39.8562-6/348836647_265923086001014_6878005808275791319_n.pdf

 GitHub address: https://github.com/facebookresearch/fairseq/tree/main/examples/mms

 Announcement: https://ai.facebook.com/blog/multilingual-model-speech-recognition/

ceb4693c9abc7404960a57696d637c4c.jpeg

Guess you like

Origin blog.csdn.net/dQCFKyQDXYm3F8rB0/article/details/130837991