The sound is endless, the new "voice" reports

Charming Voice Family Group (10)

1155fd211d01b363a5f69c90f7355fa3.png

@大玲, I just met the new Sichuan baby - Cherry, so cute!

b2fd38c00f3812b2b3a69fa8bdb0185d.png

No, this girl is really rare, I will drag her into our big group right now

6726ad7b14556cab490984bb4ad3d757.png

451c414871522c65208c42123e95c0c3.png

"Da Ling" invites "Cherry" to join the group chat

@Everybody Our big family welcomes the first baby, everyone welcomes @cherry 

abc6718fa09c54de2ca3e91035f37ae1.png

e99e36f774a82ff684747cbdf67c15aa.png

5da4252238ce19e484b7c9c50ef7b236.png

Hello brothers and sisters! ! I am Cherry from Chengdu, Sichuan. I am 7 years old. I am very happy to be a family with you!

d3c1c7179886be5e3f8a294161f52a09.png

ef06d9c8c64d5e5d1f3f1768e5494c92.png

Welcome to the cute little cherry, if you are interested in cherry, you can come to the recording studio to play with us

25ffc581fca334487b4e83f6bc3c7dfd.png

f05371b4aea8a11130455bb7541aeb2b.png

Want it! Cherry has just learned a nursery rhyme, waiting for Ha Er to sing it to her sister.

79b68c7e61d7e2b244e71bc299868bb1.png

Our child is versatile, don’t wait to go to the recording studio, let’s give you a song first

152cc01e85e2456bc8d653bd66a6f841.png

6b7a8cf244336bd6a85ea14f57d83541.png

6d1fb32aa306031e8184c80d3fe74e84.jpeg

Tintin the cat in a red dress

Miss Gao, a matchmaker

pockmarked empress drummer

Banzhu Yaya lifts up and walks away

Car lifter, lift slowly

look trip the girl

Embroidered shoes worn by girls...

5ca88697954723cb400dfe34641c2b50.png

29d2d5c92573fe92fae1c07076e1c92c.jpeg

Click below to listen to the full nursery rhyme~

84e0b731d0d964b268d3dd1dbbca264e.png

It's like that, isn't it all the hot pot flavor? 206a6864096218d86fe760221b5200e3.png6162c61a0a52eb76d831d19ff0bd0d06.pngd4ca6a4b8f2e343d595f1644f46f1497.png

The Sichuan tone cherry launched by Xiaoai in December last year is the industry's first dialect tone created from real children's recording data. At the same time, we also added super anthropomorphic technology to this tone, and reproduced a variety of modal particle expressions to make Xiao Ai's reply more anthropomorphic and natural.

As the only child newly added to the "Voice Family", the voice is full of milk, yet domineering and cute, so authentic "Sichuan dialect", who made it? Let’s get to the point, please let the engineers of the technical team come and introduce us to the implementation~

01 

"Sound" enters the hearts of the people, where does the authentic rural sound come from?

In order to allow users to experience the fun brought by sounds of different timbres, the Xiaoai Sound Store has continuously launched new sounds of different timbres for the public to choose from. There are currently 10 voices in the voice store, which are divided into 6 series: dialect, girlfriend, translation, ancient style, IP, and technology.

6ebdd26634a1a78c818c1940236ecc51.png

As a major cultural art of local culture, dialect is not only the soil that nourishes folk culture, but also a living fossil of culture that is passed on orally. Hereby, we have developed a series of "dialect timbres" , which are currently available in Northeast dialect and Sichuan dialect. The birth of the dialect series was a deliberate decision for the voice team.

On the one hand, the divergent pronunciation of dialects has brought great challenges to the accuracy of dialect synthesis. It should be known that the main differences between modern Chinese and various dialects are pronunciation, vocabulary, grammar and other aspects, and pronunciation is particularly prominent. Unlike Mandarin, which has a unified national standard, there is no unified pronunciation of dialects. Not only are there many types of dialects (eight dialects with the largest number of users, including Northeast dialect, Beijing dialect, Jilu dialect, Jiaoliao dialect, Central Plains dialect, Lanyin dialect, Jianghuai dialect, and Southwest Mandarin dialect), each dialect has internal Not unified, with several dialects and many "local languages" distributed.

On the other hand, due to the influence of Mandarin, the pronunciation of some words in some dialects has some differences among people of different age groups. This makes it more difficult for dialect synthesis to be "authentic". After all, if the dialect is to be in place, it must be "authentic and colloquial", which is also one of the important criteria for us to measure whether the timbre is up to standard.

79169b19193572aaa4986b3fed7751bb.png

In order to protect this "culture" and promote the spread of local culture carried by dialects, Xiaomi engineers continue to explore and challenge more difficult voice technologies.

02 

Teana's "Children's Voice", a look at hard-core technology

The recently added Sichuan children's dialect has received widespread praise from users and friends as soon as it was launched. I think it won the favor of all brothers, sisters, uncles and aunts!

At present, most of the dialect voice libraries on the market are mature female voices and male voices, and there is a lack of children's voices. Restricted by many factors, Sichuan Children's Cherry Voice is the most difficult one in the current dialect series.

3c1c4950b2be284afd4b63a4b9d53b7d.png

On the one hand, the Sichuan dialect is different from the Mandarin phoneme set and the pronunciation methods are different. From a technical point of view, it generally requires >3000 sound materials to directly train the sound model, but faced with a corpus of only more than 500 sentences , we lack a large amount of children’s Sichuan dialect data as the basic model, technically it is impossible to directly train the Sichuan dialect sound model, and the technical implementation is relatively difficult.

In order to overcome the difficulty of dialect synthesis on small-scale data, the engineer proposed a method based on transfer learning, which divides the synthesis of children's voice dialects into two stages. First, low-resource dialect (cross-language) synthesis technology is used to obtain a children's timbre data model, and then on the basis of more than 500 sentence corpora, iteratively trains a basic model dedicated to Sichuan dialect.

The difference in pronunciation between Mandarin and dialects determines that it is difficult to directly use the basic model of Mandarin timbre, and use the more easily obtained pronunciation model of adult dialects for timbre migration. The pronunciation categories are more matched, and it is easier to adapt to training, and then complete the overall children's voice An augmentation of , used to train child speech rate style models. Of course, when the training data is reconstructed and amplified through technical means, a small amount of TTS background noise may be introduced, and the noise reduction coder can be used to reduce the noise of the synthesized speech appropriately.

d5ca33b9a132bf1b3cefd27753a57815.png

On the other hand, dialect recording is difficult. For children who are in the learning and imitating stage, her recording materials are collected through adults leading and reading, which greatly limits the number of recordings and increases the difficulty of recording. In the end, more than 500 sentences of audio recordings were collected. Although the number can be further expanded through technical means, the impact of original recordings on the quality of synthesis is the most important. Moreover, the child's pronunciation stability is also poor, the speech speed is slower, and the intonation is higher. The pronunciation of each word changes too much in different sentences, and the degree of freedom of expression is large, so the difficulty of modeling will "increase with the tide".

In order to further improve the reasoning speed, the method of distillation learning is introduced , and the knowledge of the teacher model is transferred to the student model. The reasoning speed is faster and the sound quality is almost not reduced. The overall process of model training is shown in the figure below:

f313e87348bc4b4d3681981e7816e8ff.png

After obtaining the model's initial synthesized voice, in order to further improve the naturalness of the synthesis, the engineers also used the "super anthropomorphic technology" on this basis , so that the artificially synthesized voice (the voice of cherry) is different in terms of intonation, sentence breaks, and speech speed changes. It is more like a real person's pronunciation, and also reproduces the expression of a variety of modal particles, so as to alleviate the rigidity of the electronic mechanical sound and make the sound more natural.

With the blessing of AI technology, sound synthesis is no longer a static voice package, but a dynamic voice assistant with a "brain", which is more human and realistic. The synthesis of children's dialect timbres is a breakthrough in intelligent voice technology, bringing users a better "interactive" experience.

03 

"Sound" into the ear, technology back to basics

"Voice" is a special medium. In the past, there was Su Shi's "hometown accent without accompaniment and hard thinking", and later there was Gao Qi's "hometown accent is true return to the ear". It can endow it with intimacy that cannot be seen and touched at any time, just like "the local people do not know the local accent", carrying emotions to the listener.

In the future, Xiaomi engineers will continue to research and develop, covering wider and more dialect timbres, focusing on "authentic and colloquial" sound technology, to create the ultimate experience in dialect mode for users! At the same time, the Xiaoai Sound Store will continue to launch new timbres, so that more people can experience and feel the charm of dialects. What will the next sound look like? Let's stay tuned!

Finally, Cherry invites you to come to Sichuan to have fun!

86bc6a00a75f56b33713ab5c8ff4d86b.gif

88f6dd7bb6285c8420640d1b4d6fe4f2.png

Guess you like

Origin blog.csdn.net/pengzhouzhou/article/details/129700679