Number one! Xiaomi’s self-developed audio algorithm has made important progress

Recently, good news came from Xiaomi!

Xiaomi's self-developed sound recognition algorithm has made important progress in the audio tagging task. The audio labeling model, which uses the audio data of the public dataset AudioSet-2M as the training set, broke through the 50 mAP score for the first time in the industry , which is the best result among all audio labeling task papers so far.

Pushing the AudioSet audio mark mAP indicator of the data set into the 50+ era, it marks that Xiaomi's voice recognition algorithm has ranked first in international performance.

b5fe3c3cb91bea694d7efcc5b00bf0d8.png

Audio tagging algorithms can identify a wide range of sounds, helping to make sounds in the environment equally expressable in other modalities such as text, allowing sounds to be "seen".

This important progress of Xiaomi's self-developed voice recognition algorithm can bring users a more efficient and accurate voice recognition experience in rich device scenarios such as Xiaomi mobile phones, Xiaomi speakers, Xiaomi bracelets/watches, CyberOne/CyberDog, etc.

01

Climbing to the top of the sound field ImageNet

Refresh the highest indicators of audio marking technology

The goal of the audio labeling task is to perform multi-label classification of audio so that computers can understand the audio content, which can be applied to a wide range of scenarios such as audio search, dangerous event identification, machine failure monitoring, and assistive accessibility. For example, for the visually impaired, audio tagging technology can help them identify and understand sounds in the surrounding environment and compensate for their lack of visual ability.

4fe0ec6fc3f949f3e85a1aeea8b64960.png

Audio Tagging task (Image source: DCASE Challenge official website)

In addition, Large Language Model (LLM) has received widespread attention recently, and academic and industrial circles are also actively promoting research on Large Audio Model (LAM). Almost all known research results use pre-trained audio marking models for audio encoding, providing key audio information extraction capabilities for the entire large model. This makes the audio labeling task not only have broad application value, but also provides a technical foundation for the development of future audio large models.

The AudioSet released by Google is the most influential data set for audio labeling tasks and is considered the ImageNet in the field of sound (the most famous data set in the field of computer vision, released under the leadership of famous scholar Li Feifei and others). AudioSet consists of more than 2 million 10-second audios from YouTube, with a total duration of approximately 5800 hours. Through artificial listening, these audios are hierarchically annotated with 527 categories of labels.

The AudioSet data set is divided by the publisher Google into three subsets: balanced train, unbalanced train and evaluation. The first two subsets are used for training, and the last subset is used for testing. The balanced train is usually also called "AudioSet-20K", and the combination of balanced train and unbalanced train is called "AudioSet-2M". When various research institutions publish research results, they usually provide the results of using AudioSet-2M as the training set and evaluation as the test set in the paper, so as to fairly compare the algorithm performance horizontally among various institutions. Mean Average Precision (mAP) is the most common indicator for evaluating sound recognition performance. Average Precision (AP) represents the area of ​​the precision-recall curve of a specific category. mAP is the average AP of each category. The larger the mAP, the better.

Using AudioSet-2M audio data as a training set, Xiaomi's model broke through 50mAP for the first time in the industry, refreshed the audio marking technical indicators, and became the best-performing model so far. In addition, Xiaomi also released a mini version of the model, suitable for resource-constrained scenarios. The number of parameters of the Mini version of the model has been compressed to about one-ninth of the original model, which is much smaller than the models of other institutions (see the list for details), but it is still better than the model performance of all other institutions.

73e74809c95367b87f0e9f8eff7619a3.png

02

Interpret technical solutions

More efficient knowledge distillation

Xiaomi was able to achieve the mAP indicator of AudioSet to 50 this time. The key lies in: 1. Using a larger model; 2. Creating more data. (*"Creating data" here refers to Data Augmentation, that is, generating more data on the training set through some means such as random noise addition, without using additional data sets, so horizontal comparison is still fair of.)

It may seem obvious that larger models and more data will improve algorithm performance. However, if a model is too large, it will be difficult to deploy, including slow inference, thus losing practical value. However, it can be seen from the list that although Xiaomi's model ranks first in performance indicators, the number of parameters is not larger than other models. In particular, Xiaomi's top-ranked model still exceeds the model performance of all other institutions even if the number of parameters is further compressed to one-seventh of the best models from other institutions. This is mainly due to Knowledge Distillation (KD) technology.

▍What is Knowledge Distillation (KD) technology?

KD is not considered a new technology. It was proposed by Hinton (one of the founders of neural networks and winner of the Turing Award) as early as 2015. The main idea is to use a teacher model with powerful performance but not conducive to deployment to compare the parameters of a parameter. A small student model is used for guidance, and it is expected that the student model can achieve performance similar to that of the teacher model, thereby obtaining a model with both high performance and low number of parameters.

The Xiaomi Acoustic Speech team published a PSL model training framework based on KD at ICASSP, the top speech conference in 2022, and made it open source (*paper download address: https://arxiv.org/pdf/2204.13430.pdf). The number of parameters was only SOTA at that time. Nearly 1/30, but it can achieve performance close to SOTA (PSL mAP 35, SOTA 38) on the AudioSet-20K training set with a data volume less than AudioSet-2M.

The performance of the PSL framework in 2022 was not stronger, mainly because no way to use a larger model as a teacher for KD was found at that time. KD has two methods: online and offline. Online KD will calculate the teacher model output in each training iteration, while offline KD will pre-save the teacher model output to the disk. Obviously, the space-for-time scheme of offline KD can make the training much faster and greatly save the GPU usage for each experiment. However, it is difficult to carry out the offline KD scheme and data amplification at the same time.

The training method using data amplification in KD is called Consistent Teaching (CT). When the intensity of data amplification is large, the disk usage of offline KD will increase many times. Even today when disk prices are low, its cost is still Too much to bear. And CT is very important. We found through experiments that if CT can be successfully applied, the performance of the ViT-Base model can be improved by 16% compared with not using CT. This is a very huge improvement point.

▍What improvements have we made?

We propose the Consistent Ensemble Distillation (CED) training framework to improve KD. Specifically, CED has made three major improvements to traditional KD:

  1. Random seeds are used instead of output vectors as the storage content of offline KD, which greatly reduces the amount of storage without significantly increasing the amount of calculation.

  2. We experimented with using Top-K scores instead of all category scores for storage and several smoothing algorithms so that the Top-K results do not lead to significant performance degradation, which further reduces storage.

  3. Some engineering techniques are used to change the reading of data from random reading to sequential block reading. This improvement allows data access on mechanical disks to reach speeds close to those of solid-state drives during training.

d605c6864b6dfe3d8f8b7dfd55f3514b.png

CED training framework: The left picture shows traditional online and offline KD, and the right picture shows Xiaomi’s improvement of KD

Through these improvements, the efficiency of KD has been greatly improved, allowing us to use teacher models with much larger parameters than before (more than 1B parameters) and CT to train audio labeling models. This is Xiaomi's ability to bring this technology to the industry. The performance of the task broke through the key of 50mAP for the first time.

In addition, the CED framework is different from the traditional KD scheme that requires labels. The CED framework does not require manual labeling at all when training the student model. Although we did not use this feature to add more data for horizontal comparison with other algorithms, the value of this feature is huge and provides the possibility to utilize massive unlabeled data. Through the CED framework, it can be achieved that the mAP of the student model is only 0.1 mAP behind the teacher model. The method of this framework is not limited to this task, it can also be extended to various fields such as image and natural language processing. In these fields, we believe that the CED framework can be of great help.

03

Take advantage of technology and break barriers to perception

The audio tagging algorithm can identify a wide range of sounds, such as babies crying, animal sounds, car engines, explosions, smoke alarms, doorbells, water running, etc., helping to make sounds in the environment also use other modes such as text. Express the same attitude and allow sounds to be "seen", such as Xiaomi Wensheng installed in Xiaomi mobile phones, Xiaomi speakers ' environmental sound monitoring , Xiaomi Health's sleep snoring detection, CyberOne / CyberDog 's environmental semantic recognition , etc. , in a rich In the device scenario, it brings users a more efficient and accurate voice recognition experience.

▍Xiaomi Wensheng

In addition to communicating with people in life, various sounds in the environment also carry a large amount of information. The sound recognition function in Xiaomi Wensheng can monitor 14 important environmental sounds, including fire alarms, crying babies, kettle boiling sounds, etc., and push them in the notification bar of the phone. At the same time, notifications can also be displayed simultaneously on the Xiaomi bracelet, so you are not afraid of missing important information anytime and anywhere.

Xiaomi Wensheng allows mobile phones or tablets to help hearing-impaired users "see" other people talking. On the other hand, it can also help them "see" sounds in the surrounding environment, such as alarms, knocks, etc., giving hearing-impaired users Users have the same rights to sound perception.

50fe55626ecb33de9c0d7afb448df377.png

The user interface of Xiaomi Wensheng (conversation mode on the left, subtitle mode on the right)

▍Smart home equipment

In addition to mobile phones, ambient sound technology has also been launched on more smart home devices. For example, the baby cry monitoring function of the Mijia camera is to push notifications to the user's mobile phone in real time when a baby's cry is detected. At the same time, the Xiaomi Sound speaker is also equipped with a sound recognition function, which can identify six sounds that users are concerned about in the home environment: home alarms, crying babies, fire alarms, running water, cats meowing, and dogs barking; Xiaomi Health APP’s sleep snoring detection can Helps detect snoring and sleep talking while we sleep.

36d5ceed5110ded9f5073ec6c106a079.jpeg

Xiaomi Health APP snoring and sleep talking monitoring

7d5a14de0e67ce313361b7a834fd34a6.jpegMijia camera baby crying monitoring function

For home scenes, special adaptations have been made for speaker environmental sound monitoring. For example, in the recognition of the sound of running water, in order to avoid the trouble that may be caused to users by triggering notifications as soon as the faucet is turned on, technicians changed the condition of the sound reminder to monitor the sound of running water multiple times within a minute to avoid false reminders.

40fa0262f59b5bc1b438e9aac51f9248.jpeg

Environmental event monitoring on Xiaomi Sound speakers

▍Full -size humanoid robot CyberOne and bionic quadruped robot CyberDog

Xiaomi's first-generation full-size humanoid robot CyberOne, which took 10 months to develop from zero to one, can recognize 85 kinds of environmental sounds and perceive 6 categories and 45 kinds of human emotions through hearing. Xiaomi CyberDog quadruped robot (second generation) can recognize 38 types of environmental sounds.

db104cd479508e971bdf3b8ec3e5bf9e.png

04

Let technology come to fruition and benefit smart life

Xiaomi has been committed to allowing AI technology to penetrate into more scenarios and help technology bring a better life to everyone. The world's leading consumer-grade AloT platform has been built. As of June 30, the number of IoT devices (excluding mobile phones, tablets and laptops) connected to Xiaomi's AIoT platform reached 655 million, setting a record high.

ddbcf1ed49a3a14dea1a3c188f6071de.jpeg

Xiaomi's acoustic voice team has fully applied its self-developed acoustic voice technology to 79 categories including Xiaomi mobile phones, speakers, TVs, headphones, watches, and robots, for a total of 5,312 smart products. With 115 million monthly active users, Xiaoai is one of the busiest voice assistants in the world. The acoustic voice team has handled an average of 1.26 billion requests per day on Xiaomi mobile phones × AIoT devices, and has provided a total of 215.8 billion interactive voice services for 459 million devices. 

In the future, Xiaomi will continue to explore new heights in technology and use its strong technical capabilities to promote its mission of "let everyone in the world enjoy the beautiful life brought by technology", allowing users to enjoy smart life and feel the convenience and fun brought by technology. .

-

Adhering to the spirit of openness and sharing, we have open sourced the model training code and pre-trained models for peer research use:

  1. Paper address: https://arxiv.org/abs/2308.11957

  2. Training code: https://github.com/RicherMans/CED

  3. Pre-trained model: https://zenodo.org/record/8275347

b7ce1e46c50cfb6458c4956f9019d704.gif

93cb3b56ed3a2253ddce9aa44d7d37d8.png

Guess you like

Origin blog.csdn.net/pengzhouzhou/article/details/132867433