The smart speaker war is in full swing, so the question is: How to become a full-stack speech recognition engineer?

On November 16, Baidu released the Raven smart speaker and the DuerOS development board SoundPi. So far, another domestic giant has joined the smart speaker war. So far, the giants on the domestic battlefield include Alibaba, JD.com, Tencent, Baidu, Xiaomi, iFlytek, etc., while abroad there are Apple, Microsoft, Amazon, Google, Facebook, Samsung, etc. These giants occupy the global market value ranking list. At the same time, they are striving to compete for the voice entrance in the future artificial intelligence era, and even Amazon and Alibaba took the lead in starting a subsidy war at all costs. The fierce competition among these global giants will have an extremely important impact in the next decade. At the same time, this is a new wave of rapid career development opportunities.

The current core of speech intelligence is acoustic issues and semantic understanding. With the explosion of market demand, full-stack speech recognition engineers who can make up for one of the technical shortcomings will become a hot commodity in the workplace, and the cultivation of such talents will The cost is very high, and it will become the core talent that major giants and startups compete for for at least the next ten years.

So, how to become a full-stack speech recognition engineer? Chen Xiaoliang, former associate researcher at the Institute of Acoustics of the Chinese Academy of Sciences and founder of Sound Intelligence Technology, accepted our invitation to write an article specifically on this topic. This is an article that connects knowledge vertically and horizontally and combines it with practice to explain it in a simple way. It is very helpful for a comprehensive understanding of speech recognition. Afterwards, AI Technology Basecamp briefly followed up on several issues, hoping to be helpful to you.

                                                 **语音识别基础知识**

【Mathematics and Statistics】

Mathematics is the foundation of all subjects. Courses such as advanced mathematics, mathematical equations, and functional analysis are necessary basic knowledge. Probability theory and mathematical statistics are also basic subjects of speech recognition.

【Acoustics and Linguistics】

Fundamentals of acoustics, theoretical acoustics, acoustic measurement, etc. are basic courses in acoustics, which will help you learn more about the field of acoustics. Knowledge such as introduction to linguistics, philosophy of language, semantic minimalism and pragmatic pluralism, grammaticalization and semantic graphs are very helpful for understanding language models and voice interaction UI design.

【Computer Science】

Courses such as signal systems, digital signal processing, speech signal processing, discrete mathematics, data structures, introduction to algorithms, parallel computing, introduction to C language, Python language, speech recognition, and deep learning are also necessary basic knowledge.

                                             **语音识别专业知识**

The knowledge system of speech recognition can be divided into three major parts: professional foundation, supporting skills and application skills. The professional foundation of speech recognition also includes algorithm foundation, data knowledge and open source platform. The algorithm foundation is the core knowledge of speech recognition system, including acoustic mechanism, signal processing, acoustic model, language model and decoding search.

Insert image description here

【Professional Basics】

Algorithm basics

Acoustic mechanism : includes pronunciation mechanism, auditory mechanism and language mechanism. Pronunciation mechanism mainly discusses human vocal organs and the role of these organs in the sound production process, while auditory mechanism mainly explores human auditory organs, auditory nerves and their way of distinguishing and processing sounds. Language Mechanism mainly explores the distribution and organization of human language. This knowledge is of great significance for theoretical breakthroughs and model generation.

Signal processing : including speech enhancement, noise suppression, echo cancellation, reverberation suppression, beam forming, sound source localization, sound source separation, sound source tracking, etc. details as follows:

1. Speech enhancement : This is a narrow definition, referring to automatic gain or array gain, which mainly solves the problem of pickup distance. Automatic gain generally increases the energy of all signals, while speech enhancement only increases the energy of effective speech signals.

2. Noise suppression : Speech recognition does not require complete removal of noise. Relatively speaking, the communication system must completely remove noise. The noise mentioned here generally refers to environmental noise, such as air-conditioning noise. This type of noise usually does not have spatial directivity and is not particularly energetic. It will not cover up normal speech, but only affects the clarity and intelligibility of speech. This method is not suitable for processing in strong noise environments, but it is sufficient for voice interaction in daily scenarios.

3. Reverberation elimination : The effect of reverberation elimination greatly affects the effect of speech recognition. Generally speaking, when the sound source stops emitting sound, the sound waves will undergo multiple reflections and absorptions in the room. It seems that several sound waves are mixed for a period of time. This phenomenon is called reverberation. Reverberation can seriously affect speech signal processing and reduce direction finding accuracy.

4. Echo cancellation : Strictly speaking, this should not be called echo, but "self-noise". Echo is an extended concept of reverberation. The difference between the two is that the time delay of echo is longer. Generally speaking, humans can clearly distinguish reverberation with a delay of more than 100 milliseconds. It seems that a sound appears twice at the same time, which is called echo. In fact, what this refers to is the sound emitted by the voice interaction device itself, such as an Echo speaker. When calling Alexa when playing a song, the microphone array actually collects the music being played and the Alexa voice called by the user. Obviously, Speech recognition cannot recognize these two types of sounds. Echo cancellation is to remove the music information and only retain the user's voice. The reason why it is called echo cancellation is just to continue everyone's habit, which is actually inappropriate.

5. Sound source direction finding : Sound source positioning is not used here. Direction finding and positioning are different. While consumer-grade microphone arrays can do direction finding, positioning requires more cost investment. The main function of sound source direction finding is to detect the human voice talking to it for subsequent beam forming. Sound source direction finding can be based on energy methods or spectrum estimation, and TDOA technology is also commonly used in arrays. Sound source direction finding is generally implemented in the voice wake-up phase. VAD technology can actually be included in this category, and it is also a key factor in reducing power consumption in the future.

6. Beamforming : Beamforming is a general signal processing method. Here it refers to a method in which the output signals of each microphone of a microphone array arranged in a certain geometric structure are processed (such as weighting, delay, summation, etc.) to form spatial directivity. Beam forming mainly suppresses sound interference outside the main lobe, which also includes human voices. For example, when several people are talking around the Echo, the Echo will only recognize the voice of one of them.

Endpoint detection : Endpoint detection, in English is Voice Activity Detection , referred to as VAD . Its main function is to distinguish whether a sound is a valid voice signal or a non-voice signal. VAD is the main method for detecting pauses between sentences in speech recognition, and it is also an important factor to consider for low power consumption. VAD is usually done using signal processing methods. The reason why it is divided separately here is because the role of VAD is actually more important now, and VAD is usually also done based on machine learning methods.

Feature extraction : Acoustic models usually cannot directly process the original sound data. This requires extracting fixed feature sequences from the original sound signals in the time domain through some method, and then inputting these sequences into the acoustic model. In fact, the model trained by deep learning will not break away from the laws of physics, but will extract more features such as amplitude, phase, frequency and correlation in each dimension.

Acoustic model : The acoustic model is the most critical part in speech recognition. It integrates the knowledge of acoustics and computer science, uses the features generated by the feature extraction part as input, and generates acoustic model scores for variable-length feature sequences. The core of the acoustic model must solve the problem of variable length of feature vectors and the variability of sound signals. In fact, every speech recognition progress mentioned basically refers to the progress of acoustic models. Acoustic models have been iterated for so many years, and there are already many models. We will introduce the most widely used models at each stage. In fact, many models are now mixed, so that the advantages of each model can be used to make the scene adaptation more robust.
1. GMM , Gaussian Mixture Model, is a statistical model based on Fourier spectrum speech characteristics. The weighted coefficients in GMM and the mean and variance of each Gaussian function can be obtained through continuous iterative optimization. The GMM model training speed is fast, the acoustic model parameters are small, and it is suitable for offline terminal applications. Before deep learning was applied to speech recognition, the GMM-HMM hybrid model had always been an excellent speech recognition model. However, GMM cannot effectively model nonlinear or approximately nonlinear data, it is difficult to use contextual information, and it is difficult to expand the model.
2. HMM , Hidden Markov Model, which is a hidden Markov model , is used to describe a Markov process with hidden unknown parameters, determine the hidden parameters of the process from the observable parameters, and then use these parameters to further analyze. HMM is a statistical distribution model that can estimate speech acoustic sequence data, especially temporal features. However, these temporal features rely on the temporal independence assumption of HMM, which makes it difficult to correlate factors such as speech speed, accent, and acoustic features. . There are many extended models of HMM, but most of them are only suitable for speech recognition of small vocabulary, and large-scale speech recognition is still very difficult.
3. DNN , Deep Neural Network, that is, deep neural network, is an early neural network used in acoustic models. DNN can improve the efficiency of data representation based on Gaussian mixture models. In particular, the DNN-HMM hybrid model greatly improves the speech recognition rate. Since DNN-HMM can achieve high speech recognition rates with only limited training costs, it is still a commonly used acoustic model in the speech recognition industry.
4. RNN , Recurrent Neural Networks, CNN, Convolutional Neural Networks, convolutional neural networks. The application of these two neural networks in the field of speech recognition mainly solves the problem of how to use variable length contextual information. CNN/RNN performs better than DNN in terms of speech rate robustness. Among them, RNN models mainly include LSTM (multi-hidden layer long short-term memory network), highway LSTM, Residual LSTM, bidirectional LSTM, etc. CNN models include time-delay neural network (TDNN), CNN-DNN, CNN-LSTM-DNN (CLDNN), CNN-DNN-LSTM, Deep CNN, etc. Some of the models have similar performance but different application methods. For example, bidirectional LSTM and Deep CNN have similar performance, but bidirectional LSTM needs to wait for the end of a sentence to be recognized, while Deep CNN has no time delay and is more suitable for real-time speech recognition.

Language model : Estimating the likelihood of a word sequence by learning the relationship between words through training corpus. The most common language model is the N-Gram model. In recent years, deep neural network modeling methods have also been applied to language models, such as language models based on CNN and RNN.

Decoding search : Decoding is a key factor that determines the speed of speech recognition. The decoding process usually compiles the acoustic model, dictionary and language model into a network, and selects one or more optimal paths as the speech recognition result based on the maximum posterior probability method. . The decoding process can generally be divided into two modes: dynamic compilation and static compilation, or synchronous and asynchronous modes. The currently popular decoding method is the frame synchronization decoding method based on tree copy.

                                                     **语音识别数据知识**

Data collection : It mainly collects the sound information of the conversation between the user and the machine. It is generally divided into two parts: near field and far field. Near field collection can generally be completed based on mobile phones, and far field collection generally requires a microphone array. Data collection also pays attention to the collection environment. For different data uses, the requirements for voice collection are also very different, such as the age distribution, gender distribution and geographical distribution of the population.

Data cleaning : It mainly preprocesses the collected data to eliminate unsatisfactory speech or even invalid speech, so as to provide accurate data for subsequent data annotation.

Data annotation : It mainly translates sound information into corresponding text and trains an acoustic model, which usually requires tens of thousands of hours of annotation. Speech is a time series signal, so it requires a relatively large number of manpower. At the same time, due to factors such as staff fatigue, labeling The error rate is also relatively high. How to improve the success rate of data annotation is also a key issue in speech recognition.

Data management : mainly the classification management and organization of annotated data, which is more conducive to the effective management and reuse of data.

Data security : It mainly involves safe and convenient processing of sound data, such as encryption, etc., to avoid the leakage of sensitive information.

                                               **语音识别开源平台**

Current mainstream open source platforms include CMU Sphinx, HTK, Kaldi, Julius, iATROS, CNTK, TensorFlow, etc. CMU Sphinx is an offline speech recognition tool that supports low-power offline application scenarios such as DSP. Since deep learning plays a significant role in reducing the WER of speech recognition, tools that support deep learning such as Kaldi, CNTK, and TensorFlow are currently more popular. The advantage of Kaldi is that it integrates many speech recognition tools, including decoding and search. A summary of specific open source platforms is shown in Table 1.
Insert image description here

【Support skills】

Acoustic devices

  • A microphone , often called a microphone, is a transducer that converts sound into electronic signals, that is, converts acoustic signals into electrical signals. Its core parameters are sensitivity, directivity, frequency response, impedance, dynamic range, and signal-to-noise ratio. , maximum sound pressure level (or AOP, acoustic overload point), consistency, etc. The microphone is the core device of speech recognition and determines the basic quality of speech data.

  • A speaker , usually called a speaker, is a transducer device that converts electrical signals into acoustic signals. The performance of the speaker has a great impact on the sound quality, and its core indicator is the TS parameter. Since echo cancellation is involved in speech recognition, the total harmonic distortion requirements of the loudspeaker are slightly higher.

  • Laser sound pickup is a method of active sound pickup. It can pick up distant vibration information through laser reflection and other methods, and then restore it to sound. This method was mainly used in the field of eavesdropping in the past, but at present, this method It is still difficult to apply it to speech recognition.

  • Microwave sound pickup . Microwave refers to electromagnetic waves with wavelengths between infrared and radio waves. The frequency range is approximately between 300MHz and 300GHz. The principle of microwave sound pickup is similar to that of laser sound pickup. However, microwaves can hardly pass through glass, plastic and porcelain. absorbed.

  • High-speed camera sound pickup uses high-speed cameras to pick up vibrations and restore sounds. This method requires a visual range and high-speed cameras and is only used in some specific scenes.

computing chip

  • DSP , Digital Signal Processor, digital signal processor, generally adopts Harvard architecture, has the advantages of low power consumption and fast operation, and is mainly used in the field of low power speech recognition.

  • ARM , Acorn RISC Machine, is a RISC processor architecture designed by a British company. It has the characteristics of low power consumption and high performance. It is widely used in the mobile Internet field. Currently, in the IOT field, such as smart speakers, ARM processors are also used.

  • FPGA , Field-Programmable Gate Array, is a semi-custom circuit in the field of ASIC, which not only solves the shortcomings of fixed custom circuits, but also overcomes the shortcomings of limited programmable device gate circuits. FPGA is also very important in the field of parallel computing, and large-scale deep learning can also be implemented based on FPGA computing.

  • GPU , Graphics Processing Unit, is the most popular computing architecture in the current field of deep learning. In fact, GPGPU is used in the field of deep learning, mainly to accelerate large-scale calculations. The usual problem of GPU is excessive power consumption. , so it is generally applied to server clusters in the cloud.

  • In addition, there are emerging processor architectures such as NPU and TPU, which are mainly optimized for deep learning algorithms. Since they have not been used on a large scale, they will not be discussed in detail here.

acoustic structure

Array design mainly refers to the structural design of microphone arrays. Microphone arrays are generally divided into linear, annular and spherical shapes. Strictly speaking, they should be described as inline, cross, plane, spiral, spherical and irregular arrays, etc. As for the number of elements of the microphone array, that is, the number of microphones, it can range from 2 to thousands. Therefore, array design must solve the problem of microphone array formation and number of elements in the scene, not only to ensure the effect, but also to control the cost.

Acoustic design mainly refers to the cavity design of the speaker. The voice interaction system not only needs to collect the sound, but also needs to produce the sound. The quality of the sound is also particularly important. For example, when playing music or videos, the sound quality is also a very important reference indicator. At the same time, the sound quality The design will also affect the effect of speech recognition, so acoustic design is also a key factor in intelligent voice interaction systems.

【Application Skills】

  • The application of speech recognition will be the most anticipated innovation in the era of voice interaction. It can be compared to the era of mobile Internet. In the end, it is voice applications that stick to users. However, the current artificial intelligence is mainly infrastructure construction, and it will take some time for the application of AI to be popularized. Although Amazon's Alexa already has tens of thousands of applications, judging from user feedback, it is currently mainly based on the following core technology points.

  • Voice control is actually the most important application at present, including functions such as alarm clocks, music, maps, shopping, smart home appliance control, etc. Voice control is relatively difficult because voice control requires more accurate and faster speech recognition. .

  • Speech transcription has special applications in fields such as conference systems, smart courts, and smart medical care. It mainly transcribes the user's voice into text in real time to form meeting minutes, trial records, and electronic medical records.

  • Language translation mainly involves switching between different languages, which adds real-time translation on the basis of speech transcription, and has higher requirements for speech recognition.

The following three kinds of recognition can be classified into the category of speech recognition, or they can be listed in a separate category. Here we will broadly summarize them into the large system of speech recognition, which is easier to understand as the function point of speech recognition.

  • Voiceprint recognition, the theoretical basis of voiceprint recognition is that each voice has unique characteristics, through which the voices of different people can be effectively distinguished. The characteristics of the voiceprint are mainly determined by two factors. The first is the size of the vocal cavity, including the throat, nasal cavity, and oral cavity. The shape, size, and position of these organs determine the tension of the vocal cords and the range of sound frequencies. The second factor that determines the characteristics of the voiceprint is the way the vocal organs are manipulated. The vocal organs include the lips, teeth, tongue, soft palate and palatine muscles. The interaction between them will produce clear speech. The way of collaboration between them is learned randomly through interactions with people around them. Commonly used methods for voiceprint recognition include template matching method, nearest neighbor method, neural network method, VQ clustering method, etc.

  • Emotion recognition mainly extracts the acoustic features expressing emotions from the collected speech signals and finds the mapping relationship between these acoustic features and human emotions. Emotion recognition currently mainly uses deep learning methods, which requires the establishment of a description of the emotional space and the formation of a sufficient emotional corpus. Emotion recognition is an application that embodies intelligence in human-computer interaction, but so far, the technical level has not reached the level of product application.

  • Humming recognition mainly involves the user humming the melody of the song, and then conducting detailed analysis and comparison between the melody and the data in the music library, and finally providing the user with song information that matches the melody. This technology is currently used in music search, and the recognition rate can reach about 80%.

                                                   语音识别现状和趋势
    

At present, the accuracy and speed of speech recognition depend on the actual application environment. The speech recognition rate in quiet environments, standard accents, and common vocabulary has exceeded 95%, fully reaching the usable state. This is also the current hot topic of speech recognition. reason. With the development of technology, speech recognition in scenarios such as accents, dialects, and noise has now reached a usable state. However, speech recognition in scenarios such as strong noise, ultra-far fields, strong interference, multilingual languages, and large vocabulary still requires a lot of work. Big improvement. Of course, multi-person speech recognition and offline speech recognition are also current issues that need to be solved.

The academic community has discussed many speech recognition technology trends. There are two ideas that are very worthy of attention. One is the end-to-end speech recognition system, and the other is the capsule theory recently proposed by GE Hinton . Hinton's capsule theory is still academically controversial. It is relatively large, and it is worth exploring whether it can show its advantages in the field of speech recognition.

End-to-end speech recognition systems currently have no large-scale applications. In theory, since speech recognition is essentially a sequence recognition problem, if all models in speech recognition can be jointly optimized, better speech recognition accuracy should be achieved. degree, which is also the advantage of the end-to-end speech recognition system. However, it is very difficult to achieve end-to-end modeling processing of the entire chain from speech collection, signal processing, feature extraction, acoustic model, speech model, decoding and search. Therefore, the end-to-end models commonly referred to today are basically limited to acoustic models. Category, such as end-to-end optimization of DNN-HMM or CNN/RNN-HMM models, such as CTC criterion and Attention-based model and other methods. In fact, end-to-end training can learn noise, reverberation, etc. in real scenes as new features, which can reduce dependence on signal processing. However, this method still has training performance, convergence speed, network bandwidth, etc. Due to many problems, it has not yet achieved obvious advantages over mainstream speech recognition methods.

This article is mainly about popular science. I am very grateful to all partners in the domestic speech recognition field for their support. If there are any shortcomings in the article, I look forward to your correction!

【references】

1.Deep Learning:Methods andApplications,Li Deng and Dong Yu

2.Automaitic Speechand Speaker Recognition: Large Margin and Kernel Methods, Joseph Keshet andSamy Bengio

3.Xuedong Huang, Alex Acero, Hsiao-wuenHon, Spoken Language Processing

4.Lawrence Rabiner,Biing-Hwang Juang, Fundamentals of Speech Recognition

5.Dan jurafsky andJames H. Martin, Speech and Language Processing

6.Dynamic RoutingBetween Capsules,Sara Sabour,Nicholas Frosst,Geoffrey E. Hinton

7.https://en.wikipedia.org/wiki/Speech_perception

8.http://www.speech.cs.cmu.edu

9.http://htk.eng.cam.ac.uk/

10.http://kaldi-语音识别.org/

11.https://www.microsoft.com/en-us/cognitive-toolkit/

12.http://www.soundpi.org/

Guess you like

Origin blog.csdn.net/weixin_43153548/article/details/82840157