Several papers were selected for ICASSP 2023 Volcano voice effectively solves various practical problems

Recently, ICASSP2023, sponsored by IEEE and known as the world's largest and most comprehensive top academic conference on signal processing and its applications, was held in Greece. The conference has authority, extensive academic and industrial influence, and is highly regarded by AI There are many concerns in the field. At the meeting, many papers on Huohan Voice were accepted and published, covering technological innovations in many cutting-edge fields, and effectively solving practical problems such as word-to-sound conversion and language confusion.

Image source: https://2023.ieeeicassp.org/

LiteG2P: A fast, lightweight, high-precision phonetic conversion model (LiteG2P: A Fast, Light and High Accuracy Model for Grapheme-to-Phoneme Conversion )

Research background: As we all know, word-to-phonetic conversion (G2P) aims to convert words into their corresponding pronunciation representations, and is usually widely used in speech tasks such as speech recognition (ASR) and speech synthesis (TTS), but existing methods are based on rules The prediction accuracy of the traditional method is often poor, and it also needs the assistance of a large amount of expert experience; among them, although the data-driven deep model scheme has high accuracy, the model size is often large and the calculation efficiency is low. In this regard, the Volcano Voice team proposed an efficient, fast , lightweight, and high-precision word-to-sound conversion model, which can be further applied to various types of end-to-end devices.

Method analysis: Combining the advantages of data-driven and knowledge-driven, LiteG2P can achieve high accuracy while controlling the size of the model to be small. At the model level, it is different from the traditional sequence-to-sequence prediction model based on the attention mechanism, but uses CTC loss Aligning the phonetics also enables the model to have the advantage of predicting phoneme sequences in parallel; in addition, the Volcano Speech team also introduced a language knowledge dictionary to guide the length of letters and reduce the set of target prediction phonemes.

The architecture of LiteG2P

Effect presentation: Compared with the mainstream baseline model, the final LiteG2P model has the advantages of high precision, parallelization, light weight, and fast speed. It has the same accuracy as the mainstream baseline model, and at the same time, the speed is increased by more than 30 times, and the number of parameters is more than 10 times smaller ; A set of model architectures can be deployed on multiple types of devices in the end-cloud at the same time. The inference speed of a single word on the end-side device is predicted to be within 5ms, and the cloud device is within 2ms .

Multi-modal training of speech text based on bidirectional attention mechanism to improve speech recognition performance

Research Background: Today, although end-to-end models simplify the training process by combining acoustic models, dictionaries, and language models in a unified model, they rely heavily on large amounts of labeled training data. Unpaired data, such as pure audio or plain text data, is easier to obtain than labeled data. In order to alleviate the problem of data sparsity, unpaired data is often tried to participate in the training, which is conducive to training an end-to-end speech recognition model with good performance in low-resource scenarios. This paper uses plain text data to participate in the training of the end-to-end model decoder, so that the decoder can learn more semantic information, thereby improving the performance of the model. This process requires the use of a text encoder to fit the output of an audio encoder, thereby solving the problem that decoder training is dependent on the encoder. Due to the inconsistent length of audio and text, the paper proposes to use a multi-modal training method of speech and text based on a two-way attention mechanism to automatically learn the alignment relationship between speech and text.

Method analysis: Specifically, after the output of the speech encoder and the output of the text encoder are calculated by two-way attention, the output length of the speech encoder will be shortened to the length of the text, and the output of the text encoder will be extended to the length of the audio. The output of the two-way attention mechanism will be trained using Cosine distance loss, MLM loss, and Grapheme CTC loss. During the training process, the model will learn the alignment between speech and text, and the speech encoder and text encoder can learn to have consistent sexual characteristics.

Speech-to-text based bidirectional attention mechanism multimodal learning framework

As shown in the figure, the modules and loss functions added in the training are inside the dotted box, which will not be involved in the calculation during decoding, so it will not affect the speed of the decoding period. The role of Grapheme CTC loss is to classify Grapheme on the resampled speech embedding and text embedding. The function of MLM Loss is to enable the text encoder to learn semantic information. Cosine Embedding loss is to narrow the gap between speech embedding and text embedding. distance. All three loss functions are based on the aligned speech embedding and text embedding calculated by the bidirectional attention mechanism, thereby implicitly aligning the embeddings. After speech and text multimodal training, the text encoder can generate features close to the output of the speech encoder. The Volcano Speech team uses plain text data to send to the Text encoder and then repeats it twice to reduce the length difference between speech and text. Based on the training of the decoder, it can learn more semantic information.

Effect presentation: After the speech and text multimodal training method proposed in this paper, the performance is improved on the Librispeech public data set. It is concluded that when only using labeled data for training, a relative word error rate increase of 6.15% can be achieved; when When using more unpaired text data, the relative word error rate can reach 9.23%.

Using character-level language segmentation to reduce language confusion in cross-language speech recognition (Reducing Language Confusion for Code-switching Speech Recognition with Token-level Language Diarization)

Research background : Usually, language conversion occurs when the language conversion of the speech signal will lead to language confusion in cross-language speech recognition. In this regard , the Volcano Voice team solves the problem of language confusion from two perspectives of fusing and decoupling language information, thereby improving the performance of cross-language speech recognition.

Method analysis: Specifically for the process of language information fusion, the team uses a subtask of sequence-to-sequence-based language segmentation to generate character-level language posterior probabilities, and uses language posterior probabilities to dynamically adjust cross-lingual speech recognition model; on the contrary, the process of decoupling is to reduce the difference between different languages ​​through confrontation, so as to normalize different languages. Two different methods to implement the architecture are shown in the figure below:

The hybrid CTC/attention model (a) incorporating language information using language posterior bias, and (b) disentangling language via adversarial learning

Effect presentation: We verified the proposed method on the SEAME dataset. Compared with the baseline model, the multi-task training combined with the language segmentation task and the language posterior probability bias method proposed by the team have achieved performance improvements. "At the same time, the two methods of fusing and decoupling language information were compared, and we found that the comparison results show that fusing language information can more effectively improve cross-lingual speech recognition performance." The team emphasized.

A self -supervised learning- based fluency scoring method without ASR (An ASR-free Fluency Scoring Approach with Self-supervised Learning )

Research background: Oral fluency, that is, the speed of pronunciation and whether there are abnormal pauses, is one of the important indicators reflecting the proficiency of the corresponding acquired language. Most of the previous judgment methods often need to use the ASR system to obtain the time alignment information of speech units (such as words, syllables, phonemes, etc.), based on which to further calculate or represent the characteristics of speech fluency, but the ASR system of the target language is not always able to The above information is easily obtained, and in addition, inevitable identification errors will occur in the process. In this regard, the Volcano Speech Team proposed a new fluency scoring method based on self-supervised learning that does not require an ASR system, that is, using the frame-level speech representation generated by the self-supervised pre-trained speech model Wav2vec 2.0, and generated by a clustering algorithm. The frame-level pseudo-label of , as the input of the subsequent sequence model, finally completes the prediction of the fluency score.

The proposed ASR-free fluency scoring framework

Effect presentation: Subsequent practice results show that the correlation between machine prediction results and human expert scores has reached 0.797, which is significantly better than the 0.759 achieved by the previous method relying on the ASR system. The scheme takes advantage of the powerful phoneme discrimination ability of self-supervised speech features, and uses frame-level clustering pseudo-label sequences to simulate ASR-based phoneme time alignment, which not only removes the dependence on ASR but also demonstrates more reliable scoring performance.

Leveraging Phone -level Linguistic-Acoustic Similarity for Utterance-level Pronunciation Scoring

Research background: The so-called automatic pronunciation scoring system often needs to measure the degree of deviation between the learner's actual pronunciation and the reference pronunciation to estimate the overall pronunciation accuracy, but most of the previous methods are implicit methods such as summing or connecting acoustic embedding and phoneme embedding. to achieve. In this regard, the Volcano Speech Team proposed a sentence-level pronunciation scoring method using phoneme-level language-acoustic similarity. Compared with implicit measurement methods, the actual pronunciation is explicitly described by the cosine similarity of acoustic embedding and phoneme embedding. The method of deviation from the reference pronunciation is better, and this is used as an additional feature together with the original two embedding sequences into the subsequent sequence model to complete the scoring of the final pronunciation accuracy.

The hierarchical architecture of the pronunciation scoring network, where phone-level features can be calculated by using add_phone, concat_phone or our proposed method

Effect presentation: This explicit measurement method has been proved to be significantly better than the previous implicit measurement method of summation and connection on internal and public data sets, that is to say, pre-training based on phoneme-level GOP is in all measurement methods Both have made great improvements; the scoring system combined with the display measure of language-acoustic similarity and GOP pre-training has achieved the best scoring performance, and the correlation between the machine prediction results and human expert scores has reached 0.858, which is significantly higher than Multiple baseline systems reported in the paper.

Internal Language Model Estimation based Adaptive Language Model Fusion for Domain Adaptation

Research background: Internal language model fusion can significantly improve end-to-end speech recognition performance as long as there is enough text in the general domain or specific target domain. However, when a commercial speech recognition system in a general field is deployed, users often only have text data in specific target fields related to themselves due to data access restrictions. Therefore, the automatic speech recognition system through internal language fusion can only obtain performance improvement in user-specific fields, but will cause damage and significantly reduce performance in general fields.  Based on the above reasons, this paper proposes a premise that the user only has text data in a specific target field, compared with the traditional internal language model estimation fusion method, it can achieve a significant improvement in the acquisition performance in specific fields, while still achieving better performance in general fields. Adaptive language model fusion method.

Method analysis: This method is based on internal language model estimation. The premise is that when a speech recognition system is delivered online, the subsystems that provide user access include an end-to-end speech recognition system and an internal language model. Users only need to pay attention to the language model of their own specific domain, and can obtain the result that the performance in the specific domain is significantly improved, and the performance in the general domain achieves little loss. Specifically, when the recognition system performs language model fusion, it compares the scores of each subword in the internal language model and the user-specific language model, and decides whether to perform internal language model fusion according to the size, so as to realize the so-called adaptive fusion function.

Effect presentation: In order to verify the effectiveness of the method, the Volcano Speech Team used the Chinese speech recognition system trained for 100,000 hours as a general-purpose domain recognition system, and defined medical and novel search as specific domains. The results proved that 18.6 % relative word error rate reduction compared to only a 2.4% relative word error rate increase in the general domain .

For a long time, the Volcano voice team has provided high-quality voice AI technical capabilities and full-stack voice product solutions for ByteDance's internal business lines, and provided external services through the volcano engine. Since its establishment in 2017, the team has focused on the research and development of industry-leading AI intelligent voice technology, and constantly explored the efficient combination of AI and business scenarios to achieve greater user value.

Guess you like

Origin blog.csdn.net/weixin_48827824/article/details/131083337