Speech Recognition, first sight of speech recognition - speech signal processing learning (2)

Start from learning (2) and start watching "Li Hongyi Deep Learning Human Language Processing Mandarin Course (2020)"

Domestic viewing address:

Speech Recognition (Part 1)bilibili

Talking model: Sokusho sound 转为 text.

Text: a sequence of Token length:N, type quantity: V

Sound: vectors sequence 长度:T,维度:d

1.Text Token

type
  1. Phoneme: a unit of sound, which can be understood as the phonetic symbol of pronunciation

  2. Grapheme: the smallest unit of a writting, such as [26 English letters + spaces + punctuation marks]

  3. Word: the word in the language

  4. Morpheme: the smallest meaningful unit, such as the root of an English word

  5. Bytes: directly represents a group of Text in bytes, common encoding such as UTF-8

For everyone (big trend)

The most common one is grapheme, which is very simple and direct.

2. Model (Speech Recognition) function

  1. Output word embeddings

  2. Add Translation to the model and output the translated results after recognition.

  3. Intent classification is added to the model to output speech classification and understand the other party’s intention.

  4. Slot filling is added to the model, that is, the keywords of a sentence are output, such as time, location, etc.

3. Sound Feature Acoustic Feature

Usually 25ms is used as the window length, and the sound signal is converted into a vector (frame, that is, frame), and the window moves 10ms each time , that is to say 1s →100 frames

frame making method
  1. sample points: When the sound sampling rate is at 16KHz, there are a total of 400 sample points within 25ms. These 400 are directly Just use the numbers as a frame

  2. 39-dim MFCC: There are 39 dimensions in total

  3. 80-dim filter bank output: a total of 80 dimensions

frame production process
  1. First, Waveform is transformed by DFT is spectrogram (spectrum), which can be used for training

    When a person speaks a word, the waveform can be very different, but the spectrogram will basically be similar. Some people can even judge the content of the speech through the spectrogram.

    DFT (Discrete Fourier Transform) is an important operation to convert continuous audio signals into discrete frequency domain representations. DFT is a mathematical transform used to convert a time domain signal (such as an audio waveform) into a frequency domain representation. It is the discrete version of the Continuous Fourier Transform (CFT) and is suitable for discrete time series.

    The main purpose of DFT is to decompose time domain signals into amplitude and phase information of different frequency components. This is useful for analyzing the various frequency components contained in audio signals.

  2. Then,spectrogram passes throughfilter bank Become individualvectors

    A filter bank is a system composed of a set of filters used to perform frequency analysis on input signals. In acoustic feature extraction, a commonly used filter bank is the Mel filter bank. The Mel filter bank is a nonlinear filter bank whose design is based on the Mel scale, which is a psychoacoustic scale designed based on the characteristics of the human ear's perceived frequencies.

  3. Retake log, Re-proceed DCT, Last generated MFCC.

    DCT stands for Discrete Cosine Transform. DCT is a widely used mathematical transformation method commonly used in the fields of audio signal and image processing.

    In acoustic feature extraction, a Mel filter bank is usually used to calculate the energy of each filter channel to obtain Mel-frequency cepstral coefficients (MFCCs). However, MFCCs contain a large amount of frequency information, and adjacent frames are often highly correlated, which may lead to redundant information and excessive data.

    In order to reduce the data dimension and capture the main information, the MFCCs sequence is usually converted into cepstral coefficients (Cepstral coefficients) through DCT. Cepstral coefficients are different from original spectral coefficients. They have better characteristics and representation capabilities, and are suitable for modeling and analysis of speech and audio signals.

    MFCC is actually MFCC coefficients. MFCC (Mel Frequency Cepstral Coefficients) is an acoustic feature extraction method commonly used in the fields of audio signal processing and speech recognition. In acoustic feature extraction, MFCC is used to divide the continuous audio signal into small segments one frame at a time, and represent each frame as a set of coefficients for use in subsequent analysis.

  4. Summarize

For everyone to use

Many people use filter bank output.

4. Introduction to sound data sets

  • Note that the words in quotation marks here are the equivalent duration of the data amount.

5. Introduction to commonly used sound models (basically seq2seq models)

  • Listen, Attend, and Spell (LAS) [Chorowski.et al., NIPS'15]

  • Connectionist Temporal Classification (CTC) [Graves, et al.,ICML'O6]

  • RNN Transducer (RNN-T) [Graves, ICML workshop'12]

  • Neural Transducer [Jaitly, et al., NIPS'16]

  • Monotonic Chunkwise Attention (MoChA) [Chiu, et al.,ICLR'18]

For everyone to use

Guess you like

Origin blog.csdn.net/m0_56942491/article/details/133984621