Speech Recognition - Decoder (WFST, Lattice)

       Under the premise of decoding as a given acoustic observation sequence O=\left \{ o_{1},o_{2},...,o_{T} \right \}, find the most likely sequence of words W=\left \{ w_{1},w_{2},...,w_{N} \right \}, obtained by Bayesian:

        W=argmax_{w}P\left ( W|O \right )=argmax_{w}P\left ( O|W \right )P\left ( W \right )

        The purpose of decoding is to find one or more optimal paths from the initial state to the final state in the decoding space.

        The decoder is an important part of the speech recognition system. The main decoding methods are as follows:

        1) Dynamic decoders: Dynamic decoders use breadth-first search to simultaneously generate multiple hypotheses in the original search network, and rely on the pruning algorithm to not make the network too large.

        2) Weighted finte-state transducers (weighted finte-state transducers): Weighted finte-state transducers use finite state automata algorithms to represent and optimize state-level network structures, and use the shortest path algorithm to search for graph structures.

        3) Multi-pass search: Initially, an intra-word binary language model is used. Several hypotheses can be generated using some simple models; hypotheses are re-scored using more accurate inter-word models on the N-best list or word grid obtained in the first pass.

Original dynamic decoder based on Viterbi:

        The original dynamic decoder based on Viterbi uses breadth-first search to simultaneously generate multiple hypotheses in the original search network, and relies on the pruning algorithm to not make the network too large.

        The dynamic decoding network only compiles the dictionary into a state network, which constitutes the search space.

        Taking a four-word dictionary as an example, the dictionary contains the following four words:

        The search space formed can be divided into: linear dictionary, tree dictionary

1) Linear dictionary

        First, replace all the words in the dictionary with the corresponding phoneme state sequence, and form a parallel network in parallel, and then combine the language model to determine the network loop connection. The following is a schematic diagram of the decoding network using the 1-gram language model:

         The use of 2-gram and 3-gram language models can eliminate some rare combinations, thereby assisting the pruning calculation of the decoding process. The schematic diagrams of the decoding network are as follows:

         where sp is the space between words.

        After obtaining the parallel (loopback) network built with dictionaries, the state sequence is horizontally expanded according to the observation sequence (speech information). The x-axis is time and the y-axis is state.

         In the figure, every three states represent a phoneme.

        According to the Viterbi algorithm, with the passage of time and the movement of the frame, gradually compare the cumulative probability of each path to the last state of the last word, and obtain the optimal path, that is, the final decoding sentence, and the decoding ends.

        ( HMM-decoding problem (Viterbi, A*, beam search)_weixin_43284996's blog-CSDN blog )

        The decoding process can be optimized by pruning algorithms such as token passing (Token Passing).

        However, it can be seen that the above-mentioned decoding network is formed by a parallel network connection of the phoneme state sequences of all words. If the vocabulary is large, the storage and calculation complexity are very high.

2) Tree dictionary

        To solve the problem of linear dictionaries, tree dictionaries are proposed.

        Build a decision tree for each state of each phoneme:

         The tree structure is to obtain the corresponding word according to the path of the phoneme, while the nonlinear dictionary first estimates the word, and then enters the corresponding phoneme path from the word. (But this makes the language model use always at the end of the phone path, which can be achieved by introducing the language model earlier?)

        Since a large number of nodes of the same state are merged together, the scale of the search space can be significantly reduced, and the computational load of the decoding process can be reduced.

Summarize:

        For the decoding process of large dictionaries, the original dynamic decoder based on Viterbi has a large amount of calculation and slow decoding speed.

        To speed up decoding:

        On the one hand, pruning can be performed, and the optimal path is selected in decoding, and the paths exceeding the pruning threshold are directly deleted without subsequent operations.

        On the other hand, the knowledge source can be pre-compiled into a static network (WFST) and used directly in decoding.

Viterbi static decoder based on WFST:

        In order to speed up the decoding, the dynamic knowledge source can be compiled in advance to form a static network, which can be called directly during decoding.
Input the HMM state sequence, and directly get the word sequence and its related score.

        Use H, C, L, and G to represent the WFST forms of the above-mentioned HMM model, triphone model, dictionary, and language model, respectively.

(WFST introduction and network construction process can be seen: Weighted Finite-State Transducer/WFST_weixin_43284996's blog-CSDN blog )

        In this decoder, the acoustic part (observation sequence to HMM state sequence) still needs to be calculated separately according to the input features, and other information contained in the HCLG has been contained in the entire static network, through the input, output, and weight of the transfer arc in the network. express.

        Since the static network has fully expanded the search space, it only needs to calculate the acoustic probability and cumulative probability according to the transfer weights between nodes, so the decoding speed is very fast.

        The decoding process uses the token propagation mechanism Token passing, which is actually a general version of viterbi decoding.

        However, in actual use, it is difficult to ensure that the optimal path of viterbi decoding is the most suitable output result (for example: language scene changes), so lattice is often used to save various candidate recognition results for subsequent processing.

Static decoding of Lattice based on WFST:

        In the traditional speech recognition process, acoustic model training usually takes up a lot of resources, so the acoustic model is not updated frequently, which makes it difficult for the speech recognition system to quickly optimize for specific scenarios. On the other hand, in order to ensure the decoding efficiency, the pruning of the decoding graph is a necessary step, and it is common to use beam search directly. Considering these two issues, there is Lattice.

        When decoding, some pruning methods are used to finally obtain the N-best path, and the WFST obtained after re-determinizing the N-best path is Lattice.

        Like HCLG, Lattice's input is an HMM state and its output is a sequence of words. However, in kaldi, the generation and storage of Lattice is optimized so that its input and output are both word sequences, and information such as HMM status is stored in transition. Therefore, Lattice is usually called a word map or a word lattice.

         In this way, the decoding process of speech recognition is divided into two steps (two-pass). The first pass decodes to obtain Lattice, and the second pass performs shortest path search in Lattice to obtain the 1-best decoding result.

        The specific steps are:

        1) N-best pruning: When the original token passing algorithm creates a new word link record (word link record, WLR), only the token with the highest score and the highest likelihood probability will be saved, and the improved algorithm will save the token with the highest score N tokens. Lattice is generated through the unique ForwardLink mechanism.

        One ForwardList per frame, which can be associated with the frame index. It records the transfer of Tokens between emission arcs, and also uses a linked list to connect all Tokens in all states of the current frame together.

        2) Construction of Lattice: At the end of the sentence, all historical information will be backtracked into a word grid. This word grid contains the scores of the acoustic model and language model, as well as the recognized words and their corresponding time steps. In this way The N best paths or hypotheses can be found in the word grid of .

       3) 1-best re-evaluation: Re-score the original path in Lattice and select the optimal path. When re-scoring, a larger language model or a model with stronger business relevance can be used, which ensures that the decoding effect is optimized while the acoustic model remains unchanged.

        On the basis of the above decoders, hybrid static/dynamic decoders are sometimes used for optimization:

       1) Lattice Pruning (lattice trimming) . The word grid may be very large when it is generated, and each state saves multiple tokens, but a large part has little effect on the optimal path, so the lattice can be pruned without reducing the accuracy. The specific method is to first find the optimal path and its likelihood probability, and then for any node or arc, calculate the maximum likelihood probability of its forward and backward directions, that is, score the forward and backward directions, and then add them to obtain the likelihood of passing through the arc The probability is used as the posterior probability of this edge, and then the edge with a very low posterior probability is deleted to achieve the purpose of pruning.

        2) Acoustic Model Rescoring (acoustic model modification): A word grid can be used as a finite state restricted grammar for rescoring more complex acoustic models. Therefore, when using the acoustic model for re-scoring, the original connection arcs copied due to different time and phoneme contexts will be integrated, using a smaller word grid. The figure below shows the before and after arcs of the word "TO" after being integrated. Compared.

        3) Language Model Rescoring (linguistic model modification): The original word grid can also be re-scored with a new language model, which does not need to consider the influence of the acoustic model. The lattice generated by the meta-model is re-scored, which will introduce some more paths. The before and after comparison diagram is as follows:

                 4) Confusion Network: The method of generating a confusion network is to find the optimal path first, and then gradually add other edge alignments to the confusion network. If adding a certain edge exceeds the length of the original network, it will still be The edge joins the confusion network and adds a !NULL edge to the previous optimal path.

Summarize:

         The purpose of decoding is to find one or more optimal paths from the initial state to the final state in the decoding space. 

        The original dynamic decoder based on Viterbi uses breadth-first search to simultaneously generate multiple hypotheses in the original search network, and relies on the pruning algorithm to not make the network too large. Tree dictionaries can be used instead of linear dictionaries to reduce space complexity.

        In order to further improve the decoding speed, the knowledge source can be pre-compiled into a static network and used directly in decoding. That is, a WFST is generated.

        In WFST, the Token passing algorithm of the Viterbi algorithm can be used to find the best path.

        Due to different scenarios, the optimal path calculated by Viterbi is not necessarily the most reasonable path, so Lattice is introduced. After selecting the optimal N-best path, select the most reasonable path after re-scoring.

        For the constructed Lattice, hybrid static/dynamic decoders can also be used for optimization .

Guess you like

Origin blog.csdn.net/weixin_43284996/article/details/127465939
Recommended