NLP - Translation machine translation

Why translation tasks are difficult

insert image description here
insert image description here

Statistical Machine Translation

insert image description here
Statistics-based machine translation tasks usually learn a more powerful translation model through the combination of translation model (Translation Model) and language model (Language Model). This combination is known as statistical machine translation (SMT).

  • Translation Model: The translation model mainly focuses on how to translate source language sentences into target language sentences. It is trained using bilingual corpora to learn translation probabilities between source and target languages. Phrase-based translation model is one of the commonly used models in SMT, which divides sentences in source language and target language into phrases and establishes translation probabilities between phrase pairs. Syntax-based translation models use grammatical structure information for translation.

  • Language Model: The language model focuses on the fluency and rationality of sentences in the target language. It utilizes a monolingual corpus of the target language and learns the probability distribution of sentences in the target language. The language model can help the translation model to choose more reasonable translation candidates and improve the quality of translation.

In SMT, the combination of translation model and language model is modeled by joint probability. Given a source language sentence, the translation model calculates the translation probability of the source language sentence to the target language sentence, and the language model calculates the probability of the target language sentence, and then combines the two to obtain the final translation result. These model parameters are usually trained using maximum likelihood estimation or maximum a posteriori probability estimation.
insert image description here
insert image description here
insert image description here
insert image description here

  • So how to learn P ( f ∣ e ) P(f|e)corpora fromWhat about P ( f∣e ) ?
    insert image description here
    • These given sentence pairare not aligned

Alignment alignment

insert image description here

  • Parallel Corpora (Parallel Corpora): A parallel corpus refers to a collection of sentences or phrases that have a correspondence between the source language and the target language. This corpus is usually obtained by human translation or alignment, and is ideal data for machine translation. By using an existing parallel corpus, it is possible to directly align sentences or phrases between the source language and the target language.

  • Based on word alignment (Word Alignment): The method based on word alignment tries to correspond the words in the source language and the target language to achieve the alignment of sentences or phrases. Common word alignment algorithms include IBM model and statistical phrase alignment model. These models use statistical methods to learn the word alignment probabilities between the source and target languages ​​to achieve alignment.

  • Phrase Alignment: Phrase alignment-based methods match phrases in the source and target languages ​​to achieve sentence or paragraph-level alignment. Common phrase alignment-based methods include the IBM model and the statistical phrase alignment model. These models are based on statistical and probabilistic models utilizing bilingual corpora to learn the phrase alignment probabilities between the source and target languages.

  • Syntax-based Alignment: Syntax-based alignment-based methods attempt to leverage the syntactic structure between the source and target languages ​​for alignment. This method usually uses a syntactic analysis tool to analyze the syntactic structure of the source language and the target language, and performs alignment based on the correspondence between the syntactic structures.

  • Neural Network-Based Alignment Methods: In recent years, with the development of neural networks, neural network-based methods have made some progress in alignment tasks. These methods use neural network models to learn the correspondence between source and target languages, and can use attention mechanisms or encoder-decoder models to achieve alignment.

The following situations may occur during alignment:
insert image description hereinsert image description here
insert image description here
insert image description here

summary

insert image description here

Neural Machine Translation

insert image description here
Encoder-decoder structure: Neural network-based translation models usually adopt an encoder-decoder structure. The encoder is responsible for encoding the source language sentence into a fixed-length vector (semantic representation), and the decoder uses this vector to generate the translation result in the target language. Encoders and decoders typically use models such as Recurrent Neural Networks (RNNs) or Transformers to model context and sequence information.

insert image description here

  • First use a encoderto encode the information into a hidden vectorYou can think of this vectoras containing all the semantic information of the original sentence

  • Build a decodernetwork to hidden vectordecode this
    insert image description here

  • The model responsible for performing decodercan be calledconditional language model

  • Conditional Language Model (Conditional Language Model) is a variant of language model that not only considers context but also a given condition when generating text.

  • Traditional language models (such as n-gram models) are based on n-1the probability distribution of predicting the next word given the previous words. The conditional language model introduces additional conditional information so that the generated text can be related to the condition. Conditions can be various, such as contextual sentences, specific topics, or specified inputs.

    insert image description here

How to train Neural MT

insert image description here

  • Still need parallel datasets
  • Train in the same way as training a language model

loss setting

  • At the time of decoding, we can labelcalculate the loss of each step according to the gap between the output of each time step and , and then get decoderthe total loss at the terminal
    insert image description here

Training

insert image description here

  • The training process is very simple, because we have target sentence, just follow the steps

decoding at Test Time

  • But in the evaluationprocess, target sentencethe situation is a bit complicated because there is no
    insert image description here
  • Therefore, when predicting, we can only determine the output of each time step according to a higher probability value
    insert image description here
  • This greedy algorithm leads to a problemexposure bias

Exposure Bias

Self-evaluation bias (Exposure Bias): During the training process, the neural network model generates target language sentences through autoregressive means, that is, each time step uses previously generated words as input. However, in the evaluation stage, the model has to rely on previously generated words to generate subsequent words, which leads to self-evaluation bias. Self-evaluation bias creates a gap between what the model generated during training and what it generated during evaluation . When the decoder relies on previously generated words to generate the next word, the model's errors can gradually accumulate and lead to inaccurate generation. Even if the model can make predictions well with real target words during the training phase, the same quality cannot be guaranteed during the inference phase.

insert image description here

  • An important reason for this problem is that the model adopts greedythe method of
  • To solve this problem, a better decodingapproach from a global perspective can be adopted

Exhaustive Search Decoding

insert image description here

  • However, the exhaustive method is obviously unacceptable in terms of implementation.
  • Assuming that the length of a sentence is n, then the complexity that needs to be considered in the decoding process is O ( V n ) O(V^n)O(Vn )whereVVV is the size of the vocabulary

Beam Search Decoding

insert image description here
In sequence generation tasks, a neural network model generates an output sequence, such as a target language sentence. Beam Search Decoding progressively expands and searches for the optimal sequence by retaining the top k most likely candidate sequences at each time step

  • When k=1is the greedy algorithm
  • When k=Vis exhaustive

Here is an example showing:

insert image description here

  • Suppose our model has completed training and is now in the decoding stage of evaluation

  • First, at the first time step, the values ​​of cowand are shown in the figure respectivelytigerlogits

  • In the second time step, we will calculate cowand tigerrespectively, when they are used as the decoding results of the first step, the subsequent probability values
    insert image description here

  • We will choose kthe path with the highest probability as the global optimal decoding result at the end of the second time step, because here k=2we only choose cow eatsand tiger bitesto carry out subsequent decodingtasks

  • The same process repeats:
    insert image description here

  • At the third step, cow eats grassand cow eats carrotbecome the two highest paths

  • Next continue to generate:
    insert image description here

  • Until the symbol is encountered <end>, the generation process ends, and the final decodingresult is the highest

When does decoding stop

  • endwhen generated
  • Or set the maximum number of generated
    insert image description here

summary

insert image description here

Attention Mechanism

insert image description here

  • Due to the quality of the final code rnnformed by this model is not particularly good (relatively long-distance relationship capture is not very good)encodervector
  • The quality of the encoded segment directly affects the result of the decoding end, so we call this phenomenon information bottleneckinformation bottleneck
  • So to attention mechanismsolve this problem by

attention

insert image description here
The attention mechanism solves this problem by dynamically weighting the hidden states of the encoder to provide more comprehensive contextual information to the decoder . It allows the decoder to adaptively focus on different parts of the input sequence when generating each target language word.

Following are the basic steps of an encoder-decoder structure using attention mechanism:

  • Encoder: Encode the input sequence (source language sentence) through the encoder's recurrent neural network (RNN) or Transformer and other models to obtain a series of hidden states.

  • Attention calculation: At each time step of the decoder, the attention weights are calculated based on the current hidden state of the decoder and the hidden state of the encoder. This attention weight represents how much the decoder pays attention to each hidden state of the encoder.

  • Context Vector: A weighted summation of the hidden states of the encoder using attention weights yields a context vector. This context vector contains the comprehensive information after weighting the hidden state of the encoder according to the attention weights.

  • Decoder generation: The context vector is fed into the decoder together with the decoder's current input (usually a previously generated target language word) to generate a probability distribution for the next target language word.

Repeat steps 2 to 4 until a complete target language sequence is generated.

Through the attention mechanism, the decoder can focus adaptively according to different parts of the input sequence, so as to better capture the relevant information of the input sequence. The application of attention mechanism in translation tasks can improve the quality of translation and the ability to handle long-distance dependencies.

insert image description here
insert image description here
insert image description here

  • The core step of decoding through attentionis to calculate the similarity of decodingeach time step of hidden stateand encodingeach time step of the terminal hidden stateand weight it to obtain a dynamic vector for decoding (replacing the original version of only generating a vector at the last time step)
    insert image description here

attention summary

insert image description here

Evaluation

BLEU (Bilingual Evaluation Understudy): BLEU is a commonly used automatic evaluation indicator for measuring the similarity between machine translation results and reference translations. It computes a score by comparing the n-gram overlap between a candidate translation and multiple reference translations.
insert image description here

Supongo que te gusta

Origin blog.csdn.net/qq_42902997/article/details/131215305
Recomendado
Clasificación