Article Directory
Why translation tasks are difficult
Statistical Machine Translation
Statistics-based machine translation tasks usually learn a more powerful translation model through the combination of translation model (Translation Model) and language model (Language Model). This combination is known as statistical machine translation (SMT).
-
Translation Model: The translation model mainly focuses on how to translate source language sentences into target language sentences. It is trained using bilingual corpora to learn translation probabilities between source and target languages. Phrase-based translation model is one of the commonly used models in SMT, which divides sentences in source language and target language into phrases and establishes translation probabilities between phrase pairs. Syntax-based translation models use grammatical structure information for translation.
-
Language Model: The language model focuses on the fluency and rationality of sentences in the target language. It utilizes a monolingual corpus of the target language and learns the probability distribution of sentences in the target language. The language model can help the translation model to choose more reasonable translation candidates and improve the quality of translation.
In SMT, the combination of translation model and language model is modeled by joint probability. Given a source language sentence, the translation model calculates the translation probability of the source language sentence to the target language sentence, and the language model calculates the probability of the target language sentence, and then combines the two to obtain the final translation result. These model parameters are usually trained using maximum likelihood estimation or maximum a posteriori probability estimation.
- So how to learn P ( f ∣ e ) P(f|e)
corpora
fromWhat about P ( f∣e ) ?
- These given
sentence pair
are not aligned
- These given
Alignment alignment
-
Parallel Corpora (Parallel Corpora): A parallel corpus refers to a collection of sentences or phrases that have a correspondence between the source language and the target language. This corpus is usually obtained by human translation or alignment, and is ideal data for machine translation. By using an existing parallel corpus, it is possible to directly align sentences or phrases between the source language and the target language.
-
Based on word alignment (Word Alignment): The method based on word alignment tries to correspond the words in the source language and the target language to achieve the alignment of sentences or phrases. Common word alignment algorithms include IBM model and statistical phrase alignment model. These models use statistical methods to learn the word alignment probabilities between the source and target languages to achieve alignment.
-
Phrase Alignment: Phrase alignment-based methods match phrases in the source and target languages to achieve sentence or paragraph-level alignment. Common phrase alignment-based methods include the IBM model and the statistical phrase alignment model. These models are based on statistical and probabilistic models utilizing bilingual corpora to learn the phrase alignment probabilities between the source and target languages.
-
Syntax-based Alignment: Syntax-based alignment-based methods attempt to leverage the syntactic structure between the source and target languages for alignment. This method usually uses a syntactic analysis tool to analyze the syntactic structure of the source language and the target language, and performs alignment based on the correspondence between the syntactic structures.
-
Neural Network-Based Alignment Methods: In recent years, with the development of neural networks, neural network-based methods have made some progress in alignment tasks. These methods use neural network models to learn the correspondence between source and target languages, and can use attention mechanisms or encoder-decoder models to achieve alignment.
The following situations may occur during alignment:
summary
Neural Machine Translation
Encoder-decoder structure: Neural network-based translation models usually adopt an encoder-decoder structure. The encoder is responsible for encoding the source language sentence into a fixed-length vector (semantic representation), and the decoder uses this vector to generate the translation result in the target language. Encoders and decoders typically use models such as Recurrent Neural Networks (RNNs) or Transformers to model context and sequence information.
-
First use a
encoder
to encode the information into ahidden vector
You can think of thisvector
as containing all the semantic information of the original sentence -
Build a
decoder
network tohidden vector
decode this
-
The model responsible for performing
decoder
can be calledconditional language model
-
Conditional Language Model (Conditional Language Model) is a variant of language model that not only considers context but also a given condition when generating text.
-
Traditional language models (such as n-gram models) are based on
n-1
the probability distribution of predicting the next word given the previous words. The conditional language model introduces additional conditional information so that the generated text can be related to the condition. Conditions can be various, such as contextual sentences, specific topics, or specified inputs.
How to train Neural MT
- Still need parallel datasets
- Train in the same way as training a language model
loss setting
- At the time of decoding, we can
label
calculate the loss of each step according to the gap between the output of each time step and , and then getdecoder
the total loss at the terminal
Training
- The training process is very simple, because we have
target sentence
, just follow the steps
decoding at Test Time
- But in the
evaluation
process,target sentence
the situation is a bit complicated because there is no
- Therefore, when predicting, we can only determine the output of each time step according to a higher probability value
- This greedy algorithm leads to a problem
exposure bias
Exposure Bias
Self-evaluation bias (Exposure Bias): During the training process, the neural network model generates target language sentences through autoregressive means, that is, each time step uses previously generated words as input. However, in the evaluation stage, the model has to rely on previously generated words to generate subsequent words, which leads to self-evaluation bias. Self-evaluation bias creates a gap between what the model generated during training and what it generated during evaluation . When the decoder relies on previously generated words to generate the next word, the model's errors can gradually accumulate and lead to inaccurate generation. Even if the model can make predictions well with real target words during the training phase, the same quality cannot be guaranteed during the inference phase.
- An important reason for this problem is that the model adopts
greedy
the method of - To solve this problem, a better
decoding
approach from a global perspective can be adopted
Exhaustive Search Decoding
- However, the exhaustive method is obviously unacceptable in terms of implementation.
- Assuming that the length of a sentence is
n
, then the complexity that needs to be considered in the decoding process is O ( V n ) O(V^n)O(Vn )whereVVV is the size of the vocabulary
Beam Search Decoding
In sequence generation tasks, a neural network model generates an output sequence, such as a target language sentence. Beam Search Decoding progressively expands and searches for the optimal sequence by retaining the top k most likely candidate sequences at each time step
- When
k=1
is the greedy algorithm - When
k=V
is exhaustive
Here is an example showing:
-
Suppose our model has completed training and is now in the decoding stage of evaluation
-
First, at the first time step, the values of
cow
and are shown in the figure respectivelytiger
logits
-
In the second time step, we will calculate
cow
andtiger
respectively, when they are used as the decoding results of the first step, the subsequent probability values
-
We will choose
k
the path with the highest probability as the global optimal decoding result at the end of the second time step, because herek=2
we only choosecow eats
andtiger bites
to carry out subsequentdecoding
tasks -
The same process repeats:
-
At the third step,
cow eats grass
andcow eats carrot
become the two highest paths -
Next continue to generate:
-
Until the symbol is encountered
<end>
, the generation process ends, and the finaldecoding
result is the highest
When does decoding stop
end
when generated- Or set the maximum number of generated
summary
Attention Mechanism
- Due to the quality of the final code
rnn
formed by this model is not particularly good (relatively long-distance relationship capture is not very good)encoder
vector
- The quality of the encoded segment directly affects the result of the decoding end, so we call this phenomenon
information bottleneck
information bottleneck - So to
attention mechanism
solve this problem by
attention
The attention mechanism solves this problem by dynamically weighting the hidden states of the encoder to provide more comprehensive contextual information to the decoder . It allows the decoder to adaptively focus on different parts of the input sequence when generating each target language word.
Following are the basic steps of an encoder-decoder structure using attention mechanism:
-
Encoder: Encode the input sequence (source language sentence) through the encoder's recurrent neural network (RNN) or Transformer and other models to obtain a series of hidden states.
-
Attention calculation: At each time step of the decoder, the attention weights are calculated based on the current hidden state of the decoder and the hidden state of the encoder. This attention weight represents how much the decoder pays attention to each hidden state of the encoder.
-
Context Vector: A weighted summation of the hidden states of the encoder using attention weights yields a context vector. This context vector contains the comprehensive information after weighting the hidden state of the encoder according to the attention weights.
-
Decoder generation: The context vector is fed into the decoder together with the decoder's current input (usually a previously generated target language word) to generate a probability distribution for the next target language word.
Repeat steps 2 to 4 until a complete target language sequence is generated.
Through the attention mechanism, the decoder can focus adaptively according to different parts of the input sequence, so as to better capture the relevant information of the input sequence. The application of attention mechanism in translation tasks can improve the quality of translation and the ability to handle long-distance dependencies.
- The core step of decoding through
attention
is to calculate the similarity ofdecoding
each time step ofhidden state
andencoding
each time step of the terminalhidden state
and weight it to obtain a dynamic vector for decoding (replacing the original version of only generating a vector at the last time step)
attention summary
Evaluation
BLEU (Bilingual Evaluation Understudy): BLEU is a commonly used automatic evaluation indicator for measuring the similarity between machine translation results and reference translations. It computes a score by comparing the n-gram overlap between a candidate translation and multiple reference translations.