---- seq2seq popular understanding encoder and decoder (TensorFlow implemented)

1. What is seq2seq

In many Application AUTO The language processing natural language, the input and output START ⻓ sequence may be uncertain. In machine translation, for example, may be one Hash START input ⻓ English Text is uncertain sequence, the output may be a segment of the French ⼀ uncertain ⻓ present sequence files, for example:

English lose START: "They", "are", "watching", "."

French output: "Ils", "regardent", "."

When the input and output are variable START ⻓ sequences, we can use the coder - decoder (encoder-decoder) or seq2seq model. Sequence to sequence model, referred to as seq2seq model. These two models are essentially to two cycles Using neural Open networks, are called the encoder and decoder. Analyzing input START sequence Using the decoder to the encoder output sequence into ⽣ Use. Two recurrent neural networks are trained together.

FIG Using the described encoder - translating the sentence generating method Submenu decoder of the above-described Sentences Submenu ⼀ of the method described. In the training data set, we can attach special symbol "<eos>" (end of sequence) after the end of each sentence for still displayed in Table Submenu sequence. START input of the encoder each time step in the order of English sentences Submenu words, punctuation and special characters "<eos>". Use of the encoder of FIG manipulation hidden in the final time step as the input sentence Submenu START characterization or encoded information. Each decoder time step outputs coded information input manipulation Use Submenu START sentences and the time step as an input and hidden state START. We hope to be able to step right in order decoder output French words, punctuation and special characters "<eos>" translated at various times. Note that the decoder input time of the first step Use START to ⼀ table displayed special symbols beginning of the sequence " ”(beginning of sequence)。

2. Encoder

Use for the encoder is to ⼀ indeterminate ⻓ START input sequence into given ⻓ ⼀ a background variable c, and the START input sequence information encoding the variable background. Frequently used are cyclic encoder neural Open networks.

Let us consider the small quantities of time-series data samples zoomed 1. START input sequence is assumed that x1,..., XT, e.g. xi lose START sentence Submenu the i th word. At time step t, recurrent neural Open networks START input feature vectors xt and xt the hidden state time step \ (h_ {t-1} \) converting ht is hidden in the current time step. Using the function f we can express transformation cycles hidden layer neural Open networks:

\[h_t=f(x_t,h_{t-1})\]

Next, the encoder in the custom function q by the hidden state of each time step of converting the background variables:

\[c=q(h_1,...,h_T)\]

For example, when the selected Q ( H . 1 ,..., H T ) = H T , the background input variable is the final time step START hidden sequence H T .

Encoder described above is a one-way cycles the neural ⼀ Open networks, hidden each time step depends only on the time step and the input sequence before START Submenu. We also can use the two-way cycle nerve Open networks constructed encoder. In this case, the encoder hidden each time step depends on both the previous time step and after Submenu sequence (including the current time step input START), and encoding information of the entire sequence.

3. Decoder

Has just been described, encoder outputs background variables c sequence encoding the entire input START X1,..., XT the information. Training sequence given output sample y1, y2,..., YT ', for each time step t' (START symbol sequence with the input or the encoder are different time step t), the output of the decoder yt 'conditions probability based on previous output sequence \ (y_1, ..., y_ { t ^ { '} - 1} \) and background variables c, namely:

\[P(y_{t^{′}}|y_1,...,y_{t^{′}-1},c)\]

To this end, we can use the cycle for another time neural Open networks as a decoder. The output sequence of time step t ', the output of the decoder time step ⼀ \ (y_ {t ^ {' } - 1} \) and the background as an input variable c START, hide them and to the next higher time step status \ (s_ {t ^ { ' } - 1} \) is converted into the current time step hidden st'. Thus, we can use it transform decoder function g expressed hidden layer:

\[s_{t^{′}}=g(y_{t^{′}-1},c,s_{t^{′}-1})\]

With the hidden state of the decoder, we can make the output layer and the arithmetic softmax calculated using the self-defined \ (P (y_ {t ^ { '}} | y_1, ..., y_ {t ^ {'} - . 1}, c) \) , e.g., based on the current time step decoder hidden state st ', you to the previous time step output \ (s_ {t ^ {' } - 1} \) and background variables c calculates the current time step output yt 'probability distribution.

4. Trainer

According to the most zoomed likelihood estimation, we can output sequence is largest of conditional probability lose START sequence based on:

\[P(y_1,...,y_{t^{′}-1}|x_1,...,x_T)=\prod_{t^{′}=1}^{T^{′}}P(y_{t^{′}}|y_1,...,y_{t^{′}-1},x_1,...,x_T)\]

\[=\prod_{t^{′}=1}^{T^{′}}P(y_{t^{′}}|y_1,...,y_{t^{′}-1},c)\]

And the resulting loss of output sequence:

\[-logP(y_1,...,y_{t^{′}-1}|x_1,...,x_T)=-\sum_{t^{′}=1}^{T^{′}}logP(y_{t^{′}}|y_1,...,y_{t^{′}-1},c)\]

In model training, the mean value of all the output sequences are typically lost as loss function to be minimized. In the model prediction described in the figure above, we need to decoder time step to the next higher output as the output START of the current time step. In contrast to this, in training, we can also tag sequence (real output sequence training set) in time steps to the next higher tag as a decoder at the current time step input START. This is called compulsory education (teacher forcing).

5. seq2seq prediction model

Above describes how the training input and output are variable START encoder ⻓ sequences - the decoder. In this section we describe how Using the encoder - decoder to predict the sequence of uncertain ⻓.

In preparation for the training dataset, we are often accompanied ⼀ special symbol "<eos>" table for still displayed in the input sequence START rear end sequence and an output sequence of samples. Our next discussion will be on all along Using mathematical symbols ⼀ Festival. For ease of discussion, assume that the decoder output sequence is one Hash files present. Dictionary files provided output Y (contains special symbols "<eos>") is zoomed to a small | Y |, is largest ⻓ of the output sequence is T '. All possible output ⼀ consensus sequence \ (O (| y | ^ {T ^ { '}}) \) species. All these special symbols output sequence "<eos>" screen for the sequence will be discarded submenus.

5.1 greedy search

Search greedy (greedy search). For the output sequence of any ⼀ time step t ', we from | Y | searched words the conditional probability is largest word:

\[y_{t^{′}}=argmax_{y\in_{}Y}P(y|y_1,...,y_{t^{′}-1},c)\]

As output. Once the search ⼀ "<eos>" symbol, or the output of the sequence has reached ⻓ is largest ⻓ of T ', the output is completed. We mentioned in the description of the decoder, a conditional probability ⽣ input into an output sequence based on the sequence START \ (\ prod_ {t ^ { '} = 1} ^ {T ^ {'}} P (y_ {t ^ { ' }} | Y_1, ..., T ^ {Y_ { '} -}. 1, C) \) . We conditional probability of the output sequence is largest is called optimal output sequence. The main problem is the greedy search can not guarantee optimal output sequence.

Under ⾯ view ⼀ Case submenus. Assuming that the output dictionary ⾥ ⾯ have "A" "B" "C " and "<eos>" four words. At each time step in FIG.
4 numbers at the time step representing ⽣ as "C" and "<eos>" conditional probability four words "A" "B". At each time step, select the greedy search conditional probability is largest word. Thus, in the FIG 10.9 ⽣ in the output sequence "A" "B" "C " "<eos>". The output sequence of the conditional probability is 0.5 × 0.4 × 0.4 × 0.6 = 0.048.

Next, look at the following examples displayed submenus played. Different with figure above, in the selected time step 2 second shot zoomed conditional probability of the word "C"
. Since the time step 3 is based on the time step 1 and 2 by the output sequence Submenu above figure "A" "B" in the figure becomes the next "A" "C", the next time step in FIG. 3 into individual words ⽣ the conditional probability after- change. We choose the conditional probability is largest of the word "B". Submenu output sequence of the first three time steps at this time based on the time step 4 is "A" "C" "B ", and the image above "A" "B" "C " are different. Thus, the conditional probability of each time step 4 ⽣ word into the different figures of the figure above. We found that, at this time the output sequence "A" "C" "B " "<eos>" is the conditional probability of 0.5 × 0.3 × 0.6 × 0.6 = 0.054, below approximately conditional probability of the output sequence obtained greedy search. Therefore, the output obtained greedy search sequence "A" "B" "C " "<eos>" and does not indicate the optimal output sequence.

5.2 Exhaustive Search

If the objective is to get the optimal output destination time sequence, we can consider an exhaustive search (exhaustive search): exhaustive of all possible output sequence, the conditional probability is largest output sequence.

Although exhaustive searching the optimal output sequence can be obtained, but it is computationally expensive \ (O (| y | ^ {T ^ { '}}) \) easily is loud. For example, when | the Y | =
time '= 10, we will evaluate and 10,000 T \ (= 10000 ^ 10 ^ {10} {40} \) sequence: This is almost impossible to get accustomed. While the gauge of greedy search
operator overhead \ (O (| Y | T ^ {^ { '}}) \) , computational overhead is usually significantly less than an exhaustive search. For example, when | Y | = 10000 and when T '= 10, I
could simply assess \ (10000 * 10 = 10 ^ 5 \) sequences.

5.3 beam search

Beam search (beam search) is ⼀ a greedy algorithm to improve search. It has ⼀ a beam width (beam size) hyper-parameters. We set it to k. At time step 1, select the time step k of the conditional probability is largest current word, each word consisting of k therefore especially candidate output sequence. After at each time step based on the k candidate output sequence of time steps, from k | conditional probability is largest select the k possible output sequence, the time step as a candidate output sequence | Y. Finally, we screened the candidates of the output sequence of each time step in the "<eos>" sequence contains special symbols, and all special symbol "<eos>" Submenu sequence discarded after the screen for them to give a final candidate output sequence set.

Beam width of 2, is largest of the output sequence ⻓ 3. Candidate output sequences A, C, AB, CE, ABD and CED. We will come to a final set of candidate output sequence based on these six sequences. In the final set of candidate output sequence, we take the following sequence most ADVANCED fraction as the output sequence:

\[\frac{1}{L^{\alpha}}logP(y_1,...,y_L)=\frac{1}{L^{\alpha}}\sum_{t^{′}=1}^{T^{′}}logP(y_{t^{′}}|y_1,...,y_{t^{′}-1},c)\]

其中 L 为最终候选序列⻓度,α ⼀般可选为0.75。分⺟上的 Lα 是为了惩罚较⻓序列在以上分数中较多的对数相加项。分析可知,束搜索的计算开销为 \(O(k|y|^{T^{′}})\)。这介于贪婪搜索和穷举搜索的计算开销之间。此外,贪婪搜索可看作是束宽为 1 的束搜索。束搜索通过灵活的束宽 k 来权衡计算开销和搜索质量。

6. Bleu得分

评价机器翻译结果通常使⽤BLEU(Bilingual Evaluation Understudy)(双语评估替补)。对于模型预测序列中任意的⼦序列,BLEU考察这个⼦序列是否出现在标签序列中。

具体来说,设词数为 n 的⼦序列的精度为 pn。它是预测序列与标签序列匹配词数为 n 的⼦序列的数量与预测序列中词数为 n 的⼦序列的数量之⽐。举个例⼦,假设标签序列为A、B、C、D、E、F,预测序列为A、B、B、C、D,那么:

\[P1= \frac{预测序列中的 1 元词组在标签序列是否存在的个数}{预测序列 1 元词组的个数之和}\]

预测序列一元词组:A/B/C/D,都在标签序列里存在,所以P1=4/5,以此类推,p2 = 3/4, p3 = 1/3, p4 = 0。设 \(len_{label}和len_{pred}\) 分别为标签序列和预测序列的词数,那么,BLEU的定义为:

\[exp(min(0,1-\frac{len_{label}}{len_{pred}}))\prod_{n=1}^{k}p_n^{\frac{1}{2^n}}\]

其中 k 是我们希望匹配的⼦序列的最⼤词数。可以看到当预测序列和标签序列完全⼀致时,
BLEU为1。

因为匹配较⻓⼦序列⽐匹配较短⼦序列更难,BLEU对匹配较⻓⼦序列的精度赋予了更⼤权重。例如,当 pn 固定在0.5时,随着n的增⼤,\(0.5^{\frac{1}{2}}\approx0.7,0.5^{\frac{1}{4}}\approx0.84,0.5^{\frac{1}{8}}\approx0.92,0.5^{\frac{1}{16}}\approx0.96\)。另外,模型预测较短序列往往会得到较⾼pn 值。因此,上式中连乘项前⾯的系数是为了惩罚较短的输出而设的。举个例⼦,当k = 2时,假设标签序列为A、B、C、D、E、F,而预测序列为A、 B。虽然p1 = p2 = 1,但惩罚系数exp(1-6/2) ≈ 0.14,因此BLEU也接近0.14。

7. 代码实现

TensorFlow seq2seq的基本实现

机器学习通俗易懂系列文章

3.png

8. 参考文献

动手学深度学习


作者:@mantchs

GitHub:https://github.com/NLP-LOVE/ML-NLP

欢迎大家加入讨论!共同完善此项目!群号:【541954936】NLP面试学习群

Guess you like

Origin www.cnblogs.com/mantch/p/11433829.html