学习Transformer(The Illustrated Transformer)

复制链接

上一篇文章中(previous post),我们研究了注意力机制 - 一种在现代深度学习模型中无处不在的(ubiquitous)方法。 注意力是一个有助于提高神经机器翻译(neural machine translation)应用程序性能的概念。 在这篇文章中(In this post),我们将介绍The Transformer–一个使用注意力来提高(boost)这些模型训练速度的模型。The Transformers在特定任务中优于(outperforms)Google神经机器翻译模型。 然而,最大的好处来自于Transformer如何为并行化(parallelization)做出贡献。 事实上,Google Cloud建议使用The Transformer作为参考模型来使用他们的Cloud TPU产品。 因此,让我们尝试将模型分开,看看它是如何运作的。

The Transformer在文章中提出了Attention is All You Need。 它的TensorFlow实现作为Tensor2Tensor包的一部分提供。 哈佛大学的NLP小组创建了一个用PyTorch实现注释论文的指南。 在这篇文章中,我们将尝试过度简化一些事情并逐一介绍这些概念,以便在没有深入了解主题的情况下让人们更容易理解。

A High-Level Look

让我们首先将模型看作一个黑盒子。 在机器翻译应用程序中,它将使用一种语言的句子,并将其翻译输出到另一种语言中。
在这里插入图片描述
Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.
弹出Optimus Prime的优点,我们看到了编码组件,解码组件以及它们之间的连接。

在这里插入图片描述
The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.
编码组件是一堆编码器(纸张堆叠其中六个相互叠加 - 没有什么神奇的六号,一个肯定可以尝试其他安排)。 解码组件是相同数量的解码器的堆栈。
在这里插入图片描述
The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:
编码器的结构完全相同(但它们不共享权重)。 每一个都分为两个子层:
在这里插入图片描述
The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.
编码器的输入首先流经self-attention层 - 这一层帮助编码器在输入句子中对其他单词进行编码时对其进行编码。 我们会在文章(post)后面仔细观察self-attention。

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
self-attention层的输出被送到前馈神经网络。 完全相同的前馈网络独立应用于每个位置。

The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).
解码器具有这两个层,但它们之间是一个attention层,帮助解码器关注输入句子的相关部分(类似于seq2seq模型中的注意事项)。
在这里插入图片描述

Bringing The Tensors Into The Picture

Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.
现在我们已经看到了模型的主要组成部分,让我们开始研究各种向量/张量以及它们如何在这些组件之间流动,以将训练模型的输入转换为输出。

As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.
与NLP应用程序中的情况一样,我们首先使用embedding算法将每个输入字转换为矢量。
在这里插入图片描述
The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.
embedding仅发生在最底部(bottom-most)的编码器中。 所有编码器通用的抽象是它们接收每个大小为512的向量列表 - 在底部编码器中将是嵌入字,但在其他编码器中,它将是编码器的输出直接在下面。 这个列表的大小是我们可以设置的超参数 - 基本上它是我们训练数据集中最长句子的长度。

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.
在我们的输入序列中嵌入单词后,它们都会流过每个编码器中的两个层。
在这里插入图片描述
Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.
在这里,我们开始看到Transformer的一个关键属性(key property),即每个位置的单词在编码器中流经自己的路径。self-attention层中的这些路径之间存在依赖关系。 然而,前馈层不具有那些依赖性,因此各种路径可以在流过前馈层时并行执行(executed in parallel)。

Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.
接下来,我们将示例切换为更短的句子,我们将查看编码器的每个子层中发生的情况。

Now We’re Encoding!

As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.
正如我们已经提到的,编码器接收矢量列表作为输入。 它通过将这些向量传递到“自我关注”层,然后传递到前馈神经网络,然后将输出向上发送到下一个编码器来处理该列表。
在这里插入图片描述

Self-Attention at a High Level

Don’t be fooled by me throwing around the word “self-attention” like it’s a concept everyone should be familiar with. I had personally never came across the concept until reading the Attention is All You Need paper. Let us distill how it works.
不要被我愚弄“自我关注”这个词,这是每个人都应该熟悉的概念。 在阅读Attention is All You Need论文之前,我个人从未遇到过这个概念。 让我们提炼(distill)它是如何工作的。

Say the following sentence is an input sentence we want to translate:
假设以下句子是我们要翻译的输入句子:

”The animal didn't cross the street because it was too tired”

What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.
这句话中的“它”是指什么? 它指的是街道还是动物? 这对人类来说是一个简单的问题,但对算法来说并不简单。

When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.
当模型处理单词“it”时,自我关注允许它将“it”与“animal”相关联。

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.
当模型处理每个单词(输入序列中的每个位置)时,自我注意允许它查看输入序列中的其他位置以寻找可以帮助更好地编码该单词的线索。

If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake(融入) the “understanding” of other relevant words into the one we’re currently processing.
如果您熟悉RNN,请考虑如何保持隐藏状态允许RNN将其已处理的先前单词/向量的表示与其正在处理的当前单词/向量合并。 自我关注是Transformer用来将其他相关单词的“理解”融入我们当前正在处理的单词中的方法。
在这里插入图片描述
Be sure to check out the Tensor2Tensor notebook where you can load a Transformer model, and examine it using this interactive visualization.
请务必查看Tensor2Tensor笔记本,您可以在其中加载Transformer模型,并使用此交互式可视化对其进行检查。

Self-Attention in Detail

Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented – using matrices.
让我们首先看看如何使用向量计算自我注意力,然后继续查看它是如何实际实现的 - 使用矩阵。

The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.
计算自我关注度的第一步是从每个编码器的输入向量创建三个向量(在这种情况下,嵌入每个单词)。 因此,对于每个单词,我们创建一个Query向量,一个Key向量和一个Value向量。 这些向量是通过将嵌入乘以我们在训练过程中训练的三个矩阵而创建的。

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.
请注意,这些新向量的尺寸小于嵌入向量。 它们的维数为64,而嵌入和编码器输入/输出向量的维数为512.它们不必更小,这是一种架构选择,可以使多头注意力计算(大多数)不变。
在这里插入图片描述
What are the “query”, “key”, and “value” vectors?
什么是“查询”,“关键”和“值”向量?

They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays.
它们是抽象,有助于计算和思考注意力。 一旦你继续阅读下面如何计算注意力,你就会知道你需要知道的每个这些向量的作用。

The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
计算自我关注度的第二步是计算得分。 假设我们正在计算这个例子中第一个单词“思考”的自我关注。 我们需要根据这个词对输入句子的每个单词进行评分。 当我们在某个位置编码单词时,分数决定了对输入句子的其他部分放置多少焦点。

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.
通过将查询向量的点积与我们得分的相应单词的关键向量计算得分。 因此,如果我们处理位置#1中单词的自我关注,则第一个分数将是q1和k1的点积。 第二个分数是q1和k2的点积。
在这里插入图片描述
The third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.
第三步和第四步是将得分除以8(论文中使用的关键向量的维数的平方根 - 64.这导致具有更稳定的梯度。这里可能存在其他可能的值,但这是 默认),然后通过softmax操作传递结果。 Softmax将分数标准化,因此它们都是正数并且加起来为1。
在这里插入图片描述
The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition(直觉) here is to keep intact the values of the word(s) we want to focus on, and drown-out(淹没) irrelevant words (by multiplying them by tiny numbers like 0.001, for example).
第五步是将每个值向量乘以softmax得分(准备将它们相加)。 这里的直觉是保持我们想要关注的单词的值不变,并淹没不相关的单词(例如,通过将它们乘以像0.001这样的小数字)。

The sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).
第六步是总结加权值向量。 这会在此位置产生自我关注层的输出(对于第一个单词)。
在这里插入图片描述

Matrix Calculation of Self-Attention

The first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV).
第一步是计算Query,Key和Value矩阵。 我们通过将嵌入包装到矩阵X中,并将其乘以我们训练过的权重矩阵(WQ,WK,WV)来实现。
在这里插入图片描述
Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.
最后,由于我们正在处理矩阵,我们可以在一个公式中浓缩步骤2到6来计算自我关注层的输出。
在这里插入图片描述

The Beast With Many Heads

The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways:
本文通过增加一种称为“多头”关注的机制,进一步完善了自我关注层。 这以两种方式改善了关注层的性能:

  1. It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. It would be useful if we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to.
    它扩展了模型关注不同位置的能力。 是的,在上面的例子中,z1包含了所有其他编码的一点点,但它可能由实际的单词本身支配。 如果我们翻译一句“动物没有过马路,因为它太累了”,我们会想知道“它”指的是哪个词,这将是有用的。

  2. It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.
    它给予attention层多个“表示子空间”。 正如我们接下来将看到的,我们不仅有一个,而且还有多组Query / Key / Value权重矩阵(Transformer使用8个注意头,因此我们最终为每个编码器/解码器设置了8个)。 这些集合中的每一个都是随机初始化的。 然后,在训练之后,每组用于将输入嵌入(或来自较低编码器/解码器的矢量)投影到不同的表示子空间中。
    在这里插入图片描述
    If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices
    如果我们进行上面概述的相同的自我关注计算,只有八个不同的时间使用不同的权重矩阵,我们最终得到八个不同的Z矩阵
    在这里插入图片描述
    This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.
    这让我们面临一些挑战。 前馈层不期望八个矩阵 - 它期望单个矩阵(每个字的向量)。 所以我们需要一种方法将这八个压缩成一个矩阵。

How do we do that? We concat the matrices then multiple them by an additional weights matrix WO.
我们怎么做? 我们将矩阵连接起来然后通过另外的权重矩阵WO将它们多个。
在这里插入图片描述
That’s pretty much all there is to multi-headed self-attention. It’s quite a handful of matrices, I realize. Let me try to put them all in one visual so we can look at them in one place
这就是多头自我关注的全部内容。 我意识到这是一小部分矩阵。 让我尝试将它们全部放在一个视觉中,这样我们就可以在一个地方看到它们
在这里插入图片描述
Now that we have touched upon attention heads, let’s revisit our example from before to see where the different attention heads are focusing as we encode the word “it” in our example sentence:
现在我们已经触及了注意力的头,让我们重新审视我们之前的例子,看看不同的注意力头在哪里聚焦,因为我们在我们的例句中编码“it”这个词:
在这里插入图片描述
If we add all the attention heads to the picture, however, things can be harder to interpret:
但是,如果我们将所有注意力添加到图片中,那么事情可能更难理解:
在这里插入图片描述

Representing The Order of The Sequence Using Positional Encoding

One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.
到目前为止,模型中缺少的一件事就是考虑输入序列中单词顺序的一种方法。

To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.
为了解决这个问题,变换器为每个输入嵌入添加了一个向量。 这些向量遵循模型学习的特定模式,这有助于确定每个单词的位置,或者序列中不同单词之间的距离。 这里的直觉是,将这些值添加到嵌入中,一旦它们被投影到Q / K / V向量中并且在点积注意期间,就在嵌入向量之间提供有意义的距离。
在这里插入图片描述
If we assumed the embedding has a dimensionality of 4, the actual positional encodings would look like this:
如果我们假设嵌入的维数为4,那么实际的位置编码将如下所示:
在这里插入图片描述
What might this pattern look like?
这种模式可能是什么样的?

In the following figure, each row corresponds the a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. Each row contains 512 values – each with a value between 1 and -1. We’ve color-coded them so the pattern is visible.
在下图中,每行对应矢量的位置编码。 因此第一行将是我们添加到输入序列中嵌入第一个单词的向量。 每行包含512个值 - 每个值的值介于1和-1之间。 我们对它们进行了颜色编码,使图案可见。
在这里插入图片描述
The formula for positional encoding is described in the paper (section 3.5). You can see the code for generating positional encodings in get_timing_signal_1d(). This is not the only possible method for positional encoding. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).
位置编码的公式在论文(第3.5节)中描述。 您可以在get_timing_signal_1d()中看到用于生成位置编码的代码。 这不是位置编码的唯一可能方法。 然而,它具有能够扩展到看不见的序列长度的优点(例如,如果要求我们训练的模型翻译句子的时间长于我们训练集中的任何句子)。

The Residuals

One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer-normalization step.
在继续之前我们需要提到的编码器架构中的一个细节是每个编码器中的每个子层(自注意,ffnn)在其周围具有残余连接,然后是层规范化步骤。
在这里插入图片描述
If we’re to visualize the vectors and the layer-norm operation associated with self attention, it would look like this:
如果我们要将矢量和与自我关注相关的图层规范操作可视化,它将如下所示:
在这里插入图片描述
This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:
这也适用于解码器的子层。 如果我们想到2个堆叠编码器和解码器的变压器,它看起来像这样:
在这里插入图片描述

The Decoder Side

Now that we’ve covered most of the concepts on the encoder side, we basically know how the components of decoders work as well. But let’s take a look at how they work together.
既然我们已经涵盖了编码器方面的大多数概念,我们基本上都知道解码器的组件是如何工作的。 但是让我们来看看它们如何协同工作。
在这里插入图片描述
在这里插入图片描述
The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.
以下步骤重复此过程,直到达到特殊符号,表示变压器解码器已完成其输出。 在下一个时间步骤中,每个步骤的输出被馈送到底部解码器,并且解码器像编码器那样冒泡它们的解码结果。 就像我们对编码器输入所做的那样,我们在这些解码器输入中嵌入并添加位置编码,以指示每个字的位置。
在这里插入图片描述

The self attention layers in the decoder operate in a slightly different way than the one in the encoder:
解码器中的自关注层以与编码器中的自注意层略有不同的方式操作:

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.
在解码器中,仅允许自我关注层关注输出序列中的较早位置。 这是通过在自我关注计算中的softmax步骤之前屏蔽未来位置(将它们设置为-inf)来完成的。

The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.
“编码器 - 解码器注意”层就像多头自我注意一样,除了它从它下面的层创建其查询矩阵,并从编码器堆栈的输出中获取键和值矩阵。

The Final Linear and Softmax Layer

The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.
解码器堆栈输出浮点数向量。 我们如何将其变成一个单词? 这是最终线性层的工作,其后是Softmax层。

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.
线性层是一个简单的完全连接的神经网络,它将堆叠的解码器产生的矢量投影到一个更大,更大的向量中,称为logits向量。

Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.
让我们假设我们的模型知道10,000个独一无二的英语单词(我们的模型的“输出词汇表”),它是从训练数据集中学到的。 这将使logits向量10,000个细胞宽 - 每个细胞对应于一个唯一单词的得分。 这就是我们如何解释模型的输出,然后是线性层。

The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.
然后softmax层将这些分数转换为概率(全部为正,全部加起来为1.0)。 选择具有最高概率的单元,并且将与其相关联的单词作为该时间步的输出。

在这里插入图片描述

Recap Of Training

Now that we’ve covered the entire forward-pass process through a trained Transformer, it would be useful to glance at the intuition of training the model.
现在我们已经讲述了Transformer的整个前向传播过程,看一下训练模型的直觉是有用的。

During training, an untrained model would go through the exact same forward pass. But since we are training it on a labeled training dataset, we can compare its output with the actual correct output.
在训练期间,未经训练的模型将通过完全相同的前进传球。 但是,由于我们在标记的训练数据集上训练它,我们可以将其输出与实际正确的输出进行比较。

To visualize this, let’s assume our output vocabulary only contains six words(“a”, “am”, “i”, “thanks”, “student”, and “” (short for ‘end of sentence’)).
为了想象这一点,让我们假设我们的输出词汇只包含六个单词(“a”,“am”,“i”,“thanks”,“student”和“”(“句末”的缩写))。
在这里插入图片描述
Once we define our output vocabulary, we can use a vector of the same width to indicate each word in our vocabulary. This also known as one-hot encoding. So for example, we can indicate the word “am” using the following vector:
一旦我们定义了输出词汇表,我们就可以使用相同宽度的向量来表示词汇表中的每个单词。 这也称为单热编码。 例如,我们可以使用以下向量指示单词“am”:
在这里插入图片描述
Following this recap, let’s discuss the model’s loss function – the metric we are optimizing during the training phase to lead up to a trained and hopefully amazingly accurate model.
在回顾一下之后,让我们讨论一下模型的损失函数 - 我们在训练阶段优化的指标,以引导一个训练有素且令人惊讶的精确模型。

The Loss Function

Say we are training our model. Say it’s our first step in the training phase, and we’re training it on a simple example – translating “merci” into “thanks”.
假设我们正在训练我们的模型。 说这是我们在训练阶段的第一步,我们正在训练它的一个简单例子 - 将“merci”翻译成“谢谢”。

What this means, is that we want the output to be a probability distribution indicating the word “thanks”. But since this model is not yet trained, that’s unlikely to happen just yet.
这意味着,我们希望输出是指示“谢谢”一词的概率分布。 但由于这种模式还没有接受过训练,所以这种情况不太可能发生。
在这里插入图片描述
How do you compare two probability distributions? We simply subtract one from the other. For more details, look at cross-entropy and Kullback–Leibler divergence.
你如何比较两个概率分布? 我们简单地从另一个中减去一个。 有关更多详细信息,请查看交叉熵Kullback-Leibler散度

But note that this is an oversimplified example. More realistically, we’ll use a sentence longer than one word. For example – input: “je suis étudiant” and expected output: “i am a student”. What this really means, is that we want our model to successively output probability distributions where:
但请注意,这是一个过于简单的例子。 更现实的是,我们将使用长于一个单词的句子。 例如 - 输入:“jesuisétudiant”和预期输出:“我是学生”。 这真正意味着,我们希望我们的模型能够连续输出概率分布,其中:

  • Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 3,000 or 10,000)
    每个概率分布由宽度为vocab_size的向量表示(在我们的玩具示例中为6,但更实际地是3,000或10,000的数字)
  • The first probability distribution has the highest probability at the cell associated with the word “i”
    第一概率分布在与单词“i”相关联的单元处具有最高概率
  • The second probability distribution has the highest probability at the cell associated with the word “am”
    第二概率分布在与单词“am”相关联的单元格中具有最高概率
  • And so on, until the fifth output distribution indicates ‘’ symbol, which also has a cell associated with it from the 10,000 element vocabulary.
    依此类推,直到第五个输出分布表示’<句末结束>‘符号,其中还有一个与10,000元素词汇表相关联的单元格。
    在这里插入图片描述
    After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this:
    在足够大的数据集上训练模型足够的时间之后,我们希望产生的概率分布看起来像这样:
    在这里插入图片描述
    Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That’s one way to do it (called greedy decoding). Another way to do it would be to hold on to, say, the top two words (say, ‘I’ and ‘a’ for example), then in the next step, run the model twice: once assuming the first output position was the word ‘I’, and another time assuming the first output position was the word ‘me’, and whichever version produced less error considering both positions #1 and #2 is kept. We repeat this for positions #2 and #3…etc. This method is called “beam search”, where in our example, beam_size was two (because we compared the results after calculating the beams for positions #1 and #2), and top_beams is also two (since we kept two words). These are both hyperparameters that you can experiment with.
    现在,因为模型一次生成一个输出,我们可以假设模型从该概率分布中选择具有最高概率的单词并丢弃其余的单词。 这是一种方法(称为贪婪解码)。 另一种方法是保持前两个词(例如,‘I’和’a’),然后在下一步中,运行模型两次:一旦假设第一个输出位置是 单词’I’,另一次假设第一个输出位置是’me’这个单词,考虑到#1和#2的位置,保留的是哪个版本产生的错误较少。 我们重复这个位置#2和#3 …等。 这种方法称为“波束搜索”,在我们的例子中,beam_size是两个(因为我们在计算位置#1和#2的波束后比较了结果),top_beams也是两个(因为我们保留了两个词)。 这些都是您可以尝试的超参数。

Go Forth And Transform

I hope you’ve found this a useful place to start to break the ice with the major concepts of the Transformer. If you want to go deeper, I’d suggest these next steps:
我希望你已经发现这是一个有用的地方,开始用Transformer的主要概念打破僵局。 如果你想更深入,我建议接下来的步骤:

猜你喜欢

转载自blog.csdn.net/czp_374/article/details/88786776