Transformer
Transformer is a deep learning model introduced in 2017, mainly used in the field of natural language processing. Like recurrent neural networks, Transformers are designed to process sequential data, such as natural language, for tasks such as translation and text summarization. However, unlike RNNs, Transformers do not need to process sequential data in order.
Why do we need the Transformer?
- RNN
- It is the most classic model for processing Sequence, such as one-way RNN or two-way RNN and so on.
- The problem with RNN: Difficult to parallelize
Therefore, it was proposed to replace RNN with CNN
-
CNN instead of RNN
- CNN filters: Each triangle represents
filter
aseq
small segment with an input of , and outputs a value (obtained by inner product). Differentfilter
correspondsseq
to different parts of .
- Each CNN can only consider limited content , RNN can consider a whole sentence
- Consider very long sentences: stacking many layers of CNN , the upper
filter
layer can consider more information, because the upper layer filter will take the output of the lower layer filter as input - Problem: Many layers must be stacked to be able to consider longer sentences, so there is
self-attention
a mechanism
- CNN filters: Each triangle represents
self-attention layer
For example, enter x 1 , x 2 , x 3 , x 4 x_1,x_2,x_3,x_4x1,x2,x3,x4Pass embedding
through a 1 , a 2 , a 3 , a 4 a_1, a_2, a_3, a_4a1,a2,a3,a4, each input
of which is multiplied by three different transformations (matrixes) to obtain three different vectors q , k , vq,k,vq,k,v。
- q q q: query (to match others)
q i = W q a i q^i = W^q a^i qi=Wqai
- k k k: key (to be matched)
k i = W k a i k^i = W^k a^i ki=Wk ai
- vvv: information to be extracted
v i = W v a i v^i = W^v a^i vi=Wvai
Take each query qqq goes to each keykkk for attention, used in the textScaled Dot-Product Attention
:
α 1 , i = q 1 ⋅ ki / d \alpha_{1,i} = q^1 \cdot k^i / \sqrt{d}a1,i=q1⋅ki/d
where, ⋅ \cdot⋅ means dot product,ddd isqqThe dimension of q can offset the imbalance caused by the dimension.
Then, proceed softmax
to
activate α ^ 1 , i = exp ( α 1 , i ) ∑ j exp ( α 1 , i ) \hat{\alpha}_{1,i} = \frac{\exp \left ( \alpha_{1,i} \right)}{\sum\limits_j \exp\left( \alpha_{1,i} \right)}a^1,i=j∑exp( a1,i)exp( a1,i)
Take each query qqq to do attention for each key k
b 1 = ∑ i α ^ 1 , ivib^1 = \sum\limits_i \hat{\alpha}_{1,i} v^ib1=i∑a^1,ivi
Thus, b 1 , b 2 , b 3 , b 4 b^1,b^2,b^3,b^4b1,b2,b3,b4 can be calculated in parallel.
Next, use matrix operations to illustrate the process self-attention layer
of :
- Step 1:
- Step 2:
-
Step 3:
-
Step 4:
Therefore,
Attention ( Q , K , V ) = softmax ( QKT dk ) V \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt {d_k}}\right)VAttention(Q,K,V)=softmax(dkQKT)V
is shown in the figure below:
Anyway, it is a bunch of matrix multiplication, which can be accelerated by GPU
Multi-head Self-attention
Multi-head Self-attention
Allows the model to jointly attend to information from different representation subspaces at different locations. Different ones head
can perform their duties and learn features with different meanings (such as local or global).
MultiHead ( Q , K , V ) = Concat ( head 1 , ⋯ , head h ) WO \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1,\cdots,\ text{head}_h) W^OMultiHead(Q,K,V)=Concat(head1,⋯,headh)WO
其中
head i = Attention ( Q W i Q , K W i K , V W i V ) \text{head}_i = \text{Attention}(QW_i^Q,KW_i^K, VW_i^V) headi=Attention(QWiQ,KWiK,VWiV)
The following figure is the attention map of 2 heads
Destination bi , 1 , bi , 2 b^{i,1},b^{i,2}bi,1,bi , 2 can concatenate
beWWW for dimension change
Positional Encoding
- The position information is not considered
self-attention
in the original , so a position vector ei e_i can be introducedei, not learned but set by people. - Other methods:
one-hot encoding
use pi p_i represented bypibecause xi x_ixiindicate its location
Specifically, xx can bex andppp to carry out concatenate
oraia^iai joineie^iei。
keras
The official embedding layer
implementation is as follows
class TokenAndPositionEmbedding(layers.Layer):
def __init__(self, maxlen, vocab_size, embed_dim):
super(TokenAndPositionEmbedding, self).__init__()
self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)
def call(self, x):
maxlen = tf.shape(x)[-1]
positions = tf.range(start=0, limit=maxlen, delta=1)
positions = self.pos_emb(positions)
x = self.token_emb(x)
return x + positions
Seq2seq Architecture
seq2seq2 model
Original Encoder and Decoder are composed of two RNNs, which can be applied to machine translation.
In the above figure, the original Encoder is a bidirectional RNN, and the Decoder is a unidirectional RNN. In the figure below, both are Self-attention layer
replaced by , which can achieve the same purpose and can be operated in parallel.
An animation vividly depicts this process
Transformer
The following figure is the architecture diagram of transformer:
-
Encoder
-
Input
PassedInput Embedding
, considering the location information, plus artificial settingsPositional Encoding
, the entry will repeat $N$ timesblock
Multi-head
: Enter the Encoder, it is Multi-head Attention, that is, q, k, vq,k,vq,k,There are multiple v , do qkv qkvin itq k v individually multiplied byaaThe operation of a , calculateα \alphaα finally gotbbb
Add & Norm
:Multi-head attention
Putinput
the aa ofa andoutput
bbb add up to getb ′ b’b′ , then doLayer Normalization
- After the calculation is completed, it is thrown into the forward propagation, and then after a
Add & Norm
-
-
Decoder
input
For the previous time stepoutput
, passoutput embedding
, consider the location information, plus artificial settingspositional encoding
, the entry will repeat nnn timesblock
Masked Multi-head Attention
: Do Attention, Masked means that it will only attend to the sequence that has been generated, and then passAdd & Norm Layer
- Go through again
Multi-head Attention Layer
, attend to the output of the previous Encoder, and then enterAdd & Norm Layer
- After calculation, throw it to Feed Forward forward propagation, and then do
Linear
theSoftmax
to generate the final output
Transformer block
is implemented as follows:
class TransformerBlock(layers.Layer):
def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
super(TransformerBlock, self).__init__()
self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
self.ffn = keras.Sequential(
[layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
)
self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
self.dropout1 = layers.Dropout(rate)
self.dropout2 = layers.Dropout(rate)
def call(self, inputs, training):
attn_output = self.att(inputs, inputs)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(inputs + attn_output)
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
return self.layernorm2(out1 + ffn_output)
Download and prepare the dataset:
vocab_size = 20000 # Only consider the top 20k words
maxlen = 200 # Only consider the first 200 words of each movie review
(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words=vocab_size)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.preprocessing.sequence.pad_sequences(x_val, maxlen=maxlen)
Create a classifier model using transformer
embed_dim = 32 # Embedding size for each token
num_heads = 2 # Number of attention heads
ff_dim = 32 # Hidden layer size in feed forward network inside transformer
inputs = layers.Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(2, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
training and evaluation
model.compile("adam", "sparse_categorical_crossentropy", metrics=["accuracy"])
history = model.fit(
x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val)
)
The print information is as follows:
Epoch 1/2
782/782 [==============================] - 15s 18ms/step - loss: 0.5112 - accuracy: 0.7070 - val_loss: 0.3598 - val_accuracy: 0.8444
Epoch 2/2
782/782 [==============================] - 13s 17ms/step - loss: 0.1942 - accuracy: 0.9297 - val_loss: 0.2977 - val_accuracy: 0.8745
Attention Visualization
single head
multi-head
Example Application
- Summarizer By Google
input
For a bunch of documents, output
for an article (summarize)
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-fEMpvaW3-1612535321252)(https://arxiv.org/abs/1801.10198 "Generating Wikipedia by Summarizing Long Sequences" )]](https://cdn.jsdelivr.net/gh/ZhouKanglei/jidianxia/2021-2-5/1612530043763-example.png)
- Universal Transformer
Horizontal (time) is Transformer, vertical is RNN
- Self-Attention GAN
Can also be used for image generation