[Deep Learning] Li Hongyi: Graphical Transformer

Transformer

Transformer is a deep learning model introduced in 2017, mainly used in the field of natural language processing. Like recurrent neural networks, Transformers are designed to process sequential data, such as natural language, for tasks such as translation and text summarization. However, unlike RNNs, Transformers do not need to process sequential data in order.

Attention is all you need: means no sum, just need

Why do we need the Transformer?

  • RNN
    • It is the most classic model for processing Sequence, such as one-way RNN or two-way RNN and so on.
    • The problem with RNN: Difficult to parallelize

Hard to parallel

Therefore, it was proposed to replace RNN with CNN

Using CNN to replace RNN

  • CNN instead of RNN

    • CNN filters: Each triangle represents filtera seqsmall segment with an input of , and outputs a value (obtained by inner product). Different filtercorresponds seqto different parts of .

    Using CNN to replace RNN (CNN can parallel)

    • Each CNN can only consider limited content , RNN can consider a whole sentence
    • Consider very long sentences: stacking many layers of CNN , the upper filterlayer can consider more information, because the upper layer filter will take the output of the lower layer filter as input
    • Problem: Many layers must be stacked to be able to consider longer sentences, so there is self-attentiona mechanism

    You can try to replace any thing that has been done by RNN with self-attention.

self-attention layer

For example, enter x 1 , x 2 , x 3 , x 4 x_1,x_2,x_3,x_4x1,x2,x3,x4Pass embeddingthrough a 1 , a 2 , a 3 , a 4 a_1, a_2, a_3, a_4a1,a2,a3,a4, each inputof which is multiplied by three different transformations (matrixes) to obtain three different vectors q , k , vq,k,vq,k,v

  • q q q: query (to match others)

q i = W q a i q^i = W^q a^i qi=Wqai

  • k k k: key (to be matched)

k i = W k a i k^i = W^k a^i ki=Wk ai

  • vvv: information to be extracted

v i = W v a i v^i = W^v a^i vi=Wvai

Self attention layer

Take each query qqq goes to each keykkk for attention, used in the textScaled Dot-Product Attention:
α 1 , i = q 1 ⋅ ki / d \alpha_{1,i} = q^1 \cdot k^i / \sqrt{d}a1,i=q1ki/d
where, ⋅ \cdot means dot product,ddd isqqThe dimension of q can offset the imbalance caused by the dimension.

attention

Then, proceed softmaxto
activate α ^ 1 , i = exp ⁡ ( α 1 , i ) ∑ j exp ⁡ ( α 1 , i ) \hat{\alpha}_{1,i} = \frac{\exp \left ( \alpha_{1,i} \right)}{\sum\limits_j \exp\left( \alpha_{1,i} \right)}a^1,i=jexp( a1,i)exp( a1,i)
softmax

Take each query qqq to do attention for each key k
b 1 = ∑ i α ^ 1 , ivib^1 = \sum\limits_i \hat{\alpha}_{1,i} v^ib1=ia^1,ivi
multiply and

Thus, b 1 , b 2 , b 3 , b 4 b^1,b^2,b^3,b^4b1,b2,b3,b4 can be calculated in parallel.

Next, use matrix operations to illustrate the process self-attention layerof :

  • Step 1:

Step 1

  • Step 2:

Step_2

  • Step 3:
    Step_3

  • Step 4:

Step 4

Therefore,
Attention ( Q , K , V ) = softmax ( QKT dk ) V \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt {d_k}}\right)VAttention(Q,K,V)=softmax(dk QKT)V
is shown in the figure below:

Transformer matrix calculation

Anyway, it is a bunch of matrix multiplication, which can be accelerated by GPU

Multi-head Self-attention

Multi-head Self-attentionAllows the model to jointly attend to information from different representation subspaces at different locations. Different ones headcan perform their duties and learn features with different meanings (such as local or global).
MultiHead ( Q , K , V ) = Concat ( head 1 , ⋯ , head h ) WO \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1,\cdots,\ text{head}_h) W^OMultiHead(Q,K,V)=Concat(head1,,headh)WO
其中
head i = Attention ( Q W i Q , K W i K , V W i V ) \text{head}_i = \text{Attention}(QW_i^Q,KW_i^K, VW_i^V) headi=Attention(QWiQ,KWiK,VWiV)
The following figure is the attention map of 2 heads

2 head self-attention example

Destination bi , 1 , bi , 2 b^{i,1},b^{i,2}bi,1,bi , 2 can concatenatebeWWW for dimension change

Perform dimension transformation

Positional Encoding

  • The position information is not consideredself-attention in the original , so a position vector ei e_i can be introducedei, not learned but set by people.
  • Other methods: one-hot encodinguse pi p_i represented bypibecause xi x_ixiindicate its location

introduce an unknown vector

Specifically, xx can bex andppp to carry out concatenateoraia^iai joineie^iei

pos_2

kerasThe official embedding layerimplementation is as follows

class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

Seq2seq Architecture

seq2seq2 modelOriginal Encoder and Decoder are composed of two RNNs, which can be applied to machine translation.

Encoder-decoder-orignal

In the above figure, the original Encoder is a bidirectional RNN, and the Decoder is a unidirectional RNN. In the figure below, both are Self-attention layerreplaced by , which can achieve the same purpose and can be operated in parallel.

Encoder-decoder

An animation vividly depicts this process

Transformer: A Novel Neural Network Architecture for Language Understanding")

Transformer

The following figure is the architecture diagram of transformer:

Transformer Architecture

  • Encoder

    • InputPassed Input Embedding, considering the location information, plus artificial settings Positional Encoding, the entry will repeat $N$ timesblock

      • Multi-head: Enter the Encoder, it is Multi-head Attention, that is, q, k, vq,k,vq,k,There are multiple v , do qkv qkvin itq k v individually multiplied byaaThe operation of a , calculateα \alphaα finally gotbbb

      • Add & Norm: Multi-head attentionPut input the aa ofa andoutput bbb add up to getb ′ b’b , then doLayer Normalization
      • After the calculation is completed, it is thrown into the forward propagation, and then after aAdd & Norm
  • Decoder

    • inputFor the previous time step output, pass output embedding, consider the location information, plus artificial settings positional encoding, the entry will repeat nnn timesblock
      • Masked Multi-head Attention: Do Attention, Masked means that it will only attend to the sequence that has been generated, and then passAdd & Norm Layer
      • Go through again Multi-head Attention Layer, attend to the output of the previous Encoder, and then enterAdd & Norm Layer
      • After calculation, throw it to Feed Forward forward propagation, and then do Linearthe Softmaxto generate the final output

Transformer blockis implemented as follows:

class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

Download and prepare the dataset:

vocab_size = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review
(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words=vocab_size)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.preprocessing.sequence.pad_sequences(x_val, maxlen=maxlen)

Create a classifier model using transformer

embed_dim = 32  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer

inputs = layers.Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(2, activation="softmax")(x)

model = keras.Model(inputs=inputs, outputs=outputs)

training and evaluation

model.compile("adam", "sparse_categorical_crossentropy", metrics=["accuracy"])
history = model.fit(
    x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val)
)

The print information is as follows:

Epoch 1/2
782/782 [==============================] - 15s 18ms/step - loss: 0.5112 - accuracy: 0.7070 - val_loss: 0.3598 - val_accuracy: 0.8444
Epoch 2/2
782/782 [==============================] - 13s 17ms/step - loss: 0.1942 - accuracy: 0.9297 - val_loss: 0.2977 - val_accuracy: 0.8745

Attention Visualization

  • single head

The relationship between the two characters, the thicker the line, the deeper the relationship

  • multi-head

The results of matching with different groups are different, which means that different groups have different information (local below or global above)

Example Application

  • Summarizer By Google

inputFor a bunch of documents, outputfor an article (summarize)

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-fEMpvaW3-1612535321252)(https://arxiv.org/abs/1801.10198 "Generating Wikipedia by Summarizing Long Sequences" )]](https://cdn.jsdelivr.net/gh/ZhouKanglei/jidianxia/2021-2-5/1612530043763-example.png)

  • Universal Transformer

Horizontal (time) is Transformer, vertical is RNN

(https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html "Universal Transformer")

  • Self-Attention GAN

Can also be used for image generation

Self-Attention GAN

Guess you like

Origin blog.csdn.net/qq_38904659/article/details/113705027