[Deep Learning Series (6)]: RNN Series (5): The flexible attention mechanism of RNN model's tricks and tricks

Three magic weapons to solve NLP tasks: attention mechanism, convolutional neural network and recurrent neural network. It can be seen that the attention mechanism is very important for NLP, so here I will focus on the attention mechanism and flexibly use the attention mechanism in actual projects.

As we all know, the attention mechanism is usually used in the seq2seq model. Our commonly used attention mechanism is the information-based attention mechanism, which means that we only select some key input information for processing, but sometimes we also need to pay attention to other information. , Such as location information. In some occasions such as formula time and speech time, the position information between characters is also very important, so in this article we will introduce how to flexibly modify the attention mechanism of seq2seq and apply it to our actual task requirements. . .


table of Contents

One, the realization of BahdanauAttention attention mechanism

2. The realization of multi-head attention mechanism and internal attention mechanism

2.1, the basic idea of ​​the attention mechanism

2.2. Multi-head attention mechanism and internal attention mechanism

2.3. The actual combat of the multi-head attention mechanism-analyze whether the reviewer is satisfied

Third, the realization of the hybrid attention mechanism

3.1, the concrete realization of the hybrid attention mechanism


One, the realization of BahdanauAttention attention mechanism

Not much introduction here. For details, please refer to my previous article, address: [Deep Learning Series (6)]: RNN Series (4): seq2seq model with attention mechanism and actual combat (2): Add content description to pictures .

2. The realization of multi-head attention mechanism and internal attention mechanism

2.1, the basic idea of ​​the attention mechanism

From the first section, we have been able to grasp the attention mechanism very familiarly, here is a brief talk. In fact, the idea of ​​the attention mechanism is also very simple: specifically it is to use query for query tasks, and then query the part of the value we are concerned about according to the key value. Here we simply abbreviate the three objects of query, key, and value as q, k, and v respectively. The specific implementation is as follows:

                                                d_{v}=Attention(q_{t},k,v)=softmax(\frac{<q_{t},k>}{\sqrt{d_{k}}})v_{s}=\sum_{s=1}^{m}\frac{1}{z}exp(\frac{<q_{t},k>}{\sqrt{d_{k}}})v_{s}

The specific calculation process is as follows:

  • Q and k will be calculated for the product, and needs to be divided \sqrt{d_{k}}to remove the effect of dimension ( \sqrt{d_{k}}functions to adjust the value of the inner product will not be too large);
  • Use the result of the first step to calculate the softmax score, which is the attention score;
  • Multiply the score from the previous step with v to get the similarity between q and v;
  • The last step is to perform a weighted summation of the above results to obtain the corresponding output d.

This model can be used in a translation model. For example, for the input of m words, the word vector dimension is d_{k}, there are n words after translation, and the word vector dimension is. The above attention process can simplify the calculation process as:, [m,d_{k}]\times [n,d_{k}]\times[n,d_{v}]and finally get [m,d_{v}]. Of course, this model can also be used for other tasks, such as reading comprehension tasks, where the article is q, the question of reading comprehension is k, and the answer is v.

2.2. Multi-head attention mechanism and internal attention mechanism

The multi-head attention mechanism is a paper issued by Google in "Attention is All You Need" in 2017. This is a new concept proposed by Google. It is the improvement of the Attention mechanism. It uses multi-head technology to improve the original attention mechanism. In the method of deep learning to do NLP, we usually convert sentences into embedding and then process them. The main processing methods are:

(Note: The blue part above is from someone else’s article, I think it is very good, I will borrow it here)

From the formal point of view, the multi-head attention mechanism can't be simpler. It is to map Q, K, and V through the parameter matrix, and then do Attention. Repeat this process for h times, and the results can be spliced ​​together. "Away to Jane". The specific implementation is as follows:

                                                          head_{i}=Attention(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})

                                                        MultiHead(Q,K,V)=Concat(head_{1},...,head_{h})

The specific calculation process is as follows:

  • Transform Q, K, and V through the parameter matrix for fully connected mapping;
  • Do the dot product operation (Attention operation) of the three results in the first step;
  • Repeat the operations in the first and second steps h times, and each time in the first step, a new weight matrix needs to be used;
  • Use the concat function to concatenate the results after h calculations.

The calculation process is as follows:

Each Attention operation will cause a certain aspect of the data in the data to undergo attention conversion (obtain local features). When multiple Attention operations occur, local attention features in multiple directions will be obtained. After all the local attention features are spliced ​​together It is transformed into the overall features through the neural network, so as to achieve the result of fitting or classification.

The internal attention mechanism (Self Attention) is mainly used to discover the internal characteristics of sequence data. Its structure is similar to the multi-head attention mechanism, but there are some differences. The specific method is to convert Q, K, and V into X, that is, Attention(X,X,X). One of the main contributions of the Google paper is that it shows that internal attention is very important in the sequence encoding of machine translation (even the general Seq2Seq task) , and the previous research on Seq2Seq basically only used the attention mechanism in decoding. end. A similar thing is that R-Net, the current top model of SQUAD reading comprehension, has also added a self-attention mechanism, which has also improved its model. The general expression of the internal attention mechanism is as follows:

                                                                              Y=Attention(X,X,X)

Reference link: "Attention is All You Need" shallow reading (introduction + code)

2.3. The actual combat of the multi-head attention mechanism- analyze whether the reviewer is satisfied

2.3.1, data reading

The data set used in this example is the movie evaluation data set released by Cornell University. The specific link is as follows:

Link:  https://pan.baidu.com/s/1QBYjRjcO8MP3XFCwUPkz1g   Password: 5d40

The data set contains two files, rt-polarity.neg and rt-polarity.pos, which contain 5331 positive comments and 5331 negative comments, respectively. The specific files are shown in the figure below.

The tf.keras.preprocessing.text.Tokenizer() module is used directly here to read and preprocess data. The details are not introduced here. The code is as follows:

import tensorflow as tf

def load_data(positive_data_file,negative_data_file):
    '''加载数据'''
    # 读取数据
    file_list=[positive_data_file,negative_data_file]
    train_data=[]
    train_labels=[]
    for index,file in enumerate(file_list):
        with open(file,'r',encoding='utf-8') as fp:
            for line in fp.readlines():
                train_data.append(line.strip())
                train_labels.append(index)

    #文本标签预处理:(1)文本过滤;(2)建立字典;(3)向量化文本以及文本对齐
    # 文本过滤,去除无效字符
    tokenizer=tf.keras.preprocessing.text.Tokenizer(oov_token="<unk>",
                                          filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')

    tokenizer.fit_on_texts(train_data)

    # 建立字典,构造正反向字典
    tokenizer.word_index={key:value for key,value in tokenizer.word_index.items()}
    # 向字典中加入<unk>字符
    tokenizer.word_index[tokenizer.oov_token] = len(tokenizer.word_index) + 1
    # 向字典中加入<pad>字符
    tokenizer.word_index['<pad>'] = 0

    index_word = {value: key for key, value in tokenizer.word_index.items()}

    # 向量化文本和对齐操作,将文本按照字典的数字进行项向量化处理,
    # 并按照指定长度进行对齐操作(多余的截调,不足的进行补零)
    train_seq=tokenizer.texts_to_sequences(train_data)
    len_seq=[len(l) for l in train_seq]
    cap_vector=tf.keras.preprocessing.sequence.pad_sequences(train_seq,padding='post')
    max_length = len(cap_vector[0])  # 标签最大长度

    return cap_vector, train_labels,max_length, len_seq,tokenizer.word_index, index_word

def dataset(positive_data_file,negative_data_file,batch_size=64):
    cap_vector, train_labels,max_length, len_seq, word_index, index_word = load_data(positive_data_file, negative_data_file)

    dataset=tf.data.Dataset.from_tensor_slices(((cap_vector,len_seq),train_labels))
    dataset=dataset.shuffle(len(cap_vector))
    dataset=dataset.batch(batch_size,drop_remainder=True)
    return dataset,max_length, word_index, index_word

2.3.2, model building 

  • Implementation of word embedding layer with location information

Although MultiAttention is essentially a key-value search mechanism, such a model cannot capture the sequence of the sequence ! In other words, if you shuffle the order of K and V by lines (equivalent to shuffle the word order in a sentence), then the result of Attention is still the same. For time series, especially for tasks in NLP, order is very important information. It represents a local or even global structure. If the order information cannot be learned, the effect will be greatly reduced (such as machine translation , It is possible that only each word is translated, but it cannot be organized into a reasonable sentence).

So Google offered another trick- Position Embedding , or "position vector", numbered each position, and then each number corresponds to a vector. By combining the position vector and the word vector, each word was introduced Certain position information, so that Attention can distinguish words in different positions.

In the previous Position Embedding, it is basically a vector trained according to the task. And Google directly gave a formula for constructing Position Embedding:

\bg_red \tiny \dpi{200} \left\{\begin{matrix} PE_{2i}(p)=sin(p/10000^{\frac{2i}{d_{pos}}})\\ PE_{2i+1}(p)=cos(p/1000^{\frac{2i}{d_{pos}}})) \end{matrix}\right.

The meaning here is to map the position with id p to a dpos-dimensional position vector, and the value of the iith element of this vector is PEi(p). Google said in the paper that they compared the position vector directly trained with the position vector calculated by the above formula, and the effect is close. So obviously we are more willing to use the Position Embedding constructed by the formula.

Position Embedding itself is an absolute position information, but in language, relative position is also very important. An important reason Google chose the aforementioned position vector formula is: because we have sin(α+β)=sin(α)cos( β)+cos(α)sin(β) and cos(α+β)=cos(α)cosβ−sin(α)sin(β), which shows that the vector at position p+k can be expressed as the vector at position p Linear transformation, which provides the possibility of expressing relative position information.

The main implementation steps of Position Embedding are as follows:

1. Use sin and cos algorithms to calculate each element in the embedding;

2. Use the concat function to connect the results of the first step as the final location information;

3. Join the obtained position information with embedding or directly add it to embedding.

Here we customize the Position Embedding layer. There are mainly the following steps to customize the network in tf.keras:

1. Inherit tf.keras.layers.Layer;

2. Implement the __init__ method in the class to initialize the layer;

3. Implement the build method to define the weight of this layer;

4. Implement the call method for calling events. Customize the input data, and also need to support masking (calculated according to the actual length);

5. Implement the compute_output_shape method in the class to specify the output shape of the layer.

The specific implementation is as follows:

class Position_Embedding(tf.keras.layers.Layer):
    '''带位置信息的词嵌入'''
    def __init__(self,size=None,mode='sum',**kwargs):
        super(Position_Embedding,self).__init__(**kwargs)
        self.size=size#必须为偶数
        self.mode=mode
    def call(self, inputs, **kwargs):
        if self.size==None or self.mode=='sum':
            self.size=int(inputs.shape[-1])

        position_j=1./K.pow(10000.,2*K.arange(self.size/2,dtype='float32')/self.size)
        position_j=K.expand_dims(position_j,0)

        # 按照x的1维度累计求和,与arange一样,生成序列。只不过按照x的实际长度来
        position_i=tf.cumsum(K.ones_like(inputs[:,:,0]),1)-1
        position_i=K.expand_dims(position_i,2)
        position_ij=K.dot(position_i,position_j)
        position_ij=K.concatenate([K.cos(position_ij),K.sin(position_ij)],2)
        if self.mode=='sum':
            return position_ij+inputs
        elif self.mode=='concat':
            return K.concatenate([position_ij,inputs],2)
        
    def compute_output_shape(self, input_shape):
        if self.mode == 'sum':
            return input_shape
        elif self.mode == 'concat':
            return (input_shape[0], input_shape[1], input_shape[2]+self.size)
  • Realization of multi-head attention layer

Although the meaning of Multi-Head is very simple-do it several times and then splice it, but in fact you can't write the program according to this idea, it will be very slow. Therefore, we must combine the operations of Multi-Head into a tensor for operation, because the multiplication of a single tensor is automatically parallelized internally. This method directly extracts the weights in the final fully connected network of the multi-head attention mechanism, merges them into the original inputs Q, K, and V and expands them according to the specified number of times, so that they are directly calculated in the form of a matrix. The specific implementation steps are as follows:

1. Make a linear transformation of the three roles Q, K, and V in the attention mechanism;

2. Call batch_dot to perform matrix-based multiplication calculation on the transformed Q and K;

3. Make a matrix-based multiplication calculation for the result of the second step and V.

The specific implementation is as follows:

class Attention(tf.keras.layers.Layer):
    '''基于内部注意力机制的多头注意力机制'''
    def __init__(self,nb_head,size_per_head,**kwargs):
        super(Attention,self).__init__(**kwargs)
        self.nb_head=nb_head                    #设置注意力的计算次数
        self.size_per_head=size_per_head        #设置每次线性变化为size_per_head的维度
        self.output_dim=nb_head*size_per_head   #设置输出的总维度

    def build(self,input_shape):
        '''定义q、k、v的权重'''
        super(Attention,self).build(input_shape)
        self.WQ=self.add_weight(name='WQ',
                                shape=(int(input_shape[0][-1]),self.output_dim),
                                initializer='glorot_uniform',trainable=True)
        self.WK = self.add_weight(name='WK',
                                  shape=(int(input_shape[1][-1]), self.output_dim),
                                  initializer='glorot_uniform', trainable=True)
        self.WV = self.add_weight(name='WV',
                                  shape=(int(input_shape[2][-1]), self.output_dim),
                                  initializer='glorot_uniform', trainable=True)

    def Mask(self,inputs,seq_len,mode='mul'):
        '''定义Mask方法方法,按照seq_len实际长度对inputs进行计算'''
        if seq_len==None:
            return inputs
        else:
            mask=K.one_hot(seq_len[:,0],K.shape(inputs)[1])
            mask=1-K.cumsum(mask,1)
            for _ in range(len(inputs.shape)-2):
                mask=K.expand_dims(mask,2)

            if mode=='mul':
                return inputs*mask
            if mode=='add':
                return inputs-(1-mask)*1e12

    def call(self, inputs, **kwargs):
        #解析传入的Q_seq,K_seq,V_seq
        if len(inputs)==3:
            Q_seq,K_seq,V_seq=inputs
            Q_len,V_len=None,None
        elif len(inputs)==5:
            Q_seq,K_seq,V_seq,Q_len,V_len=inputs

        #对Q_seq,K_seq,V_seq作nb_head次线性变换,并转化为size_per_head
        Q_seq=K.dot(Q_seq,self.WQ)
        Q_seq=K.reshape(Q_seq,(-1,K.shape(Q_seq)[1],self.nb_head,self.size_per_head))
        Q_seq=K.permute_dimensions(Q_seq,(0,2,1,3))

        K_seq = K.dot(K_seq, self.WK)
        K_seq = K.reshape(K_seq, (-1, K.shape(K_seq)[1], self.nb_head, self.size_per_head))
        K_seq = K.permute_dimensions(K_seq, (0, 2, 1, 3))

        V_seq = K.dot(V_seq, self.WV)
        V_seq = K.reshape(V_seq, (-1, K.shape(V_seq)[1], self.nb_head, self.size_per_head))
        V_seq = K.permute_dimensions(V_seq, (0, 2, 1, 3))

        # 计算内积,然后mask,然后softmax
#         A=tf.compat.v1.keras.backend.batch_dot(Q_seq, K_seq, axes=[3,3])/ self.size_per_head**0.5
        A = K.batch_dot(Q_seq, K_seq, axes=[3,3]) / self.size_per_head**0.5
        A=K.permute_dimensions(A,(0,3,2,1))
        A=self.Mask(A,V_len,'add')
        A=K.permute_dimensions(A,(0,3,2,1))
        A=K.softmax(A)

        # 输出并mask
        O_seq=K.batch_dot(A,V_seq,axes=[3,2])
        O_seq=K.permute_dimensions(O_seq,(0,2,1,3))
        O_seq=K.reshape(O_seq,(-1,K.shape(O_seq)[1],self.output_dim))
        O_seq=self.Mask(O_seq,Q_len,'mul')

        return O_seq

    def compute_output_shape(self, input_shape):
        return (input_shape[0][0], input_shape[0][1], self.output_dim)
  • Model building

There is nothing to say here, just look at the code:

def RNN_Attention(embedding_size,vocab_size,max_len):
    input=tf.keras.layers.Input([max_len])

    # 生成带位置信息的词向量
    embeddings=tf.keras.layers.Embedding(vocab_size,embedding_size)(input)
    embeddings=Position_Embedding()(embeddings) #默认使用同等维度的位置向量

    #注意力机制
    x=Attention(8,16)([embeddings,embeddings,embeddings])

    #全局池化
    x=tf.keras.layers.GlobalAveragePooling1D()(x)

    #dropout
    x=tf.keras.layers.Dropout(rate=0.5)(x)
    # x=TargetedDropout(drop_rate=0.5, target_rate=0.5)(x)

    x=tf.keras.layers.Dense(1,activation='sigmoid')(x)

    model=tf.keras.Model(inputs=input,outputs=x)

    return model

2.3.3, model training

positive_data_file="./rt-polaritydata/rt-polarity.pos"
negative_data_file="./rt-polaritydata/rt-polarity.neg"

dataset,max_length, word_index, index_word=dataset(positive_data_file,negative_data_file)

embedding_size=128
vocab_size=len(word_index)
max_len=max_length

model=RNN_Attention(embedding_size,vocab_size,max_len)
model.summary()

#添加反向传播节点
model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy'])

#开始训练
print('Train...')
model.fit(dataset,epochs=5)

The final training results are as follows:

Third, the realization of the hybrid attention mechanism

The hybrid attention mechanism has been analyzed in detail in the previous article. For details, please refer to my previous article: [Deep learning series (6)]: RNN series (4): seq2seq model with attention mechanism and actual combat (1) ) . Here mainly talk about the difference between the attention mechanism and the general attention mechanism. Generally speaking, the structure of the hybrid attention mechanism is as follows:

                                                    a_{i}=Attention(s_{i-1},a_{i-1},h_{i})

That is to say, the hybrid attention mechanism is related to the output s and position information a at the previous moment, and the content h at the current moment. The structure of the general attention mechanism without position information is as follows:

                                                   a_{i}=Attention(s_{i-1},h_{i})

The difference with the hybrid attention mechanism is that there is more position information a.

3.1, the concrete realization of the hybrid attention mechanism

The specific implementation code is as follows:

import tensorflow as tf
from tensorflow.contrib.seq2seq.python.ops.attention_wrapper import BahdanauAttention
#from tensorflow.python.layers import core as layers_core
#from tensorflow.python.ops import array_ops, math_ops, nn_ops, variable_scope
from tensorflow.python.ops import array_ops, variable_scope

def _location_sensitive_score(processed_query, processed_location, keys):
	#获取注意力的深度(全连接神经元的个数)
	dtype = processed_query.dtype
	num_units = keys.shape[-1].value or array_ops.shape(keys)[-1]

   #定义了最后一个全连接v
	v_a = tf.get_variable('attention_variable', shape=[num_units], dtype=dtype,
		initializer=tf.contrib.layers.xavier_initializer())
    
   #定义了偏执b
	b_a = tf.get_variable('attention_bias', shape=[num_units], dtype=dtype,
		initializer=tf.zeros_initializer())
    #计算注意力分数
	return tf.reduce_sum(v_a * tf.tanh(keys + processed_query + processed_location + b_a), [2])

def _smoothing_normalization(e):#平滑归一化函数,返回[batch_size, max_time],代替softmax
	return tf.nn.sigmoid(e) / tf.reduce_sum(tf.nn.sigmoid(e), axis=-1, keepdims=True)


class LocationSensitiveAttention(BahdanauAttention):#位置敏感注意力
	
	def __init__(self,                 #初始化
			num_units,                   #实现过程中全连接的神经元个数
			memory,                     #编码器encoder的结果
			smoothing=False,             #是否使用平滑归一化函数代替softmax
			cumulate_weights=True,       #是否对注意力结果进行累加
			name='LocationSensitiveAttention'):
		

		#smoothing为true则使用_smoothing_normalization,否则使用softmax
		normalization_function = _smoothing_normalization if (smoothing == True) else None
		super(LocationSensitiveAttention, self).__init__(
				num_units=num_units,
				memory=memory,
				memory_sequence_length=None,
				probability_fn=normalization_function,#当为None时,基类会调用softmax
				name=name)

		self.location_convolution = tf.layers.Conv1D(filters=32,
			kernel_size=(31, ), padding='same', use_bias=True,
			bias_initializer=tf.zeros_initializer(), name='location_features_convolution')
		self.location_layer = tf.layers.Dense(units=num_units, use_bias=False,
			dtype=tf.float32, name='location_features_layer')
		self._cumulate = cumulate_weights

	def __call__(self, query, #query为解码器decoder的中间态结果[batch_size, query_depth]
                      state):#state为上一次的注意力[batch_size, alignments_size]
		with variable_scope.variable_scope(None, "Location_Sensitive_Attention", [query]):

			#全连接处理query特征[batch_size, query_depth] -> [batch_size, attention_dim]
			processed_query = self.query_layer(query) if self.query_layer else query
			#维度扩展  -> [batch_size, 1, attention_dim]
			processed_query = tf.expand_dims(processed_query, 1)

			#维度扩展 [batch_size, max_time] -> [batch_size, max_time, 1]
			expanded_alignments = tf.expand_dims(state, axis=2)
			#通过卷积获取位置特征[batch_size, max_time, filters]
			f = self.location_convolution(expanded_alignments)
			#经过全连接变化[batch_size, max_time, attention_dim]
			processed_location_features = self.location_layer(f)

			#计算注意力分数 [batch_size, max_time]
			energy = _location_sensitive_score(processed_query, processed_location_features, self.keys)


		#计算最终注意力结果[batch_size, max_time]
		alignments = self._probability_fn(energy, state)

		#是否累加
		if self._cumulate:
			next_state = alignments + state
		else:
			next_state = alignments#[batch_size, alignments_size],alignments_size为memory的最大序列次数max_time

		return alignments, next_state

 

Guess you like

Origin blog.csdn.net/wxplol/article/details/104484101