[Deep learning series (6)]: RNN series (5): dynamic routing of RNN models

Dynamic routing is similar to the attention mechanism, and its main purpose is to assign the corresponding parameter c to sequence data. This is somewhat similar to the attention mechanism. It has been proved in practice that, compared with the attention mechanism, the dynamic routing algorithm has improved accuracy. Different from the similarity algorithm used in the attention mechanism to calculate the weights, this paper uses a dynamic routing algorithm to assign weights. The dynamic routing algorithm is used in the capsule network. This algorithm is mainly used for reference here and used in the RNN. In practice, it has been proved that some algorithms in CNN or RNN can learn from each other, and sometimes have miraculous effects. Take a look at this article for specific practical details. . .


table of Contents

One, dynamic routing algorithm

Second, the dynamic routing algorithm practice in RNN

3. The practice of RNN model based on dynamic routing-classification of Reuters news

3.1, data loading

3.2. Use IndyLSTM unit to build RNN model


 

One, dynamic routing algorithm

Geoffrey Hinton, one of the pioneers of deep learning and the inventor of classic neural network algorithms such as backpropagation, proposed a capsule network. The capsule network is a new type of neural network based on capsules. The capsule network is trained with dynamic routing algorithms between capsules. The schematic diagram of capsule and inter-capsule routing described in the early report by Hinton et al. is shown in the following figure:

Here is a brief introduction to the routing algorithm in the next capsule network. In the capsule network, dynamic routing groups capsules into low-level capsules and parent capsules, and calculates the output of the parent capsule. How to calculate it? In dynamic routing, we use a transformation matrix to transform the vector of the input capsule to form a vote, and use similar voting groups. These votes eventually become the output vector of the parent capsule. The specific calculation is as follows:

  • Calculate similarity weight (coupling coefficient)c_{ij}

                                                   c_{ij}=\frac{exp(b_{ij})}{\sum_{K} exp(b_{ij})}

Among them b_{ij}is the similarity between the lower-layer capsule and the upper-layer capsule, which is initialized to 0 by default before each iteration. Here, Softmax is used to normalize the similarity, and finally the similarity weight is obtained. Using Softmax can ensure that all weights cij are non-negative, and their sum is equal to one. In essence, softmax enforces the probabilistic nature of the coefficient cij I described above. Conceptually, calculating the similarity weight c_{ij}measures how likely the capsule is to activate the capsule.

  • Calculate the output of the capsule (activation vector)v_{j}

First of all, we know that the output (prediction vector) of the low-level capsule network is u_{i}, here we need to multiply the transformation matrix with the output of the low-level capsule to transform the dimension and get a new output (prediction vector) \widehat{u}_{j|i}.

                                                   \ widehat {u} _ {j | i} = W_ {ji} u_ {i}

Then c_{ij}calculate the weighted sum according to the similarity weight (coupling coefficient) , and finally get the output of the capsule s_{j}.

                                                  s_{j}=\sum _{i}{c_{ji}\widehat{u}_{j|i}}

Furthermore, we need to pass the squash nonlinear function, which ensures that the direction of the output of the capsule is preserved, and the length is limited to less than 1. This function can compress a small vector to zero, and a large vector to a unit vector. .

                                                  v_{j}=\frac{\left \| s_{j} \right \|^{2}}{1+\left \| s_{j} \right \|^{2}}\frac{s_{j}}{\left \| s_{j} \right \|^{2}}

  • Update similarity weightb_{ij}

Intuitively, the prediction vector is the prediction (voting) from the capsule and affects the output of the capsule. If the activation vector and the prediction vector have a high degree of similarity, then we can conclude that the two capsules are highly correlated. This similarity is measured by the scalar product of the prediction vector and the activation vector. b_{ij}The update calculation is as follows:

                                                  b_{ji}\leftarrow b_{ji}+\widehat{u}_{j|i}\cdot v_{j}

Therefore, the similarity score will consider both the possibility and the characteristic attributes, instead of only considering the possibility like the neuron. Here is the final pseudo code for dynamic routing:

Reference connection:

CapsNet Getting Started Series Three: Dynamic Routing Algorithm Between Capsules

Understanding of dynamic routing between capsules (based on Hinton's capsule network)

How to treat Hinton's paper "Dynamic Routing Between Capsules"?

Slowly learn NLP / Capsule Net

Second, the dynamic routing algorithm practice in RNN

Certain changes need to be made to apply the dynamic routing algorithm in the capsule network to the RNN model, as follows:

(1) Use the fully linked network to convert the output result of RNN into a prediction vector \widehat{u}_{j|i}, see shared_routing_uhat() function

(2) Mask the input sequence length, use dynamic routing algorithm to support dynamic length sequence data input, see masked_routing_iter() function

(3) Perform information aggregation on the results of RNN output, and perform dropout processing on the results of dynamic routing calculations, so that it has stronger normalization capabilities, see routing_mask class

The specific implementation code is as follows:

def mkMask(input_tensor,maxLen):
    '''
    计算变长RNN模型的掩码,根据序列长度生成对应的掩码
    :param input_tensor:输入标签文本序列的长度list,(batch_size)
    :param maxLen:输入标签文本的最大长度
    :return:
    '''
    shape_of_input=tf.shape(input_tensor)
    shape_of_output=tf.concat(axis=0,values=[shape_of_input,[maxLen]])

    oneDtensor=tf.reshape(input_tensor,shape=(-1,))
    flat_mask=tf.sequence_mask(oneDtensor,maxlen=maxLen)

    return tf.reshape(flat_mask,shape_of_output)

def shared_routing_uhat(caps,out_caps_num,out_caps_dim):
    '''
    定义函数,将输入转化成uhat
    :param caps: 输入向量,[batch_size,max_len,cap_dims]
    :param out_caps_num:输出胶囊的个数
    :param out_cap_dim:输出胶囊的维度
    :return:
    '''
    batch_size,max_len=caps.shape[0],caps.shape[1]

    caps_uchat=tf.keras.layers.Dense(out_caps_num*out_caps_dim,activation='tanh')(caps)
    caps_uchat=tf.reshape(caps_uchat,[batch_size,max_len,out_caps_num,out_caps_dim])

    return caps_uchat

def _squash(in_caps,axes):
    '''
    定义_squash激活函数
    :param in_caps:
    :param axes:
    :return:
    '''
    _EPSILON=1e-9
    vec_squared_norm=tf.reduce_sum(tf.square(in_caps),axis=axes,keepdims=True)
    scalar_factor=vec_squared_norm/(1+vec_squared_norm)/tf.sqrt(vec_squared_norm+_EPSILON)
    vec_squared=scalar_factor*in_caps
    return vec_squared

def masked_routing_iter(caps_uhat,seqLen,iter_num):
    '''
    动态路由计算
    :param caps_uhat:输入向量,(batch_size,max_len,out_caps_num,out_caps_dim)
    :param seqLen:
    :param iter_num:
    :return:
    '''
    assert iter_num>0

    #获取批次和长度
    batch_size,max_len=tf.shape(caps_uhat)[0],tf.shape(caps_uhat)[1]
    #获取胶囊的个数
    out_caps_num=int(tf.shape(caps_uhat)[2])
    seqLen=tf.where(tf.equal(seqLen,0),tf.ones_like(seqLen),seqLen)
    mask=mkMask(seqLen,max_len) #(batch_size,max_len)
    float_mask=tf.cast(tf.expand_dims(mask,axis=-1),dtype=tf.float32)#(batch_size,max_len,1)

    #初始化相似度权重b
    B=tf.zeros([batch_size,max_len,out_caps_num],dtype=tf.float32)

    #迭代更新相似度权重b
    for i in range(iter_num):
        #计算相似度权重(耦合系数)c
        c=tf.keras.layers.Softmax(axis=2)(B)#(batch_size,max_len,out_caps_num)
        c=tf.expand_dims(c*float_mask,axis=-1)#(batch_size,max_len,out_caps_num,1)

        #计算胶囊的输出(激活向量)v
        weighted_uhat=c*caps_uhat#(batch_size,max_Len,out_caps_num, out_caps_dim)
        s=tf.reduce_sum(weighted_uhat,axis=1)# (batch_size, out_caps_num, out_caps_dim)
        #squash非线性函数
        v=_squash(s,axes=[2])#(batch_size, out_caps_num, out_caps_dim)
        v=tf.expand_dims(v,axis=1)#(batch_size, 1, out_caps_num, out_caps_dim)

        #更新相似度权重b
        B=tf.reduce_sum(caps_uhat*v,axis=-1)+B#(batch_size, maxlen, out_caps_num)

    v_ret = tf.squeeze(v, axis=[1])  # shape(batch_size, out_caps_num, out_caps_dim)
    s_ret = s
    return v_ret, s_ret

#定义函数,使用动态路由对RNN结果信息聚合
def routing_masked(in_x, xLen, out_caps_dim, out_caps_num, iter_num=3,
                                dropout=None, is_train=False, scope=None):
    assert len(in_x.get_shape()) == 3 and in_x.get_shape()[-1].value is not None
    b_sz = tf.shape(in_x)[0]
    with tf.variable_scope(scope or 'routing'):
        caps_uhat = shared_routing_uhat(in_x, out_caps_num, out_caps_dim, scope='rnn_caps_uhat')
        attn_ctx, S = masked_routing_iter(caps_uhat, xLen, iter_num)
        attn_ctx = tf.reshape(attn_ctx, shape=[b_sz, out_caps_num*out_caps_dim])
        if dropout is not None:
            attn_ctx = tf.layers.dropout(attn_ctx, rate=dropout, training=is_train)
    return attn_ctx

3. The practice of RNN model based on dynamic routing-classification of Reuters news

3.1, data loading

The data set used here is the tf.keras interface. The data set includes 11228 news and a total of 46 topics. The specific interface is:

tf.keras.datasets.reuters

The main realization is as follows:

  • Load data using  tf.keras.datasets.reuters.load_data function
  • Use tf.keras.preprocessing.sequence.pad_sequences function to align data

The specific code implementation is as follows:

#定义参数
NUM_WORDS=2000 #字典的最大长度
MAXLEN=80 #设置句子的最大长度

def load_data(num_words=NUM_WORDS,maxlen=MAXLEN):
    '''加载数据'''
    # 加载数据
    print("load datasets ...")
    (x_train, y_train), (x_test, y_test) = \
        tf.keras.datasets.reuters.load_data(path='./reuters.npz', num_words=num_words)

    #数据预处理:对齐序列数据并计算长度
    #使用tf.keras.preprocessing.sequence函数对齐标签
    # 对于句子长度大于maxlen的,从前面截断,只保留前maxlen个数据;
    #对于句子长度小于maxlen的,需要在句子后面补零
    x_train=tf.keras.preprocessing.sequence.pad_sequences(x_train,maxlen=maxlen,padding='post')
    x_test=tf.keras.preprocessing.sequence.pad_sequences(x_test,maxlen,'post')
    print('Pad sequences x_train shape:', x_train.shape)

    #计算每个句子的真实长度
    len_train=np.count_nonzero(x_train,axis=1)
    len_test = np.count_nonzero(x_test, axis=1)
    
    return  (x_train,y_train,len_train),(x_test,y_test,len_test)

def dataset(batch_size):
    (x_train,y_train,len),_=load_data()
    
    dataset=tf.data.Dataset.from_tensor_slices(((x_train,len),y_train))
    
    dataset=dataset.shuffle(1000).batch(batch_size,drop_remainder=True)#丢弃剩余数据
    
    return dataset

3.2. Use IndyLSTM unit to build RNN model

The main steps are as follows:

(1) Pass the 3-layer IndyLSTM unit into the tf.nn.dynamic_rnn() function to build a dynamic RNN model;

(2) Use the routing_masked() function to aggregate the output results of the RNN model based on dynamic routing;

(3) Calculate the loss value with the classified result, and define the optimizer for training.

x = tf.placeholder("float", [None, maxlen]) #定义输入占位符
x_len = tf.placeholder(tf.int32, [None, ])#定义输入序列长度占位符
y = tf.placeholder(tf.int32, [None, ])#定义输入分类标签占位符

nb_features = 128   #词嵌入维度  
embeddings = tf.keras.layers.Embedding(num_words, nb_features)(x)

#定义带有IndyLSTMCell的RNN网络
hidden = [100,50,30]#RNN单元个数
stacked_rnn = []
for i in range(3):
    cell = tf.contrib.rnn.IndyLSTMCell(hidden[i])
    stacked_rnn.append(tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=0.8))
mcell = tf.nn.rnn_cell.MultiRNNCell(stacked_rnn)

rnnoutputs,_  = tf.nn.dynamic_rnn(mcell,embeddings,dtype=tf.float32)
out_caps_num = 5 #定义输出的胶囊个数
n_classes = 46#分类个数

outputs = routing_masked(rnnoutputs, x_len,int(rnnoutputs.get_shape()[-1]), out_caps_num, iter_num=3)
print(outputs.get_shape())
pred =tf.layers.dense(outputs,n_classes,activation = tf.nn.relu)



#定义优化器
learning_rate = 0.001
cost = tf.reduce_mean(tf.losses.sparse_softmax_cross_entropy(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

iterator1 = tf.data.Iterator.from_structure(dataset.output_types,dataset.output_shapes)
one_element1 = iterator1.get_next()							#获取一个元素


#训练网络
with tf.Session()  as sess:
    sess.run( iterator1.make_initializer(dataset) )		#初始化迭代器
    sess.run(tf.global_variables_initializer())
    EPOCHS = 20
    for ii in range(EPOCHS):
        alloss = []  									#数据集迭代两次
        while True:											#通过for循环打印所有的数据
            try:
                inp, target = sess.run(one_element1)
                _,loss =sess.run([optimizer,cost], feed_dict={x: inp[0],x_len:inp[1], y: target})
                alloss.append(loss)

            except tf.errors.OutOfRangeError:
                #print("遍历结束")
                print("step",ii+1,": loss=",np.mean(alloss))
                sess.run( iterator1.make_initializer(dataset) )	#从头再来一遍
                break

The final training results are as follows:

Guess you like

Origin blog.csdn.net/wxplol/article/details/104484067