InferSent的代码实现

我最近抽空完成了一个新的github项目–InferSent 。关于InferSent前面的文章有过介绍。我实现它的原因有二：一是因为算法本身简单，二是因为它在各种NLP任务上表现可以和其他state-of-art的模型对标。

InferSent的模型结构如下：
这里写图片描述

InferSent选择了NLI任务用来训练句子embedding，对应的数据集是SNLI，前文有介绍，这里不再赘述。作为premise和hypothesis的句子共享同一个sentence encoder。论文实验了LSTM and GRU, BiLSTM with mean/max pooling, Self-attentive network, Hierarchical ConvNet。结论是BiLSTM with max pooling在迁移学习中综合表现最好。

我在代码里实现了两种encoder，DAN(deep averaging network) 和 BiLSTM with max pooling。实现后者是因为其作为encoder表现最优，实现DAN仅仅是因为它简单，可以做一个基线算法。

def bilstm_as_encoder(sent_padded_as_tensor, word_embeddings,
layer_size, hidden_size=100, sent_length=50, embedding_size=300):
    embed_input = tf.nn.embedding_lookup(word_embeddings,
sent_padded_as_tensor)
    print("sent_padded_as_tensor: "+str(sent_padded_as_tensor))
    print("embed_input: "+str(embed_input))

    cell_fw = tf.nn.rnn_cell.LSTMCell(hidden_size)
    print('build fw cell: '+str(cell_fw))
    cell_bw = tf.nn.rnn_cell.LSTMCell(hidden_size)
    print('build bw cell: '+str(cell_bw))

    rnn_outputs, _ = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw,
inputs=embed_input, dtype=tf.float32)
    print('rnn outputs: '+str(rnn_outputs))

    concatenated_rnn_outputs = tf.concat(rnn_outputs, 2)
    print('concatenated rnn outputs: '+str(concatenated_rnn_outputs))

    max_pooled = tf.layers.max_pooling1d(concatenated_rnn_outputs,
sent_length, strides=1)
    print('max_pooled: '+str(max_pooled))

    max_pooled_formated = tf.reshape(max_pooled, [-1, 2*hidden_size])
    print('max_pooled_formated: '+str(max_pooled_formated))

    w1 = tf.get_variable(name="w1", dtype=tf.float32,
shape=[2*hidden_size, layer_size[0]])
    b1 = tf.get_variable(name="b1", dtype=tf.float32,
shape=[layer_size[0]])
    encoded = tf.matmul(max_pooled_formated, w1) + b1

    return encoded

以上是BiLSTM with max pooling的encoder实现。BiLSTM的问题是速度比较慢，想提高一点速度也很简单，直接把cell_fw和cell_bw给换成GRU就行了。我试过，速度提高了而且精度差距也不大。这里word_embeddings用的是论文用到的840B token训练出的300维GloVe vectors。LSTM cell的hidden size我用的是100，应该说还可以再取大一点的值，因为资源的限制，我只能委曲求全。对于有资源的同学强烈建议实验更大的hidden vectors。代码里的layer_size是最终输出的encoded sentence的维度，我在实验中用了论文建议的512。

def build_graph(
    inputs1,
    inputs2,
    emb_matrix,
    encoder,
    embedding_size = 300,
    layer_size = None,
    nclasses = 3
    ):

    print(" input1 shape: "+str(inputs1.shape))
    print(" input2 shape: "+str(inputs2.shape))

  #  reuse_var = None 
   # reuse_encoder_var = None 
    word_embeddings = tf.convert_to_tensor(emb_matrix, np.float32)
    print("word_embeddings shape: "+str(word_embeddings.shape))
    print(word_embeddings)

    # the encoders
    with tf.variable_scope("encoder_vars") as encoder_scope:
        encoded_input1 = encoder(inputs1, word_embeddings, layer_size)
        encoder_scope.reuse_variables()
        encoded_input2 = encoder(inputs2, word_embeddings, layer_size)
        print("encoded inputs1 shape: "+str(encoded_input1.shape))
        print("encoded inputs2 shape: "+str(encoded_input2.shape))

    abs_diffed = tf.abs(tf.subtract(encoded_input1, encoded_input2))
    print(abs_diffed)

    multiplied = tf.multiply(encoded_input1, encoded_input2)
    print(multiplied)
    concatenated = tf.concat([encoded_input1, encoded_input2,
abs_diffed, multiplied], 1)
    print(concatenated)
    concatenated_dim = concatenated.shape.as_list()[1]

    # the fully-connected dnn layer
    # fix it as 512
    fully_connected_layer_size = 512
    with tf.variable_scope("dnn_vars") as encoder_scope:
        wd = tf.get_variable(name="wd", dtype=tf.float32,
shape=[concatenated_dim, fully_connected_layer_size])
        bd = tf.get_variable(name="bd", dtype=tf.float32,
shape=[fully_connected_layer_size])
dnned = tf.matmul(concatenated, wd) + bd
    print(dnned)

    with tf.variable_scope("out") as out:
        w_out = tf.get_variable(name="w_out", dtype=tf.float32,
shape=[fully_connected_layer_size, nclasses])
        b_out = tf.get_variable(name="b_out", dtype=tf.float32,
shape=[nclasses])
    logits = tf.matmul(dnned, w_out) + b_out

    return logits

上面这段代码是计算图。利用第一段代码里的encoder完成模型剩下的部分。
注意这一段代码：

    # the encoders
    with tf.variable_scope("encoder_vars") as encoder_scope:
        encoded_input1 = encoder(inputs1, word_embeddings, layer_size)
        encoder_scope.reuse_variables()
        encoded_input2 = encoder(inputs2, word_embeddings, layer_size)
        print("encoded inputs1 shape: "+str(encoded_input1.shape))
        print("encoded inputs2 shape: "+str(encoded_input2.shape))

对于encoded_input1和encoded_input2来说，encoder的参数是共享的。

模型在每个epoch结束后都会存入logs文件夹。我提供了一个sentence_encoder.py，用来读入存好的model，把输入的句子encode,产生sentence embedding。

如果不是资源限制，我相信完全能够复现文章的精度。没时间做downstream的NLP的任务了，谁有兴趣做欢迎分享。

InferSent的代码实现

猜你喜欢