Pointer Generator 摘要及其强化学习策略梯度版本初步尝试

论文链接:

源起：

现今，强化学习是一个比较有意思的课题。其基本是一个决策过程的建模问题，类比于传统数据挖掘问题，决策的不同状态往往对应于个体特征的抽象（自变量描述），决策的不同action往往对应于所感兴趣的值（因变量描述），环境对于决策给与的reward往往可以类比于“负损失”，这是将一般的数据挖掘或深度学习问题对应转化成强化学习问题的一般思路。

有时这种转化不仅仅停留在类比方面，不局限于对一个问题换一种方式进行解决，而是要解决一些如在传统深度学习方法没法下手的点。如loss与eval metric的非对等问题，即由于整体的优化是面向loss进行的，但是eval metric往往是更为“切题”（这里指更为符合人的直观价值判断）的目标。但又难以直接面向eval metric进行优化（如在深度学习场景 eval metric不可导，且难以处理）。

强化所要解决的点就是直接面对eval metric进行优化，其一个直观的强化解决方案就是策略梯度，直接将evalmetric导出的reward作为梯度的方向引导加入优化过程，如能对一个具体问题使用这种方法可能能得到一个更为“切题”的优化解决方案。

上面引出的文章：Deep Reinforcement Learning For Sequence to Sequence Models 就是作者在基于seq2seq + pointer generator的基本序列对序列框架在强化方向的对应转化。

作者在文中使用多种强化学习方法“转化”了seq2seq + pointergenerator模型。由于本人仅仅尝试过Q-learning及策略梯度两种方法，故仅仅对在策略梯度场景的“转化”做一个简要的介绍。（下文不加说明的都是摘要场景）

Seq2seq + pointer generator序列编码解码模型（见论文Get To The Point: Summarization with Pointer-GeneratorNetworks）

传统的seq2seq模型可见于下图：

基本上的优势使用attention机制给出对于编码部分进行概括抽象。

可能涉及到的建模公式包含下面的部分：

Attension的计算：

输入隐状态加权生成context vector

解码概率分布：

其一个问题是摘要生成过程本身是被词表“限制”的，当对于新的输入，输入的若干token不在词表中时（oov: out-of-vocabulary），生成的摘要难以利用这部分信息，实际上一种常用的利用这部分信息的方法是将输入词直接拷贝至应有的输出部分。也就是应当对于解码的每一步给出一个对应于输入的概率分布，将这部分“指针”分布（指向输入的token位置）与原始seq2seq + pointer generator的分布进行概率加权整合得到一个混合分布，再从混合分布中抽出token得到解码过程。

实际这种思路在统计中也是有的，从优化角度来看,统计中的数据变换的一个初衷就是解决下面一个问题：自变量与因变量为同一个分布族时，残差往往是正态分布，对应loss类似l2，优化一般不会出现样本权重偏倚问题。我们将编码分布的若干词的分布加入到解码分布中，正是扩张二者地“重合”部分，加强分布近似，有助于得到更为平衡的优化结果。（混合线性模型：如加入非参数模型的大量在干这些事）

加入指针分布的具体形式见下式：

混合分布权重定义：

混合分布：

这样就得到下图所示模型：

作者为了解决篇章重复问题，加入attention的累加作为惩罚，用前面解码步骤累加的attention与当前解码步骤attention的最小值作为惩罚加入loss，从而控制积累信息的扩散。

作者将这部分累加值还加入到当前步attention的构造中，见下式：

Loss整合过程见下式：

Seq2seq + pointer generator简单策略梯度版本

将上述过程再加上beam search解码及用诸如rouge的方式进行评价的的后续步骤就可用下图概括：

Deep Reinforcement Learning For Sequence toSequence Models 中提出了将上面过程，用强化学习过程表示的诸多方案，较简单的一种就是策略梯度。

从Q表的角度来看，将不同的解码步骤对应于状态，将不同解码步骤从前述混合分布中所选择的token定义为action，将一次解码产生的序列与目标序列的rouge定义为reward就可以简单地得到策略梯度情形下的下列算法：

关于这种算法的特点可以参看文章开头的描述（比如rouge的不可导等等）及原文的出发点，下面给出的loss对应版本是面向原文(18)式的，即下式：

下面尝试给出这两种模型的实现。

数据：https://www.kaggle.com/sunnysai12345/news-summary

数据量有点少，大致试一下模型效果而已。

数据处理：

import pandas as pd
from collections import defaultdict
from nltk.tokenize import word_tokenize
from collections import Counter
import pickle

def load():
    # use text as summary and ctext as complete source text.
    summary_path = r"E:\Coding\python\pointerDIY\data\news_summary.csv"
    summary_df = pd.DataFrame.from_csv(summary_path, encoding="cp437", index_col = None)
    token_dict = defaultdict(list)
    word_cnt = Counter()

    req_df = summary_df[["text", "ctext"]]
    for idx, r in req_df.iterrows():
        text = r["text"]
        ctext = r["ctext"]
        try:
            text_pos = word_tokenize(text.lower())
            ctext_pos = word_tokenize(ctext.lower())
        except:
            print("nan ctext : call continue")
            continue

        token_dict["text"].append(text_pos)
        token_dict["ctext"].append(ctext_pos)
        word_cnt.update(text_pos)
        word_cnt.update(ctext_pos)

    token_df = pd.DataFrame.from_dict(token_dict)

    idx_token_dict = defaultdict(list)
    word2idx = dict((w, idx) for idx, w in enumerate(list(word_cnt.keys())))
    for idx, r in token_df.iterrows():
        text = r["text"]
        ctext = r["ctext"]
        idx_text = list(map(lambda w: word2idx[w], text))
        idx_ctext = list(map(lambda w: word2idx[w], ctext))
        idx_token_dict["text"].append(idx_text)
        idx_token_dict["ctext"].append(idx_ctext)

    idx_token_df = pd.DataFrame.from_dict(idx_token_dict)

    with open("new_data.pkl", "wb") as f:
        pickle.dump({
            "idx_token_df": idx_token_df,
            "word_cnt": word_cnt,
            "word2idx": word2idx
        }, f)
    print("data dump end")

if __name__ == "__main__":
    load()

Seq2seq + pointer generator 数据导出及模型实现：

（这里在oov的选择及Coveragemechanism的构建上偷工减料了，大体上还可以用）

真正要使用oov，不能用全量词表，而且在固定词表后要进行索引判定。

Coverage mechanism 对于每一个解码步都要重算（重新softmax再拼接），这里交换了顺序。

import tensorflow as tf
import pickle
import numpy as np
import pandas as pd
import pause
import os

with open(r"data_process\new_data.pkl", "rb") as f:
    obj_dict = pickle.load(f)

idx_token_df = obj_dict["idx_token_df"]
word2idx = obj_dict["word2idx"]
# add <START>, <STOP>, <PAD>, at this step not add <UNK> for all information known.
word2idx = dict([(k, v) for k, v in word2idx.items()] + [("<START>", len(word2idx)),
                                                         ("<STOP>", len(word2idx) + 1),
                                                         ("<PAD>", len(word2idx) + 2)])
max_text_len = max(map(len ,idx_token_df["text"]))
ctext_upper_bound = 1000
max_ctext_len = min(max(map(len, idx_token_df["ctext"])), ctext_upper_bound)
print("max_text_len: {}, max_ctext_len: {}".format(max_text_len, max_ctext_len))
# truly used add <STRAT>, <STOP> into text head and tail. so
max_text_len += 2

vocab_size = len(word2idx)
print("vocab_size : {}".format(vocab_size))

# random split dataframe into train and valid
train_ratio = 0.9
total_size = idx_token_df.shape[0]

# use random seed
np.random.seed(0)
rand_df = pd.DataFrame(np.random.randn(total_size, 2))
msk = np.random.rand(len(rand_df)) < train_ratio
train_idx_token_df = idx_token_df[msk]
valid_idx_token_df = idx_token_df[~msk]
print("train valid split end.")

def data_generator(type = "train", max_encoder_len = max_ctext_len, max_decoder_len = max_text_len,
                   batch_size = 16, padding_idx = word2idx["<PAD>"]):
    print("init data_generator : {}".format(type))
    used_idx_token_df = None
    assert type in ["train", "valid"]
    if type == "train":
        used_idx_token_df = train_idx_token_df
    else:
        used_idx_token_df = valid_idx_token_df

    # init step
    start_idx = 0
    encoder_input = np.full(shape=[batch_size, max_encoder_len], fill_value=padding_idx, dtype=np.int32)
    encoder_mask = np.zeros(shape=[batch_size], dtype=np.int32)
    decoder_input = np.full(shape=[batch_size, max_decoder_len], fill_value=padding_idx, dtype=np.int32)
    decoder_mask = np.zeros(shape=[batch_size], dtype=np.int32)
    oov_indices_input = np.full(shape=[batch_size, max_encoder_len, max_decoder_len - 1],
                                fill_value= max_encoder_len,
                                dtype=np.int32)
    input_y = np.full(shape=[batch_size, max_decoder_len - 1], fill_value=padding_idx,
                      dtype=np.int32)

    print("begin iter rows ")
    for idx, r in used_idx_token_df.iterrows():
        summary = r["text"]
        summary = [word2idx["<START>"]] + summary + [word2idx["<STOP>"]]
        body = r["ctext"][:max_encoder_len]

        for s_idx, s_w in enumerate(summary):
            decoder_input[start_idx][s_idx] = s_w
            for b_idx, b_w in enumerate(body):
                if s_w == b_w:
                    oov_indices_input[start_idx][b_idx][s_idx] = b_idx
        decoder_mask[start_idx] = len(summary)

        for b_idx, b_w in enumerate(body):
            encoder_input[start_idx][b_idx] = b_w
        encoder_mask[start_idx] = len(body)

        for y_idx, y_w in enumerate(summary[1:]):
            if y_w in encoder_input[start_idx]:
                encoder_index = encoder_input[start_idx].tolist().index(y_w)
                input_y[start_idx][y_idx] = vocab_size + encoder_index
                continue
            input_y[start_idx][y_idx] = y_w

        start_idx += 1
        if start_idx == batch_size:
            yield (encoder_input, encoder_mask, decoder_input, decoder_mask,
                   oov_indices_input, input_y)

            start_idx = 0
            encoder_input = np.full(shape=[batch_size, max_encoder_len], fill_value=padding_idx, dtype=np.int32)
            encoder_mask = np.zeros(shape=[batch_size], dtype=np.int32)
            decoder_input = np.full(shape=[batch_size, max_decoder_len], fill_value=padding_idx, dtype=np.int32)
            decoder_mask = np.zeros(shape=[batch_size], dtype=np.int32)
            oov_indices_input = np.full(shape=[batch_size, max_encoder_len, max_decoder_len - 1],
                                        fill_value= max_encoder_len,
                                        dtype=np.int32)
            input_y = np.full(shape=[batch_size, max_decoder_len - 1], fill_value=padding_idx,
                              dtype=np.int32)

class PointerGenerator(object):
    def __init__(self, vocab_size = vocab_size, word_embedding_dim = 100,
                 max_encoder_len = max_ctext_len, max_decoder_len = max_text_len,
                 encoder_single_hidden_size = 50, decoder_hidden_size = 100,
                 v_dim = 100, V_dim = 1000, batch_size = 16,
                 ):

        self.encoder_single_hidden_size = encoder_single_hidden_size
        self.decoder_hidden_size = decoder_hidden_size
        self.max_encoder_len = max_encoder_len
        self.max_decoder_len = max_decoder_len
        self.v_dim = v_dim
        self.V_dim = V_dim
        self.vocab_size = vocab_size
        self.word_embedding_dim = word_embedding_dim
        self.batch_size = batch_size

        self.Word_Embed = tf.Variable(
            tf.random_normal(shape=[vocab_size, word_embedding_dim]), name="Word_Embed"
        )

        self.encoder_input = tf.placeholder(dtype=tf.int32, shape=[None, max_encoder_len],
                                            name="encoder_input")
        self.encoder_mask = tf.placeholder(dtype=tf.int32, shape=[None],
                                           name="encoder_mask")

        # this is the full input of decoder single sample ex: [<START>, token0, token1, ... <END>, <PAD>, <PAD>...]
        self.decoder_input = tf.placeholder(dtype=tf.int32, shape=[None, max_decoder_len],
                                            name="decoder_input")
        self.decoder_mask = tf.placeholder(dtype=tf.int32, shape=[None],
                                           name="decoder_mask")

        # for use oov input, must set a placeholder for it
        # can indice attension matrix by it
        # 注意这里要用 max_encoder_len表示 当前位置没有oov(不作为当前 decoder_step的oov存在)
        # oov 占位表示 当为oov时 要使用相应的 indice 表明当前位置。
        self.oov_indices_input = tf.placeholder(dtype=tf.int32,
                                                shape=[None, max_encoder_len, max_decoder_len - 1],
                                                name="oov_indices_input")

        self.encoder_lookup = tf.nn.embedding_lookup(self.Word_Embed, self.encoder_input,
                                                     name="encoder_lookup")
        self.decoder_lookup = tf.nn.embedding_lookup(self.Word_Embed, self.decoder_input,
                                                     name = "decoder_lookup")

        # every element generate from tf.range(vocab_size + max_encoder_len)
        self.input_y = tf.placeholder(tf.int32, [None, max_decoder_len - 1],
                                      name="input_y")

        # keep prob for rnn cell dropput
        self.keep_prob = tf.placeholder(tf.float32, [], name="keep_prob")
        # l2 regular param
        self.l2_param = tf.placeholder(tf.float32, [], name="l2_param")
        # conv lambda param
        self.lambda_val = tf.placeholder(tf.float32, [], name="lambda_val")

        self.Wh = tf.Variable(tf.random_normal(
            shape=[v_dim, self.encoder_single_hidden_size * 2]
        ), name="Wh")
        self.Ws = tf.Variable(tf.random_normal(
            shape=[v_dim, self.decoder_hidden_size]
        ), name="Ws")
        self.battn = tf.Variable(tf.constant([0.1] * v_dim), name="battn")
        self.v = tf.Variable(tf.random_normal(shape=[v_dim, 1]), name="v")

        self.V = tf.Variable(
            tf.random_normal(shape=[self.encoder_single_hidden_size * 2 + self.decoder_hidden_size, self.V_dim],
                             ),name="V"
        )
        self.b = tf.Variable(tf.constant([1.0] * self.V_dim), name="b")

        self.V_ = tf.Variable(tf.random_normal(
            shape=[self.V_dim, vocab_size]
            , name="V_"
        ))
        self.b_ = tf.Variable(tf.constant([1.0] * vocab_size), name="b_")

        self.wh = tf.Variable(tf.random_normal(shape=[self.encoder_single_hidden_size * 2, 1]),
                              name = "wh")
        self.ws = tf.Variable(tf.random_normal(shape=[self.decoder_hidden_size, 1]),
                              name="ws")
        self.wx = tf.Variable(tf.random_normal(shape=[word_embedding_dim, 1]),
                              name="wx")
        self.bptr = tf.Variable(tf.constant([0.1]), name="bptr")

        # conv_ weight
        self.wc = tf.Variable(tf.constant(np.full(shape=[1, self.v_dim], fill_value=0.1, dtype=np.float32), dtype=tf.float32),
                              name="wc")

        self.encoder_decoder_lstm_layer()
        # construct opt in the final step
        self.opt_construct()

    def context_vector_layer(self, encoder_outputs, decoder_outputs):
        assert int(encoder_outputs.get_shape()[-1]) == int(decoder_outputs.get_shape()[-1])
        encoder_len = self.max_encoder_len
        total_input = tf.concat([encoder_outputs, decoder_outputs], axis = 1, name="total_input")

        def generate_attention(fuse_input):
            # [max_encoder_len, encoder_hidden_size * 2]
            encoder_part = tf.slice(fuse_input, [0, 0], [encoder_len, -1], name="encoder_part")
            # [max_deooder_len -1 , decoder_hiddem_size] && encoder_hidden_size * 2 == decoder_hiddem_size
            decoder_part = tf.slice(fuse_input, [encoder_len, 0], [-1, -1], name="decoder_part")

            encoder_W_part = tf.matmul(encoder_part, tf.transpose(self.Wh, [1, 0]), name="encoder_W_part")
            decoder_W_part = \
                tf.nn.xw_plus_b(decoder_part,  tf.transpose(self.Ws, [1, 0]), self.battn, name="decoder_W_part")

            # [max_encoder_len, v_dim * (self.max_decoder_len - 1)]
            encoder_tiled = tf.tile(encoder_W_part, [1, self.max_decoder_len - 1])
            decoder_list = tf.unstack(decoder_W_part, axis=0)
            # [1, v_dim * (self.max_decoder_len - 1)]
            decoder_before_tiled = tf.expand_dims(tf.concat(decoder_list, axis=0), axis=0)
            # [max_encoder_len, v_dim * (self.max_decoder_len - 1)]
            decoder_tiled = tf.tile(decoder_before_tiled, [self.max_encoder_len, 1])

            # [ self.v_dim, self.max_encoder_len,  self.max_decoder_len - 1,]
            tiled = tf.transpose(tf.reshape(tf.nn.tanh(encoder_tiled + decoder_tiled), [self.max_encoder_len, self.v_dim, self.max_decoder_len - 1]), [1, 0, 2])
            # [self.max_encoder_len,  self.max_decoder_len - 1]
            e = tf.reshape(tf.squeeze(tf.matmul(tf.transpose(self.v, [1, 0]) ,tf.reshape(tiled, [self.v_dim, -1]))), [self.max_encoder_len, self.max_decoder_len - 1],
                           name="e")
            # softmax alone column
            softmax_e = tf.nn.softmax(e, name="softmax_e", axis=0)

            # add converage mechanism by multiply Upward-pointing triangle matrix
            m_dim = self.max_decoder_len - 1
            up_tri_matrix = tf.convert_to_tensor(np.ones([m_dim, m_dim]) - np.tri(m_dim, m_dim),
                                                 name="up_tri_matrix", dtype=tf.float32)

            # [self.max_encoder_len,  self.max_decoder_len - 1]
            conv_mech_part = tf.matmul(softmax_e, up_tri_matrix, name="conv_mech_part")
            # use reshape to perform Kronecker product
            # [self.max_encoder_len,  self.max_decoder_len - 1, v_dim]
            conv_mech_part_v = tf.reshape(tf.matmul(tf.reshape(conv_mech_part, [-1, 1]), self.wc), [self.max_encoder_len,  self.max_decoder_len - 1, self.v_dim],
                                          name="conv_mech_part_v")
            # [max_encoder_len, v_dim * (self.max_decoder_len - 1)]
            conv_mech_part_v_final = tf.reshape(tf.transpose(conv_mech_part_v, [0, 2, 1]), [self.max_encoder_len, -1],
                                                name="conv_mech_part_v_final")
            # [ self.v_dim, self.max_encoder_len,  self.max_decoder_len - 1,]
            conv_tiled = tf.transpose(tf.reshape(tf.nn.tanh(encoder_tiled + decoder_tiled + conv_mech_part_v_final),
                                                 [self.max_encoder_len, self.v_dim, self.max_decoder_len - 1]),
                                      [1, 0, 2])
            conv_e = tf.reshape(tf.squeeze(tf.matmul(tf.transpose(self.v, [1, 0]) ,tf.reshape(conv_tiled, [self.v_dim, -1]))), [self.max_encoder_len, self.max_decoder_len - 1],
                                name="conv_e")
            # softmax alone column
            conv_softmax_e = tf.nn.softmax(conv_e, name="conv_softmax_e", axis=0)

            # every decoder step has this context_vector
            # [self.max_decoder_len - 1, encoder_hidden_size * 2]
            #context_vector = tf.matmul(tf.transpose(softmax_e, [1, 0]), encoder_part)
            context_vector = tf.matmul(tf.transpose(conv_softmax_e, [1, 0]), encoder_part)

            # for return convient, fuse and split followingly
            # [self.max_decoder_len - 1, encoder_hidden_size * 2 + self.max_encoder_len + self.max_encoder_len]
            fuse_output = tf.concat([context_vector, tf.transpose(conv_softmax_e, [1, 0]),
                                     tf.transpose(conv_mech_part, [1, 0])],
                                    axis=-1, name="fuse_output")
            return fuse_output

        # [batch, self.max_decoder_len - 1, encoder_hidden_size * 2 + self.max_encoder_len + self.max_encoder_len]
        batch_fuse_output = tf.map_fn(generate_attention, total_input, name="batch_fuse_output")
        # [batch, self.max_decoder_len - 1, encoder_hidden_size * 2]
        batch_context_vector = tf.slice(batch_fuse_output, [0, 0, 0], [-1, -1, self.encoder_single_hidden_size * 2], name="batch_context_vector")
        # [batch, self.max_decoder_len - 1, self.max_encoder_len]
        batch_softmax_e = tf.slice(batch_fuse_output, [0, 0, self.encoder_single_hidden_size * 2],
                                   [-1, -1, self.max_encoder_len], name="batch_softmax_e")
        # [batch, self.max_decoder_len - 1, self.max_encoder_len]
        batch_conv_mech_part = tf.slice(batch_fuse_output, [0, 0, self.encoder_single_hidden_size * 2 + self.max_encoder_len],
                                        [-1, -1, -1], name="batch_conv_mech_part")
        self.batch_softmax_e = batch_softmax_e
        self.batch_conv_mech_part = batch_conv_mech_part

        return batch_context_vector

    # in this layer a common used seq2seq attension model is finished.
    def encoder_decoder_lstm_layer(self):
        fw_encoder_cell = tf.contrib.rnn.BasicLSTMCell(num_units = self.encoder_single_hidden_size
                                                       , name = "fw_encoder_cell")
        bw_encoder_cell = tf.contrib.rnn.BasicLSTMCell(num_units = self.encoder_single_hidden_size
                                                       , name = "bw_encoder_cell")

        fw_encoder_cell = tf.nn.rnn_cell.DropoutWrapper(fw_encoder_cell, input_keep_prob=self.keep_prob,
                                                        output_keep_prob=self.keep_prob)
        bw_encoder_cell = tf.nn.rnn_cell.DropoutWrapper(bw_encoder_cell, input_keep_prob=self.keep_prob,
                                                        output_keep_prob=self.keep_prob)

        encoder_outputs, encoder_output_states = tf.nn.bidirectional_dynamic_rnn(cell_fw=fw_encoder_cell,
                                                                                 cell_bw=bw_encoder_cell,
                                                                                 inputs=self.encoder_lookup,
                                                                                 dtype=tf.float32,
                                                                                 sequence_length=self.encoder_mask,
                                                                                 )
        # [batch, max_encoder_len, encoder_single_hidden_size * 2]
        encoder_outputs = tf.concat([encoder_outputs[0], encoder_outputs[1]], axis=-1, name="encoder_outputs")
        # [batch, encoder_single_hidden_size * 2]

        encoder_output_states_h = tf.concat([encoder_output_states[0].h, encoder_output_states[1].h],
                                            axis=-1, name="encoder_output_states_h")
        encoder_output_states_c = tf.concat([encoder_output_states[0].c, encoder_output_states[1].c],
                                            axis=-1, name="encoder_output_states_c")

        # need LSTMStateTuple object for decoder
        encoder_output_states = tf.contrib.rnn.LSTMStateTuple(h =encoder_output_states_h , c = encoder_output_states_c,
                                                              )

        decoder_cell = tf.contrib.rnn.BasicLSTMCell(num_units = self.decoder_hidden_size
                                                    , name = "decoder_cell")
        # cell for beam search decoder
        self.decoder_cell = decoder_cell

        decoder_cell = tf.nn.rnn_cell.DropoutWrapper(decoder_cell, input_keep_prob=self.keep_prob,
                                                     output_keep_prob=self.keep_prob)

        decoder_inputs_slice = tf.slice(self.decoder_lookup, begin=[0, 0, 0], size=[-1, self.max_decoder_len - 1, -1],
                                        name="decoder_inputs_slice")

        # decoder_outputs: [batch, max_decoder_len - 1, decoder_hidden_size]
        decoder_outputs, decoder_output_states = tf.nn.dynamic_rnn(cell=decoder_cell, inputs=decoder_inputs_slice,
                                                                   initial_state=encoder_output_states,
                                                                   sequence_length=self.decoder_mask - 1)
        self.decoder_inputs_slice = decoder_inputs_slice
        self.decoder_outputs = decoder_outputs

        # [batch, self.max_decoder_len - 1, encoder_hidden_size * 2]
        batch_context_vector = self.context_vector_layer(encoder_outputs, decoder_outputs)
        self.batch_context_vector = batch_context_vector

        # [batch, max_decoder_len - 1, decoder_hidden_size + encoder_hidden_size * 2]
        fused_decoder_context = tf.concat([decoder_outputs, batch_context_vector], axis=-1,
                                          name="fused_decoder_context")
        fused_decoder_context_reshape = tf.reshape(fused_decoder_context,
                                                   shape=[-1, self.encoder_single_hidden_size * 2 + self.decoder_hidden_size],
                                                   name="fused_decoder_context_reshape")
        # [batch, max_decoder_len - 1, vocab_size]
        word_distribute_before_softmax = tf.reshape(tf.nn.xw_plus_b(tf.nn.xw_plus_b(fused_decoder_context_reshape, self.V, self.b), self.V_, self.b_),
                                                    [-1, self.max_decoder_len -1, self.vocab_size], name="word_distribute_before_softmax")
        self.word_distribute_before_softmax = word_distribute_before_softmax

    def generate_oov_attension(self):
        # [batch, self.max_encoder_len, self.max_decoder_len - 1, ]
        batch_softmax_e_t = tf.slice(tf.transpose(self.batch_softmax_e, [0, 2, 1]), [0, 0, 0],
                                     [self.batch_size, -1, -1])
        # [batch, self.max_encoder_len + 1, self.max_decoder_len - 1]
        batch_softmax_e_t_app = tf.concat([batch_softmax_e_t, tf.zeros(
            [self.batch_size, 1, int(batch_softmax_e_t.get_shape()[2])], dtype=tf.float32)], axis=1)
        #[max_decoder_len - 1, batch ,max_encoder_len, ]
        indice_input = tf.cast(tf.transpose(self.oov_indices_input, [2, 0, 1]), tf.float32)
        batch_size = self.batch_size
        # [max_decoder_len - 1, batch ,max_encoder_len + (max_encoder_len + 1), ]
        fuse_input = tf.concat([indice_input, tf.transpose(batch_softmax_e_t_app, [2, 0, 1])], axis=-1)

        # fuse input produced by concat indice_input and batch_softmax_e_t_app
        # in the same decoder_step, for indiced convient
        # fuse_input = tf.concat([indice_input, se_input], axis = -1)
        def indice_single_decoder_step(fuse_input):
            indice_input = tf.cast(tf.slice(fuse_input, [0, 0], [-1, self.max_encoder_len]), tf.int32)
            se_input = tf.slice(fuse_input, [0 ,self.max_encoder_len], [-1, -1])

            indices_output_list = []
            tail_dim = int(indice_input.get_shape()[-1])
            for col_idx in range(tail_dim):
                indices = tf.concat([tf.expand_dims(tf.range(batch_size), axis=-1), tf.slice(indice_input, [0, col_idx], [-1, 1])], axis=-1)
                indices_output_list.append(tf.expand_dims(indices, 1))

            indices_final = tf.concat(indices_output_list, axis=1)
            gather_ext = tf.gather_nd(se_input, indices_final)

            # [batch, max_encoder_len]
            return gather_ext

        # [max_decoder_len - 1 ,batch, max_encoder_len]
        batch_gathered = tf.map_fn(indice_single_decoder_step, fuse_input, dtype=tf.float32)
        # [batch, max_encoder_len, max_decoder_len - 1]
        return tf.transpose(batch_gathered, [1, 2, 0])

    def pointer_generator_layer(self):
        ht_part = tf.reshape(self.batch_context_vector, [-1, 2 * self.encoder_single_hidden_size],
                             name="ht_part")
        st_part = tf.reshape(self.decoder_outputs, [-1, self.decoder_hidden_size],
                             name="st_part")
        xt_part = tf.reshape(self.decoder_inputs_slice, [-1, self.word_embedding_dim],
                             name = "xt_part")

        # [batch, max_decoder_len - 1]
        pgen = tf.reshape(tf.nn.sigmoid(tf.matmul(ht_part, self.wh) + tf.matmul(st_part, self.ws) + tf.matmul(xt_part, self.wx) + self.bptr),
                          [-1, self.max_decoder_len - 1])
        total_times = self.vocab_size
        pgen_expand = tf.expand_dims(pgen, axis=-1, name="pgen_expand")

        # pgen_final have the identical shape as word_distribute_before_softmax, can use it friendly
        # by solely multiply
        # [batch, max_decoder_len - 1, vocab]
        # used for concat encoder_attention step in the last dim : i.e. vocab
        # expand vocab to vocab + max_encoder_dim
        pgen_final = tf.tile(pgen_expand, [1, 1, total_times], name="pgen_final")
        # [batch, max_decoder_len - 1, vocab]
        seq2seq_part = self.word_distribute_before_softmax * pgen_final

        # [batch, max_decoder_len - 1, max_encoder_len]
        one_subtract_pgen = tf.tile(tf.expand_dims(tf.subtract(1.0, pgen), -1), [1, 1, self.max_encoder_len])
        # [batch, max_decoder_len - 1, max_encoder_len]
        oov_attension = self.generate_oov_attension()
        pointer_generator_part = one_subtract_pgen * tf.transpose(self.generate_oov_attension(), [0, 2, 1])

        # append max_encoder_len pointer generator to vocab to have final dist
        # [batch, max_decoder_len - 1, vocab + max_encoder_len]
        final_dist = tf.concat([seq2seq_part, pointer_generator_part], axis=-1, name="final_dist")

        return final_dist

    def opt_construct(self):
        final_dist = self.pointer_generator_layer()
        self.prediction = tf.nn.softmax(final_dist, axis=-1, name = "prediction")

        logits = final_dist
        targets = self.input_y
        weights = tf.sequence_mask(self.decoder_mask, maxlen=self.max_decoder_len - 1, dtype=tf.float32)
        self.loss = tf.reduce_mean(tf.contrib.seq2seq.sequence_loss(logits = logits, targets = targets,
                                                                    weights = weights))
        # add conv loss part to it
        self.conv_loss = tf.reduce_mean(tf.reduce_min(tf.concat([tf.expand_dims(self.batch_softmax_e, -1), tf.expand_dims(self.batch_conv_mech_part, -1)],
                                                                axis=-1), axis=-1), name="conv_loss")
        self.loss += self.lambda_val * self.conv_loss

        self.l2_loss = None
        for train_able_var in tf.trainable_variables():
            if self.l2_loss is None:
                self.l2_loss = tf.nn.l2_loss(train_able_var)
            else:
                self.l2_loss += tf.nn.l2_loss(train_able_var)
        self.loss = self.loss + self.l2_param * self.l2_loss
        self.train_op = tf.train.AdamOptimizer(0.001).minimize(self.loss)

    # return top_k num of nest list with -log prob as beam search likelihood
    @staticmethod
    # 由于没有对于oov的外部判断处理，这里只对 valid data 的输入解码情况进行学习。
    def beam_search_decoder_step(data, top_k = 3):
        from math import log
        # this decode step may be implemented by numpy version as following
        # https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/

        def beam_search_decoder(data, k = top_k):
            sequences = [[list(), 1.0]]
            for row in data:
                all_candidates = list()
                for i in range(len(sequences)):
                    seq, score = sequences[i]
                    for j in range(len(row)):
                        # j index row[j] the prob
                        candidate = [seq + [j], score * -log(row[j])]
                        all_candidates.append(candidate)
                ordered = sorted(all_candidates, key = lambda tup: tup[1])
                sequences = ordered[:k]
            return sequences
        return beam_search_decoder(data)

    @staticmethod
    def predict(epsilon = 1e-10):
        idx2word = dict((i, w) for w, i in word2idx.items())

        def process_single_prediction_list_to_word_list(single_encoder_input ,input_list):
            # single_encoder_input [max_encoder_len]
            req_word_list = []
            for word_idx in input_list:
                if idx2word.get(word_idx):
                    word = idx2word[word_idx]
                else:
                    assert word_idx >= vocab_size
                    word_idx = single_encoder_input[word_idx - vocab_size]
                    word = idx2word[word_idx]
                req_word_list.append(word)
            return req_word_list

        valid_gen = data_generator(type = "valid", batch_size=1)
        pointerGenerator_ext = PointerGenerator(batch_size=1)
        print("model construct end")

        saver = tf.train.Saver()
        with tf.Session() as sess:
            if os.path.exists(r"E:\Coding\python\pointerDIY\model.ckpt.index"):
                saver.restore(sess, save_path=r"E:\Coding\python\pointerDIY\model.ckpt")
            else:
                print("model not exists return")
                return

            while True:
                try:
                    encoder_input, encoder_mask, decoder_input, decoder_mask, \
                    oov_indices_input, input_y = valid_gen.__next__()
                except:
                    print("valid epoch end, will return")
                    return

                # the prediction have shape [1 ,max_decoder_len - 1, vocab + max_encoder_len]
                prediction = sess.run(pointerGenerator_ext.prediction,
                                feed_dict={
                                    pointerGenerator_ext.encoder_input: encoder_input,
                                    pointerGenerator_ext.encoder_mask: encoder_mask,
                                    pointerGenerator_ext.decoder_input: decoder_input,
                                    pointerGenerator_ext.decoder_mask: decoder_mask,
                                    pointerGenerator_ext.oov_indices_input: oov_indices_input,
                                    #pointerGenerator_ext.input_y: input_y,

                                    pointerGenerator_ext.keep_prob: 1.0,
                                    pointerGenerator_ext.l2_param: 0.0,
                                    pointerGenerator_ext.lambda_val: 0.0
                                })
                prediction += epsilon
                # [max_decoder_len - 1, vocab + max_encoder_len]
                prediction_array = np.squeeze(prediction)
                # [top_k ,max_decoder_len - 1]
                beam_search_array = PointerGenerator.beam_search_decoder_step(prediction_array, top_k=3)

                single_encoder_input = np.squeeze(encoder_input)
                single_encoder_input_words = process_single_prediction_list_to_word_list(single_encoder_input=single_encoder_input,
                                                                                         input_list=single_encoder_input)

                # visual produre
                print("single_encoder_input_words :")
                print(" ".join(single_encoder_input_words))
                for beam_search_ele in beam_search_array:
                    beam_search_seq = process_single_prediction_list_to_word_list(single_encoder_input ,beam_search_ele[0])
                    beam_search_log_prob = beam_search_ele[-1]
                    print("seq :")
                    print(" ".join(beam_search_seq))
                    print("log_prob :")
                    print(beam_search_log_prob)
                print("*" * 100)

    @staticmethod
    def train():
        train_gen = data_generator(type = "train")
        valid_gen = data_generator(type = "valid")

        pointerGenerator_ext = PointerGenerator()
        print("model construct end")
        pause.seconds(1)

        total_step = 0
        epoch = 0

        config = tf.ConfigProto()
        #config.gpu_options.allow_growth=True
        #config.gpu_options.per_process_gpu_memory_fraction=0.4
        saver = tf.train.Saver()
        with tf.Session(config=config) as sess:
            if os.path.exists(r"E:\Coding\python\pointerDIY\model.ckpt.index"):
                print("restore exists")
                saver.restore(sess, save_path=r"E:\Coding\python\pointerDIY\model.ckpt")
            else:
                print("init global")
                sess.run(tf.global_variables_initializer())

            #sess.run(tf.global_variables_initializer())
            print("model init end")
            pause.seconds(1)

            while True:
                try:
                    encoder_input, encoder_mask, decoder_input, decoder_mask,\
                    oov_indices_input, input_y = train_gen.__next__()
                except:
                    # at least train for 30 epoch as paper discribe
                    print("train epoch {} end".format(epoch))
                    epoch += 1
                    if epoch == 50:
                        print("will return in 50 epoch")
                        return

                    train_gen = data_generator(type = "train")

                _, loss = sess.run([pointerGenerator_ext.train_op, pointerGenerator_ext.loss],
                         feed_dict={
                             pointerGenerator_ext.encoder_input: encoder_input,
                             pointerGenerator_ext.encoder_mask: encoder_mask,
                             pointerGenerator_ext.decoder_input: decoder_input,
                             pointerGenerator_ext.decoder_mask: decoder_mask,
                             pointerGenerator_ext.oov_indices_input: oov_indices_input,
                             pointerGenerator_ext.input_y: input_y,

                             pointerGenerator_ext.keep_prob: 0.7,
                             pointerGenerator_ext.l2_param: 0.00001,
                             pointerGenerator_ext.lambda_val: 0.001
                         })
                print("train loss : {}".format(loss))

                total_step += 1
                if total_step % 5 == 0:
                    #print("train loss : {}".format(loss))
                    try:
                        encoder_input, encoder_mask, decoder_input, decoder_mask, \
                        oov_indices_input, input_y = valid_gen.__next__()
                    except:
                        print("valid epoch end, re init")
                        valid_gen = data_generator(type = "valid")

                    loss = sess.run(pointerGenerator_ext.loss,
                                    feed_dict={
                                        pointerGenerator_ext.encoder_input: encoder_input,
                                        pointerGenerator_ext.encoder_mask: encoder_mask,
                                        pointerGenerator_ext.decoder_input: decoder_input,
                                        pointerGenerator_ext.decoder_mask: decoder_mask,
                                        pointerGenerator_ext.oov_indices_input: oov_indices_input,
                                        pointerGenerator_ext.input_y: input_y,

                                        pointerGenerator_ext.keep_prob: 1.0,
                                        pointerGenerator_ext.l2_param: 0.0,
                                        pointerGenerator_ext.lambda_val: 0.0
                                    })
                    print("valid loss : {}".format(loss))
                    saver.save(sess, save_path=r"E:\Coding\python\pointerDIY\model.ckpt")

if __name__ == "__main__":
    PointerGenerator.train()
    #PointerGenerator.predict()

将上述代码进行简单的loss修改就可以得到策略梯度情形的model:

Agent 代码：

import tensorflow as tf
import numpy as np

class Seq2SeqPolicyGradient(object):
    def __init__(self, vocab_size, word_embedding_dim,
                 max_encoder_len, max_decoder_len,
                 encoder_single_hidden_size, decoder_hidden_size,
                 v_dim , V_dim, batch_size,
                 ):

        # params for seq2seq_pointer_generator model
        self.encoder_single_hidden_size = encoder_single_hidden_size
        self.decoder_hidden_size = decoder_hidden_size
        self.max_encoder_len = max_encoder_len
        self.max_decoder_len = max_decoder_len
        self.v_dim = v_dim
        self.V_dim = V_dim
        self.vocab_size = vocab_size
        self.word_embedding_dim = word_embedding_dim
        self.batch_size = batch_size

        self.Word_Embed = tf.Variable(
            tf.random_normal(shape=[vocab_size, word_embedding_dim]), name="Word_Embed"
        )

        self.encoder_input = tf.placeholder(dtype=tf.int32, shape=[None, max_encoder_len],
                                            name="encoder_input")
        self.encoder_mask = tf.placeholder(dtype=tf.int32, shape=[None],
                                           name="encoder_mask")

        # this is the full input of decoder single sample ex: [<START>, token0, token1, ... <END>, <PAD>, <PAD>...]
        self.decoder_input = tf.placeholder(dtype=tf.int32, shape=[None, max_decoder_len],
                                            name="decoder_input")
        self.decoder_mask = tf.placeholder(dtype=tf.int32, shape=[None],
                                           name="decoder_mask")

        # for use oov input, must set a placeholder for it
        # can indice attension matrix by it
        # 注意这里要用 max_encoder_len表示 当前位置没有oov(不作为当前 decoder_step的oov存在)
        # oov 占位表示 当为oov时 要使用相应的 indice 表明当前位置。
        self.oov_indices_input = tf.placeholder(dtype=tf.int32,
                                                shape=[None, max_encoder_len, max_decoder_len - 1],
                                                name="oov_indices_input")

        self.encoder_lookup = tf.nn.embedding_lookup(self.Word_Embed, self.encoder_input,
                                                     name="encoder_lookup")
        self.decoder_lookup = tf.nn.embedding_lookup(self.Word_Embed, self.decoder_input,
                                                     name = "decoder_lookup")

        self.reward = tf.placeholder(tf.float32, [None], name="reward")

        # keep prob for rnn cell dropput
        self.keep_prob = tf.placeholder(tf.float32, [], name="keep_prob")
        # l2 regular param
        self.l2_param = tf.placeholder(tf.float32, [], name="l2_param")
        # conv lambda param
        self.lambda_val = tf.placeholder(tf.float32, [], name="lambda_val")

        self.Wh = tf.Variable(tf.random_normal(
            shape=[v_dim, self.encoder_single_hidden_size * 2]
        ), name="Wh")
        self.Ws = tf.Variable(tf.random_normal(
            shape=[v_dim, self.decoder_hidden_size]
        ), name="Ws")
        self.battn = tf.Variable(tf.constant([0.1] * v_dim), name="battn")
        self.v = tf.Variable(tf.random_normal(shape=[v_dim, 1]), name="v")

        self.V = tf.Variable(
            tf.random_normal(shape=[self.encoder_single_hidden_size * 2 + self.decoder_hidden_size, self.V_dim],
                             ),name="V"
        )
        self.b = tf.Variable(tf.constant([1.0] * self.V_dim), name="b")

        self.V_ = tf.Variable(tf.random_normal(
            shape=[self.V_dim, vocab_size]
            , name="V_"
        ))
        self.b_ = tf.Variable(tf.constant([1.0] * vocab_size), name="b_")

        self.wh = tf.Variable(tf.random_normal(shape=[self.encoder_single_hidden_size * 2, 1]),
                              name = "wh")
        self.ws = tf.Variable(tf.random_normal(shape=[self.decoder_hidden_size, 1]),
                              name="ws")
        self.wx = tf.Variable(tf.random_normal(shape=[word_embedding_dim, 1]),
                              name="wx")
        self.bptr = tf.Variable(tf.constant([0.1]), name="bptr")

        # conv_ weight
        self.wc = tf.Variable(tf.constant(np.full(shape=[1, self.v_dim], fill_value=0.1, dtype=np.float32), dtype=tf.float32),
                              name="wc")

        self.encoder_decoder_lstm_layer()
        # construct opt in the final step
        self.opt_construct()

    def context_vector_layer(self, encoder_outputs, decoder_outputs):
        assert int(encoder_outputs.get_shape()[-1]) == int(decoder_outputs.get_shape()[-1])
        encoder_len = self.max_encoder_len
        total_input = tf.concat([encoder_outputs, decoder_outputs], axis = 1, name="total_input")

        def generate_attention(fuse_input):
            # [max_encoder_len, encoder_hidden_size * 2]
            encoder_part = tf.slice(fuse_input, [0, 0], [encoder_len, -1], name="encoder_part")
            # [max_deooder_len -1 , decoder_hiddem_size] && encoder_hidden_size * 2 == decoder_hiddem_size
            decoder_part = tf.slice(fuse_input, [encoder_len, 0], [-1, -1], name="decoder_part")

            encoder_W_part = tf.matmul(encoder_part, tf.transpose(self.Wh, [1, 0]), name="encoder_W_part")
            decoder_W_part = \
                tf.nn.xw_plus_b(decoder_part,  tf.transpose(self.Ws, [1, 0]), self.battn, name="decoder_W_part")

            # [max_encoder_len, v_dim * (self.max_decoder_len - 1)]
            encoder_tiled = tf.tile(encoder_W_part, [1, self.max_decoder_len - 1])
            decoder_list = tf.unstack(decoder_W_part, axis=0)
            # [1, v_dim * (self.max_decoder_len - 1)]
            decoder_before_tiled = tf.expand_dims(tf.concat(decoder_list, axis=0), axis=0)
            # [max_encoder_len, v_dim * (self.max_decoder_len - 1)]
            decoder_tiled = tf.tile(decoder_before_tiled, [self.max_encoder_len, 1])

            # [ self.v_dim, self.max_encoder_len,  self.max_decoder_len - 1,]
            tiled = tf.transpose(tf.reshape(tf.nn.tanh(encoder_tiled + decoder_tiled), [self.max_encoder_len, self.v_dim, self.max_decoder_len - 1]), [1, 0, 2])
            # [self.max_encoder_len,  self.max_decoder_len - 1]
            e = tf.reshape(tf.squeeze(tf.matmul(tf.transpose(self.v, [1, 0]) ,tf.reshape(tiled, [self.v_dim, -1]))), [self.max_encoder_len, self.max_decoder_len - 1],
                           name="e")
            # softmax alone column
            softmax_e = tf.nn.softmax(e, name="softmax_e", axis=0)

            # add converage mechanism by multiply Upward-pointing triangle matrix
            m_dim = self.max_decoder_len - 1
            up_tri_matrix = tf.convert_to_tensor(np.ones([m_dim, m_dim]) - np.tri(m_dim, m_dim),
                                                 name="up_tri_matrix", dtype=tf.float32)

            # [self.max_encoder_len,  self.max_decoder_len - 1]
            conv_mech_part = tf.matmul(softmax_e, up_tri_matrix, name="conv_mech_part")
            # use reshape to perform Kronecker product
            # [self.max_encoder_len,  self.max_decoder_len - 1, v_dim]
            conv_mech_part_v = tf.reshape(tf.matmul(tf.reshape(conv_mech_part, [-1, 1]), self.wc), [self.max_encoder_len,  self.max_decoder_len - 1, self.v_dim],
                                          name="conv_mech_part_v")
            # [max_encoder_len, v_dim * (self.max_decoder_len - 1)]
            conv_mech_part_v_final = tf.reshape(tf.transpose(conv_mech_part_v, [0, 2, 1]), [self.max_encoder_len, -1],
                                                name="conv_mech_part_v_final")
            # [ self.v_dim, self.max_encoder_len,  self.max_decoder_len - 1,]
            conv_tiled = tf.transpose(tf.reshape(tf.nn.tanh(encoder_tiled + decoder_tiled + conv_mech_part_v_final),
                                                 [self.max_encoder_len, self.v_dim, self.max_decoder_len - 1]),
                                      [1, 0, 2])
            conv_e = tf.reshape(tf.squeeze(tf.matmul(tf.transpose(self.v, [1, 0]) ,tf.reshape(conv_tiled, [self.v_dim, -1]))), [self.max_encoder_len, self.max_decoder_len - 1],
                                name="conv_e")
            # softmax alone column
            conv_softmax_e = tf.nn.softmax(conv_e, name="conv_softmax_e", axis=0)

            # every decoder step has this context_vector
            # [self.max_decoder_len - 1, encoder_hidden_size * 2]
            #context_vector = tf.matmul(tf.transpose(softmax_e, [1, 0]), encoder_part)
            context_vector = tf.matmul(tf.transpose(conv_softmax_e, [1, 0]), encoder_part)

            # for return convient, fuse and split followingly
            # [self.max_decoder_len - 1, encoder_hidden_size * 2 + self.max_encoder_len + self.max_encoder_len]
            fuse_output = tf.concat([context_vector, tf.transpose(conv_softmax_e, [1, 0]),
                                     tf.transpose(conv_mech_part, [1, 0])],
                                    axis=-1, name="fuse_output")
            return fuse_output

        # [batch, self.max_decoder_len - 1, encoder_hidden_size * 2 + self.max_encoder_len + self.max_encoder_len]
        batch_fuse_output = tf.map_fn(generate_attention, total_input, name="batch_fuse_output")
        # [batch, self.max_decoder_len - 1, encoder_hidden_size * 2]
        batch_context_vector = tf.slice(batch_fuse_output, [0, 0, 0], [-1, -1, self.encoder_single_hidden_size * 2], name="batch_context_vector")
        # [batch, self.max_decoder_len - 1, self.max_encoder_len]
        batch_softmax_e = tf.slice(batch_fuse_output, [0, 0, self.encoder_single_hidden_size * 2],
                                   [-1, -1, self.max_encoder_len], name="batch_softmax_e")
        # [batch, self.max_decoder_len - 1, self.max_encoder_len]
        batch_conv_mech_part = tf.slice(batch_fuse_output, [0, 0, self.encoder_single_hidden_size * 2 + self.max_encoder_len],
                                        [-1, -1, -1], name="batch_conv_mech_part")
        self.batch_softmax_e = batch_softmax_e
        self.batch_conv_mech_part = batch_conv_mech_part

        return batch_context_vector

    # in this layer a common used seq2seq attension model is finished.
    def encoder_decoder_lstm_layer(self):

        fw_encoder_cell = tf.contrib.rnn.BasicLSTMCell(num_units = self.encoder_single_hidden_size
                                                       , name = "fw_encoder_cell")
        bw_encoder_cell = tf.contrib.rnn.BasicLSTMCell(num_units = self.encoder_single_hidden_size
                                                       , name = "bw_encoder_cell")


        fw_encoder_cell = tf.nn.rnn_cell.DropoutWrapper(fw_encoder_cell, input_keep_prob=self.keep_prob,
                                                        output_keep_prob=self.keep_prob)
        bw_encoder_cell = tf.nn.rnn_cell.DropoutWrapper(bw_encoder_cell, input_keep_prob=self.keep_prob,
                                                        output_keep_prob=self.keep_prob)

        encoder_outputs, encoder_output_states = tf.nn.bidirectional_dynamic_rnn(cell_fw=fw_encoder_cell,
                                                                                 cell_bw=bw_encoder_cell,
                                                                                 inputs=self.encoder_lookup,
                                                                                 dtype=tf.float32,
                                                                                 sequence_length=self.encoder_mask,
                                                                                 )
        # [batch, max_encoder_len, encoder_single_hidden_size * 2]
        encoder_outputs = tf.concat([encoder_outputs[0], encoder_outputs[1]], axis=-1, name="encoder_outputs")
        # [batch, encoder_single_hidden_size * 2]

        encoder_output_states_h = tf.concat([encoder_output_states[0].h, encoder_output_states[1].h],
                                            axis=-1, name="encoder_output_states_h")
        encoder_output_states_c = tf.concat([encoder_output_states[0].c, encoder_output_states[1].c],
                                            axis=-1, name="encoder_output_states_c")

        # need LSTMStateTuple object for decoder
        encoder_output_states = tf.contrib.rnn.LSTMStateTuple(h =encoder_output_states_h , c = encoder_output_states_c,
                                                              )


        decoder_cell = tf.contrib.rnn.BasicLSTMCell(num_units = self.decoder_hidden_size
                                                    , name = "decoder_cell")

        # cell for beam search decoder
        self.decoder_cell = decoder_cell

        decoder_cell = tf.nn.rnn_cell.DropoutWrapper(decoder_cell, input_keep_prob=self.keep_prob,
                                                     output_keep_prob=self.keep_prob)

        decoder_inputs_slice = tf.slice(self.decoder_lookup, begin=[0, 0, 0], size=[-1, self.max_decoder_len - 1, -1],
                                        name="decoder_inputs_slice")

        # decoder_outputs: [batch, max_decoder_len - 1, decoder_hidden_size]
        decoder_outputs, decoder_output_states = tf.nn.dynamic_rnn(cell=decoder_cell, inputs=decoder_inputs_slice,
                                                                   initial_state=encoder_output_states,
                                                                   sequence_length=self.decoder_mask - 1)
        self.decoder_inputs_slice = decoder_inputs_slice
        self.decoder_outputs = decoder_outputs

        # [batch, self.max_decoder_len - 1, encoder_hidden_size * 2]
        batch_context_vector = self.context_vector_layer(encoder_outputs, decoder_outputs)
        self.batch_context_vector = batch_context_vector

        # [batch, max_decoder_len - 1, decoder_hidden_size + encoder_hidden_size * 2]
        fused_decoder_context = tf.concat([decoder_outputs, batch_context_vector], axis=-1,
                                          name="fused_decoder_context")
        fused_decoder_context_reshape = tf.reshape(fused_decoder_context,
                                                   shape=[-1, self.encoder_single_hidden_size * 2 + self.decoder_hidden_size],
                                                   name="fused_decoder_context_reshape")
        # [batch, max_decoder_len - 1, vocab_size]
        word_distribute_before_softmax = tf.reshape(tf.nn.xw_plus_b(tf.nn.xw_plus_b(fused_decoder_context_reshape, self.V, self.b), self.V_, self.b_),
                                                    [-1, self.max_decoder_len -1, self.vocab_size], name="word_distribute_before_softmax")
        self.word_distribute_before_softmax = word_distribute_before_softmax

    def generate_oov_attension(self):
        # [batch, self.max_encoder_len, self.max_decoder_len - 1, ]
        batch_softmax_e_t = tf.slice(tf.transpose(self.batch_softmax_e, [0, 2, 1]), [0, 0, 0],
                                     [self.batch_size, -1, -1])
        # [batch, self.max_encoder_len + 1, self.max_decoder_len - 1]
        batch_softmax_e_t_app = tf.concat([batch_softmax_e_t, tf.zeros(
            [self.batch_size, 1, int(batch_softmax_e_t.get_shape()[2])], dtype=tf.float32)], axis=1)
        #[max_decoder_len - 1, batch ,max_encoder_len, ]
        indice_input = tf.cast(tf.transpose(self.oov_indices_input, [2, 0, 1]), tf.float32)
        batch_size = self.batch_size
        # [max_decoder_len - 1, batch ,max_encoder_len + (max_encoder_len + 1), ]
        fuse_input = tf.concat([indice_input, tf.transpose(batch_softmax_e_t_app, [2, 0, 1])], axis=-1)

        # fuse input produced by concat indice_input and batch_softmax_e_t_app
        # in the same decoder_step, for indiced convient
        # fuse_input = tf.concat([indice_input, se_input], axis = -1)
        def indice_single_decoder_step(fuse_input):
            indice_input = tf.cast(tf.slice(fuse_input, [0, 0], [-1, self.max_encoder_len]), tf.int32)
            se_input = tf.slice(fuse_input, [0 ,self.max_encoder_len], [-1, -1])

            indices_output_list = []
            tail_dim = int(indice_input.get_shape()[-1])
            for col_idx in range(tail_dim):
                indices = tf.concat([tf.expand_dims(tf.range(batch_size), axis=-1), tf.slice(indice_input, [0, col_idx], [-1, 1])], axis=-1)
                indices_output_list.append(tf.expand_dims(indices, 1))

            indices_final = tf.concat(indices_output_list, axis=1)
            gather_ext = tf.gather_nd(se_input, indices_final)

            # [batch, max_encoder_len]
            return gather_ext

        # [max_decoder_len - 1 ,batch, max_encoder_len]
        batch_gathered = tf.map_fn(indice_single_decoder_step, fuse_input, dtype=tf.float32)
        # [batch, max_encoder_len, max_decoder_len - 1]
        return tf.transpose(batch_gathered, [1, 2, 0])

    def pointer_generator_layer(self):
        ht_part = tf.reshape(self.batch_context_vector, [-1, 2 * self.encoder_single_hidden_size],
                             name="ht_part")
        st_part = tf.reshape(self.decoder_outputs, [-1, self.decoder_hidden_size],
                             name="st_part")
        xt_part = tf.reshape(self.decoder_inputs_slice, [-1, self.word_embedding_dim],
                             name = "xt_part")

        # [batch, max_decoder_len - 1]
        pgen = tf.reshape(tf.nn.sigmoid(tf.matmul(ht_part, self.wh) + tf.matmul(st_part, self.ws) + tf.matmul(xt_part, self.wx) + self.bptr),
                          [-1, self.max_decoder_len - 1])
        total_times = self.vocab_size
        pgen_expand = tf.expand_dims(pgen, axis=-1, name="pgen_expand")

        # pgen_final have the identical shape as word_distribute_before_softmax, can use it friendly
        # by solely multiply
        # [batch, max_decoder_len - 1, vocab]
        # used for concat encoder_attention step in the last dim : i.e. vocab
        # expand vocab to vocab + max_encoder_dim
        pgen_final = tf.tile(pgen_expand, [1, 1, total_times], name="pgen_final")
        # [batch, max_decoder_len - 1, vocab]
        seq2seq_part = self.word_distribute_before_softmax * pgen_final

        # [batch, max_decoder_len - 1, max_encoder_len]
        one_subtract_pgen = tf.tile(tf.expand_dims(tf.subtract(1.0, pgen), -1), [1, 1, self.max_encoder_len])
        # [batch, max_decoder_len - 1, max_encoder_len]
        oov_attension = self.generate_oov_attension()
        pointer_generator_part = one_subtract_pgen * tf.transpose(self.generate_oov_attension(), [0, 2, 1])

        # append max_encoder_len pointer generator to vocab to have final dist
        # [batch, max_decoder_len - 1, vocab + max_encoder_len]
        final_dist = tf.concat([seq2seq_part, pointer_generator_part], axis=-1, name="final_dist")

        return final_dist

    # this part will be changed for rl convenient
    def opt_construct(self):
        final_dist = self.pointer_generator_layer()
        # [batch, max_decoder_len - 1, vocab + max_encoder_len]
        # decoder step as state, decoder word as  choose action
        self.q_table = tf.nn.softmax(final_dist, axis=-1, name="q_table")

        # [batch, max_decoder_len - 1]
        self.prediction = tf.cast(tf.argmax(self.q_table,
                                    axis=-1), tf.int32, name="prediction")

        # [batch]
        self.policy_log_part = tf.reduce_sum(tf.log(tf.reduce_max(self.q_table, axis=-1)), axis=-1,
                                             name="policy_log_part")
        self.loss = tf.reduce_mean(-1 * self.policy_log_part * self.reward, name="loss")

        # add conv loss part to it
        self.conv_loss = tf.reduce_mean(tf.reduce_min(tf.concat([tf.expand_dims(self.batch_softmax_e, -1), tf.expand_dims(self.batch_conv_mech_part, -1)],
                                                                axis=-1), axis=-1), name="conv_loss")
        self.loss += self.lambda_val * self.conv_loss

        self.l2_loss = None
        for train_able_var in tf.trainable_variables():
            if self.l2_loss is None:
                self.l2_loss = tf.nn.l2_loss(train_able_var)
            else:
                self.l2_loss += tf.nn.l2_loss(train_able_var)
        self.loss = self.loss + self.l2_param * self.l2_loss
        self.train_op = tf.train.AdamOptimizer(0.001).minimize(self.loss)

    # return top_k num of nest list with -log prob as beam search likelihood
    @staticmethod
    # 由于没有对于oov的外部判断处理，这里只对 valid data 的输入解码情况进行学习。
    def beam_search_decoder_step(data, top_k = 3):
        from math import log
        # this decode step may be implemented by numpy version as following
        # https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/

        def beam_search_decoder(data, k = top_k):
            sequences = [[list(), 1.0]]
            for row in data:
                all_candidates = list()
                for i in range(len(sequences)):
                    seq, score = sequences[i]
                    for j in range(len(row)):
                        # j index row[j] the prob
                        candidate = [seq + [j], score * -log(row[j])]
                        all_candidates.append(candidate)
                ordered = sorted(all_candidates, key = lambda tup: tup[1])
                sequences = ordered[:k]
            return sequences
        return beam_search_decoder(data)

environment代码：

import tensorflow as tf
import pickle
import numpy as np
import pandas as pd
import pause
import os
from RL_brain import Seq2SeqPolicyGradient

# the input params are string list for eval_sentences, ref_sentences
# so the reward of sentence can solely get in the tf eval form
from tensor2tensor.utils.rouge import rouge_l_sentence_level

with open(r"data_process\new_data.pkl", "rb") as f:
    obj_dict = pickle.load(f)

idx_token_df = obj_dict["idx_token_df"]
word2idx = obj_dict["word2idx"]
# add <START>, <STOP>, <PAD>, at this step not add <UNK> for all information known.
word2idx = dict([(k, v) for k, v in word2idx.items()] + [("<START>", len(word2idx)),
                                                         ("<STOP>", len(word2idx) + 1),
                                                         ("<PAD>", len(word2idx) + 2)])
max_text_len = max(map(len ,idx_token_df["text"]))
ctext_upper_bound = 1000
max_ctext_len = min(max(map(len, idx_token_df["ctext"])), ctext_upper_bound)
print("max_text_len: {}, max_ctext_len: {}".format(max_text_len, max_ctext_len))
# truly used add <STRAT>, <STOP> into text head and tail. so
max_text_len += 2

vocab_size = len(word2idx)
print("vocab_size : {}".format(vocab_size))

# random split dataframe into train and valid
train_ratio = 0.9
total_size = idx_token_df.shape[0]

# use random seed
np.random.seed(0)
rand_df = pd.DataFrame(np.random.randn(total_size, 2))
msk = np.random.rand(len(rand_df)) < train_ratio
train_idx_token_df = idx_token_df[msk]
valid_idx_token_df = idx_token_df[~msk]
print("train valid split end.")

def data_generator(type = "train", max_encoder_len = max_ctext_len, max_decoder_len = max_text_len,
                   batch_size = 16, padding_idx = word2idx["<PAD>"]):
    print("init data_generator : {}".format(type))
    used_idx_token_df = None
    assert type in ["train", "valid"]
    if type == "train":
        used_idx_token_df = train_idx_token_df
    else:
        used_idx_token_df = valid_idx_token_df

    # init step
    start_idx = 0
    encoder_input = np.full(shape=[batch_size, max_encoder_len], fill_value=padding_idx, dtype=np.int32)
    encoder_mask = np.zeros(shape=[batch_size], dtype=np.int32)
    decoder_input = np.full(shape=[batch_size, max_decoder_len], fill_value=padding_idx, dtype=np.int32)
    decoder_mask = np.zeros(shape=[batch_size], dtype=np.int32)
    oov_indices_input = np.full(shape=[batch_size, max_encoder_len, max_decoder_len - 1],
                                fill_value= max_encoder_len,
                                dtype=np.int32)
    input_y = np.full(shape=[batch_size, max_decoder_len - 1], fill_value=padding_idx,
                      dtype=np.int32)

    print("begin iter rows ")
    for idx, r in used_idx_token_df.iterrows():
        summary = r["text"]
        summary = [word2idx["<START>"]] + summary + [word2idx["<STOP>"]]
        body = r["ctext"][:max_encoder_len]

        for s_idx, s_w in enumerate(summary):
            decoder_input[start_idx][s_idx] = s_w
            for b_idx, b_w in enumerate(body):
                if s_w == b_w:
                    oov_indices_input[start_idx][b_idx][s_idx] = b_idx
        decoder_mask[start_idx] = len(summary)

        for b_idx, b_w in enumerate(body):
            encoder_input[start_idx][b_idx] = b_w
        encoder_mask[start_idx] = len(body)

        for y_idx, y_w in enumerate(summary[1:]):
            if y_w in encoder_input[start_idx]:
                encoder_index = encoder_input[start_idx].tolist().index(y_w)
                input_y[start_idx][y_idx] = vocab_size + encoder_index
                continue
            input_y[start_idx][y_idx] = y_w

        start_idx += 1
        if start_idx == batch_size:
            yield (encoder_input, encoder_mask, decoder_input, decoder_mask,
                   oov_indices_input, input_y)

            start_idx = 0
            encoder_input = np.full(shape=[batch_size, max_encoder_len], fill_value=padding_idx, dtype=np.int32)
            encoder_mask = np.zeros(shape=[batch_size], dtype=np.int32)
            decoder_input = np.full(shape=[batch_size, max_decoder_len], fill_value=padding_idx, dtype=np.int32)
            decoder_mask = np.zeros(shape=[batch_size], dtype=np.int32)
            oov_indices_input = np.full(shape=[batch_size, max_encoder_len, max_decoder_len - 1],
                                        fill_value= max_encoder_len,
                                        dtype=np.int32)
            input_y = np.full(shape=[batch_size, max_decoder_len - 1], fill_value=padding_idx,
                              dtype=np.int32)

tf.set_random_seed(1)

def rltrain_func():
    train_gen = data_generator(type = "train")
    valid_gen = data_generator(type = "valid")

    pointerGenerator_ext = Seq2SeqPolicyGradient(
        vocab_size = vocab_size, word_embedding_dim = 100,
        max_encoder_len = max_ctext_len, max_decoder_len = max_text_len,
        encoder_single_hidden_size = 50, decoder_hidden_size = 100,
        v_dim = 100, V_dim = 1000, batch_size = 16,
    )
    print("model construct end")
    pause.seconds(1)

    total_step = 0
    epoch = 0

    config = tf.ConfigProto()
    saver = tf.train.Saver()
    with tf.Session(config=config) as sess:
        if os.path.exists(r"E:\Coding\python\pointerDiyRLTrained\model.ckpt.index"):
            print("restore exists")
            saver.restore(sess, save_path=r"E:\Coding\python\pointerDiyRLTrained\model.ckpt")
        else:
            print("init global")
            sess.run(tf.global_variables_initializer())

        #sess.run(tf.global_variables_initializer())
        print("model init end")
        pause.seconds(1)

        while True:
            try:
                encoder_input, encoder_mask, decoder_input, decoder_mask, \
                oov_indices_input, input_y = train_gen.__next__()
            except:
                # at least train for 30 epoch as paper discribe
                print("train epoch {} end".format(epoch))
                epoch += 1
                if epoch == 100:
                    print("epoch 100 end will return")
                    return

                train_gen = data_generator(type = "train")

            prediction = sess.run(pointerGenerator_ext.prediction,
                               feed_dict={
                                   pointerGenerator_ext.encoder_input: encoder_input,
                                   pointerGenerator_ext.encoder_mask: encoder_mask,
                                   pointerGenerator_ext.decoder_input: decoder_input,
                                   pointerGenerator_ext.decoder_mask: decoder_mask,
                                   pointerGenerator_ext.oov_indices_input: oov_indices_input,

                                   pointerGenerator_ext.keep_prob: 1.0,
                               })
            # [batch, max_decoder_len - 1]
            str_prediction = np.array(prediction, dtype=str)
            str_input_y = np.array(input_y, dtype=str)
            total_rouge_1_list = []
            for i in range(str_input_y.shape[0]):
                rouge_1_val = rouge_l_sentence_level(str_prediction[i].tolist(), str_input_y[i].tolist())
                total_rouge_1_list.append(rouge_1_val)
            total_rouge_1_array = np.array(total_rouge_1_list, dtype=np.float32)
            total_rouge_1_array -= total_rouge_1_array.mean()

            # [batch]
            reward_array = total_rouge_1_array

            _, loss = sess.run([pointerGenerator_ext.train_op, pointerGenerator_ext.loss],
                     feed_dict={
                         pointerGenerator_ext.encoder_input: encoder_input,
                         pointerGenerator_ext.encoder_mask: encoder_mask,
                         pointerGenerator_ext.decoder_input: decoder_input,
                         pointerGenerator_ext.decoder_mask: decoder_mask,
                         pointerGenerator_ext.oov_indices_input: oov_indices_input,
                         pointerGenerator_ext.reward: reward_array,

                         pointerGenerator_ext.keep_prob: 1.0,
                         pointerGenerator_ext.l2_param: 0.00001,
                         pointerGenerator_ext.lambda_val: 0.001
                     })
            print("train loss : {}".format(loss))
            total_step += 1
            if total_step % 5 == 0:
                try:
                    encoder_input, encoder_mask, decoder_input, decoder_mask, \
                    oov_indices_input, input_y = valid_gen.__next__()
                except:
                    print("one valid step end.")
                    valid_gen = data_generator(type = "valid")

                prediction = sess.run(pointerGenerator_ext.prediction,
                                      feed_dict={
                                          pointerGenerator_ext.encoder_input: encoder_input,
                                          pointerGenerator_ext.encoder_mask: encoder_mask,
                                          pointerGenerator_ext.decoder_input: decoder_input,
                                          pointerGenerator_ext.decoder_mask: decoder_mask,
                                          pointerGenerator_ext.oov_indices_input: oov_indices_input,

                                          pointerGenerator_ext.keep_prob: 1.0,
                                      })
                # [batch, max_decoder_len - 1]
                str_prediction = np.array(prediction, dtype=str)
                str_input_y = np.array(input_y, dtype=str)
                total_rouge_1_list = []
                for i in range(str_input_y.shape[0]):
                    rouge_1_val = rouge_l_sentence_level(str_prediction[i].tolist(), str_input_y[i].tolist())
                    total_rouge_1_list.append(rouge_1_val)
                total_rouge_1_array = np.array(total_rouge_1_list, dtype=np.float32)
                total_rouge_1_array -= total_rouge_1_array.mean()

                # [batch]
                reward_array = total_rouge_1_array

                loss = sess.run(pointerGenerator_ext.loss,
                                   feed_dict={
                                       pointerGenerator_ext.encoder_input: encoder_input,
                                       pointerGenerator_ext.encoder_mask: encoder_mask,
                                       pointerGenerator_ext.decoder_input: decoder_input,
                                       pointerGenerator_ext.decoder_mask: decoder_mask,
                                       pointerGenerator_ext.oov_indices_input: oov_indices_input,
                                       pointerGenerator_ext.reward: reward_array,

                                       pointerGenerator_ext.keep_prob: 1.0,
                                       pointerGenerator_ext.l2_param: 0.0,
                                       pointerGenerator_ext.lambda_val: 0.001
                                   })
                print("valid loss : {}".format(loss))
                saver.save(sess, save_path=r"E:\Coding\python\pointerDiyRLTrained\model.ckpt")

def rlpred_func(epsilon = 1e-10):
    idx2word = dict((i, w) for w, i in word2idx.items())

    def process_single_prediction_list_to_word_list(single_encoder_input ,input_list):
        # single_encoder_input [max_encoder_len]
        req_word_list = []
        for word_idx in input_list:
            if idx2word.get(word_idx):
                word = idx2word[word_idx]
            else:
                assert word_idx >= vocab_size
                word_idx = single_encoder_input[word_idx - vocab_size]
                word = idx2word[word_idx]
            req_word_list.append(word)
        return req_word_list

    valid_gen = data_generator(type = "valid", batch_size=1)
    pointerGenerator_ext = Seq2SeqPolicyGradient(
        vocab_size = vocab_size, word_embedding_dim = 100,
        max_encoder_len = max_ctext_len, max_decoder_len = max_text_len,
        encoder_single_hidden_size = 50, decoder_hidden_size = 100,
        v_dim = 100, V_dim = 1000,
        batch_size=1)
    print("model construct end")

    saver = tf.train.Saver()
    with tf.Session() as sess:
        if os.path.exists(r"E:\Coding\python\pointerDiyRLTrained\model.ckpt.index"):
            saver.restore(sess, save_path=r"E:\Coding\python\pointerDiyRLTrained\model.ckpt")
        else:
            print("model not exists return")
            return

        while True:
            try:
                encoder_input, encoder_mask, decoder_input, decoder_mask, \
                oov_indices_input, input_y = valid_gen.__next__()
            except:
                print("valid epoch end, will return")
                return

            # the prediction have shape [1 ,max_decoder_len - 1, vocab + max_encoder_len]
            prediction = sess.run(pointerGenerator_ext.q_table,
                                  feed_dict={
                                      pointerGenerator_ext.encoder_input: encoder_input,
                                      pointerGenerator_ext.encoder_mask: encoder_mask,
                                      pointerGenerator_ext.decoder_input: decoder_input,
                                      pointerGenerator_ext.decoder_mask: decoder_mask,
                                      pointerGenerator_ext.oov_indices_input: oov_indices_input,

                                      pointerGenerator_ext.keep_prob: 1.0,
                                      pointerGenerator_ext.l2_param: 0.0,
                                      pointerGenerator_ext.lambda_val: 0.0
                                  })

            # preprocess for log
            prediction += epsilon
            # [max_decoder_len - 1, vocab + max_encoder_len]
            prediction_array = np.squeeze(prediction)
            # [top_k ,max_decoder_len - 1]
            beam_search_array = Seq2SeqPolicyGradient.beam_search_decoder_step(prediction_array, top_k=3)

            single_encoder_input = np.squeeze(encoder_input)
            single_encoder_input_words = process_single_prediction_list_to_word_list(single_encoder_input=single_encoder_input,
                                                                                     input_list=single_encoder_input)

            # visual produre
            print("single_encoder_input_words :")
            print(" ".join(single_encoder_input_words))
            for beam_search_ele in beam_search_array:
                beam_search_seq = process_single_prediction_list_to_word_list(single_encoder_input ,beam_search_ele[0])
                beam_search_log_prob = beam_search_ele[-1]
                print("seq :")
                print(" ".join(beam_search_seq))
                print("log_prob :")
                print(beam_search_log_prob)
            print("*" * 100)


if __name__ == "__main__":
    #rltrain_func()
    rlpred_func()

下面举这两个模型在valid集上的例子：（虽然训练数据少（总共才4000多个样本），也没有那么多时间训练，可以看一下大致意思）

原文：

a woman parliamentarian of imran khan ? spakistan tehreek-e-insaf ( pti ) has stirred up a major row by accusing thecricketer-turned-politician of sending her ? obscene ? text messages andharassing women leaders of the party.ayesha gulalai , elected to the nationalassembly in 2013 from a reserved seat for women in the tribal areas borderingafghanistan , also announced her decision to quit the pti because she could notcompromise on her honour and dignity.addressing a news conference in islamabadon tuesday , gulalai said the ? honour ? of women in the pti is not safebecause of 64-year-old khan , whom she called a ? fake pathan ? . gulalai , whohails from the conservative south waziristan tribal region and is the sister ofleading squash player maria toorpakai , said khan is a ? characterless person ?who considers himself an angel but his conduct is ? highly indecent ? . therewas no response to the allegations from khan though pti spokesman fawadchaudhry dismissed gulalai ? s allegations and claimed she had ? sold her soul? for money.the lawmaker said she had received numerous text messages from khan, and the first one was sent from his blackberry in october 2013. she declinedto read out the messages but asked journalists to contact the telecom authorityin this context . she urged the supreme court to take notice of the matter forthe sake of the honour of women members of the pti as there is speculation thatkhan might one day become the prime minister . gulalai said her decision toquit the pti had nothing to do with a case regarding khan ? s assets that isbeing heard by the supreme court . ? the criterion for awarding party ticketsis entirely different in the pti , ? she said.she said she had the courage tocome out against khan and his clan as she is a brave pathan and honour was moreimportant to her . the lawmaker went on to say that khan was more impressed bywestern culture and wanted to replicate it in pakistan . ? their change isconfined to social media while workers like me work on the ground , whom theycall inferior workers . when i used to attend meetings , imran khan gave tipson how to ridicule rivals and target them , ? she said.gulalai also accused thepti government in khyber-pakhtunkhwa province of corruption and said chiefminister pervez khattak was acting like a ? mafia boss ? .she rejected reportsthat she was joining the pml-n party of former premier nawaz sharif but praisedhim for respecting women .

原文google翻译：

imran khan的女议员？巴基斯坦队队员希拉里 - 伊恩夫（pti）通过指责板球运动员变身的政客向她送去了一个主要阵容。淫秽？短信和骚扰女性党的领导人。2013年，由于在与阿富汗接壤的部落地区的女性保留席位而当选为国民议会的耶莎古拉莱也宣布她决定退出帕蒂，因为她无法妥协，因为她的荣誉和尊严。星期二在伊斯兰堡召开新闻发布会，古拉莱说：荣誉？的女性在pti中是不是很安全，因为64岁的汗，她称之为？假的帕坦？。 gulalai，谁来自保守的南瓦兹里斯坦部落地区，是领先壁球运动员玛丽亚toorpakai的妹妹，说汗是一个？无性格的人？谁认为自己是天使，但他的行为是？非常不雅？。虽然pti发言人fawad chaudhry驳回了gulalai，但是对可汗的指控没有回应？她的指控并声称她有？卖掉了她的灵魂？为了钱，立法者说，她收到了来自汗的许多短信，第一个是2013年10月从他的黑莓手机发来的。她拒绝读出这些消息，但要求记者在这方面联系电信管理部门。她敦促最高法院注意这件事，以便为pti女成员的荣誉，因为有人猜测可汗有一天会成为总理。 gulalai说她退出pti的决定与可汗案无关？这是最高法院正在审理的资产。？派对门票的标准在pti中完全不同，她说，她说她有勇气对抗汗和他的氏族，因为她是一个勇敢的人，荣誉对她来说更重要。立法者接着说西方文化对汗的印象更深，并希望在巴基斯坦复制它。？他们的变化仅限于社交媒体，而像我这样的工作人员在实地工作，他们称之为劣等工。当我以前参加会议时，imran khan提供了关于如何嘲笑对手并瞄准他们的提示，？她说戈莱莱还指责皮革政府在开伯尔 - 普赫图赫瓦省的腐败行为，并说首席部长佩尔韦兹卡塔克表现得像一个？黑手党老大？她拒绝了她参加前总理纳瓦兹·谢里夫的pml-n派对的报道，但称赞他尊重女性。

1、Seq2seq + pointergenerator：

模型摘要：

the a woman lawmaker of pakistan the thepti the party , the gulalai , has the party the imran khan of the women leadersin the the and sending them obscene text messages . the the the the to me thethe and i the the compromise the it the to the honour and the , the gulalaisaid the her the from the party . 53.07 spectre.there rebel iran-fresh rebelbarker rebel the disregarding rebel criminalising all.the deplored rebel rebelrebel corporations deplored north-western establishing átelugu all.the rebelhorrible

模型摘要译文：

一位巴基斯坦妇女立法者党的党，gulalai，派对上的女性领袖imran khan，并向他们发送淫秽的短信。对我来说，我和这个荣誉的妥协，并且，古拉莱说她是党的。 53.07 spectre.there叛乱伊朗新鲜的反叛巴克叛乱无视叛乱定罪所有叛乱的叛乱反叛者叛乱反叛团体谴责西北建立áteluguall.the叛军可怕

前面意思大致上是对的，后面。。。。。。（数据量？），这种结果基本上跟神经网络作诗类似（随着解码步骤发散），作诗的网络及文章由于太差了已经被本人从网站及硬盘上删掉了。

2、策略梯度版本：

对于上面的同一个原文，策略梯度得到如下版本：

模型摘要：

the a woman lawmaker of pakistantehreek-e-insaf ( pti ) the , the gulalai , has accused party the imran khan ofharassing women leaders in the party and sending them obscene the messages thethe the the the to me the the and i the not compromise when it the to thehonour and the , the gulalai said the her the from the party . no-entriesarbitrary no-entries chokers erskine-directed no-entries chokers no-entrieschokers no-entries slight abuse.as notallstarkids the no-entriesstatements.vishwas abuse.as no-entries no-entries slight the no-entriesno-entries chokers

模型摘要译文：

一位巴基斯坦妇女立法委员会议员古拉莱指控党内骚扰党内女性领导人的内幕，并向他们发出淫秽消息，说明我和我之间的关系。在荣誉和嘉拉来称她为党的时候，并没有妥协。无条目任意无条目chokers erskine指示无条目chokers无条目chokers无条目轻微abuse.as notallstarkids无条目statements.vishwas abuse.as无条目无条目轻微无条目no-条目chokers

强化学习是一个好东西，大致可以看一下结果。

Pointer Generator 摘要及其强化学习策略梯度版本初步尝试

猜你喜欢