Fear the REAPER A System for Automatic Multi-Document Summarization with Reinforcement Learning

Cody Rioux, Sadid A. Hasan, Yllias Chali

##Abstract

Achieve the largest coverage of the docu
ments content.目标的覆盖整个文档的内容
Concentrate distributed information to hidden units layer by layer. 通过一层一层的隐藏单元集中分散的信息
the whole deep architecture is fine tuned by minimizing the information loss of reconstruction validation. 整个框架是减少重建确认时发生的信息丢失
According to the concentrated information, dynamic programming is used to seek most informative set of sentences as the summary
DP被用来计算最有信息量的集合，来作为摘要
##Relatedwork
We explore the use of SARSA which is a derivative of TD(lamada) that models the action space in addition to the state space modelled by TD(lamada). Furthermore we explore the use of an algorithm not based on temporal difference methods, but instead on policy iteration techniques
REAPER (Relatedness-focused Extractive Automatic
summary Preparation Exploiting Reinfocement learning)
以相关性为中心的抽取自动摘要准备利用强化学习
##Motivation
TD(lamada) is relatively old as far as reinforcement learning (RL)
algorithms are concerned, and the optimal ILP did not outperform ASRL using the same reward function.
强化学习有很大打提升空间
基于查询的摘要得到广泛关注
不对句子压缩的效果做进一步探讨
##Model
TD(lamada)
时间差（TD）学习是一种基于预测的机器学习方法。它主要用于强化学习问题，据说是“ 蒙特卡罗思想和动态规划（DP）思想的结合”。[1] TD类似于蒙特卡洛方法，因为它根据某种策略通过对环境进行采样来学习，并且与动态规划技术相关，因为它基于先前学习的估计来逼近其当前估计（称为自举）。TD学习算法与动物学习的时间差模型有关。[2]
temporal difference methods-wiki
Approximate Policy Iteration
近似策略迭代（API）遵循一个不同的范式，通过迭代地改进马尔可夫决策过程的策略，直到策略收敛为止。
Sarsa算法
Q算法是当选择下一步的时候会找最好的一个走（选最大Q值的）而sarsa是当选择下一步的时候运用和上一步一样/想等的Q值但是最后都会更新之前的一步从而达到学习的效果~
On-policy Sarsa算法与Off-policy Q learning对比
##Experiment
Feature Space depends on the presence of top bigrams，而不用
tf *idf words
Reward Function
based on the n-gram concurrence score metric
the longest-common-subsequence recall metric

Immediate Rewards
Query Focused Rewards

Fear the REAPER A System for Automatic Multi-Document Summarization with Reinforcement Learning

猜你喜欢