又是一篇很久之前用到的模型，今天回来整理，发现分类的模型都好简单啊，然后看到模型基于GRU，总觉得有点不想看，因为带时间序列的训练起来太慢了，最进没怎么关注分类的新模型，不过我觉得CNN和transformer结构(self attention)的搭配应该是分类问题的趋势，不过这篇文章后面的attention效果可视化还是不错的~

文章目录

1.模型概述
2.模型详情

2.1.Word Encoder
2.2.Word Attention layer
2.3.Sentence Encoder
2.4.Sentence Attention
2.5.Document Classification
2.6.实验结果和延伸阅读
2.7.源码

参考文献

该模型就是基于分类问题提出来的，所以背景什么的也就不说了，很简单，直接切入模型正题。

博主标记版论文地址

1.模型概述

对于一个document含有这样的层次结构，document由sentences组成，sentence由words组成。words和sentences都是高度上下文依赖的，同一个词或sentence在不同的上下文中，其表现的重要性会有差别。因此，这篇论文中使用了两个attention机制，来表示结合了上下文信息的词或句子的重要程度。（这里结合的上下文的词或句子，就是经过RNN处理后的隐藏状态）。

层级“注意力”网络的网络结构如图1所示，网络可以被看作为两部分，第一部分为词“注意”部分，另一部分为句“注意”部分。整个网络通过将一个句子分割为几部分（例如可以用“，”讲一句话分为几个小句子），对于每部分，都使用双向RNN结合“注意力”机制将小句子映射为一个向量，然后对于映射得到的一组序列向量，我们再通过一层双向RNN结合“注意力”机制实现对文本的分类。
在这里插入图片描述

2.模型详情

论文里面竟然用了不小的篇幅在说gru，瞬间就觉得有灌水的嫌疑，不过这也不像cmu的作风啊，可能早期文章还是比较偏底层，比较简单吧~
这里关于gru在我早期的文章里面有提及过，这里就不说了

2.1.Word Encoder

$x_{it}=W_ew_{it}, t\in [1, T]$
$\overrightarrow h_{it}=\overrightarrow {GRU}(x_{it}),t\in[1,T]$
$\overleftarrow h_{it}=\overleftarrow {GRU}(x_{it}),t\in [T,1]$
$h_{it} = [\overrightarrow h_{it},\overleftarrow h_{it}]$

其中：

$i^{th}$ sentence in the document, and $t^{th}$ means the tth word in the sentence.
这里的 $x_{it}=W_ew_{it}, t\in [1, T]$ 应该表达的就是word lookup的过程，paper中说的不清楚，但是在所有我看到的实现里面都是直接lookup

2.2.Word Attention layer

Not all words contribute equally to the representation of the sentence meaning. Hence, we introduce attention mechanism to extract such words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector.

Attention机制说到底就是给予sentence中每个结合了上下文信息的词一个权重。关键在于这个权重怎么确定？
$u_{it}=tanh(W_wh_{it}+b_w)$
$\alpha_{it}=\dfrac{exp(u_{it}^Tu_w)}{\sum_t^Texp(u_{it}^Tu_w)}$
$s_i=\sum_t^T\alpha_{it}h_{it}$
这里首先是将 $h_{it}$ 通过**一个全连接层得到 hidden representation $u_{it}$ ,然后计算 $u_{it}$ 与 $u_w$ 的相似性。**并通过softmax归一化得到每个词与 $u_w$ 相似的概率。越相似的话，这个词所占比重越大，对整个sentence的向量表示影响越大。

这里我觉得这里的attention就是“self attention”，只不过没有transformer正式提出来，也没有那个复杂，没有那个正式，不过总还是有self attention的影子。

2.3.Sentence Encoder

$\overrightarrow h_{i}=\overrightarrow {GRU}(s_{i}),t\in[1,L]$
$\overleftarrow h_{i}=\overleftarrow {GRU}(s_{i}),t\in [L,1]$
$H_i=[\overrightarrow h_{i}, \overleftarrow h_{i}]$
$h_i$ summarizes the neighbor sentences around sentence i but still focus on sentence i.

2.4.Sentence Attention

$u_i=tanh(W_sH_i+b_s)$
$\alpha_i=\dfrac{exp(u_i^Tu_s)}{\sum_i^Lexp(u_i^Tu_s)}$
$v = \sum_i^L\alpha_ih_i$
同样的 $u_s$ 表示： a sentence level context vector $u_s$

2.5.Document Classification

The document vector v is a high level representation of the document and can be used as features for document classification:
$p=softmax(W_cv+b_c)$

2.6.实验结果和延伸阅读

关于实验结果

在查相关资料的时候发现了这个，以后再研究吧~
Multilingual Hierarchical Attention Networks for Document Classification

2.7.源码

一份比较简洁的代码：tf-hierarchical-rnn

paper:Hierarchical Attention Networks for Document Classification