From Word Embeddings To Document Distances

M. J. Kusner, Y. Sun, N. I. Kolkin, K. Q. Weinberger, From Word Embeddings To Document Distances, ICML (2015)

摘要

词嵌入（word embedding）：根据单词在语句中的局部共存性，学习单词语义层面的表示（semantically meaningful representations for words）。

单词移动距离（Word Mover’s Distance，WMD）：基于词嵌入，衡量文本文档（text documents）间距离的函数。WMD以一个文档的嵌入词移动至另一个文档的嵌入词的最小距离（the minimum amount of distance that the embedded words of one document need to “travel” to reach the embedded words of another document）作为两个文本文档间不相似度（dissimilarity）的度量。

WMD测度不包含超参数（hyperparameters）。

1 引言

文档表示的最常用的两种方法：

词袋模型（bag of words，BOW）；
词频逆文档频率（term frequency - inverse document frequency，TF-IDF）。

由于各文档的BOW（TF-IDF）表示通常近似正交性（frequent near-orthogonality），二者并不适于度量文档距离；另外，二者无法表示不同单词间的距离（not capture the distance between individual words）。

文档的低维隐含变量表示（a latent low-dimensional representation of documents）：

隐含语义索引（Latent Semantic Indexing，LSI）：对BOW特征空间（feature space）进行特征分解（eigendecompose）；
主体模型（Latent Dirichlet Allocation，LDA）：将相似词按概率分配到不同的主题（probabilistically groups similar words into topics），并将文档表示这些主题的分布（represents documents as distribution over these topics）

通常，语义关系体现在词向量的运算上（semantic relationships are often preserved in vector operations on word vectors），即嵌入词向量间的距离能够表示语义（distances between embedded word vectors are to some degree semantically meaningful）。本文将文本文档表示为嵌入词的加权点云（a weighted point cloud of embedded words），文本文档 $A$ 和 $B$ 间的单词移动距离（Word Mover’s Distance，WMD）定义为：为匹配（match）文档 $B$ 的点云（point cloud），文档 $A$ 中的单词（words from document $A$ ）所需移动（travel）的最小累积距离（minimum cumulative distance），Fig. 1。

在这里插入图片描述
WMD最优问题是最短测地距离（Earth Mover’s Distance，EWD）运输问题（transportation problem）的特例。本文给出几个下界距离（lower bounds）用于近似WMD或对查询范围剪枝（approximations or to prune away documents that are provably not amongst the $k$ -nearest neighbors of a query）。

WMD特性：（1）无超参（hyper-parameter free）；（2）可解释性强（highly interpretable），文档距离可解释为少量不同单词间的稀疏距离（the distance between two documents can be broken down and explained as the sparse distances between few individual words）；（3）高检索准确率（high retrieval accuracy）。

2 相关工作

Okapi BM25

LDA

LSI

TextTiling-EMD

Stacked Denoising Autoencoders （SDA）、mSDA

Componential Counting Grid

3 Word2Vec词嵌入（Word2Vec Embedding）

word2vec：词嵌入过程（word-embedding procedure），使用（浅层）神经网络语言模型（a (shallow) neural network language model）学习单词的向量表示（vector representation）。

skip-gram模型：由输入层、投影层（a projection layer）和输出层组成，用于预测相邻单词（nearby words）。通过最大化语料库（corpus）中相邻单词（neighboring words）的对数概率（log probability），训练各单词词向量（word vector），即给定单词序列（a sequence of words） $w_{1}, \cdots, w_{T}$ ：

$\frac{1}{T} \sum_{t = 1}^{T} \sum_{j \in nb(t)} \log p(w_{j} | w_{t})$

其中， $nb(t)$ 表示单词 $t$ 的相邻单词集合、 $p(w_{j} | w_{t})$ 表示相应词向量（associated word vectors） $\mathbf{v}_{w_{j}}$ 和 $\mathbf{v}_{w_{t}}$ 之间的层级归一化指数（hierarchical softmax）。由于结构简单和层级归一化指数，skip-gram能够使用台式机在数十亿单词上训练（due to its surprisingly simple architecture and the use of the hierarchical softmax, the skip-gram model can be trained on a single machine on billions of words per hour using a conventional desktop computer），因此能学到复杂的单词关系。

4 WMD距离（Word Mover’s Distance）

$\mathbf{X} \in \R^{d \times n}$ 表示 $n$ 个单词的word2vec嵌入矩阵（a word2vec embedding matrix），其第 $i$ 列 $\mathbf{x}_{i} \in \R^{d}$ 表示第 $i$ 个单词在 $d$ 维空间中的词嵌入。假设文本文档表示为归一化词袋模型（normalized bag-of-words，nBOW）向量 $\mathbf{d} \in \R^{n}$ ，即如果单词 $i$ 出现 $c_{i}$ 次，则 $d_{i} = \frac{c_{i}}{\sum_{j = 1}^{n} c_{j}}$ 。通常，nBOW向量 $\mathbf{d}$ 非常稀疏（very sparse）。

$n$ BOW（ $n$ BOW representation）

向量 $\mathbf{d}$ 为 $n - 1$ 维单纯形（simplex），包含不同唯一词的两文档（different unique words）位于单纯形不同的区域中，但这两个文档的语义确可能相近（semantically close）。

词映射损失（word travel cost）

本文将单词对（individual word pairs）间的语义相似度（document distance metric）包含进文档距离度量（document distance metric）。单词不相似度通常采用在word2vec嵌入空间（the word2vec embedding space）中的欧氏距离（Euclidean distance）度量。单词 $i$ 和 $j$ 之间的距离为： $c(i, j) = \| \mathbf{x}_{i} - \mathbf{x}_{j} \|_{2}$ ，表示一个单词移动到另一个单词的代价（the cost associated with “traveling” from one word to another）。

文档距离（document distance）

（1）令 $\mathbf{d}$ 、 $\mathbf{d}^{\prime}$ 表示两个文档在 $n - 1$ 维单纯形（simplex）上的 $n$ BOW表示。

（2）假定 $\mathbf{d}$ 中的每个单词 $i$ 都可以全部或部分映射到 $\mathbf{d}^{\prime}$ 中的任意单词（each word $i$ in $\mathbf{d}$ to be transformed into any word in $\mathbf{d}^{\prime}$ in total or in parts）。

（3）令 $\mathbf{T} \in \R^{n \times n}$ 表示（稀疏）流矩阵（a (sparse) flow matrix），其中 $\mathbf{T}_{ij} \geq 0$ 表示 $\mathbf{d}$ 中单词 $i$ 到 $\mathbf{d}^{\prime}$ 中单词 $j$ 的流量（how much of word $i$ in $\mathbf{d}$ travels to word $j$ in $\mathbf{d}^{\prime}$ ）。

（4）为将 $\mathbf{d}$ 完全转移至 $\mathbf{d}^{\prime}$ ， $\mathbf{d}$ 中单词 $i$ 的流出量为 $d_{i}$ ，即 $\sum_{j} \mathbf{T}_{ij} = d_{i}$ ；同时 $\mathbf{d}^{\prime}$ 中单词 $j$ 的流入量为 $d_{j}$ ，即 $\sum_{i} \mathbf{T}_{ij} = d_{j}$ （to transform d entirely into $\mathbf{d}$ we ensure that the entire outgoing flow from word $i$ equals $d_{i}$ , i.e. $\sum_{j} \mathbf{T}_{ij} = d_{i}$ . Further, the amount of incoming flow to word $j$ must match $d_{j}$ , i.e. $\sum_{i} \mathbf{T}_{ij} = d_{j}$ ）。

则两个文档间的距离定义为：将 $\mathbf{d}$ 中所有单词迁移至 $\mathbf{d}^{\prime}$ 中的最小加权累积代价（the distance between the two documents as the minimum (weighted) cumulative cost required to move all words from $\mathbf{d}$ to $\mathbf{d}^{\prime}$ ），即：

$\sum_{i, j} \mathbf{T}_{ij} c(i, j)$

运输问题（transportation problem）

给定约束，将 $\mathbf{d}$ 移至 $\mathbf{d}^{\prime}$ 的最小加权累积代价为如下线性规化（linear program）的解：

$\begin{aligned} & \min_{\mathbf{T} \geq 0} \sum_{i, j = 1}^{n} \mathbf{T}_{ij} c(i, j) \\ \text{subject to:} & \\ & \sum_{j = 1}^{n} \mathbf{T}_{ij} = d_{i}, \ \forall i \in \{ 1, \cdots, n \} \\ & \sum_{i = 1}^{n} \mathbf{T}_{ij} = d_{j}^{\prime}, \ \forall j \in \{ 1, \cdots, n \} \\ \end{aligned}$

■ $\mathbf{T}_{ij} \geq 0$ ■

WMD距离（word mover’s distance）即为方程（1）的解。由于 $c(i, j)$ 是一个测度（metric），可以证明WMD也是一个测度。

可视化（visualization）

在这里插入图片描述

4.1 快速距离计算（Fast Distance Computation）

WMD优化问题的最佳平均计算时间复杂度（best average time complexity）为 $\mathcal{O} (p^{3} \log p)$ ，其中 $p$ 表示文档中唯一词（unique words）的数量（the number of unique words in the documents）。■即 $p$ 为 $n$ BOW向量长度■

WMD运输问题的下界距离：

词质心距离（word centroid distance）

根据三角不等式（triangle inequality），文档 $\mathbf{d}$ 和 $\mathbf{d}^{\prime}$ 之间的质心距离（centroid distance） $\| \mathbf{X} \mathbf{d} - \mathbf{X} \mathbf{d}^{\prime} \|$ 为其WMD距离的下界（lower bound），

$\sum_{i, j = 1}^{n} \mathbf{T}_{ij} c(i, j) \geq \| \mathbf{X} d_{i} - \mathbf{X} d_{j}^{\prime} \|_{2}$

■■

$\begin{aligned} \sum_{i, j = 1}^{n} \mathbf{T}_{ij} c(i, j) & = \sum_{i, j = 1}^{n} \mathbf{T}_{ij} \| \mathbf{x}_{i} - \mathbf{x}_{j}^{\prime} \|_{2} \\ & = \sum_{i, j = 1}^{n} \| \mathbf{T}_{ij} (\mathbf{x}_{i} - \mathbf{x}_{j}^{\prime}) \|_{2} \\ & \geq \| \sum_{i, j = 1}^{n} \mathbf{T}_{ij} (\mathbf{x}_{i} - \mathbf{x}_{j}^{\prime}) \|_{2} \\ & = \| \sum_{i = 1}^{n} \left( \sum_{j = 1}^{n} \mathbf{T}_{ij} \right) \mathbf{x}_{i} - \sum_{j = 1}^{n} \left( \sum_{i = 1}^{n} \mathbf{T}_{ij} \right) \mathbf{x}_{j}^{\prime} \|_{2} \\ & = \| \sum_{i = 1}^{n} d_{i} \mathbf{x}_{i} - \sum_{j = 1}^{n} d_{j}^{\prime} \mathbf{x}_{j}^{\prime} \|_{2} \\ & = \| \mathbf{X} d_{i} - \mathbf{X} d_{j}^{\prime} \|_{2} \\ \end{aligned}$

■

由于每个文档都用其加权平均词向量表示（each document is represented by its weighted average word vector），本文称之为词质心距离（Word Centroid Distance, WCD）。WCD距离的计算时间复杂度为 $\mathcal{O} (dp)$ （it is very fast to compute via a few matrix operations and scales $\mathcal{O} (dp)$ ）。

对于最近邻（nearest-neighbor）问题，WCD能够缩小候选点范围（promising candidates），以加速WMD搜索。

WCD易于计算，但不够紧致（not very tight）。

松弛词移动距离（relaxed word moving distance）

通过放松WMD优化问题（relaxing the WMD optimization problem）并移除一个约束条件（removing one of the two constraints respectively），可以更紧致的下界（much tighter bounds）。

若移除第二个约束条件，优化问题为：

由于WMD最优问题的解需要满足两个约束条件，移除一个后，解的可行域变大，因此松弛问题的解必为WMD距离的下界（this relaxed problem must yield a lower-bound to the WMD distance, which is evident from the fact that every WMD solution (satisfying both constraints) must remain a feasible solution if one constraint is removed）。

最优流矩阵 $\mathbf{T}^{\ast}$ 为：

$\mathbf{T}^{\ast} = \begin{cases} d_{i}, & \text{if } j = \argmin_{j} c(i, j) \\ 0, & \text{otherwise} \end{cases} \tag {2}$

令 $\mathbf{T}$ 为松弛问题的任意可行解（feasible solution）， $\forall$ 单词 $i$ ，其最近词为 $j^{\ast} = \argmin_{j} c(i, j)$ ，则

$\sum_{j} \mathbf{T}_{ij} c(i, j) \geq \sum_{j} \mathbf{T}_{ij} c(i, j^{\ast}) = c(i, j^{\ast}) \sum_{j} \mathbf{T}_{ij} = c(i, j^{\ast}) d_{i} = \sum_{j} \mathbf{T}_{ij}^{\ast} c(i, j)$

因此， $\mathbf{T}^{\ast}$ 必能生成最小损失（a minimum objective value）。计算该解仅需确定 $j^{\ast} = \argmin_{j} c(i, j)$ （identification），可在欧氏word2vec空间中做最近邻搜索（a nearest neighbor search in Euclidean word2vec space）。对文档 $D$ 中的每个词向量 $\mathbf{x}_{i}$ ，需要找到文档 $D^{\prime}$ 中的最相似的词向量 $\mathbf{x}_{j}$ 。

若移除第一个约束，最近邻搜索过程相反，即对文档 $D^{\prime}$ 中的每个词向量 $\mathbf{x}_{j}$ ，需要找到文档 $D$ 中的最相似的词向量 $\mathbf{x}_{i}$ 。

令两个松弛解分别为 $l_{1} (\mathbf{d}, \mathbf{d}^{\prime})$ 、 $l_{2} (\mathbf{d}, \mathbf{d}^{\prime})$ ，通过取二者中的最大值（taking the maximum of the two），可得到更紧致的下界，称为松弛WMD（Relaxed WMD，RWMD）：

$l_{r} (\mathbf{d}, \mathbf{d}^{\prime}) = \max \left( l_{1} (\mathbf{d}, \mathbf{d}^{\prime}), l_{2} (\mathbf{d}, \mathbf{d}^{\prime}) \right)$

预读取与减枝（prefetch and prune）

查找查寻文档（a query document）的 $k$ 近邻：

（1）根据与查寻文档的WCD距离对所有文档进行排序，并计算前 $k$ 个文档的WMD距离；

（2）遍历（traverse）其余文档，首先检查各文档的RWMD下界是否大于当前 $k$ 近邻文档的WMD距离，如果条件为真则剪枝（check if the RWMD lower bound exceeds the distance of the current $k$ -th closest document, if so we can prune it）；否则计算其WMD距离，并更新 $k$ 近邻文档。

由于RWMD近似（RWMD approximation）的极其紧致，在一些数据集上， $95%$ 的文档能被剪枝。

5 实验

5.1 数据集

在这里插入图片描述
SMART停用词（stop word）列表

比较7种文档表示基线（baseline）：词袋（bag-of-words，BOW）、TFIDF（term frequency-inverse document frequency）、BM25 Okapi、LSI（Latent Semantic Indexing）、LDA（Latent Dirichlet Allocation）、mSDA（Marginalized Stacked Denoising Autoencoder）、CCG（Componential Counting Grid）

欧氏距离 $k$ 近邻，超参使用贝叶斯优化（Bayesian optimization）