关于词向量的一些理解

数学上如何解释？

有很多种解释，其中一种是相对简单的：《Neural Word Embedding as Implicit Matrix Factorization》

文中对skipgram (negative sampling) 的损失函数进行了重新整理后发现，实际上，
$\vec{w} \cdot \vec{c} = \log (\frac{# (w, c) \cdot | D |}{# (w) \cdot # (c)} \cdot \frac{1}{k}) = \log (\frac{# (w, c) \cdot | D |}{# (w) \cdot # (c)}) - \log k$ $\overrightarrow{w} \cdot \overrightarrow{c} = \log(\frac{\#(w,c) \cdot |D|}{\#(w) \cdot \#(c)} \cdot \frac{1}{k}) \\ = \log(\frac{\#(w,c) \cdot |D|}{\#(w) \cdot \#(c)}) - \log k$
解释一下，其中 $w,c$ 表示一个word-context pair， $\overrightarrow{w}, \overrightarrow{c}$ 是其对应的词向量，#(w) 表示计数， $|D|$ 是所有 $(w,c)$ 对的数量， $k$ 是negative sample的数量，等式右边
$\log (\frac{# (w, c) \cdot | D |}{# (w) \cdot # (c)}) = \log (\frac{\frac{# (w, c)}{| D |}}{\frac{# (w)}{| D |} \cdot \frac{# (c)}{| D |}}) = \log \frac{P (w, c)}{P (w) P (c)} = P M I (w; c)$ $\log(\frac{\#(w,c) \cdot |D|}{\#(w) \cdot \#(c)}) = \log(\frac{\frac{\#(w,c)}{|D|}}{\frac{\#(w)}{|D|} \cdot \frac{\#(c)}{|D|}}) \\ = \log{{\frac {P(w, c)}{P(w)P(c)}}} = PMI(w; c)$
PMI(Pointwise mutual information)是衡量相关性的一种方法(>0表示正相关，<0负相关)
所以这里可以理解为，skipgram (negative sampling)做的，就是将偏移的（注意这里的 $\log k$ ）PMI矩阵做分解，也就是 $shifted\ PMI = WC^T$
文中尝试了SVD的方式，并采用了PPMI = $max (PMI(w;c) − \log k, 0)$ ，在一些任务上取得了更好的效果

When representing words, there is some intuition behind ignoring negative values: humans can easily think of positive associations (e.g. “Canada” and “snow”) but find it much harder to invent negative ones (“Canada” and “desert”). This suggests that the perceived similarity of two words is more influenced by the positive context they share than by the negative context they share. It therefore makes some intuitive sense to discard the negatively associated contexts and mark them as “uninformative” (0) instead. Indeed, it was shown that the PPMI metric performs very well on semantic similarity tasks.
- 这样说来，GloVe也是对矩阵(词共现矩阵)的一种分解，只是损失函数不同
  $\sum_{i,j=1}^V f(X_{ij})(w_i^T \tilde{w}_j^T + b_i + \tilde{b}_j - \log{X_{ij}})^2$

是否具有可加性？

假设king - man + woman = queen，那么king + woman = queen + man = ？
开心 + 不 = 伤心/难过 ?
- 从以前实验的结果来看，top cosine相似度，前两名分别是“开心”和“不”，取不同参数，“伤心/难过”虽然会出现在top 20里，但位置不稳定，而且会出现很多不相关的词
- 另一个角度，伤心 - 开心 = 健康 - 生病 = 不（？）
- 再来，伤心 + 伤心 = 悲痛欲绝（？）
- 那么，isn’t = is + not （？）
到底为什么
- 从上面矩阵分解的角度继续推，无法得到可加(减)的结论，以下为个人猜测
  - 基于词共现的方法（word2vec/GloVe）所能抓住的主要是词的结构特征（上下文，词之间的搭配）以及浅层的语义特征，原因是窗口太小
  - 回想我们是怎么学到“悲痛欲绝”这个词的
    - 老师解释给我们听它的意思是什么 or 它是文章中某个片段主人公的思想感情
      - 都是基于较长文本的抽象，不仅是更大的窗口，目标函数也不同（总结/相似，而不是邻近），绝不仅仅是相加那么简单
    - 悲痛 + 欲 + 绝（有时我们也会这么看）
- embedding的不可解释性也来自于它基于矩阵的线性分解
  - 想想对共现矩阵做PCA，每个成分听着就不容易解释
- 所以理想的embedding大概是这样一种结构（纯属YY）
  - 向量的每个位置代表某个可解释的属性，就像人工设计的特征一样（king=[词性id-2, 生物-1, 性别-0, 种类id-1, 权力-100, 国家id-10] ），现在的更像是做了某种非线性变换，都搅和在一起了
  - 非线性的操作，外在是DL，内在执行的类似决策树的操作

关于词向量的一些理解

数学上如何解释？

是否具有可加性？

猜你喜欢