MNLM：Word Embedding开山之作 A Neural Probabilistic Language Model

持续创作，加速成长！这是我参与「掘金日新计划 · 6 月更文挑战」的第29天，点击查看活动详情

今天搞一下word embedding 开山之作 A Neural Probabilistic Language Model ，简称MNLM ，我也不知道为啥，主观上我觉得应该叫NPLM，当然我说了不算数。

下载地址：ResearchGate：A Neural Probabilistic Language Model

附带两个讲解视频：MNLM：A neural probabilistic language model_哔哩哔哩_bilibili

话不多说，开搞。本文中英文结合，是我的英文笔记+我的中文注释。

Objective : learn the joint probability function of sequences of words in a language.

文章要做的事情就是学习语言中词语序列的联合概率分布函数。

Road Blocks at that time:

Curse of Dimensionality
What if we encounter new word sequences that are not present in training.
Computational expenses.

在文章出来之前，存在的问题是：

维数灾难：比如one-hot向量，当词汇表变得越来越大，表示一个词超级长，存在超多的0，还造成空间浪费。
如果我们在实际应用中遇到训练时候没遇到的词语序列怎么办。
计算资源昂贵

从上边这几个痛点我们就能想到作者想干嘛：

能不能找到相对密集的矩阵表示词汇
提高泛化性
密集矩阵就能减少维度，减少计算量

Existing best method at that time:

N-gram models by overlapping short sentences.

当时最好用的方法就是N-gram模型

Proposed work in the paper:

to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences.
The model learns simultaneously
- a distributed representation for each word
- the probability function for word sequences, expressed in terms of these representations.

论文中提出的工作：

通过学习单词的分布式表征来对抗维度诅咒，这使得每个训练句子都能为模型提供指数级数量的语义相邻的句子。

该模型同时学习：

每个词的分布式表示
词序列的概率函数

Similarity between words: For example, having seen the sentence “The cat is walking in the bedroom” in the training corpus should help us generalize to make the sentence “A dog was running in a room” almost as likely, simply because “dog"and“cat” (resp. "the"and"a”,“room"and"bedroom", etc...) have similar semantic and grammatical roles.

模型要能够关注词语之间的相似性，相似的词语要能获得相近的词汇表示。

例如，在训练时候，从语料库中看到 "The cat is walking in the bedroom"的句子应该有助于我们归纳出 "A dog was running in a room"，并且这个句子的表示应该和上个句子的表示相接近。因为 "dog"和 "cat"（还有 "the "和 "a"，"room"和 "bathroom"等等）具有类似的语义和语法作用。

Fighting the curse of dimensionality:

associate with each word in the vocabulary a distributed word feature vector(a real valued vector in $R^m$
express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence.
learn simultaneously the word feature vectors and the parameters of that probability function.

对抗维度诅咒做出的努力：

对每个词都学习其特征向量
使用学到的特征向量表示联合概率函数
训练过程中同时学习特征表示和概率函数

Important points from above:

Feature Vector:represents different aspects of the word: each word is associated with a point in a vector space. The number of features(e.g.m=30, 60 or 100 in the experiments) is much smaller than the size of the vocabulary (e.g. 100,000).

the presence of only one of the sentences in the training data will increase the probability, not only of that sentence,but also of its combinatorial number of “neighbors" in sentence space

if we knew that dog and cat played similar roles (semantically and syntactically), and similarly for (the,a), (bedroom,room), (is,was), (running,walking), we could naturally generalize from

The cat is walking in the bedroom

A dog was running in a room
The cat is running in a room
A dog is walking in a bedroom
......

总结一下上边提到的重点：

特征向量：用不同特征表示单词不同的方面，特征向量的维度远小于词汇表的维度（如实验中特征向量m=30、60或100），就是说你原来用one-hot表示一个词，如果你的词汇表是3w词，那你表示一个词要用30000维的向量，但是转化成特征向量表示只需要几十维的长度。
如果我们知道狗和猫扮演着类似的角色（在语义和句法上），以及类似的（the,a）、（bedroom,room）、（is,was）、（running,walking），我们就可以很自然地归纳出
- A dog was running in a room
- The cat is running in a room
- A dog is walking in a bedroom
- ......

最后必须放上一张原图，让我们来看一下这个原文的图都干了什么。

对于输入序列中的每一个单词都有学习到一个特征向量表示C。然后这个C会进入一个隐藏层，激活函数是tanh，并且C还会跨过这个隐藏层，直接进入到输出层，我不知道那个时候这种东西叫什么，可以类比一下残差连接那种东西。然后前面那个隐藏层的东西也会进入到输出层，二者组合起来计算，最后得到最终的输出。

看一下公式：

y=b+W x+U \tanh (d+H x)

y 是输出
x 是输入
d 是隐藏层的bias
H 是输入层到隐藏层的权重
U 是隐藏层到输出层的权重
W 是c直接到输出层的权重
b 是输出层的bias

猜你喜欢

目录

热门文章