Simple and crude understanding and implementation of machine learning neural network NN (IV): word vector -word2vec, Word2Vec model introduction, statistical language model, neural network model language NNLMNNLM, Word2Vec case Word2vec, word vector tools

7.4 word vector -word2vec

learning target

  • aims
    • We know the statistical language model
    • NNLM grasp the principles of neural network model language
    • Master wor2vec implementation and optimization features
  • application
    • no

7.3.1 Word2Vec model introduced

7.3.1.1 Why learn words embedded in

Image and audio processing system uses a large set of high-dimensional data, the image data is such data as a vector encoding a single assembly the original pixel intensity. However, natural language processing system has long been regarded as the word discrete atomic symbol, the word is expressed as a unique discrete ID will result in data sparsity, and usually means that we may need more data to successfully trained statistical model . Use vector notation can remove some of the obstacles.

Here Insert Picture Description

  • Calculating the similarity
    • Look for similar words
    • Information Retrieval
  • As input SVM / LSTM like Model
    • Chinese word
    • Name recognition
  • Sentence representation
    • emotion analysis
  • Document representation
    • Document Themes discrimination
  • Machine translation with bot

7.3.1.2 What is the word vector

Definition: The word is represented by a string of numbers vector

  • One-hot word representation

    :One-hot Representation

    • Stored using sparse, simple and easy to implement
    • Lamp: [0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0], Lamp: [0,0,0,0,0, 0,0,1,0,0,0,0,0,0,0,0]

Victoria spent large vocabulary gap phenomenon: are isolated between any two words. Light from these two vectors was not clear whether the two words are related, even if "light bulb" and "tube" that the two words are synonymous does not work

  • Distributed word representation

    :Distributed representation

    • Represents a separate traditional heat (one-hot representation) only the symbols of the word, does not contain any semantic information
    • Distributed representation was first proposed by Hinton in 1986. It is a low-dimensional real vector, such vector typically grow in this way: [0.792, -0.177, -0.107, 0.109, -0.542, ...]
    • The maximum contribution is to make related or similar words, at a distance closer to the

7.3.1.3 thought vector word Source training - statistical language model

Statistical language model

  • Statistical language models: the statistical language model language (sequence of words) as a random event, and give the corresponding probability to describe the possibility of a language that belongs to the set

Note: The language model is used to calculate the probability of a sentence model, that is, to determine whether a word is the probability of human words?

For example: a sentence of w1, w2, w3, w4, w5, ...... these phrases, such that P (w1, w2, w3, w4, w5 ......) a high probability (can be derived from the training corpus).

  • N-Gram:

    N-gram model is the assumption that the probability of occurrence of the current word only with its previous N-1 words related

    • Language is a sequence between words are not independent of each other
    • Unigram (unigram model): Suppose a probability of occurrence is independent of all previous terms
      • P(s) = P(w1)P(w2)P(w3)…P(w4)
    • Binary model (bigram model): assume that the probability of an occurrence of a word associated with the front
      • P(s) = P(w1)P(w2|w1)P(w3|w2)…P(w_i|w_i-1)
    • Trigram (trigram model): assume that the probability of occurrence of a word associated with the first two
      • P(s) = P(w1)P(w2|w1)P(w3|w1,w2)…P(w_i|w_i-2,w_i-1)

Note: currently using more ternary model, because the training corpus limitations, not the pursuit of greater N, N and increasing the amount of calculation lead to greater

  • Solving:
    • According to the conditional probability formula and the law of large numbers , when the size of the corpus is large enough, there is

Here Insert Picture Description

统计语言模型案例

这个例子来自大一点的语料库,为了计算对应的二元模型的参数。即P(wi | wi-1),我们要先计数即c(wi-1,wi),然后计数c(wi-1),再用除法可得到这些条件概率。

共现次数:

Here Insert Picture Description

每个词个数
Here Insert Picture Description

统计语言模型缺点

N-gram语言模型还存在OOV问题(Out Of Vocabulary),也就是序列中出现了词表外词(也叫做未登录词),或者说在测试集和验证集上出现了训练集中没有过的词。它采用一般的解决办法:

  • 设置一个词频阈值,只有高于该阈值的词才会加入词表。
  • 所有低于阈值的词替换为 UNK(一个特殊符号)。

统计语言模型这样的思想可以用来做很多事情。

7.3.1.4 神经网络语言模型NNLMNNLM

神经网络语言模型NNLM依然是一个概率语言模型,它通过神经网络来计算概率语言模型中每个参数。

  • 2003年,Bengio等人发表的《A Neural Probabilistic Language Model》论文就提出了这个模型。

Here Insert Picture Description

  • 模型解释:

    • 输入层:将context(w)每个词映射成一个长度为m的词向量(长度训练者指定),词向量在开始是随机的,也

      参与网络训练

      • 使用随机初始化的方法建立一个|m|×N个词大小的查找表(lookup table)
      • context(w):可以称之为上下文窗口长度,类似N-gram取多少个词作为添加
    • 投影层:将所有的上下文此项来给你拼接成一个长向量,作为目标w的特征向量。长度为m(n-1)

    • 隐藏层:拼接后的向量会经过一个规模为h的隐藏层,论文中使用tanh

    Here Insert Picture Description

    • 输出层:最后输出会通过softmax输出所有词个数大小比如N的大小概率分布
  • 训练过程:

    • 训练时,使用交叉熵作为损失函数,反向传播算法进行训练

    • 当训练完成时

      ,就得到了 N-gram 神经语言模型,以及副产品

      词向量

      • 初始化的矩阵查找表是和神经网络的参数同时训练更新
神经网络语言模型例子
  • 语料库:
    • “训练 神经网络 语言 模型 需要 反向 传播 算法”
    • 假设只有这样一句语料库,设定上下文窗口为c=3,总不同次数N=8,会分割成如下组合
    • "训练 神经网络 语言 模型”
    • “神经网络 语言 模型 需要”
    • “语言 模型 需要 反向 传播”
    • “模型 需要 反向 传播 算法”

下面这个例子,我们自定义初始化向量为3,会得到一个|3|×8个词大小的查找表(lookup table):

Here Insert Picture Description
过程描述为:

Here Insert Picture Description

7.3.1.4 Word2Vec

word2Vec 本质上也是一个神经语言模型,但是它的目标并不是语言模型本身,而是词向量;因此,其所作的一系列优化,都是为了更快更好的得到词向量

Here Insert Picture Description
Here Insert Picture Description
在这里我们讲一种模型,就是CBOW也是python gensim库以及google Tensorflow word2vec使用的模型。

举例:CBOW前向计算与向量(参数)更新推导

CBOW与2003年Bengio的结构有些不同,不同点在于CBOW去掉了最耗时的非线性隐层、并且所有词共享隐层。该推导公式不包含下面的分层softmax与负采样优化过程!!!

Here Insert Picture Description

  • 前向计算:

输入层和隐藏层:输入上下文词的平均值与W权重计算,[1,V]x[V, N] = [1, N]得到中间向量h,
Here Insert Picture Description

意义就是:使得 exp(目标位置的输出)/exp(所有的位置的输出和)的概率最大,目标简写为E,最大化问题会取-号变成最小化问题(常见损失优化都是最小)

  • 输入向量参数更新推导:

Here Insert Picture Description

最终得到梯度更新输入向量的公式,其中我们给了一些中间值临时表示结果,防止式子变的过长。y_{j}y**j的推导过程如下:

这里的U代表输出目标的大小等同上面的V(隐层到输出的权重);Vc代表上面的h就是隐层的输出结果,包含输入到隐层的权重和输入词向量

复合函数,所以两部分继续对输入词向量Vc求偏导,最终结果得到: Here Insert Picture Description
整个过程就是一个符合函数的求导链式法则过程,所以常见函数的导数要记清楚有利于复杂公式求导;

7.3.1.5 拓展- Word2vec 的训练trick(优化)

Skip-Gram模型CBOW模型计算softmax的时候都需要整个词表 V,但实际当中,词语的个数非常非常多,会给计算造成很大困难效率低,所以需要用技巧来加速训练。

  • hierarchical softmax:分层softmax
    • 本质是把 N 分类问题变成 log(N)次二分类,从O(N)时间复杂度编程O(log(N))
  • negative sampling
    • 本质是预测总体类别的一个子集

其它:Skip与CBOW模型的推导过程,以及word2vec分层softmax的推导原理,word2vec负采样参数更新公式推导。

我们可以可视化学到的向量,方法是使用(t-SNE降维)等技术将它们投射到二维空间。在检查这些可视化效果时,这些向量很明显捕获了一些关于字词及其相互关系的一般语义信息,而且这些语义信息实际上非常有用。第一次发现诱导向量空间中的某些方向专门表示了字词之间特定的语义关系,例如男性-女性、动词时态,甚至国家/地区-首都关系。https://distill.pub/2016/misread-tsne/

Here Insert Picture Description

7.3.2 Word2vec 词向量工具使用

7.3.3 总结

  • NNLM grasp the principles of neural network model language
  • Master wor2vec implementation and optimization features
Published 633 original articles · won praise 844 · views 110 000 +

Guess you like

Origin blog.csdn.net/qq_35456045/article/details/104719223