word2vec learning summary

1 Introduction

word2vec is Google's acquisition in 2013 of open source word vector word2vec toolkit. It includes a set of models for the word embedding, using these models are usually shallow (two) words neural network training vectors.

Word2vec model in the large-scale corpus as input to train a vector space (typically a few hundred dimension) by the neural network. Each word in the dictionary corresponds to a unique vector vector space, and corpus share a common context word mapped to the vector space distance will be closer, and its cosine similarity higher.

In the process of learning by word2vec in [4], [6] two articles inspired background information which also draws on these articles. Generating a training sequence and model also benefit [8] from the teacher's Fudan Qiuxi Peng "Neural Network Learning and depth" a book.

2. Start from the statistical language model

In machine learning, statistical language models (Statistic Language Model) as the primary methods of natural language processing, and it is a very important part.

In cognitive psychology, there is a classic experiment, when a person sees the following two sentences

Spread butter on bread
on bread coated socks

After a sentence in the semantic integration of the human brain requires more processing time, but does not comply with the rules of natural language. From a statistical point of view, these linguistic rules may be viewed as a probability distribution, and for more than two sentences, the probability of occurrence of significant sentences behind smaller.
A length \ (T \) a text sequence as a random event \ (X-_ {(. 1: T)} = ⟨X_1, · · ·, X_T⟩ \) , wherein the variables at each position \ (X_t \ ) sample space for a given vocabulary (vocabulary) \ (V \) , the entire sequence \ (x_ {1: T} \) sample space \ (| V | ^ T \) .
That is, given a sequence of samples for a given sample of a sequence \ (X_ {. 1: T} = x_1, x_2, · · ·, x_t \) , it can be seen that the probability of a \ (T \) joint probability of the words.

\[ P(X_{1:T} = x_{1_T}) = P(X_1 = x_1, X_2 = x_2, ... , X_T = x_T) = p(x_{1:T}). \]

2.1 sequence probability model

Sequence data has two characteristics: (1) samples are variable length; and (2) a very large sample space. For a length \ (T \) sequence, which is a sample space \ (| V | ^ T \) . Therefore, it is difficult to directly model the probability of the entire sequence with a known probability model.

The multiplication formula probability sequence \ (x_ {1: T} \) probability can be written as

\[ p(x_{1:T}) = p(x_1)p(x_2|x_1)p(x_3|x_2) \cdots p(x_T|x_{1:(T-1)}) \\ = \prod^T_{t=1} p(x_t|x_{1:(t-1)}) \]

Wherein \ (x_t ∈ V, t ∈ {1, · · ·, T} \) is the vocabulary \ (V \) in a word, \ (P (x_1 | x_0) = P (x_1) \) .

Conditional probability Therefore, the probability of sequence data density estimation problem can be converted into a single variable estimation problem, i.e., a given \ (x_ {1: (t -1)} \) when \ (x_t \) is the conditional probability \ (P ( x_t | {X_. 1: (. 1-T)}) \) .

Given \ (N \) sequence data \ (\ {X ^ {(n-)} _ {. 1: T_n} \} ^ N _ {n-=. 1} \) , a sequence probability model to learn a model \ (p_θ ({\ mathbf x} | x_ {1: (t-1)}) \) to maximize the number of whole data set of the likelihood function.

\[ {\mathop {max}_{\theta}} \sum^N_{n=1}log\ p_\theta(x^{(n)}_{1:T_n}) = {\mathop {max}_{\theta}}\sum^N_{n=1}\sum^T_{t=1}log\ p (x_t^{(n)}|x1^{(n)}_{1:(t-1)}). \]

2.2 N yuan statistical model

Since the data sparseness problem, when \ (t \) when relatively large, is still difficult to estimate the conditional probability \ (the p-(x_t | x_ {1: (t-1)}) \) . A simplified method is \ (N \) metamodel (N-Gram Model), assuming that each word \ (x_t \) depend only on the front face thereof
\ (n - 1 \) word ( \ (n-\) Order Markov property), i.e.,

\[ p(x_t|x_{1:(t-1)}) = p(x_t|x_{(t-n+1):(t-1)}). \]

When \ (n = 1 \) when referred to one yuan (the unigram) model; when \ (n = 2 \) , the called two yuan (of bigrams in) model, and so on.

Unigram : a meta-model, \ (n-=. 1 \) , i.e., the sequence \ (x_ {1: T} \) each word with other words independently, i.e. irrespective of its context. That is, each word is obtained from a number of probability distributions, the log-likelihood function is:

\ [Log \ prod ^ {N '} _ {n = 1} p (x_ {1: T_n} ^ {(n)}; \ theta) = log \ prod ^ {| V |} _ {k = 1} \ theta ^ {m_k} _k \\ = \ sum ^ {| V |} _ {k = 1} m_k \ log \ theta_k \]

Wherein, through a certain proof equation, maximum likelihood estimation can be drawn is equivalent to the frequency estimate.

N-gram models : unigram and Similarly, depends only on the current word before \ (n-1 \) word, satisfies \ (n-\) order Markov properties. The maximum likelihood function can be obtained:

\ [P (x_t | x _ {(t-n + 1) :( t-1)}) = \ frac {m (x _ {(t-n + 1): t})} {m (x _ {(t -n + 1): t-1
})} \] where \ ({m (x _ { (t-n + 1): t})} \) of \ (x _ {(t- n + 1): t } \) the number of occurrences in the data set.

Smoothing technology

Consider the following two questions

  1. If a word \ (w \) in the number of occurrences of the current corpus (training set) for \ (0 \) , the probability of it that the word appears in the current language sequence is \ (0 \) ?
  2. If a word \ (w1 \) and another word \ (w2 \) the same number of occurrences in the corpus, whether the probability of two people for that word appears in a certain sequence, namely \ (p (w_k | w_1 ^ {k -1}) \) is equal to 1?

Of course, the above two issues is clearly not, even if the current training set how much, can not avoid the appearance of certain words with more than two questions, therefore, smoothing technique is used for the above address two issues.

3. Depth Series Models

Depth refers to the sequence model using a neural network model to estimate \ (P_ \ Theta (x_t | {X_. 1: (. 1-T)}) \) .

3.1 nerve probability model

Neural probabilistic model is a text Bengio et al. "A neural probabilistic language model. Journal of Machine Learning Research" (2003) is proposed. The model uses an important tool - word vector.
Neural language model probability can generally be divided into three parts: the buried layer, wherein layer and output layer.

3.1.1 intercalation layer

Order \ (h_t = x_ {1: (t-1)} \) represents the inputted history information, typically a sequence of symbols. Since the neural network typically requires input in the form of a real vector, so that the neural network model to be able to handle data symbols, these symbols need to be converted to vector form. A simple conversion method is embedded by a table (Embedding Lookup Table) for each symbol to be mapped directly to the vector representation. Embedding embedded matrix or table is also called lookup table.

3.1.2 feature layers

Wherein layer from an input vector sequence \ (e_1, · · ·, e_ {t-1} \) extracts features, the output can be expressed as a history information vector \ (h_t \) . Briefly, a selected word vector input to the neural network calculation performed by the training data.

Wherein layer may be implemented by a different type of neural network, such cycles feedforward neural networks and neural networks. Common network types are the following three:

  1. Simple average.
  2. Feedforward neural networks.
  3. Recurrent Neural Networks.

3.1.3 output layer

Generally used for the output layer softmax classifier, receiving history information vector representation \ (^ R & lt h_t ∈ {D2} \) , the output is the posterior probability of each word in the vocabulary, the output size \ (| V | \) . In a nutshell it is usually in the neural network softmax, the final output probability distribution, then the loss can be calculated.

3.2 one-hot vector notation

Before Google proposes word2vec model, commonly used in the language of the neural network model is a one-hot vector, i.e., for a length word corpus 10000, using the one-hot vector representation, then a single word can be represented as a vector 10 000 * 1 If the current word is in the first position corpus of 4096, the current one-hot word at the 4096 vector is set to 1, the rest position vector 0 10,000 * 1.

This vector represents Ci advantage of the process is simple, easy to use directly, because the text can not be used in mathematical calculations between, so a vector representation by means of a method, it may be placed in the language used in the model calculations.
But the disadvantage is obvious, this word can only express the position of the current vector only word corpus can only express their identity information, but can not represent the association information between the two vectors.

So we will need a way to express vector-related information between the two words representation.

3.3 word2vec

word2vec can be called word embedding, i.e. embedded word, because both word2vec or one-hot, the two word representations are embedded in the layer of the neural probabilistic model used as a neural network model is capable of processing data symbols using .

Representation of the one-hot word2vec different, each word in the same word2vec is a vector, but the length of the vector is not fixed, i.e. non corpus length, the user may have a length m and a given algorithm, the vector each dimension is not used as 0 or 1 indicates such a simple manner, generally arbitrary real numbers. Thus representation, it can be concluded that the word representation can represent both the identity information of the current word (that is different from other word), they can calculate the semantic relationship between the current word with other words.

We can think about the semantic association between this word. First, for each word, it is because of its multi-dimensional vector, it can be mapped into high-dimensional space. Next, after the high-dimensional space to map, the relationship between the contact vector, to derive the degree of association between two vectors by the cosine value between two vectors. What's more, by vector subtraction between the words, the other words are calculated.

How to get word vector, the probability model for nerve, the word vectors (word embedding) is the parameter of the neural network, because of the embedded layer as a parameter, the form of the input vector of real numbers. So get word embedding, that is, learning the parameters of the neural network.

This training a training sequence through a given set of data, the target train the neural network model is to find a set of parameters \ (\ Theta \) , so that the maximum log-likelihood function. Wherein \ (\ Theta \) represents all the network parameters, including embedded matrix M and the neural network weights and bias.
Which is the word embedded matrix vector matrix with a size of \ (M \ Times N \) , \ (N \) is the number of words of the corpus, M being given word vector length.

3.3.1 word2vec trained neural network model

Two articles in the year of 2013 presented in google [2] and [3], the use of two important models --CBOW model and Skip-Gram model.

Skip the word model (Skip-gram)

The drawing paper given, can vividly that for skip-gram model, is a neurological probability model generated by the word around the current word, in these models under discussion, the word is generally divided into two species (the last word trained vector is also divided into two), one is a central word, the other is as a background word.
This assumption can be clearly seen that the manner in skip-gram model, the headwords (FIG as \ (W (T) \) ) through the word model generation probability neural around \ ((w (t-2 ) , W (. 1-T), W (T +. 1), W (T + 2)) \) , in these, there are generated around each of two words before and, correspondingly, generating a background this range word the size of the window is called background term, the figure background word window 2.

在以上模型中,我们不妨将中心词的词向量设为\(v_c\),而背景词可认为是\(u_o\)。在skip-gram模型中,输入层获取到当前作为中心词的向量,并在预测层与模型参数(实为背景词的词向量)做乘积计算,最后在输出层做一个softmax的运算输出生成背景词的一个概率分布。

设中心词\(w_c\)在词典中索引为\(c\),背景词\(w_o\)在词典中索引为\(o\),给定中心词生成背景词的条件概率可以通过对向量内积做softmax运算而得到:

\[ P(w_o \mid w_c) = \frac{\text{exp}(\boldsymbol{u}_o^\top \boldsymbol{v}_c)}{ \sum_{i \in \mathcal{V}} \text{exp}(\boldsymbol{u}_i^\top \boldsymbol{v}_c)}. \]

其中词典索引集\(\mathcal{V} = \{0, 1, \ldots, |\mathcal{V}|-1\}\)。假设给定一个长度为\(T\)的文本序列,设时间步\(t\)的词为\(w^{(t)}\)。假设给定中心词的情况下背景词的生成相互独立,当背景窗口大小为\(m\)时,跳字模型的似然函数即给定任一中心词生成所有背景词的概率

\[ \prod_{t=1}^{T} \prod_{-m \leq j \leq m,\ j \neq 0} P(w^{(t+j)} \mid w^{(t)}) \]

训练跳字模型

在上文中已经提到整个跳字模型从输入层到输出层的一个计算,那么对于这个神经网络,使用反向传播便可以对整个模型与词向量做一个训练。而在3.1.3小节及之后的讨论中提到,训练中是通过最大化似然函数来学习模型参数,即最大似然估计。由上文的推导即分析,可以得出loss为:

\[ - \sum_{t=1}^{T} \sum_{-m \leq j \leq m,\ j \neq 0} \text{log}\, P(w^{(t+j)} \mid w^{(t)}).\]

如果使用随机梯度下降,那么在每一次迭代里我们随机采样一个较短的子序列来计算有关该子序列的损失,然后计算梯度来更新模型参数。梯度计算的关键是条件概率的对数有关中心词向量和背景词向量的梯度。根据定义,首先看到

\[\log P(w_o \mid w_c) = \boldsymbol{u}_o^\top \boldsymbol{v}_c - \log\left(\sum_{i \in \mathcal{V}} \text{exp}(\boldsymbol{u}_i^\top \boldsymbol{v}_c)\right)\]

通过微分,我们可以得到上式中\(\boldsymbol{v}_c\)的梯度

\[ \begin{aligned} \frac{\partial \text{log}\, P(w_o \mid w_c)}{\partial \boldsymbol{v}_c} &= \boldsymbol{u}_o - \frac{\sum_{j \in \mathcal{V}} \exp(\boldsymbol{u}_j^\top \boldsymbol{v}_c)\boldsymbol{u}_j}{\sum_{i \in \mathcal{V}} \exp(\boldsymbol{u}_i^\top \boldsymbol{v}_c)}\\ &= \boldsymbol{u}_o - \sum_{j \in \mathcal{V}} \left(\frac{\text{exp}(\boldsymbol{u}_j^\top \boldsymbol{v}_c)}{ \sum_{i \in \mathcal{V}} \text{exp}(\boldsymbol{u}_i^\top \boldsymbol{v}_c)}\right) \boldsymbol{u}_j\\ &= \boldsymbol{u}_o - \sum_{j \in \mathcal{V}} P(w_j \mid w_c) \boldsymbol{u}_j. \end{aligned} \]

它的计算需要词典中所有词以\(w_c\)为中心词的条件概率。有关其他词向量的梯度同理可得。

训练结束后,对于词典中的任一索引为\(i\)的词,我们均得到该词作为中心词和背景词的两组词向量\(\boldsymbol{v}_i\)\(\boldsymbol{u}_i\)。在自然语言处理应用中,一般使用跳字模型的中心词向量作为词的表征向量。

以上这一段公式的推导是取自《dive into deeplearning》一书[7]中,其中更为详细的推导可以见[4],这篇文章中有着对整个word2vec模型与公式上更为详细与严谨的推导过程。

连续词袋模型(CBOW)

CBOW模型(Continuous Bag-of-Words Model)与skip-gram模型相反,CBOW模型是由周围的词(即背景词)生成中心词的一种神经概率模型,这里我们将背景词记为\(v_o\),生成的中心词记为\(u_c\)。由于是多个背景词生成中心词,因此在输入层中,对背景词的词向量选取并进行求和取平均数,然后与跳字模型相同,通过与中心词的向量做乘积运算(即在隐藏层和输出层之间做矩阵运算),以及在输出层上的softmax归一化运算。

设中心词\(w_c\)在词典中索引为\(c\),背景词\(w_{o_1}, \ldots, w_{o_{2m}}\)在词典中索引为\(o_1, \ldots, o_{2m}\),那么给定背景词生成中心词的条件概率

\[P(w_c \mid w_{o_1}, \ldots, w_{o_{2m}}) = \frac{\text{exp}\left(\frac{1}{2m}\boldsymbol{u}_c^\top (\boldsymbol{v}_{o_1} + \ldots + \boldsymbol{v}_{o_{2m}}) \right)}{ \sum_{i \in \mathcal{V}} \text{exp}\left(\frac{1}{2m}\boldsymbol{u}_i^\top (\boldsymbol{v}_{o_1} + \ldots + \boldsymbol{v}_{o_{2m}}) \right)}.\]
为了让符号更加简单,我们记\(\mathcal{W}_o= \{w_{o_1}, \ldots, w_{o_{2m}}\}\),且\(\bar{\boldsymbol{v}}_o = \left(\boldsymbol{v}_{o_1} + \ldots + \boldsymbol{v}_{o_{2m}} \right)/(2m)\),那么上式可以简写成

\[P(w_c \mid \mathcal{W}_o) = \frac{\exp\left(\boldsymbol{u}_c^\top \bar{\boldsymbol{v}}_o\right)}{\sum_{i \in \mathcal{V}} \exp\left(\boldsymbol{u}_i^\top \bar{\boldsymbol{v}}_o\right)}.\]

给定一个长度为\(T\)的文本序列,设时间步\(t\)的词为\(w^{(t)}\),背景窗口大小为\(m\)。连续词袋模型的似然函数是由背景词生成任一中心词的概率

\[ \prod_{t=1}^{T} P(w^{(t)} \mid w^{(t-m)}, \ldots, w^{(t-1)}, w^{(t+1)}, \ldots, w^{(t+m)}).\]

训练连续词袋模型

同样对于CBOW模型的训练与skip-gram模型类似,通过最大似然估计等价于最小化损失函数,可以得到\(loss\)

\[ -\sum_{t=1}^T \text{log}\, P(w^{(t)} \mid w^{(t-m)}, \ldots, w^{(t-1)}, w^{(t+1)}, \ldots, w^{(t+m)}).\]
注意到

\[\log\,P(w_c \mid \mathcal{W}_o) = \boldsymbol{u}_c^\top \bar{\boldsymbol{v}}_o - \log\,\left(\sum_{i \in \mathcal{V}} \exp\left(\boldsymbol{u}_i^\top \bar{\boldsymbol{v}}_o\right)\right).\]

通过微分,我们可以计算出上式中条件概率的对数有关任一背景词向量\(\boldsymbol{v}_{o_i}\)\(i = 1, \ldots, 2m\))的梯度

\[\frac{\partial \log\, P(w_c \mid \mathcal{W}_o)}{\partial \boldsymbol{v}_{o_i}} = \frac{1}{2m} \left(\boldsymbol{u}_c - \sum_{j \in \mathcal{V}} \frac{\exp(\boldsymbol{u}_j^\top \bar{\boldsymbol{v}}_o)\boldsymbol{u}_j}{ \sum_{i \in \mathcal{V}} \text{exp}(\boldsymbol{u}_i^\top \bar{\boldsymbol{v}}_o)} \right) = \frac{1}{2m}\left(\boldsymbol{u}_c - \sum_{j \in \mathcal{V}} P(w_j \mid \mathcal{W}_o) \boldsymbol{u}_j \right).\]

有关其他词向量的梯度同理可得。同跳字模型不一样的一点在于,我们一般使用连续词袋模型的背景词向量作为词的表征向量。

这一部分的梯度推导也取自于《dive into deeplearning》一书[7]中。

3.3.2近似计算

在3.3.1节中,可以由两种模型的条件概率看出,对于一个词的条件概率而言,背景窗口虽然能够限制背景词的数量,但是这个背景词需要从整个语料库中选择,即对于跳字模型中的条件概率而言:

\[ P(w_o \mid w_c) = \frac{\text{exp}(\boldsymbol{u}_o^\top \boldsymbol{v}_c)}{ \sum_{i \in \mathcal{V}} \text{exp}(\boldsymbol{u}_i^\top \boldsymbol{v}_c)}. \]
其中的分母的\(u_i\)需要从训练集的语料库中选择,因此使得最终的softmax的计算量非常的巨大,因此提出了两种近似计算,分别是Hierarchical softmax(层序softmax)与Negative Sampling(负采样)。我对这两种方法深受[4]这篇文章的理解,因此接用其中的表示方法。

层序softmax

在一般书籍与论文的举例中,通过使用平衡树来解释,但在实际实现的代码中则是通过Huffman树去实现这个方法。

文章[4]中引入相关的表示记号如下:

  1. \(p^w\):从根节点出发到达w对应叶子节点的路径。
  2. \(l^w\):路径\(p^w\)中包含节点的个数。
  3. \(p^w_1, p^w_2, \cdots, p^w_{l^w}\):路径\(p^w\)中的\(l^w\)个结点,其中\(p^w_1\)表示根结点,\(p^w_{l^w}\)表示词w对应的结点。
  4. \(d^w_1, d^w_2, \cdots, d^w_{l^w} \in \{0, 1\}\):词w的Huffman编码,它由\(l^w - 1\)位编码构成,\(d^w_j\)表示路径\(p^w\)中第j个结点对应的编码(根节点不对应编码)。
  5. \(\theta^w_1, \theta^w_2, \cdots, \theta^w_{l^w-1} \in \mathbb{R}^m\):路径\(p^w\)中非叶子结点对应的向量,\(\theta^w_j\)表示路径\(p^w\)中第j个非叶子结点对应的向量。

由这些标记符号我们可以结合以下的二叉树图对整体参数的分布记每个词的条件概率做一个深入的理解。

其中,正如文章[4]中所提到的那样,为何还要为Huffman树中的每个非叶子结点也定义一个同长的向量呢?文章[4]中解释道:“它们只是算法中的辅助向量”。具体的详细推导可见文章[4],这里给出一定简要的理解与解释。

首先是为什么Huffman树能够大量的减少计算量。我们可以通过二叉树形象的来看。

再上图的二叉树中,我们可以看到,语料库大小为\(|V|\)的情况下,每个词向量作为二叉树的根节点的每个向量数据,其中由辅助参数\(\theta^w_{l^w-1}\)非叶子结点上的向量数据。那么在二叉树中如何定义条件概率函数\(P(w_c \mid Context(w))\)(其中使用CBOW模型的条件概率)?更具体的说,就是如何使用平衡二叉树中的结点向量去定义这个条件概率函数。以图中红色的路径为例,中间一共经历了3次分支,每一次分支都可以视为二分类。
从二分类的角度来看,对于每一个非叶子结点,需要考虑为左右孩子制定一个分类类别,我们假设左为正,而右为负,根据逻辑回归,易知:

\[\sigma(x^T_w \theta) = \frac{1}{1+\exp(-x^T_w \theta)}.\]
那么分类为负的概率则为:

\[ 1 - \sigma(x^T_w \theta) \]
其中\(\theta\)则是作为非叶子结点的辅助参数向量。

那么,我们可以看到该路径的条件概率则为(以下忽略参数):

\[P(w_3 \mid Context(w)) = \sigma(\boldsymbol{x}_w^\top \boldsymbol{\theta}_{n(w_3,1)}) \cdot (1- \sigma(\boldsymbol{x}_w^\top \boldsymbol{\theta}_{n(w_3,2)})) \cdot \sigma(\boldsymbol{x}_w^\top \boldsymbol{\theta}_{n(w_3,3)}).\]

\[P(w_c \mid Context(w)) = \prod^{l^w}_{j=2}p(d^w_j \mid x_w, \theta^w_{j-1}),\]
其中

\[ p(d^w_j \mid x_w, \theta^w_{j-1}) = \begin{cases} \sigma(x^T_w \theta^w_{j-1}), & d^w_j = 1 \\ 1 - \sigma(x^T_w \theta^w_{j-1}), & d^w_j = 0 \end{cases} \]

或写成整体表达式:

\[ p(d^w_j \mid x_w, \theta^w_{j-1}) = [\sigma(x^T_w \theta^w_{j-1})]^{d^w_j} \cdot [1 - \sigma(x^T_w \theta^w_{j-1})]^{1-d^w_j} \]

其中\(d^w_j=1\)表示为正类,\(d^w_j=0\)表示为负类。

也就是说,通过将语料库中的每一个词表示为二叉树上的叶子结点,可以将原来的条件概率变为简单的路径上的二分类的相乘,由此大大的缩短了整个计算量。
那么又为何使用Huffman树呢?学过二叉树与Huffman树的同学到现在应该能够想到,对于整个语料库中,每一个词的复现的频率是不同的,因此在计算过程中,有一些高频率的的词用到的次数就较多,而Huffman树则可以通过对高频词进行较短的编码,对低频词进行较长的编码来进一步缩短计算量。

负采样

相比于层序softmax,负采样显得更为简单。
负采样同样使用了正负样本的概念,不再使用Huffman树,而是利用随机负采样,如此能够大幅提高性能。
同样是在CBOW模型中,词w的上下文为Context(w)需要预测w,那么给定的Context(w)就是正样本,而负样本则是不出现在这个上下文窗口中的词。

那么,负样本是如何选取的呢?在上一小节也提到,对于语料库中的不同词而言,不同的词的复现频率是不同的,因此在语料库上的负样本的选取就变成了一个带权采样问题。
而对于样本的权值设置则是通过如下公式:

\[ p(w)=\frac{[count(w)]^{\frac{3}{4}}}{\sum_{u \in D}[count(u)]^{\frac{3}{4}}} \]

在选取好一个关于w的负样本子集NEG(w),并且定义了词典\(D\)中的任意词\(w'\),都有:

\[ L^w(w')= \begin{cases} 1 & w'=w \\ 0 & w'\neq w \end{cases} \]

对于一个给定的正样本\((Context(w),w)\),我们希望最大化

\[ g(w)=\prod_{u\in \{w\}\bigcup NEG(w)} p(u|Context(w)) \]

其中,正负样本的条件概率类似于层序softmax中的正负类的条件概率:

\[ p(u|Context(w))= \begin{cases} \sigma(X_w^T\theta^u) & L^w(u)=1 \\ 1-\sigma(X_w^T\theta^u) & L^w(u)=0 \end{cases} \\ = [\sigma(X_w^T\theta^u)]^{L^w(u)} \cdot [1-\sigma(X_w^T\theta^u)]^{1-L^w(u)} \]

负采样直接通过负样本(采样不出现在背景窗口中的词)与正样本(出现在背景窗口中的词)计算目标词出现的条件概率,相比于层序softmax,不需要构建一个Huffman树与使用树中每一个非叶子节点的参数向量,取而代之的是直接使用语料库中的词。

对于层序softmax与负采样的梯度计算的详细推导可以在文章[4]中有详细提到。

词向量的选取

对于层序softmax与负采样这两种近似计算训练出来的词向量,每一个词其实有两种形式的词向量,一种是作为中心词的词向量,一种是作为背景词的词向量,一般情况下直接选择中心词的词向量作为最终训练得到的词向量使用。可以从负采样近似训练中得出,作为背景词的词向量可能会是不存在生成目标词的背景窗口中,因此作为背景词的词向量相比之下的置信度较低。

3.4word2vec的实际运用

词向量的语言翻译

在google的Tomas Mikolov团队开发了一种词典和术语表的自动生成技术,能够把一种语言转变成另一种语言,该技术利用数据挖掘来构建两种语言的结构模型,然后加以对比,每种语言词语之间的关键即“语言空间”,可以被表征成数学意义上的向量集合。在向量空间内,不同的语言享有许多共性,只有实现一个向量空间向另一个向量空间的映射与转换,语言翻译即可实现。该技术效果非常不错,对英语和西班牙语的翻译准确率高达90%。
文[5]在介绍算法时举了一个简单的例子,可以帮助我们更好地理解词向量的工作原理。

词向量的近义词与类比词

词向量的一个简单应用就是求得当前次的近义词与类比词。

对于近义词而言,直接在训练好的词向量中计算其他词与当前次的余弦相似度,选取最高的几个即可求出当前次的近义词。

除了求近义词以外,我们还可以使用预训练词向量求词与词之间的类比关系。例如,“man”(男人): “woman”(女人):: “son”(儿子) : “daughter”(女儿)是一个类比例子:“man”之于“woman”相当于“son”之于“daughter”。求类比词问题可以定义为:对于类比关系中的4个词 \(a : b :: c : d\),给定前3个词\(a\)\(b\)\(c\),求\(d\)。设词\(w\)的词向量为\(\text{vec}(w)\)。求类比词的思路是,搜索与\(\text{vec}(c)+\text{vec}(b)-\text{vec}(a)\)的结果向量最相似的词向量。

又如“首都-国家”类比:“beijing”(北京)之于“china”(中国)相当于“tokyo”(东京)之于什么?答案应该是“japan”(日本)。

通过词向量之间的加减即可求得所对应词的类比词。

4.总结

本文中,简要的对当前自然语言处理中所使用的词向量做了一个简要的介绍。在词向量的训练过程中,每一个词向量(word vector)都被要求为相邻上下文中的word的出现作预测,所以尽管我们是随机初始化Word vectors,但是这些vectors最终仍然能通过上面的预测行为捕获到word之间的语义关系,从而训练到较好的word vectors。而这些训练好的词向量在一些自然语言任务中使用的效果非常出色,比如词向量就可以应用到情感分类的问题中。

参考资料

[1] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin. A Neural Probabilistic Language Model. Journal of machine learning(JMLR), 3:1137-1155, 2003
[2] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.
[3] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Advances in neural information processing systems. 2013: 3111-3119.
[4] https://www.cnblogs.com/peghoty/p/3857839.html
[5] Tomas Mikolov, Quoc V. Le, llya Sutskever. Exploiting Similarities among Languages for Machine Translation. arXiv:1309.4168v1, 2013.
[6] https://www.zybuluo.com/Dounm/note/591752#5-%E5%9F%BA%E4%BA%8Enegative-sampling%E7%9A%84%E6%A8%A1%E5%9E%8B
[7] https://zh.d2l.ai/
[8] https://nndl.github.io/

Guess you like

Origin www.cnblogs.com/guzhouweihu/p/12080738.html