word2vec和doc2vec的总结

来拔flag。这个是详细NLP技术系列的第一篇。马上开学了,回非常忙,估计第二篇又是猴年马月了。下一篇题材是关于glove或者nnlm的。
文章还有些不完善的地方,都不是太重要的模块,之后有时间慢慢补充吧。

更新:
9/4
做了关于SG模型最后计算loss的一些补充。对doc2vec损失计算部分出现的错误进行了订正。

Represent the Meaning of a Word

WordNet

WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.

pros

  1. can find synonyms. 方便寻找同义词

cons

  1. missing new words (impossible to keep up to date). 缺少新词。
  2. subjective. 主观化。
  3. requires human labor to create and adapt. 需要耗费大量人力去整理。

One Hot Encoding

Discrete representation.

cons

  1. dimension is extremely high. 维度爆炸。
  2. hard to compute accurate word similarity (all vectors are orthogonal). 无法计算词语相似度。

Bag of Words

Co-occurrence of words with variable window size.

cons

  1. dimension is extremely high, will grow as dictionary grows. 维度爆炸,而且会随着字典大小的增大而增大,对下游的ML模型产生影响。

Word2vec

A neural probabilistic language model.

Distributional similarity based representations. Represent a word by means of its neighbors.上下文足以辅助理解一个词的意思。

We will build a dense vector for each word type, chosen so that it is good at predicting other words appearing in its context.

Distributional similarity & Distributed representation (dense vector)

There are certain differences between the two. The Distributional Similarity emphasizes that the meaning of a word shall be inferred from its context. Distributed Representation is opposite to One Hot Encoding, and vector representation is non-sparse. 两者有一定的区别。distributional similarity强调能够用上下文去表示某一个单词的意思,而distributed representation与one hot encoding相对,强调向量的表示是非稀疏的。

pros

  1. can compute accurate word similarity. 可以计算词语相似度。

cons

  1. The calculation is related word vector instead of semantic word vector, so the polysemous case cannot be solved (1 vector for each word instead of each meaning). 计算出来的是关联词向量,而不是语义词向量,所以无法解决一词多意的情况(每个单词而不是每个词意对应1个向量)。

Loss Function损失函数

Softmax function: map from R v R^v to a probability distribution(从实数空间到概率分布的标准映射方法)。公式上面的部分将保证这个数转化成一个正数,下面的部分保证所有概率之和为1。

p i = e x p ( u i ) j e x p ( u j ) p_i = \frac {exp(u_i)} {\sum_{j} exp(u_j)}

我们在求出center/context word的概率分布之后,还需要使用交叉熵来得到loss。

L ( y ^ , y ) = j = 1 V y j l o g ( y ^ j ) L(\hat y, y) = − \sum_{j=1}^V y_j log(\hat y_j) . 根据公式,在完美预测的情况下,loss是0。

此处以skip-gram的训练为例。

J = 1 p ( w t w t ) J = 1 - p(w_{-t} | w_t)

w t w_{-t} 代表 w t w_t 的上下文(负号表示除了该词之外)。

p ( o c ) = e x p ( u o T v c ) w = 1 v e x p ( u w T v c ) p(o|c) = \frac {exp(u_o^T v_c)} {\sum_{w=1}^v exp(u_w^T v_c)}

o is the outside (or output) word index, c is the center word index. v c v_c and u o u_o are center and outside vectors of indices c and o. Softmax uses word c to obtain probability of word o.

According to this formula, the words in the text will be represented by two vectors. There’s one when it’s a center word, and there’s another when it’s a context. 根据这个公式,文中的单词会有两个向量表示。当它作为中心词的时候有一个,当它作为上下文的时候又有一个。

loss的推导过程主要运用softmax的概率分布公式和微积分的链式法则。
KaTeX parse error: Undefined control sequence: \part at position 8: \frac{\̲p̲a̲r̲t̲}{\part v_c} p(…
① 表示的是observation,也就是context word实际是什么(true label)。② 表示的是expectation,也就是模型认为概率最高的应该是哪个词(prediction label)。所以,实际上我们就是希望最小化实际和预测之间的差值。
KaTeX parse error: Undefined control sequence: \part at position 12: ② = \frac{\̲p̲a̲r̲t̲}{\part v_c} lo…

当我们使用sgd进行优化的时候,每一个窗口最多有2m+1个单词,所以 θ J t ( θ ) \nabla_{\theta} J_{t}(\theta) 会非常稀疏。

Training Algorithms训练方法

Skip-grams (SG)

Predict context words given target (position independent).

context

在这个案例中,"into"是target (center word),而"problems turning"和"banking crises"是我们的output context words。假设我们的句子一共有T个单词。我们定义window size(也就是预测上下文的半径)为m,这个案例中m=2。

pair

通过center word和context word组成一组训练数据,喂给word2vec模型。

Objective Function目标函数

Max the probability of any context word given the current center word. θ \theta represents all variables we will optimize. The number of total words is T. Window size is m.

J ( θ ) = t = 1 T j = m , j 0 m p ( w t + j w t ; θ ) J'(\theta) = \prod_{t=1}^T \prod_{j=-m,j \ne 0}^m p(w_{t+j}|w_t; \theta)

We use negative log likelihood to turn the objective function into a loss function.

J ( θ ) = 1 T t = 1 T j = m , j 0 m p ( w t + j w t ) J(\theta) = -\frac 1 {T} \sum_{t=1}^T \sum_{j=-m,j \ne 0}^m p(w_{t+j}|w_t)

Training Process训练过程

sg

这张图第一眼看上去非常花哨,但是其实把这个工作流程说清楚了。d表示向量的维度,V是vocabulary size。

图中的 W W 是center word矩阵,以列为单位存储每一个单词作为center word的向量表示, W R d V W \in R^{d*V} 。在一个训练批次只有一个center word,所以可以用独热向量 w t w_t 来表示。通过计算两者的点乘,我们就得到了当前想要的center word的向量 v c v_c v c R d 1 v_c \in R^{d*1} v c = w t W v_c = w_t \cdot W .

图中的 W W' 是context word矩阵,以行为单位存储每一个单词作为context word的向量表示, W R V d W' \in R^{V*d} 。通过计算该矩阵和center word向量的点乘我们可以得到一个中间产物 v t m p v_{tmp} v t m p = W v c v_{tmp} = W' \cdot v_c 。对这个中间产物进行softmax,可以得到每一个词作为context word对应的概率,这个概率的向量表示标记为 p ( x c ) p(x|c) ,是大小为 V V 的向量 y p r e d y_{pred} p ( x c ) = s o f t m a x ( v t m p ) p(x|c) = softmax(v_{tmp}) 。我们希望在得到的向量 y p r e d y_{pred} 中真正context word所对应的索引处的值(在上个模块例子中有4个context word)是大的,而其他索引处的值是小的。。

W W W W' 都是模型训练过程中需要学习的。

theta

之前提到每一个单词会有两个向量表示,即v (center word)和u (context word),把这两个向量拼接起来(其实也可以相加)作为训练参数 θ \theta θ R 2 V d \theta \in R^{2Vd} 。这里的 θ \theta 是一个非常长的向量,而不是一个矩阵。

Continuous Bag of Words (CBOW)

Predict target word from bag-of-words context.

Objective Function目标函数

Max the probability of center word given its context words. θ \theta represents all variables we will optimize. The number of total words is T. Window size is m.

J ( θ ) = t = 1 T j = m , j 0 m p ( w t w t + j ; θ ) J'(\theta) = \prod_{t=1}^T \prod_{j = -m, j \ne 0}^m p(w_t|w_{t+j}; \theta)

We use negative log likelihood to turn the objective function into a loss function.

J ( θ ) = 1 T t = 1 T j = m , j 0 m p ( w t w t + j ) J(\theta) = -\frac 1 {T} \sum_{t=1}^T \sum_{j = -m, j \ne 0}^m p(w_{t}|w_{t+j})

Training Process训练过程

cbow

训练过程和skig-gram非常类似。

When computing the hidden layer output, instead of directly copying the input vector of the input context word, the CBOW model takes the average of the vectors of the input context words, and use the product of the input→hidden weight matrix and the average vector as the output. 图中的 W W 是context word矩阵,以列为单位存储每一个单词作为context word的向量表示, W R d V W \in R^{d*V} 。如果在一个训练批次只考虑一个context word,可以用独热向量 x t x_t 来表示。通过计算两者的点乘,我们就得到了当前想要的context word的向量 v c o n t e x t v_{context} v c o n t e x t R d 1 v_{context} \in R^{d*1} v c o n t e x t = W x t v_{context} = W \cdot x_t . 但是,在context包含多个词的时候,通常会采用这多个context word所对应向量的平均值作为输入。 v c o n t e x t = 1 2 m j = m j 0 m W x j v_{context} = \frac{1}{2m} \sum_{j=-m \\\\ j \ne 0}^m W \cdot x_j .

Improvement提升训练效率

Hierarchical SoftMax

The hierarchical softmax encodes the language model’s output softmax layer into a tree hierarchy, where each leaf is one word and each internal node stands for relative probabilities of the children nodes.

hs

An example path from root to w2 is highlighted. p w p^w means the path from the root node to the leaf node. In the example shown, the length of the path l ( w 2 ) l(w2) = 4. n ( w , j ) n(w, j) means the j-th unit on the path from root to the word w. d j w { 0 , 1 } d_j^w \in \{0,1\} stands for the encoding of j-th node in the path p w p^w . θ j w \theta_j^w is the vector of j-th node in the path p w p^w .

In the hierarchical softmax model, there is no output vector representation for words. 相当于是去掉了模型的隐藏层。原因是从hidden layer到output layer的矩阵运算太多了。

使用了哈夫曼树,则时间复杂度就从 O ( V ) O(|V|) 降到了 O ( l o g 2 V ) O(log_2|V|) 。另外,由于哈夫曼树的特点,词频高的编码短,这样就更加快了模型的训练过程。

// TODO 补充关于SG和CBOW的推导

Negative Sampling负采样

每次训练不需要更新所有负样本的权重,而只更新其中的k个。

For Unigram Model, the power of 3/4 works best. Word2vec则在词频基础上取了0.75次幂,减小词频之间差异过大所带来的影响,使得词频比较小的负样本也有机会被采到。

w e i g h t ( w ) = c o u n t ( w ) 0.75 / i = 1 V c o u n t ( i ) 0.75 weight(w) = count(w)^{0.75}/\sum_{i=1}^V count(i)^{0.75}

P ( w ) = U ( w ) 0.75 / Z P(w) = U(w)^{0.75} / Z

Loss Function损失函数

with SG

Our new objective function:

l o g σ ( u o T v c ) + k = 1 K E j P ( w ) l o g σ ( u j T v c ) log \sigma(u_{o}^T \cdot v_c) + \sum_{k=1}^K E_{j \sim P(w)} log \sigma(-u_j^T \cdot v_c) .

Loss function:

J n e g ( o , v c , U ) = l o g σ ( u o T v c ) k = 1 K l o g σ ( u k T v c ) J_{neg}(o, v_c, U) = -log \sigma(u_o^Tv_c) - \sum_{k=1}^K log \sigma(-u_k^T \cdot v_c)

This maximizes probability that real outside word appears, minimize probability that random words appear around center word.

Consider a pair (w, c) of word and context. Did this pair come from the training data? Let’s denote by P(D = 1|w, c) the probability that (w, c) came from the corpus data. Correspondingly, P(D = 0|w, c) will be the probability that (w, c) did not come from the corpus data. First, let’s model P(D = 1|w, c) with the sigmoid function

P ( D = 1 w , c , θ ) = 1 / ( 1 + e x p ( v c T v w ) ) P(D = 1 | w,c,\theta) = 1/(1+exp(-v_c^Tv_w))

Now, we build a new objective function that tries to maximize the probability of a word and context being in the corpus data if it indeed is, and maximize the probability of a word and context not being in the corpus data if it indeed is not. We take a simple maximum likelihood approach of these two probabilities. (Here we take θ to be the parameters of the model, and in our case it is V and U.)
θ = a r g m a x θ ( w , c ) D P ( D = 1 w , c , θ ) ( w , c ) D ~ P ( D = 0 w , c , θ ) = a r g m a x θ ( w , c ) D P ( D = 1 w , c , θ ) ( w , c ) D ~ 1 P ( D = 1 w , c , θ ) = a r g m a x θ ( w , c ) D l o g P ( D = 1 w , c , θ ) ( w , c ) D ~ l o g ( 1 P ( D = 1 w , c , θ ) ) = a r g m a x θ ( w , c ) D l o g ( 1 / ( 1 + e x p ( u w T v c ) ) ( w , c ) D ~ l o g ( 1 1 / ( 1 + e x p ( u w T v c ) ) = a r g m a x θ ( w , c ) D l o g ( 1 / ( 1 + e x p ( u w T v c ) ) ( w , c ) D ~ l o g ( 1 / ( 1 + e x p ( u w T v c ) ) \theta = argmax_\theta \prod_{(w,c) \in D} P(D=1| w,c,\theta) \prod_{(w,c) \in \widetilde D} P(D=0| w,c,\theta) \\\\ = argmax_\theta \prod_{(w,c) \in D} P(D=1| w,c,\theta) \prod_{(w,c) \in \widetilde D} 1-P(D=1| w,c,\theta) \\\\ = argmax_\theta \sum_{(w,c) \in D} log P(D=1| w,c,\theta) \sum_{(w,c) \in \widetilde D} log(1-P(D=1| w,c,\theta)) \\\\ = argmax_\theta \sum_{(w,c) \in D} log(1/(1+exp(-u_w^Tv_c)) \sum_{(w,c) \in \widetilde D} log(1-1/(1+exp(-u_w^Tv_c)) \\\\ = argmax_\theta \sum_{(w,c) \in D} log(1/(1+exp(-u_w^Tv_c)) \sum_{(w,c) \in \widetilde D} log(1/(1+exp(u_w^Tv_c)) \\\\
D ~ \widetilde D stands for a “false” corpus. For example, the unnatural sentences is one of such corpus.

Our new objective function:

l o g σ ( u c m + j T v c ) + k = 1 K l o g σ ( u ~ k T v c ) log \sigma(u_{c-m+j}^T \cdot v_c) + \sum_{k=1}^K log \sigma(-\widetilde u_k^T \cdot v_c) .

In the above formulation, { u ~ k \widetilde u_k |k=1~K} are sampled from P(w).

极大化正样本出现的概率,极小化负样本出现的概率。用sigmoid代替了softmax,相当于进行正负样本的二分类。

E = l o g σ ( v w p o s h ) w j W n e g l o g σ ( v w j h ) E = -log \sigma(v'_{w_{pos}}h) - \sum_{w_j \in W_{neg}} log \sigma(-v'_{w_j}h)

对于W’中v’进行求导

v w j ( t + 1 ) = v w j ( t ) η ( σ ( v w j ( t ) T h ) t j ) h v_{w_j}^{'(t+1)} = v_{w_j}^{'(t)} - \eta (\sigma(v_{w_j}^{'(t)T}h)-t_j)h

with CBOW

Doc2Vec

其实一个句子的向量表示可以由构成这个句子的词语的向量求加权平均值而得到。频率高的词权重稍微小一些。由这个方法得到的句子向量其实有非常不错的效果。虽然这种方法效果非常不错,但是有一个缺陷,就是忽略了单词之间的排列顺序对句子或文本信息造成的影响。比如,将一个句子的主语和宾语调换,那么意思会完全相反,但是根据这种方法得出来的向量却是相同的。

Doc2vec被用来解决这个问题,在使用向量表示段落或者文本的时候,考虑到了词序对于语意的影响。

pros

  1. 非监督的学习方法,可以被应用于没有足够标签的训练数据

cons

  1. missing new words (impossible to keep up to date)缺少新词

Training Algorithms训练方法

Distributed Memory (PV-DM)

PV-DM is analogous to Word2Vec CBOW. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a center word based an average of both context word-vectors and the full document’s doc-vector. It acts as a memory that remembers what is missing from the current context — or as the topic of the paragraph. 名字起得比较搞笑,PV-DM实际上对应的是word2vec中的CBOW模式。在给定上下文和文档向量的情况下预测单词的概率。

DM模型在训练时,首先将每个文档ID和语料库中的所有词初始化一个K维的向量,然后将文档向量和上下文词的向量输入模型,隐层将这些向量累加(或取均值、或直接拼接起来)得到中间向量,作为输出层softmax的输入。在一个文档的训练过程中,文档ID保持不变,共享着同一个文档向量,相当于在预测单词的概率时,都利用了整个文档的语义。

dm
在这个图中,作者貌似只说了用前文的词去预测后文的词,比如在这个例子中"the cat sat"是"on"的前文。这个实际上具有一定的误导性。在原paper的损失函数仍然是包含了一个center word的前后文的。
在这里插入图片描述

假设总共有N段文字,文字所映射到的向量维度是p。字典中词的数量是V,词所映射到的向量维度是q。那么模型总共有N*p+V*q个参数。

Distributed Bag of Words (PV-DBOW)

PV-DBOW is analogous to Word2Vec SG. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a target word just from the full document’s doc-vector.名字起得比较搞笑,PV-DM实际上对应的是word2vec中的SG模式。在每次迭代的时候,从文本中采样得到一个窗口,再从这个窗口中随机采样一个单词作为预测任务,让模型去预测,输入就是段落向量。

在这里插入图片描述

这种训练方式通常要比DM训练方式快很多,需要更少的储存空间,但是准确度不如DM高。

调包使用

参数

  • dm: 0 = DBOW; 1 = DMPV. 模型的模式
  • vector_size: Dimensionality of the feature vectors.
  • window: The maximum distance between the current and predicted word within a sentence.
  • min_count: Ignores all words with total frequency lower than this.
  • sample: this is the sub-sampling threshold to downsample frequent words; 10e-5 is usually good for DBOW, and 10e-6 for DMPV.
  • hs: 1 turns on hierarchical sampling; this is rarely turned on as negative sampling is in general better
  • negative: number of negative samples; 5 is a good value.
  • **dm_mean **(optional): If 0 , use the sum of the context word vectors. If 1, use the mean. Only applies when dm is used in non-concatenative mode.
  • dm_concat (optional): If 1, use concatenation of context vectors rather than sum/average; Note concatenation results in a much-larger model, as the input is no longer the size of one (sampled or arithmetically combined) word vector, but the size of the tag(s) and all words in the context strung together.
  • dbow_words (optional): If set to 1 trains word-vectors (in skip-gram fashion) simultaneous with DBOW doc-vector training; If 0, only trains doc-vectors (faster).

方法

class TaggedDocument(namedtuple('TaggedDocument', 'words tags')):
    """
    A single document, made up of `words` (a list of unicode string tokens)
    and `tags` (a list of tokens). Tags may be one or more unicode string
    tokens, but typical practice (which will also be most memory-efficient) is
    for the tags list to include a unique integer id as the only tag.

    Replaces "sentence as a list of words" from Word2Vec.

很多人奇怪doc2vec作为一个非监督学习的方法,为什么会需要提供一个words tags的选项。通过看文档我们可以发现,实际上这个参数我们填写每个文档对应的唯一性标识就可以。当然,我们也可以传对应的标签进去,但是这个并不会妨碍doc2vec把文档当成标记过的数据。注意,必须要把标记当成列表传递。

示例

下面的案例是基于一个三分类的问题,使用的是dm模式。其中xtrain[“ngram”]是已经分好词的预料。

from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

def getVec(model, tagged_docs, epochs=20):
  sents = tagged_docs.values
  regressors = [model.infer_vector(doc.words, epochs=epochs) for doc in sents]
  return np.array(regressors)

def plotVec(ax, x, y, title="title"):
  scatter = ax.scatter(x[:, 0], x[:, 1], c=y, 
             cmap=matplotlib.colors.ListedColormap(["red", "blue", "yellow"]))
  ax.set_title(title)
  ax.legend(*scatter.legend_elements(), loc=0, title="Classes")

xtrain_tagged = xtrain.apply(
    lambda r: TaggedDocument(words=r["ngram"], tags=[r["Label"]]), axis=1
)

model_dm = Doc2Vec(dm=1, vector_size=30, negative=5, hs=0, min_count=2, sample=0)
model_dm.build_vocab(xtrain_tagged.values)
for epoch in range(10):
    sents = xtrain_tagged.values
    model_dm.train(sents, total_examples=len(sents), epochs=1)
    model_dm.alpha -= 0.002 
    model_dm.min_alpha = model_dm.alpha
xtrain_vec = getVec(model_dm, xtrain_tagged)
xtrain_tsne = TSNE(n_components=2, metric="cosine").fit_transform(xtrain_vec)
plotVec(ax1, xtrain_tsne, ytrain, title="training")

Reference

猜你喜欢

转载自blog.csdn.net/qq_40136685/article/details/108354404