The principle and implementation of word2vec

Word2vec is a necessary preprocessing process for early NLP, which is used to generate word vector representations (embeding).

$w_{embed} = f(word)=[0.1,0.2,...,0.21,0.32]$

It maps words into fixed-length vectors (embeding vectors), and the correlation between different words can be better expressed through vector representation, so it will make subsequent NLP tasks such as classification and generation better learning and training. Word2vec describes the correlation between different words, mainly referring to the co-occurrence of words with other words in its context. There are two main paradigms:

Skip-gram model: It is assumed that the $w_c$ context is generated through the central word $w_{c\pm i}$ , so the goal is to maximize the conditional probability of the context under the central word, that $P(w_{c-k},...,w_{c-1},w_{c+1},...,w_{c+k}|w_c)$ is, the following optimization formula, C represents the number of central words, and k represents the context window number.

$\prod^C_c P(w_{c-k},...,w_{c-1},w_{c+1},...,w_{c+k}|w_c)=\prod^C_c \prod^k_{i=1} P(w_{c-i}|w_c)*P(w_{c+i}|w_c)$

Continuous bag of words CBOW: It is assumed that $w_{c\pm i}$ the core word is generated through the context $w_c$ , so its target is under the context, and the conditional probability of generating the core word is $P(w_c | w_{c-k},...,w_{c-1},w_{c+1},...,w_{c+k})$ the largest, that is, the following optimization formula:

$\prod^C_c P(w_c|w_{c-k},...,w_{c-1},w_{c+1},...,w_{c+k})$

This article focuses on the Skip-gram model. In order to solve the above formula, the log of the above formula is transformed into the minimized formula:

$Min - \sum^C_c \sum^{\pm k}_{i=\pm 1} log(P(w_{c+i}|w_c))$

$P(w_u|w_c)=\frac{P(w_u, w_c)}{P(w_c)}=\frac{P(w_u, w_c)}{\sum P(w_*, w_c)}$

The joint distribution between words and words in the above formula $P(w_u,w_c)$ can be measured by word vector similarity, and word2vec is measured by exp form for the convenience of calculation:

$P(w_u|w_c)=\frac{P(w_u, w_c)}{P(w_c)}=\frac{P(w_u, w_c)}{\sum P(w_*, w_c)}=\frac{e^{Ew_u * Ew_c}}{\sum e^{Ew_* *Ew_c}}$

$Ew$ The word vector in the above formula , word2vec is to realize the conversion from word to word vector through the Embeding module, so that the above Loss is minimized. The Embeding module is actually similar to a fully connected network layer, whose input is an N-dimensional one-hot vector (N refers to the number of full quantifiers), and the output is an L-dimensional vector (L is the length of the word vector), and its parameters The amount is N*L in total.

word2vec is mainly to solve the weight parameters of the above-mentioned Embeding module $w_{c,h}$ , which constitutes the word vector of the central word c $It_c$ , and its partial derivative can be obtained as follows:

$\\ \frac{\partial logP(w_o|w_c)}{\partial Ew_c}\\ =\frac{\partial }{\partial Ew_c}(Ew_o*Ew_c-log(\sum P(w_*, w_c)))\\ =Ew_o-\frac{\sum Ew_* P(w_*, w_c)}{\sum P(w_*, w_c)}\\ =Ew_o-\sum P(w_*|w_c)Ew_*$

Below we implement the definition of word2vec network structure through paddle code:

class Word2Vec(nn.Layer):
    def __init__(self, num_embeddings, embedding_dim):
        super(Word2Vec, self).__init__() 
        self.embed = nn.Embedding(num_embeddings, embedding_dim, 
                weight_attr=paddle.ParamAttr(
                    name="center_embed",
                    initializer=paddle.nn.initializer.XavierUniform()))
        
    # 执行前向计算
    def forward(self, center, contexts_and_negatives=None):
        """Skip-Gram"""
        v = self.embed(center)
        if contexts_and_negatives is None:
            return v
        u = self.embed(contexts_and_negatives)
        pred = paddle.squeeze(paddle.bmm(v, u.transpose(perm=[0, 2, 1])), axis=1)
        return pred

The pred in the above definition is used to represent $It_u*It_v$ to describe the multiplication of two word vectors.

During training, we convert the original Loss into batches for training. The other $P(w_u|w_c)$ solution involves softmax calculations, which are relatively difficult to calculate. Therefore, we will simplify the training ( negative sampling ).

It first defines $w_u,w_c$ the probability of defining word u in the center word window k when it appears together:

$P(D=1|w_u,w_c)=\sigma (Ew_u*Ew_c)=\frac{1}{1 + e^{-Ew_u*Ew_c}}$

Similarly, the probability of not being in the center word window k is:

$P(D=0|w_u,w_c)=1-\sigma (Ew_u*Ew_c)=1-\frac{1}{1 + e^{-Ew_u*Ew_c}}$

At this time, the conditional probability can be expressed as:

$P(w_u|w_c)=P(D=1|w_u,w_c)*\prod_{*\sim P(w)} P(D=0|w_*,w_c)$

At this time, the loss in the batch can be expressed as:

$-\sum^B(\sum^{\pm k }_{\pm i}logP(D=1|w_{c+i},w_c)+\sum^h log(P(D=0|w_h,w_c))))$

Where k represents the window size of positive examples, h represents the number of negative examples (that is, words that are not in the context window), and the above loss function can be expressed by the binary_cross_entropy_with_logits loss function:

$Out = -label * log(\sigma (logits))+(1-label)log(1 - \sigma (logits))$

Among them, the label indicates that the word is a positive or negative example, and the logits are $It_u*It_v$ , so we can design the following loss function code

class SigmoidBCELoss(nn.Layer):
    # 带掩码的二元交叉熵损失
    def __init__(self):
        super().__init__()

    def forward(self, inputs, label, mask):
        out = nn.functional.binary_cross_entropy_with_logits(
            logit=inputs, label=label, weight=mask, reduction="none")
        return out.mean(axis=1)

The overall paddle training code is as follows:

# 中心词
center_spec = paddle.static.InputSpec([None, 1], 'int64', 'center')
# 上下文正例词及负例词
context_spec = paddle.static.InputSpec([None, max_context_len], 'int64', 'contexts_and_negatives')
# 正例及负例的标识
label_spec = paddle.static.InputSpec([None, max_context_len], 'float32', 'label')
# mask，正例及负例以外的填充为0不参与训练
mask_spec = paddle.static.InputSpec([None, max_context_len], 'float32', 'mask')

model = paddle.Model(Word2Vec(num_embeddings, embedding_dim), [center_spec, context_spec], [label_spec, mask_spec])
model.prepare(
    optimizer=paddle.optimizer.Adam(learning_rate=learning_rate, parameters=model.parameters()),
    loss=SigmoidBCELoss()
)
model.fit(
    train_dataset, 
    valid_dataset,
    batch_size=batch_size,
    epochs=num_epochs, 
    eval_freq=1,
    shuffle=True,
    save_dir=save_model_dir,
    callbacks=[loss_print, vdl_record]
)

Word Embeddings from Global Vectors (GloVe)

GloVe mainly introduces two features in the original loss function:

The global co-occurrence weight is introduced $x_{u,c}$ to indicate the number of co-occurrences of word u and word c. At this time, the global loss function can be expressed as:

$-\sum^N_u \sum^N_v x_{u,v} log(P(w_u|w_c))$

Redefining $P(w_u|w_c)$ the calculation of conditional probability, the conditional probability is actually expressed as $x_{u,c}/x_c$ , assuming $P(w_u|w_c)\approx \alpha e^{(Ew_u*Ew_c)}$ that the learning goal at this time is:

$\alpha e^{(Ew_u*Ew_c)}-x_{u,c}/x_c=0\Rightarrow Ew_u*Ew_c+log(\alpha)-log(x_{u,c}) + log(x_c)=0$

At this time, the loss function of GloVe is defined as:

$\sum^N_u \sum^N_v h(x_{u,v})(Ew_u*Ew_v+a_u + b_v - log(x_{u,v}))^2$

The principle and implementation of word2vec

Word Embeddings from Global Vectors (GloVe)

Guess you like