[Study Notes] Word Vector Model-Word2vec

Reference materials:
[word2vec word vector model] Detailed explanation of the principle + code to implement
the classic model of NLP natural language processing Word2vec

Paper background knowledge

word expression method

One-hot Representation: One-hot representation

Simple, but the more words, the longer the vector. And it cannot express the relationship between words.

Insert image description here

Thesis reserve knowledge-pre-knowledge

The concept of language model

A language model is a model that calculates the probability that a sentence is a sentence. (grammatically and semantically)

Development of language models

Language model based on expert grammar rules

Linguists try to summarize a set of universal grammatical rules, such as adjectives followed by nouns.

statistical language model

Insert image description here
Insert image description here
Insert image description here
Some words may not appear in the corpus, or the phrase may be too long. So the probability is 0. In order to solve this problem, the smoothing operation in statistical language models is proposed below .

Smoothing operations in statistical language models

Insert image description here
But this only addresses word probabilities.
Insert image description here
In order to solve the problem of too large parameter space, the Markov hypothesis is introduced.
Insert image description here

Language model evaluation indicators

Each field has its own evaluation indicators.

The language model can be regarded as a multi-classification problem. The purpose
of taking the nth root is to avoid the probability of long sentences being smaller than that of small sentences, resulting in biased evaluation.

Insert image description here

Compare models

NNLM

Insert image description here

Language models are unsupervised and do not require labeled corpus.

input layer

Insert image description here

If you can write it as a matrix instead of a loop, just write it as a matrix. Can reduce complexity.

Hidden layer

Insert image description here

output layer

Insert image description here

Loss

Insert image description here

Batchsize is a kind of tradeoff.
Since the sentence lengths are different, the pad needs to be added, but it needs to be removed in the end.

Insert image description here

RNNLM

Insert image description here
Insert image description here

word2vec

Insert image description here

The logistic regression model under multi-classification is also a Log linear model.
The skip-gram and cbow below are also Log linear models.

The principle of word2vec

Insert image description here

skip-gram

Insert image description here
Insert image description here
Insert image description here

cbow

Insert image description here

The bag-of-words model ignores the order of words.

Insert image description here
Insert image description here

Key technologies

Insert image description here
The complexity of softmax needs to be reduced.

hierarchical softmax

Convert the softmax calculation into a sigmoid calculation
and write it into a binary tree structure
Insert image description here
Insert image description here
Insert image description here

Hierarchical softmax in skip-gram

Insert image description here
Insert image description here

Hierarchical softmax in cbow

Insert image description here

The difference from skip-gram's hierarchical softmax is that u0 is the context word vector avg.
Skip-gram only has a complete set of center word vectors. It cannot average the center word vector and surrounding word vectors as before.
cbow only has a complete set. surrounding word vectors

negative sampling

Idea: Convert multi-classification into a two-classification problem.
The negative sampling effect is better than hierarchical softmax.

Generally, 3-10 negative samples are sampled

skip-gram negative sampling

Insert image description here
Insert image description here

Important words tend to appear less frequently, while unimportant words tend to appear more frequently.

CBOW negative sampling

Insert image description here

Re-sampling

Insert image description here
Insert image description here
Insert image description here

Model complexity

Insert image description here

E and T of different models are considered the same, so Q is used below to represent model complexity.

NNLM

Insert image description here

Using hierarchical softmax, V*H will become log 2 V ∗ H log_2V*Hlog2VH

RNNLM

Insert image description here

Skip-gram

Insert image description here

skip-gram negative sampling

Insert image description here

CBOW

Insert image description here

Compare

Insert image description here

Guess you like

Origin blog.csdn.net/zhangyifeng_1995/article/details/132719661