Reference materials:
[word2vec word vector model] Detailed explanation of the principle + code to implement
the classic model of NLP natural language processing Word2vec
Paper background knowledge
word expression method
One-hot Representation: One-hot representation
Simple, but the more words, the longer the vector. And it cannot express the relationship between words.
Thesis reserve knowledge-pre-knowledge
The concept of language model
A language model is a model that calculates the probability that a sentence is a sentence. (grammatically and semantically)
Development of language models
Language model based on expert grammar rules
Linguists try to summarize a set of universal grammatical rules, such as adjectives followed by nouns.
statistical language model
Some words may not appear in the corpus, or the phrase may be too long. So the probability is 0. In order to solve this problem, the smoothing operation in statistical language models is proposed below .
Smoothing operations in statistical language models
But this only addresses word probabilities.
In order to solve the problem of too large parameter space, the Markov hypothesis is introduced.
Language model evaluation indicators
Each field has its own evaluation indicators.
The language model can be regarded as a multi-classification problem. The purpose
of taking the nth root is to avoid the probability of long sentences being smaller than that of small sentences, resulting in biased evaluation.
Compare models
NNLM
Language models are unsupervised and do not require labeled corpus.
input layer
If you can write it as a matrix instead of a loop, just write it as a matrix. Can reduce complexity.
Hidden layer
output layer
Loss
Batchsize is a kind of tradeoff.
Since the sentence lengths are different, the pad needs to be added, but it needs to be removed in the end.
RNNLM
word2vec
The logistic regression model under multi-classification is also a Log linear model.
The skip-gram and cbow below are also Log linear models.
The principle of word2vec
skip-gram
cbow
The bag-of-words model ignores the order of words.
Key technologies
The complexity of softmax needs to be reduced.
hierarchical softmax
Convert the softmax calculation into a sigmoid calculation
and write it into a binary tree structure
Hierarchical softmax in skip-gram
Hierarchical softmax in cbow
The difference from skip-gram's hierarchical softmax is that u0 is the context word vector avg.
Skip-gram only has a complete set of center word vectors. It cannot average the center word vector and surrounding word vectors as before.
cbow only has a complete set. surrounding word vectors
negative sampling
Idea: Convert multi-classification into a two-classification problem.
The negative sampling effect is better than hierarchical softmax.
Generally, 3-10 negative samples are sampled
skip-gram negative sampling
Important words tend to appear less frequently, while unimportant words tend to appear more frequently.
CBOW negative sampling
Re-sampling
Model complexity
E and T of different models are considered the same, so Q is used below to represent model complexity.
NNLM
Using hierarchical softmax, V*H will become log 2 V ∗ H log_2V*Hlog2V∗H