NLP entry learning route word2vec principle analysis

NLP and ML related information

ML
Awesome NLP
Word2vec blog
ML+Learning
图解word2vec

NLP

Insert picture description here
Insert picture description here

Main research directions of NLP

Insert picture description here

NLP process

Insert picture description here

NLP learning

After reading the general task flow of nlp, I decided to learn from embedding.

embedding

Everything can be embedding. I have forgotten who said it. Someone must have said it.
Why embedding, the ultimate goal is to form a form that can be understood by computers.
So one hot can also be done, why not use it?
1. One hot is indeed a form that computers can understand, but it will introduce dimensional disasters.
2. One hot cannot represent the relationship between two words. For example, like and love should be words with similar meanings (Chinese perspective), but If you use one hot, you can't measure it.
Insert picture description here
So the purpose of embedding is mainly the above two.

word embedding

The following mainly explains the principle of word2vec, and does not involve implementation, mainly in skipgram and negative sampling, and the window size is taken as 2.

word2vec

skipgram

Insert picture description here
Use the center word to predict 2 [window size] words on the left and two words on the right.
Insert picture description here
But in order to improve the speed, refer to the last link above and modify the model, as shown below. The
Insert picture description here
input data must be restructured. The changes are as follows:
At this time, the sample target is all 1, and the model must not learn anything, so negative sampling is required. , The words in the non-window are counted as negative samples. The Insert picture description here
specific method needs further study.

The parameters involved

embedding_size is generally 1e1-1e2, the
window size is generally 2-15, the
number of negative samples is generally 5 enough,
etc.
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_32507417/article/details/108013427