NLP learning (1) --- Glove word vector model model ---

I. Introduction:

1, concepts: Glove Word representation method is an unsupervised.

2, Advantages: full and effective use of statistical information corpus, co-occurrence matrix using only non-zero elements inside the train, and skip-gram is not very effective to use some statistical information corpus.

3, the development process:

Vector detailed derivation of the word: https: //blog.csdn.net/liuy9803/article/details/86592392

  (1)one-hot:

Dimension of the vector is the word length of the whole vocabulary, for each word, which corresponds to a position opposed to the vocabulary is 1, the remaining dimensions are set to 0.

The disadvantage is:

  • Dimension is very high, too sparse coding, prone to the curse of dimensionality;
  • It does not reflect the similarity between the word and the word, each word is isolated, poor generalization ability.

 

(2) vector space model VSM:

Definition: at a given set of documents C and D of the dictionary will be a document published by the model expressed as a bag of words one word, and then calculated based on TF-IDF is a real value for each word;

Since the size of the dictionary is D M, and therefore this document will be transformed into a M-dimensional vector, if a word does not appear in the dictionary in the document, the corresponding element in the vector of the word is 0 , if a word occurs in the document, the word in the corresponding vector element value tf-idf value of the word. In this way, put the document represented as a vector, and this is the vector space model (the Vector Space Model) .

And with document vectors, it can be calculated using the cosine similarity between documents.

Disadvantages:

  • With respect to the tf-idf onehot added information, but thevector space model does not catch live relationship between words (term) with the word (term),it is assumed that between the respective term is independent of each other. Some context information is lost.
  • In practical applications, we do not directly use the TF * IDF this theoretical model, because it is calculated from the weight biased in favor of short text , and therefore require some smoothing .

Take, for example, term1 appear in docA in three times, term2 appeared nine times in docA, the TF is calculated according to the above manner, meaning: tf right term2 weight (or importance) to be three times larger than term1 , it really is important it three times? Thus, the actual score Lucene model, calculation is sqrt (tf), i.e. square root by tf, play a smoothing effect. Similarly, when calculating the IDF, it is to take the logarithm log, but also to smooth.

The main idea of ​​the word vector space model is seen in a similar context as the word is likely similar in semantics. For example, if we found that the "coffee" and "drink" often occur simultaneously, on the other hand, "tea" and "drink" often occur simultaneously, then we can speculate "coffee" and "tea" in the semantics should be similar of. The word vector dimension is the number of the total context of the word. [But if the number of dimensions of the problem of high feed a lot of words, it will produce]

(3) the word Embed:

Neural network vocabulary words as an input, an output of a low-dimensional vector representation, then use optimization parameter BP.

Word vector generated neural network model is divided into two:

  • One object of the training term vectors may represent semantic relationship, it can be used for subsequent tasks, such as word2vec;
  • Another word vector is generated as a byproduct, they need to be trained to get word vector, such as fastText according to specific tasks.

① learning probability distribution

Word2Vec: [whose output is the probability of occurrence of a word while distribution]

GLove: [compared to the same time the probability of occurrence of a word, while the ratio of the probability of occurrence of a word can better distinguish words. ]

For example, suppose we want to express, "ice" and "steam" two words. And for "ice" related, and "steam" irrelevant word, such as "solid", we can expect P ice - solid / P steam - solid large. Similarly, for and "ice" has nothing to do, and "steam" related words, such as "gas", we can expect P ice - gas / P steam - gas is small. In contrast, for like "water" and the like at the same time and "ice", "steam" related words, and "fashion" and the like at the same time and "ice", "steam" unrelated words, we can expect P ice - water / P steam - water, P ice - fashion / P steam - fashion should be close to 1.

② objective function: Least Squares

Word2Vec: [Word2Vec hidden layer activation function is not used, which means that the hidden layer learning is actually a linear relationship. ]

GLove: [hidden layer easier to use than the neural network model]
word unsupervised learning vector is one of the few successful applications, the advantage that no manual annotation corpus, directly unlabeled training set text as input, output word vector can be used in downstream processing operations.

③ advantages:

    • Word unsupervised learning vector is one of the few successful applications, the advantage does not require manual tagging corpus , direct training set text is not marked as a word vector input, output can be used for downstream processing operations.
    • Word vector for transfer learning:

(1) using a large training corpus word vector (or downloaded pre-trained word vector);

(2) the word vector model to migrate to only a few labeled training set tasks;

(3) to fine-tune term vectors with the new data (If the new data set is small, then this step is not necessary).

    • Reduced dimensions word

Although the word is the input vector of the neural network, but not the first layer of input. The first layer is one-hot encoding of the word, get the word is multiplied by a weighting matrix representation to quantify, and weights training phase model is updated.

Second, the model

word-word-occurence: co-occurrence matrix, is defined as X.

The X- ij : represents the Word J appears in Word i times around.

  

 

Guess you like

Origin www.cnblogs.com/Lee-yl/p/11172255.html