word2vec (reprint)

Source: https://www.cnblogs.com/iloveai/p/word2vec.html

word2vec Past and Present

2013, Google open source tools for word --word2vec a vector calculation, attracted the attention of industry and academia. First, word2vec can be efficiently trained in the order of millions and millions of dictionary data set; secondly, the training results obtained by the tool - word vector (word embedding), can be a good measure of between words similarity. With the popularization and application of deep learning (Deep Learning) in natural language processing, many people mistakenly think word2vec is a deep learning algorithms. In fact, behind word2vec algorithm is a shallow neural network. Also to be emphasized is that, word2vec is an open source tool to calculate word vector of. When we say word2vec algorithm or model when, in fact refers to the underlying model for CBoW and Skip-gram model to calculate the word vector. Many people think word2vec refers to a model or algorithm, this is a fallacy. Next, we start from the statistical language model, the model describes the ins and outs of the algorithm behind the tool word2vec much detail as possible.

Statistical Language Model

Before details of the depth word2vec algorithm, we first look at a fundamental problem of natural language processing: how to calculate the probability of a text sequence appears in a certain language? The sum is referred to as a fundamental problem, because it is in many NLP tasks play an important role. For example, the problem of machine translation, if we know the probability of every word in the target language, you can pick from the candidate set out the most reasonable sentence as a translation result returned.

Statistical language model gives a basic framework to address this kind of problem. For a text sequence S = w1, w2, ..., wTS = w1, w2, ..., wT, its probability can be expressed as:

 

P(S)=P(w1,w2,...,wT)=∏t=1Tp(wt|w1,w2,...,wt−1)P(S)=P(w1,w2,...,wT)=∏t=1Tp(wt|w1,w2,...,wt−1)


Upcoming joint probability sequences into the product of a series of conditional probabilities. The question becomes how to predict the conditions under which a given previous words probability p (wt | w1, w2, ..., wt-1) p (wt | w1, w2, ..., wt-1).

Because of its huge parameter space, such an original model in practice and did not use any eggs. We are using more simplified version --Ngram model:

 

p(wt|w1,w2,...,wt−1)≈p(wt|wt−n+1,...,wt−1)p(wt|w1,w2,...,wt−1)≈p(wt|wt−n+1,...,wt−1)


The common bigram model (N = 2N = 2) and trigram models (N = 3N = 3). In fact, due to the complexity of the model and prediction accuracy, we rarely consider N> model 3N> 3 of.

We can use the maximum likelihood method to solve the parameters Ngram model - equivalent to the word frequency statistics conditions of each Ngram.

In order to avoid zero probability statistical problems that occur (some never seen before in the training set Ngram segment will make the probability of the entire sequence is 00), it is further developed based on the original model Ngram a back-off trigram model (with low-level the bigram and trigram's unigram probability instead of zero) and interpolated trigram model (expressed as the conditional probability unigram, bigram, trigram three linear function). Not repeat them here. Of interest may further read the relevant literature [3].

Distributed Representation

However, Ngram model has its limitations. First of all, due to the explosive growth of the parameter space, it can not handle a longer process context (N> 3N> 3). Secondly, it does not consider the inherent connectedness between words. For example, consider "the cat is walking in the bedroom" this sentence. If we see a lot in the training corpus similar to "the dog is walking in the bedroom" or "the cat is running in the bedroom" such a sentence, so even if we have not seen this sentence, also from the "cat similarity between "and" dog "(" walking "and" running "), the estimated probability of this sentence [3]. However, Ngram model can not.

This is because, as a word is an isolated unit atoms (atomic unit) to process the Ngram nature. This approach corresponds to the mathematical form is a discrete one-hot vector (except the direction of a dictionary index subscript corresponds to 00 is 11, the remaining directions). For example, for a size of the dictionary 55: { "I", "love", "nature", "luaguage", "processing"}, "nature" one-hot vector corresponding to: [0, 0, 0,0] [0,0,1,0,0]. Clearly, one-hot vector of dimension equal to the size of the dictionary. This is the practical application of the hundreds of thousands or even millions dictionary, the dimensions of the disaster facing a huge problem (the curse of dimensionality)

So, people will naturally think of, whether with a continuous dense vector a word to describe the characteristics of it? In this way, we not only can directly describe the similarity between the word and the word can also build a model from vector to the smooth function of the probability that a similar word vector can be mapped to a similar probability space. This vector is also referred to as a dense continuous word is distributed representation [3].

In fact, this concept in information retrieval (Information Retrieval) areas have long been widely used. But, in the IR field, this concept is known as a vector space model (Vector Space Model, hereinafter referred to as VSM).

VSM is based on a Statistical Semantics Hypothesis [4]: ​​statistical language features hidden semantic information (Statistical pattern of human word usage can be used to figure out what people mean). For example, two words with a similar document distribution can be considered with a similar theme. The Hypothesis There are a lot of derivative versions. Among them, two better-known version is the Bag of Words Hypothesis and Distributional Hypothesis. The former is said that a document word frequency (instead of word order) represent the theme of the document; the latter is to say, two words similar context has similar semantics. Later we will see, word2vec algorithm is based on the assumption of Distributional.

So, VSM is how sparse discrete one-hot word vector mapping is dense continuous distributional representation of it?

In simple terms, based on Bag of Words Hypothesis, we can construct a term-document matrix AA: Ai line matrix,: Ai ,: corresponding dictionary of a word; columns of the matrix A:, jA:, j corresponding training the corpus in a document; matrix, the elements AijAij represents the number of times (or frequency) word wiwi appear in the document DjDj in. So, we can extract the row vectors as semantic vector word (although, in practice, the more we are using the document as a theme vector column vector).

Similarly, we can construct a matrix based on word-context Distributional Hypothesis. This time, the columns of the matrix into a context where the word, but also elements of the matrix into a number of co-occurrence of the word in a context window.

Note that these two types of matrix row vectors calculated similarity with a subtle difference: term-document matrix will often given a higher similarity with a document in the two word; and word-context would matrix those two have the same context the word given a higher degree of similarity. With respect to the former which is a higher order of similarity, so it has been more widely used in conventional art information retrieval.

However, this co-occurrence matrix there is still the problem of data sparsity and dimensions of the disaster. For this purpose, a series of methods have been proposed (e.g., LSI / LSA, etc.) to reduce the dimension of the matrix. These methods are based on SVD big idea, the original sparse matrix into the form of two low-rank matrices.

More description about the VSM, can be further read reference text at the end [4].

Neural Network Language Model

Next, let us return to the discussion of statistical language models. Given the lack Ngram and other models, in 2003, Bengio, who published a seminal article: A neural probabilistic language model [3]. In this article, they summed up a framework for the establishment of a statistical language model using neural networks (Neural Network Language Model, hereinafter referred to as NNLM), and the first time the word embedding the concept (though not by that name), which laid the foundation including word2vec including follow-up study of basic word representation learning.

The basic idea NNLM model can be summarized as follows:

  1. Assuming each vocabulary word corresponds to a continuous feature vector;
  2. Assuming a smooth continuous probability model, a period of the input sequence vector words may be output joint probability of this sequence;
  3. While learning the right word vector of weight and probability models in the parameters.

A point worth noting is that the parameter vector word here is to learn.

In the 2003's thesis, Bengio, who used a simple forward feedback neural network f (wt-n + 1, ..., wt) f (wt-n + 1, ..., wt) intends to together a word sequence of conditional probability p (wt | w1, w2, ..., wt-1) p (wt | w1, w2, ..., wt-1). The whole network structure model shown below:

the neural network language model

We can be split into two parts, the entire model to be understood:

  1. The first is a linear embedding layer. N-1N-1 th one-hot vector it entered word, through a shared matrix CC D × VD × V is mapped to N-1N-1 th word vector distributed (distributed vector). Wherein, VV is the size of the dictionary, DD is the embedding dimension vector (a priori parameter). CC matrix stored in the word vector to be learned.
  2. Followed by a simple front gg feedback to neural networks. It consists of a tanh a hidden layer and output layer softmax. Probability by N-1N-1 vector of word mapped to the output layer embedding VV vector for a length distribution, thereby dictionary word input made conditional probability estimates in context:

    p(wi|w1,w2,...,wt−1)≈f(wi,wt−1,...,wt−n+1)=g(wi,C(wt−n+1),...,C(wt−1))p(wi|w1,w2,...,wt−1)≈f(wi,wt−1,...,wt−n+1)=g(wi,C(wt−n+1),...,C(wt−1))

We can by minimizing a cross-entropy regularization loss function to adjust the model parameters θθ:

 

L (i) = 1TStlogf (wt, wt-1, ..., wt-n + 1) + R (i) L (i) = 1TStlog⁡f (wt, wt-1, ... , wt-n + 1) + R (i)

Wherein the model parameter θθ layer comprising a matrix embedding elements CC, and before the re-gg in a feedback neural network model weights. This is a huge parameter space. However, when parameter SGD updated model of learning, not all parameters need to be adjusted (for example, the word corresponding to the word vector does not appear in the context input). The main bottleneck is calculated on the normalization functions softmax layer (again need to calculate conditional probabilities for all of the dictionary word).

However, swore complex parameter space, we must ask why such a simple model will be a huge success of it?

Careful observation will find that this model, which in fact solve two problems simultaneously: one is concerned about the condition of statistical language models in the probability p | p (wt context) | computing (wt context); and in a vector space model is concerned word expression vector. The essence is not separate these two issues. We can words by introducing a continuous and smooth vector probability model in a continuous sequence probability modeling space, thus easing the problem of data sparsity and the curse of dimensionality fundamentally. On the other hand, to the conditional probability p (wt | context) p (wt | context) is the right word to update the learning target vector heavy, with more guidance, but also coincides with the VSM in the Distributional Hypothesis.

CBoW & Skip-gram Model

Bedding so much, finally turn the protagonist played.

However, before the protagonist official debut, we look at several problems NNLM.

One problem is that, like Ngram model, NNLM model can only deal with fixed-length sequence. In the 2003's thesis, Bengio, who will be the length of the sequence of NN model is capable of processing up to 55, although compared bigram and trigram has been greatly improved, but still lacks flexibility.

Therefore, Mikolov, who in 2010 proposed a RNNLM model [7], instead of the original model in the front with a recurrent neural network feedback to the neural network, and the embedding layer was combined with RNN in the hidden layer, so as to solve the longer the sequence of questions.

Another problem is more serious. NNLM training too slow. Even in the order of millions of data sets, even by means of a CPU 40 for training, NNLM also need to take several weeks to give a solution to slightly fly. Obviously, for now in the hundreds of millions or even billions of real corpus, a NNLM training model is almost a impossible mission.

At this time, or that Mikolov stand out. He noted that the original NNLM training model actually can be split into two steps:

  1. A simple model of a continuous training vector word;
  2. Word expression vector-based, continuous training of the Ngram a neural network model.
    The calculation bottleneck NNLM model mainly in the second step.

If we just want to get a continuous feature vectors word, is not it possible to simplify the second step in the neural network model it?

Mikolov think so, did the same. His breath launched two paper in 2013, and an open source tool to calculate word vector - So far, word2vec turned out, the protagonist debut.

Below, I will lead you a simple analysis principle under word2vec algorithm. With the basis of the foregoing, understanding word2vec algorithm becomes very simple.

First, we make the following transformation of the original NNLM model:

  1. Before removing the non-linear feedback to the neural network hidden layer, directly softmax layer embedding layer and output layer is connected to the intermediate layer;
  2. Ignore the context of sequence information: All input word summary of the volume to the same embedding layer;
  3. The future words into the context

CBoW model called model (Continuous Bag-of-Words Model) obtained is word2vec first model algorithm:

cbow Model

From a mathematical point of view, CBoW model is equivalent to a bag of words model of embedding vector by a matrix, resulting in a continuous embedding vectors. This is also CBoW origin of the model name.

CBoW prediction model is still learning from the context of the target word to word in the expression vector. In turn, we can predict the context of the target word to word vector in learning it? The obvious answer is possible:

Skip-gram Model

This model is called Skip-gram model (derived from the name of the model will be sampled in the context of the word during training).

If the former Skip-gram model is written in the form of the mathematical calculation, we get:

 

p(wo|wi)=eUo⋅Vi∑jeUj⋅Vip(wo|wi)=eUo⋅Vi∑jeUj⋅Vi


Wherein, ViVi embedding layer is the column vector in the matrix, also referred wiwi the input vector. UjUj softmax layer is the row vector in a matrix, also called wjwj the output vector.

Thus, the nature of Skip-gram model is the cosine similarity between the input vector calculating the input word and the target word of the output vector, and softmax normalization . We should learn from the model parameters is precisely these two types of term vectors.

However, direct to the dictionary of words VV calculate the similarity and normalized, apparently is an extremely time-consuming impossible mission. To this end, Mikolov introduces two optimization algorithms: hierarchical Softmax (Hierarchical Softmax) and negative samples (Negative Sampling).

Hierarchical Softmax[5]

Softmax level approach was first introduced in 2005 by the Bengio into language model. The basic idea is complex normalized probability into a series of product in the form of conditional probabilities:

 

p (v | context) = Πi = 1mp (bi (v) | b1 (v), ..., bi-1 (v), context) p (v | context) = Πi = 1mp (bi (v ) | b1 (v), ..., bi-1 (v), context)


Wherein each layer corresponding to the conditional probability of a binary classification problem, by a simple logic to fit the regression function. In this way, we will probabilities VV words normalized problem, transformed into a probability logVlog⁡V words fit the problem.

We can intuitively understand this process by constructing a binary tree classification. First, we'll DD original dictionary is divided into two subsets D1D1, D2D2, and assuming that under the given context, target word belongs to a subset of the D1D1 probability p (wt∈D1 | context) p (wt∈D1 | context) subject to logistical function of the form:

 

p(wt∈D1|context)=11+e−UDroot⋅Vwtp(wt∈D1|context)=11+e−UDroot⋅Vwt


Where the parameters are UDrootUDroot and VwtVwt model.

Next, we can set D1D1 and D2D2 further sub-divided. Repeat this process until the collection only one word. In this way, we will convert the original size of the dictionary DD VV became a depth of logVlog⁡V binary tree. Leaf nodes of the tree with the original dictionary of word-one correspondence; the non-leaf nodes correspond to the collection of a certain type of word. Obviously, starting from a root node to any leaf node is only one unique path - this path also encode the leaf node belongs to this category.

At the same time, starting from the root node to the leaf node process is a random walk. Therefore, we can calculate the probability of the likelihood of a binary tree leaf nodes appear Fengyun-based. For example, in the training sample for a target word wtwt, assuming the corresponding binary coded as {1,0,1, ..., 1} {1,0,1, ..., 1}, we constructed likelihood function is:

 

p(wt|context)=p(D1=1|context)p(D2=0|D1=1)…p(wt|Dk=1)p(wt|context)=p(D1=1|context)p(D2=0|D1=1)…p(wt|Dk=1)


We are each a function of a product of the logistic regression.

We can solve the parameters of the binary tree by maximizing the likelihood function - vector on non-leaf node, used to calculate the probability of a walk to the child node.

Softmax level is a very clever model. It will target computational complexity reduces the probability of the order logVlog⁡V by constructing a binary tree from the initial to the VV. However, the price paid is artificially enhanced coupling between the word and the word. For example, changes in the conditional probability of occurrence of a word, it will affect the probability of all non-leaf nodes on the path of change, indirectly bring varying degrees of impact on the conditional probability of occurrence of another word. Therefore, a meaningful binary tree structure is very important. Practice has proved that, in practical applications, based on a binary tree Huffman coding to meet the needs of most application scenarios.

Negative Sampling[6]

The idea originally came from a negative sample called Noise-Contrastive Estimation algorithm [6], which was originally to solve the probabilistic model parameters that can not be normalized estimate problem. And the output probability level Softmax algorithm transformation model different, NCE reconstruction algorithm is the likelihood function of the model.

Skip-gram to model, for example, its original likelihood function corresponds to an Multinomial distribution. In solving this likelihood function using the maximum likelihood method, we get a cross-entropy loss function of:

 

J(θ)=−1T∑t=1T∑−c≤j≤c,j≠0logp(wt+j|wt)J(θ)=−1T∑t=1T∑−c≤j≤c,j≠0log⁡p(wt+j|wt)


Formula p (wt + j | wt) p (wt + j | wt) is a dictionary on the entire normalization of the probability.

In the NCE algorithm, we construct such a problem: For a set of training samples <context, word>, we want to know, target appeared word is from the context of driving, or a presupposed background noise drive? Obviously, we can use a logistic regression function to answer the question:

 

p(D=1|w,context)=p(w|context)p(w|context)+kpn(w)=σ(logp(w|context)−logkpn(w))p(D=1|w,context)=p(w|context)p(w|context)+kpn(w)=σ(log⁡p(w|context)−log⁡kpn(w))


This equation gives the probability that a target word ww from the context-driven. Wherein, KK is a priori parameter indicating the sampling frequency of the noise. p (w | context) p (w | context) is a non-normalized probability distribution, using molecular moiety herein softmax normalization function. pn (w) pn (w) is the background noise word distribution. Commonly used word of unigram distribution.

Kk noise distribution by sampling, we obtain a new data set: <context, word, label>. Which, label marking the source (or background noise distribution of real data distribution?) Data. In this new dataset, we can come to solve the parameters of the model by maximizing the likelihood function equation logistic regression.

The negative sampling algorithm Mikolov in 2013 the thesis put forward, is a simplified version of the NCE. In this algorithm, where, Mikolov abandoned NCE likelihood function dependent on the noise distribution, the direct use of the original molecule softmax function defines a function in the logistic regression, further simplifies the calculations:

 

p(D=1|wo,wi)=σ(Uo⋅Vi)p(D=1|wo,wi)=σ(Uo⋅Vi)


In this case, the model corresponding objective function becomes:

J (i) = logs (Uo⋅Vi) + Sj = 1kEwj~pn (w) [logs (-Uj⋅Vi)] J (i) = log⁡s (Uo⋅Vi) + Sj = 1kEwj~pn (w) [log⁡s (-Uj⋅Vi)]

In addition to optimizing algorithm level Softmax and negative samples presented here, Mikolov in 13-year paper was also introduced another trick: downsampling (subsampling). The basic idea is in training in probability random discard those high frequency words:

 

pdiscard(w)=1−tf(w)−−−−−√pdiscard(w)=1−tf(w)


Wherein, tt parameter is a priori, generally taken to 10-510-5. f (w) f (w) is the frequency of occurrence in the corpus ww.

Experiments show that this down-sampling technique can significantly improve the accuracy of the word vector of the low-frequency words.

Beyond the Word Vector

Introduction to algorithms and principles word2vec model, let's discuss some of the topics hearted - application of the model.

After 13 years word2vec model turned out, one of the most talked about is the use of a vector in the semantic and syntactic similarities learned it - especially this similarity is actually on addition and subtraction operations on mathematical sense [8 ]! The most classic example is, v ( "King") - v ( "Man") + v ( "Woman") = v ( "Queen") v ( "King") - v ( "Man") + v ( "Woman") = v ( "Queen"). However, this example does not seem much practical use.

In addition, word2vec model is also used in machine translation systems and recommend areas.

Machine Translation[9]

RNN different models of machine translation at the sentence level with the later proposed, word2vec model is mainly used for machine translation of the word size.

Specifically, we first learn from a large number of monolingual corpora to word2vec expression of each language, and then with a small bilingual corpus to learn linear mapping relationship WW word2vec expression of the two languages. Loss function constructed as follows:

 

J (W) = Σi = 1n || Wxi-zi || 2J (W) = Σi = 1n || Wxi-zi || 2

In the translation process, we first word2vec vector source language is mapped to the vector space of the target language through the matrix WW; then find out the results as translated from the nearest word and return to the projection vector in the vector space of the target language.

The principle is to learn different languages ​​word2vec vector space having a certain isomorphism geometrically. Mapping the WW is essentially linear transformation matrix A spatial alignment.

Item2Vec [11]

Essentially, word2vec model is set up in co-occurrence matrix on the basis of word-context. Therefore, any algorithm model based on co-occurrence matrix, can apply ideas word2vec algorithm to be improved.

For example, collaborative filtering recommendation system in the field of algorithms.

Collaborative filtering algorithm is based on co-occurrence of a user-item matrix based on the recommendation by the similarity of the row or column vector. If we buy the same user item as a context, you can create a matrix of item-context. Further, the model can draw CBoW or Skip-gram model to calculate a vector expressing this matrix in the item, the item similarity calculated in higher order.

About this word2vec more applications, and with further reference to this document [10].

Word Embedding

Finally, I think I Reflections on the word embedding of simple elaboration. Not necessarily correct, we also welcome different opinions.

Word embedding first appeared in the seminal article Bengio published in 2003 [3]. By embedding a linear projection matrix (projection matrix), the original one-hot vector mapping a dense continuous vector, and through a task language model to study this the weight vector. This idea came to be widely used in a variety of models, including word2vec including NLP in.

Word embedding training methods can be divided into two categories: one is an unsupervised or supervised pre-trained weak; one end to end (end to end) of supervised training.

No pre-training or supervision and weak oversight to word2vec and auto-encoder represented. The characteristics of this type of model that does not require a lot of manual labeled samples you can get good quality of embedding vectors. But because of the lack of task-oriented, and may be issues we have to solve quite a distance. Therefore, we tend to get in the pre-trained embedding vector, with a small amount of sample manual annotation to fine-tune the entire model.

In contrast, the end-supervised model more and more attention in recent years. Compared with unsupervised model-end models are often more complex in structure. At the same time, also because there is a clear task-oriented, end-to-learn model of embedding vectors also tend to be more accurate. For example, connected to each other by a plurality of layers and embedding depth convolutional neural network to achieve emotional classification of sentences, you can learn the semantics of the word richer expression vector.

Another research Word embedding is modeled embedding vector sentence at a higher level.

We know, word is the basic unit of the sentence. One of the simplest and most direct way is to get sentence embedding the embedding vector all word sentence consisting of all add up - like CBoW model.

Obviously, this simple and crude way to lose a lot of information.

Another way to learn the word2vec idea - the sentence or paragraph as a special word, then training with CBoW model or Skip-gram [12]. The problem with this approach is that, for a new article, always need to retrain a new sentence2vec. In addition, as with word2vec, this model is the lack of supervised training guide.

Personal feeling is more reliable third approach - training end-word embedding based. It is the word of the sequence Sentence nature. Therefore, based on the word embedding, we can connect a plurality RNN or convolutional neural network models, encoding a sequence of word embedding, whereby sentence embedding.

Work in this area has been a lot. Have the opportunity, I will write a review about the sentence embedding.

References

[1]: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013, January 17). Efficient Estimation of Word Representations in Vector Space. arXiv.org.

[2]: Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013, October 17). Distributed Representations of Words and Phrases and their Compositionality. arXiv.org.

[3]: Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. The Journal of Machine Learning Research, 3, 1137–1155.

[4]: Turney, P. D., & Pantel, P. (2010). From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research, 37(1).

[5]: Morin, F., & Bengio, Y. (2005). Hierarchical Probabilistic Neural Network Language Model. Aistats.

[6]: Mnih, A., & Kavukcuoglu, K. (2013). Learning word embeddings efficiently with noise-contrastive estimation, 2265–2273.

[7]: Mikolov, T., Karafiát, M., Burget, L., & Cernocký, J. (2010). Recurrent neural network based language model. Interspeech.

[8]: Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations. Hlt-Naacl.

[9]: Mikolov, T., Le, Q. V., & Sutskever, I. (2013, September 17). Exploiting Similarities among Languages for Machine Translation. arXiv.org.

[10]: Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research, 12(Aug), 2493–2537.

[11]: Barkan, O., & Koenigstein, N. (2016, March 14). Item2Vec: Neural Item Embedding for Collaborative Filtering. arXiv.org.

[12]: Le, Q. V., & Mikolov, T. (2014, May 16). Distributed Representations of Sentences and Documents. arXiv.org.

Published 44 original articles · won praise 16 · views 10000 +

Guess you like

Origin blog.csdn.net/YYIverson/article/details/103928157