deeplearning.ai Sequence Model Week 2 NLP & Word Embeddings

 

1. Word representation

  Disadvantages of One-hot representation: Treat each word independently, resulting in poor generalization ability to related words. For example, after training "I want a glass of orange juice ", in the face of "I want a glass of apple          ", since the inner product of the one-hot vector of any two different words is 0, the algorithm does not know that orange and apple are A class of words, so there is no way to generalize to fill in "juice" after apple.

  Featurized representation: word embedding can effectively solve this problem. As shown in the figure below, each word is described with a large number of features. The figure lists 300 features, including Gender, Royal, Age, Food... Size, Cost, Alive (is it alive), Verb (is it a verb? ), etc., each word can be described by a 300*1 vector. For example, Man is the 5391st in the dictionary, and its feature description vector is e 5391 (here e represents word embedding). At this time, looking at the words Apple and Orange, most of their features are very similar, so the algorithm generalizes them better. Of course, in the actual algorithm, each element of the feature vector may not be so intuitively understood, but the ultimate goal is to find the correlation between different words. A commonly used visualization algorithm is t-SNE (Van der Maaten and Hinton, 2008, "Visualizing data using t-SNE"), which can map high-dimensional feature vectors to a two-dimensional plane and see a similar clustering effect. The feature description of words is equivalent to embedding words into a high-dimensional space (300 dimensions in the example), which is the origin of the name embedding. Another advantage of word embedding is that a relatively low-dimensional feature vector (such as 300-dimensional) can be used to replace the description of one-hot representation high-dimensional (such as 10000-dimensional), although one-hot vecotr calculation is very fast (because it is sparse), but The word embedding is denser.

 

2. Using word embeddings

  If the algorithm is only trained on the training set with a relatively small amount of data, it may encounter many words that are not in the training set during the testing phase, but if these unfamiliar words have word embeddings, the algorithm still has a very good generalization ability for them. Word embedding is most effective when the training set is relatively small, and is less used when there is a large amount of data (such as machine translation).

  Step 1: Learn word embeddings from a very large corpus (1~100 Billion words, usually obtained from the Internet) (this corpus has no labels, it is not a training set, just for learning the feature description of each word); or Download pre-trained embeddings directly online.

  Step 2: Train the algorithm on a small training set (say 100k words) using the feature descriptions obtained in the first step. This step is an operation similar to transfer learning, which transfers information learned from a large amount of unlabeled text freely available on the Internet to specific tasks.

  Step 3: In a new application, the word embedding can be finetuned with new data. This step is optional and is generally done when the training set of the second step is large enough (such as 100k, the second step thinks that 100k is less than the corpus of the first step), if the training set of the second step is not enough. Large, there is no need for finetune.

  In addition, word embeddings are very similar to face recognition, because face recognition also converts faces into feature codes (such as 128*1 vectors), where embedding and encoding mean the same thing. The difference between the two is that face recognition is training a neural network that can encode any face photo, even an unseen face photo, while NLP is for a fixed-size vocabulary (such as 10,000 words) .

 

3. Properties of word embeddings

  One of the great uses of word embeddings is that they can do analogical reasoning about words. For example, it is known that man corresponds to woman, what word does king correspond to? Abstracting this problem into a mathematical model is to solve the similarity of vectors.

  The most commonly used method for evaluating similarity is cosine similarity: $sim(u, v)=\frac{u^Tv}{||u||_2||v||_2}$, which can calculate u, v two the angle of a vector. Another less-used method of describing similarity is the Euclidean distance: $sim(u, v)=||uv||^2$.

 

4. Embedding matrix

  How to learn word embedding description of words? A matrix called embedding matrix is ​​constructed here. Each column of the matrix corresponds to the word embedding description of a word, so if 300 features are used to describe 10000 words, the embedding matrix is ​​300*10000 in size. Define the embedding matrix as E, e j is the word embedding of the jth word, o j is the one-hot vector of the jth word, then $Eo_j=e_j$. However, in practice, the time complexity of matrix multiplication is too high, so a special function will directly read the word embedding of a column from the embedding matrix instead of using matrix multiplication.

5. Learning word embeddings

  In the field of deep learning, researchers initially used more complex algorithms to learn word embeddings, and later found that simpler algorithms could be used to achieve the same effect (especially when the data set is large). Andrew NG said that he will introduce complex algorithms first, because it is easier to understand intuitively, and then introduce simple algorithms.

  One of the earliest methods to efficiently learn word embeddings comes from Bengio et. al., 2003, A neural probabilistic language model. As shown in the figure below, according to the first 6 words to infer the 7th word, first use the one-hot vector of each word to multiply the embedding matrix to get the word embedding, and then stack the word embeddings of the 6 words into a 1800*1 vector (It is assumed here that word embedding is a 300*1 vector. If 4 words are used for speculation, it is stacked into a 1200*1 vector) as input to a neural network, and then a softmax function predicts the probability of each word ( It is assumed here that the thesaurus is 10,000 words, so a vector of 10,000*1 is output). The predicted word can be compared with the true value, and then gradient descent is performed by backpropagation. The optimized parameters include the embedding matrix, the parameters of the neural network (w [1] , b [1] ) and the parameters of the softmax (w [2 ] ] , b [2] ). The researchers found that if it is to learn a language model, the first few words (such as 4) are generally used to predict the next word; if it is to learn word embedding, the first four words and the last four words (4 words on left & right) predicting the middle word, or predicting the next word with the previous word (last 1 word), or predicting with a nearby single (nearby 1 word) word can also give good results.

6. Word2Vec

  Compared with the algorithm in the previous section, the algorithm for learning word embeddings introduced in this section (Mikolov et. al., 2013. Efficient estimation of word representations in vector space) is simpler and more computationally efficient. The algorithm selects a word as the context, and then randomly selects another word as the target word within a certain distance around it (usually there is a window, such as within 5 or 10 words before and after), Such models are called Skip-grams. With the pairing of context and target, a supervised learning problem can be constructed, that is, to predict the target according to the context, the purpose is not to accurately predict, but to learn word embeddings. As shown in the figure below, multiply the one-hot vector of the context by the embedding matrix to get the word embedding of the context, and then send it to the softmax function to get a prediction of the probability of 10,000 words in the thesaurus $\hat{y_i}$ (equivalent to the picture The conditional probability p(t|c) in ), and then find the value of the loss function. The parameters involved in the optimization here are the embedding matrix E and the parameter θ of the softmax function.

  The biggest problem of this algorithm is that the calculation amount of the denominator of the softmax function is too large, which affects the calculation speed. One solution is the hierarchical softmax classifier. This method is not to determine which of the 10,000 categories at once, but to use a classifier (such as logistics) to first determine the first 5000 words in the vocabulary or the last 5000 words , if it is the first 5000 words, then determine whether it is the first 2500 words or the last 2500 words, and so on, and finally find a category. In practice, a classification tree that is perfectly balanced (or left-right symmetrical) will not be constructed, but a classification tree will be constructed based on the frequency of word occurrences. Common words (such as the, of) only need a few steps to search at the top. Deeper words (such as durian durian) require many search steps.

  How to sample the context? It is not appropriate to sample the corpus uniformly and randomly, which results in common words (such as the, of) appearing quite frequently. Therefore, the sampling probability P(c) of the context will be heuristically designed according to the probability of word occurrence.

 

7. Negative sampling

  The algorithm in this section (Mikolov et. al., 2013. Distributed representation of words and phrases and their compositionality) omits the calculation of the denominator of the softmax function in the previous section and is more efficient. The algorithm selects a context word in the corpus, and then selects a word around it as the target word, and these two words constitute a pair of positive samples. Then, for the same context word, other k words are randomly selected from the corpus (for common words such as of and the, even if they appear around the context word, they are regarded as negative samples) to form k pairs of negative samples. Then perform supervised learning training on these k+1 samples to predict whether each sample is a positive sample or a negative sample. How to choose k? The smaller the dataset is, the larger the k is. For small datasets, k=5~20; for large datasets, k=2~5. As shown in the figure below, the algorithm in this section converts the softmax classifier of 10,000 categories (10,000-way softmax classifier, where the corpus is assumed to have 10,000 words) from the previous section into 10,000 binary classifiers (logistic function), And each iteration only updates k+1 classifiers. How to choose negative samples? The empirical formula proposed by Mikolov et al. is $P(w_i)=\frac{f(w_i)^{3/4}}{\sum_{j=1}^{10000}f(w_j)^{3/4} }$, where $f(w_i)$ is the frequency of the word $w_i$ in the corpus.

8. GloVe(global vectors for word representation) word vectors

  The algorithm in this section (Pennington et. al., 2014. GloVe: Global vectors for word representation) is not as used as Word2Vec or skip-gram models, but it is indeed simpler. The algorithm defines a variable X ij , which represents the number of times the i-th word (equivalent to target in the previously introduced algorithm) is in the context of the j-th word (equivalent to context in the previously introduced algorithm). In other words, X ij is a counter that counts the occurrence of these two words together. If the definition context is several words that are symmetrical before and after the word, then X ij =X ji , otherwise they are not equal. The optimization goal of this algorithm is:

$$minimize\sum_{i=1}^{10000}\sum_{j=1}^{10000}f(X_{ij})(\theta_i^Te_j+b_i+b_j'-logX_{ij})^2$$

The meaning of this formula is to train $\theta_i$ and $e_j$ so that the product of the two can be as close as possible to the frequency X ij of the two words co-occurring (both in the logistic and softmax functions are $e^{\theta_t^Te_c} $, and here is log on X ij ). Here $b_i$ and $b_j'$ are bias terms, $f(X_{ij})$ are weight terms, when Xij=0, $f(X_{ij})=0$, and 0log0=0 is defined ; On the other hand, for common words like this, is, of, $f(X_{ij})$ can't be too large, and for rare words like durian, $f(X_{ij})$ can't be too small. In addition, in this algorithm, $\theta_w$ and $e_w$ are symmetrical, and they have the same physical meaning, so the word embedding of the final word w can be defined as $e_w^{(final)}=(e_w+\theta_w)/2 $.

  It is worth mentioning that the characteristics of the trained feature vectors may not be described in a way that is easy for humans to understand. For example, we have defined several dimensions of the feature vector as Gender, Royal, Age, Food, etc., but because $(A\theta_i)^T(A^{-T}e_j)=\theta_i^TA^TA^{-T }e_j=\theta_i^Te_j$, which means that the axes of each dimension of the trained feature vector may be rotated, or each dimension is a combination of those dimensions that humans can understand, resulting in no longer being understood by humans.

 

9. Sentiment classification

  The sentiment classification problem is to judge whether the person likes something based on a piece of text. For example, in Dianping, users will write a comment on a store and give stars, then the sentiment classification problem here is to judge how many stars users will give based on the comments. The difficulty with this problem is that the labeled training set is not that large, but with word embeddings, even a moderately sized labeled training set can build a decent sentiment classifier.

  The simplest algorithm is to convert each word in the comment into word embeddings according to the embedding matrix, then find the average of the word embeddings of all words, and then take this average (if the word embedding is a 300-dimensional vector, then this average is also 300-dimensional vector) to a softmax classifier, which then predicts a few stars. The problem with this algorithm is that it ignores the word order. If the comment is "Completely lacking in good taste, good service, and good ambience", since there are many "good", the average value of word embeddings is also a positive evaluation. The algorithm Don't know that because of the negation of "locking in", all "good" are actually negative reviews. So the improved version of the algorithm replaces the averaging operation with an RNN:

 

This is a many-to-one RNN architecture. The value of word embedding here is that if another comment uses "absent of" instead of "locking in", since the feature vectors of the two are very close, it can still generalize well.

 

10. Debiasing word embeddings

  This section on Algorithms (Bolukbasi et. al., 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddgins.) discusses how to remove biases (such as gender, race, age discrimination, etc.) from algorithms. Here is an example of sexism, man: computer programmer is equivalent to woman: what? If the algorithm gives homemaker, this will produce sex discrimination, we hope that woman also corresponds to computer programmer, or the phrase computer programmer should not have gender characteristics. How to eliminate algorithmic bias? Step 1: Determine the trend direction of bias in the word embedding space (such as 300-dimensional). The specific method is to calculate the difference between the word embeddings of words such as he and she, male and female, and then average these differences, which results in the overall gender-related bias direction (bias direction), here bias The direction is one-dimensional, and the remaining 299 dimensions constitute the non-bias direction. Andrew NG is simplified here. The bias direction in the original paper is not obtained by simply taking the average, but by a method similar to principal component analysis. At this time, the bias direction may have several dimensions, and the remaining dimensions constitute non-bias direction. Step 2: For the neutral words such as doctor and babysitter, which we hope have nothing to do with gender characteristics, project them to the non-bias direction, thus eliminating the components in the bias direction. How to determine which words are neutral words? The authors of the paper designed classifiers to decide which words were neutral. Step 3: For words like grandmother and grandfarther, which we want to be only related to gender, make them the same distance from neutral words.

  

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325104265&siteId=291194637
Recommended