Some conclusions about word embedding (Word Embedding)

After reading the course for a long time, I didn't understand what word embedding is, so I sorted out the relevant knowledge. refer to:

1. About text and vector

Text is a kind of unstructured data information, which cannot be directly calculated.

The function of text representation is to transform these unstructured information into structured information, so that calculations can be made on text information to complete tasks such as text classification and emotional judgment that we can see every day.
insert image description here
There are many ways to represent text, but there are mainly three types:

insert image description here

insert image description here

2. One-hot encoding

insert image description here
But in reality, tens of thousands of different words are likely to appear in the text, and the vector will be very long at this time. More than 99% of them are 0.

The disadvantages of one-hot are as follows:

  • Inability to express relationship between words
  • This kind of too sparse vector leads to low calculation and storage efficiency

3. Information Retrieval (IR) Technology

To overcome the limitations of one-hot encoding, the field of NLP borrowed information retrieval (IR) techniques to vectorize text using documents as context. Such as TF-IDF, LSA and topic modeling

·Bag of words

The bag of words model assumes that we do not consider the contextual relationship between words in the text, but only consider the weight of all words. The weight is related to the frequency of words appearing in the text.

The bag-of-words model will first perform word segmentation. After word segmentation, by counting the number of times each word appears in the text, we can get the word-based features of the text. If these words and the corresponding word frequencies of each text sample are put together , which is what we often call vectorization. After the vectorization is completed, TF-IDF is generally used to correct the weight of the features, and then the features are standardized. After performing some other operations, the data can be brought into the machine learning model for calculation.

The trilogy of the bag-of-words model: tokenizing , counting and normalizing .

The bag of words model has great limitations, because it only considers word frequency and does not consider the context relationship, so it will lose the semantics of a part of the text.

Disadvantages of the bag of words model:

The most important thing about the bag-of-words model is to construct a vocabulary, and then assign values ​​to the words in the vocabulary through the text, but the bag-of-words model seriously lacks the expression between similar words.

For example, "I like Beijing" and "I don't like Beijing". In fact, these two texts are seriously dissimilar. But the bag of words model will be judged as highly similar.

"I like Beijing" and "I love Beijing" actually express very, very close meanings, but the bag-of-words model cannot express a serious similarity between "like" and "love". (Of course, the word bag model can also give these two sentences a high degree of similarity, but pay attention to the meaning I want to express)

In lower text corpora, some words are very common (e.g., "the", "a", "is" in English) and thus carry little useful information about the actual content of the document. If we feed pure count data directly to the classifier, those frequent words will overshadow the frequency of less frequent but more meaningful words.

In order to recalculate the count weights of features in order to convert them into floating point values ​​suitable for classifiers, tf-idf conversion is usually performed.

TF-IDF

It is a commonly used weighting technique for information retrieval and text mining.

TF-IDF is a statistical method for evaluating the importance of a word to a document set or a document in a corpus.

The importance of a word increases proportionally to the number of times it appears in the document, but decreases inversely proportional to the frequency it appears in the corpus.

Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of how relevant a document is to a user query.

The main idea:

If a word or phrase appears frequently in an article and rarely appears in other articles, it is considered that the word or phrase has a good ability to distinguish categories and is suitable for classification .
insert image description here
insert image description here
A high word frequency within a particular document, and a low document frequency of that word in the entire collection of documents, can produce a highly weighted TF-IDF. Therefore, TF-IDF tends to filter out common words and keep important words.

It can be found in IDF that when a word appears more times in each document in the corpus, its IDF value is lower. When it appears in all documents, its IDF calculation result is 0, and usually these occurrences A lot of words or characters are "de", "I", "do", etc., which do not play a certain role in the weight calculation of the article.

For example, if the word "environmental protection" appears very frequently in an article, we can think that the word "environmental protection" is very important; but when the word "environmental protection" appears in many articles, it becomes less important.

Application : The bag-of-words model is mostly used in image processing.

This article about the word bag model is more detailed
word bag model: https://www.jianshu.com/p/0587bc01e414

·Bi-gram and N-Gram

N-Gram is an algorithm based on statistical language models. Its basic idea is to perform a sliding window operation with a size of N on the content of the text according to bytes, forming a sequence of byte fragments with a length of N.

Each byte fragment is called a gram, which counts the frequency of occurrence of all grams, and filters them according to a preset threshold to form a list of key grams, which is the vector feature space of this text, and each type in the list gram is a feature vector dimension.

The model is based on the assumption thatThe appearance of the Nth word is only related to the previous N-1 words, but not to any other words. The probability of the entire sentence is the product of the probability of each word. These probabilities can be obtained by counting the number of simultaneous occurrences of N words directly from the corpus. Commonly used are binary Bi-Gram and ternary Tri-Gram.
insert image description here
insert image description here
A common application of the N-gram model is
search engines (Google or Baidu), or guesses or hints for input methods. When you use Google, you enter one or several words, and the search box usually gives several options in the form of a drop-down menu as shown in the picture below. These options are actually guessing the word string you want to search for.

4. Distributed representation

Distributed representation takes into account the connection between words on the basis of One hot representation, such as word meaning, part of speech and other information. Each dimension element is no longer 0 or 1, but a continuous real number, representing different degrees. Distributed representation includes the following three processing methods:

  • Matrix-based distribution representation. A row in the matrix becomes the representation of the corresponding word, which describes the distribution of the context of the word. Since the distribution hypothesis holds that words with similar contexts have similar semantics, under this representation, the semantic similarity of two words can be directly converted into the spatial distance of two vectors.
  • Cluster-based distribution representation.
  • Distributed representations based on neural networks.
    Their core idea consists of two parts: 1. Choose a method to describe the context of the word ; 2. Choose a mode to describe the relationship between a word ("target word") and its context.
    The Distributed representation we often say now is mainly based on the distributed representation of neural networks. For example, the Distributed representations of 'Hangzhou' and 'Shanghai' are [0.3 1.2 0.8 0.7] and [0.5 1.2 0.6 0.8] respectively.

So for word embedding, we can understand it as a distributed expression of words, and it is mapped from a high-dimensional sparse vector to a relatively low-dimensional real number vector.

The purpose of distributed representation is toFind a transformation function so that each word is converted to its associated vector. In other words, a distribution is a representation that converts words into vectors, where the similarity between vectors is related to the semantic similarity between words.

Distributed Hypothesis. Similar words in the context have similar semantics . In this way, we can store information distributedly in each dimension of the vector. This distributed representation method has the characteristics of compact low-dimensionality and easy access to syntactic and semantic information.

Ps: In short, the above words can be used to represent the following words without increasing the dimension.

insert image description here

1. Matrix-based distribution representation

The matrix-based distribution representation is mainly to construct a **"word-context" matrix**, and obtain the distribution representation of words from the matrix through a certain technology. The rows of the matrix represent words, the columns represent context, and each element represents a certain word. The number of co-occurrences with the context. In this way, one row of the matrix describes the context distribution of the word.

Common contexts are: 1. Document: the "word-document" matrix; 2. Each word in the context: the "word-word" matrix; 3. n-grams, the "word-n-tuple" matrix . Each element in the matrix is ​​the number of words and context co-occurrences, and TF-IDF, logarithm and other methods are usually used for weighted smoothing. In addition, if the dimension of the matrix is ​​high and very sparse, it can be reduced by means such as SVD to become a less dense matrix.

co-occurrence matrix

insert image description here

In the above example, three sentences are given, assuming that this is all our corpus. We use a window of size=1 to slide each sentence in turn, which is equivalent to counting only the adjacent words. In this way, a co-occurrence matrix can be obtained.

Each column of the co-occurrence matrix can naturally be regarded as a vector representation of the word. Such a representation is obviously better than the one-hot representation, because each dimension of it has meaning - the number of co-occurrences, so such a vector representation can find the similarity between words.
insert image description here
Disadvantages : To a certain extent, it alleviates the problem that the similarity of one-hot vectors is 0, but it still does not solve the problems of data coefficients and dimension disasters. Facing the sparsity problem, the vector dimension grows linearly with the dictionary size.
Solution: SVD, PCA dimensionality reduction, but the amount of calculation is large

SVD (Singular Value Decomposition)

First, count the co-occurrence matrix X of a word. X is a matrix of size |V|×|V|, Xij represents the number of words that appear in the i-th word and the j-th word in the vocabulary V in all corpus, and |V| is the size of the vocabulary. Do matrix decomposition (such as singular value decomposition) on X to get the matrix orthogonal matrix U, and normalize U to get the matrix, which is regarded as the word vector of all words: SVD gets the dense (dense) matrix of word, which
insert image description here
has Many good properties: words with similar semantics are similar in vector space, and can even reflect the linear relationship between words to a certain extent.
insert image description here

But this traditional approach has many problems:

Since many words do not appear, the matrix is ​​extremely sparse, so additional processing is required on the word frequency to achieve a good matrix decomposition effect; the
matrix is ​​very large, and the dimension is too high
to manually remove stop words (such as although, a,...), otherwise these Frequently appearing words also affect the effect of matrix factorization.

2. Clustering-based distribution representation (no understanding)

The relationship between word contexts is constructed by means of clustering, and the representative model is Brown Clustering (Brown Clustering)

3. Distribution representation based on neural network

  • Based on the neural network language model (NNLM)
    NNLM was proposed by Bengio et al. His main idea is:
  1. Each word in the dictionary corresponds to a word feature vector

  2. Represent sequences of words as joint probability functions

  3. Automatically learn the parameters of word feature vectors and probability functions

In NNLM, each word is a point in the vector space, and the number of features is smaller than the size of the dictionary, and its probability function is expressed as the product of the conditional probability of the next word given the previous word.

insert image description here
insert image description here
insert image description here
Ps: I didn't understand the above pictures

  • Recurrent neural network language model (unknown)
  • C&W model (unknown)
  • Word2Vec
    • skip-gram
    • CBOW(continuous Bags-of-Words)

Ps: I am a little confused when I write here. What is the relationship between corpus, word vector, word embedding, and language model? Look here ☞: What is the relationship between corpus, word vector, word embedding, and language model?

If the problem of natural language understanding is to be transformed into a problem of machine learning, the first step must be to find a way to mathematicize these symbols.

  • The most direct one is one-hot. But the problem is that there is no relationship between words
  • The other is distributed representation Distributed representation
    • Matrix-based distribution representation (n-gram, Glove)
    • Cluster-based distribution representation (judging similarity based on the clustering categories of two words, Brown clustering)
    • Based on the distributed representation of neural networks, the core of distributed representation, word embedding, and word embedding
      is still the representation of context and the modeling of the relationship between context and target words.
      So far, all word vector training methods have obtained word vectors while training the language model, thus introducing the concept of language model.

Language models include grammatical language models and statistical language models. Generally we refer to statistical language models. In fact, it is to see if a sentence is said by a normal person. For example, after machine translation and speech recognition have obtained several candidates, the language model can be used to pick a result that is as reliable as possible. It can also be used in other tasks of NLP.

OK, now back to the original question, what is the relationship between corpus, word vectors, word embeddings, and language models?

The corpus trains the language model, and generates word vectors by the way, and word vectors contain representations such as word embeddings.
Using your own corpus, can you generate your own language model with Finetuning?

5. Word Embedding and Word2Vec

Word Embedding (Word Embedding) is a kind of text in theConvert words to numeric vectorsIn order to analyze them using standard machine learning algorithms, these vectors converted to numbers need to be used as input in numerical form. The word embedding process is to embed a high-dimensional space whose dimension is the number of all words into a continuous vector space with a much lower dimension. Each word or phrase is mapped to a vector on the real number field.The result of word embedding generates a word vector

Word vectors are the preferred technology for text vectorization in various NLP tasks, such as part -of-speech tagging, named entity recognition, text classification, document clustering, sentiment analysis, document generation, question answering systems , etc.
insert image description here

About Word2Vec

Word2vec is one of the methods of Word Embedding. He was a new word embedding method proposed by Google's Mikolov in 2013.

This algorithm has 2 training modes:

  • Predict the current word from the context
  • Predict the context from the current word

The Word2Vec model is actually divided into two parts. The first part is the construction of the training data set , and the second part is to obtain the word embedding vector through the model , namely word embedding.

The entire modeling process of Word2Vec is actually very similar to the idea of ​​an auto-encoder, that is, first build a neural network based on the training data. After the model is trained, it will not use the trained model to process new data. task, but what is really needed is the parameters that the model is updated to through the training data.

Regarding the development of word embedding, due to the consideration of context, the input and output of the model are composed of words in the vocabulary, and then two word model methods are produced: Skip-Gram and CBOW. At the same time, in the hidden layer-output layer, it has also evolved from the softmax() method to the layered softmax and negative sample methods.

1. Skip-gram algorithm

insert image description here
Skip-Gram is to predict the context given the input word. We can use sentences in elementary school English class to help understand, for example: "The __________".

Regarding the model structure of Skip-Gram, it is mainly divided into several steps:

  • Define a central word from the sentence, that is, the Skip-Gram model input word
  • Define the skip_window parameter to indicate the number of words selected from one side (left and right) of the current input word.
  • According to the center word and skip_window, build a window list.
  • Define the num_skips parameter, which is used to indicate how many different words are selected from the current window list as the output word.

Suppose there is a sentence "The quick brown fox jumps over the lazy dog", the set window size is 2 (window_size=2), that is to say, only two words before and after the central word (input word) and the central word (input word) are selected. ) to combine. As shown in the figure below, the center word is slid with a step size of 1, where the blue represents the input word, and the square represents the word in the window list.
insert image description here
Therefore, we can use Skip-Gram to construct the training data of the neural network.

We need to understand that a word cannot be fed into a neural network as a text string, so we need a way to encode the word and feed it into the network. To do this, a vocabulary of, say, 10,000 distinct words is first constructed from the documents to be trained. Then what needs to be done is to make each word one hot representation. Furthermore the output of the neural network is a single vector (also with 10,000 components) that contains the probability that each word in the vocabulary randomly selects a nearby word.

Suppose we extract 10,000 unique and unique words from our training documents to form a vocabulary. We perform one-hot encoding on these 10,000 words, and each word obtained is a 10,000-dimensional vector. The value of each dimension of the vector is only 0 or 1. If the word ants appears in the third position in the vocabulary , then the brown vector is a 10,000-dimensional vector whose third dimension is 1 and the other dimensions are 0 (brown = [0,0,1,0,…,0]

The following figure is the neural network structure that needs to be trained. The neuron Input Vector on the left is a word after One hot representation in the vocabulary , and each neuron on the right represents each word in the vocabulary. In fact, when training the neural network feed training data, not only the input word (center word) is represented by One hot representation, but also the output word is represented by One hot representation. But when this network is evaluated and predicted, the output vector is actually the probability distribution of all words in a vocabulary calculated by the softmax() function (that is, a bunch of floating point values, not a One hot representation). hidden layer
insert image description here

After talking about the encoding of words and the selection of training samples, let's look at our hidden layer. If we now want to use 300 features to represent a word (that is, each word can be represented as a 300-dimensional vector). Then the weight matrix of the hidden layer should be 10,000 rows and 300 columns (the hidden layer has 300 nodes).
Google uses word vectors with 300 features in its newly released model trained on the Google news dataset. The dimension of the word vector is an adjustable hyperparameter (the default word vector size of the Word2Vec interface encapsulated in Python's gensim package is 100, and window_size is 5).

Look at the picture below. The left and right pictures represent the weight matrix of the input layer-hidden layer from different angles. Each column in the left figure represents a 10,000-dimensional word vector and a weight vector connected to a single neuron in the hidden layer. From the plot on the right, each row actually represents a word vector for each word.

insert image description here
So our ultimate goal is to learn the weight matrix of this hidden layer.
Let's come back now and train our model through the definition of the model.

We mentioned above that both input word and output word will be one-hot encoded by us. Think about it carefully, after our input is one-hot encoded, most of the dimensions are 0 (in fact, only one position is 1), so this vector is quite sparse, so what will happen. If we multiply a 1 x 10000 vector with a 10000 x 300 matrix, it consumes considerable computational resources,For efficient calculation, it only selects the index row with a dimension value of 1 in the corresponding vector in the matrix(This sentence is very convoluted), you can understand it by looking at the picture.

insert image description here
Let's take a look at the matrix operation in the above figure. The left side is a matrix of 1 x 5 and 5 x 3. The result should be a matrix of 1 x 3. According to the rules of matrix multiplication, the first row and first column of the result are [ formula], the other two elements can be obtained as 12, 19 in the same way. It is very inefficient to use such a calculation method for a matrix of 10,000 dimensions .

In order to perform calculations efficiently, matrix multiplication calculations will not be performed in this sparse state. You can see that the result of matrix calculations is actually an index with a value of 1 in the vector corresponding to the matrix . In the above example, the value in the left vector is The corresponding dimension of 1 is 3 (the subscript starts from 0), then the calculation result is the third row of the matrix (the subscript starts from 0) - [10, 12, 19], so the hidden layer weight matrix in the model is It becomes a "lookup table". When performing matrix calculations, directly check the weight values ​​corresponding to the dimension with a value of 1 in the input vector. The output of the hidden layer is the "word embedding" for each input word.

output layer

After the calculation of the hidden layer of the neural network, the word ants will change from a vector of 1 x 10000 to a vector of 1 x 300, and then be input to the output layer. The output layer is a softmax regression classifier, and each node of it will output a value (probability) between 0 and 1, and the sum of the probabilities of all neuron nodes in the output layer is 1.

The following is an example, the training sample is (input word: "ants", output word: "car") calculation diagram.

insert image description here
Let’s write about skip-gram here first, I think it’s too much. . .

2. Continuous bag-of-words algorithm (CBOW)

insert image description here
insert image description here

See ☞ https://www.cnblogs.com/Luv-GEM/p/10593103.html

Still don't understand. . . . Cried. . .

Guess you like

Origin blog.csdn.net/Mason_Chen/article/details/109528753