Embedding Basics

1. What is Embedding

Simply put, Embedding is a method of "representing" an object (Object) with a numerical vector. The object mentioned here can be a word, an item, or a movie, etc. An item can be represented by a vector because the distance between this vector and other item vectors reflects the similarity of these items. Furthermore, the distance vector between two vectors can even reflect the relationship between them. This explanation may still sound a bit abstract, so let's explain it with two specific examples.
insert image description here

The picture above is an example in Google's famous paper Word2vec. It uses the Word2vec model to map words into a high-dimensional space. The position of each word in this high-dimensional space is very interesting. The vector sum of the left picture from king to queen The vectors from man to woman are very close in terms of direction and scale. This shows that the operation between word Embedding vectors can reveal the gender relationship between words. For example, the word vector of the word woman can be obtained by the following operation:

Embedding(woman)=Embedding(man)+[Embedding(queen)-Embedding(king)]

Similarly, the vectors from walking to walked and swimming to swam on the right are basically the same, which shows that the word vector reveals the temporal relationship between words.

insert image description here

The movie Embedding vector method used by Netflix is ​​a very direct recommendation system application. From the schematic diagram of the Embedding vectors of movies and users generated by Netflix using the matrix decomposition method, we can see that different movies and users are distributed in a two-dimensional space. Since the Embedding vectors preserve the similarity relationship between them, there is After creating this Embedding space, it is very easy for us to recommend movies. Specifically, we directly find the movie vectors around a user vector, and then recommend these movies to the user. This is the most direct application of Embedding technology in recommendation systems.

2. The role of Embedding technology in feature engineering

  1. First of all, Embedding is a powerful tool for dealing with sparse features. Because there are many categories and ID-type features in the recommendation scene, extensive use of One-hot coding will lead to extremely sparse sample feature vectors, and the structural characteristics of deep learning are not conducive to the processing of sparse feature vectors, so almost all deep learning recommendation models will The Embedding layer is responsible for converting sparse high-dimensional feature vectors into dense low-dimensional feature vectors. Therefore, various Embedding technologies are the basic operations for building deep learning recommendation models.
  2. Secondly, Embedding can incorporate a large amount of valuable information, which itself is an extremely important feature vector. Compared with the eigenvectors directly processed by the original information, Embedding has stronger expressive ability, especially after the Graph Embedding technology is proposed, Embedding can introduce almost any information for encoding, so that it contains a lot of valuable information itself, so The Embedding vector obtained through pre-training itself is an extremely important feature vector.

These two characteristics are also the reason why we put the relevant content of Embedding into the feature engineering chapter, because it is not only a method of dealing with sparse features, but also an effective means of fusing a large number of basic features to generate high-order feature vectors.

3. Word2vec

1. What is Word2vec

Word2vec is the abbreviation of "word to vector". As the name suggests, it is a model that generates vector representations of "words". To train the Word2vec model, we need to prepare a corpus consisting of a set of sentences. Suppose one of the sentences of length T contains words w1,w2...wt, and we assume that each word is most closely related to its neighbors. According to different model assumptions, the Word2vec model is divided into two forms, CBOW model (left picture) and Skip-gram model (right picture). Next, use Skip-gram as the framework to talk about the details of the Word2vec model.
insert image description here

Among them, the CBOW model assumes that the selection of each word in the sentence is determined by the adjacent words, so we can see that the input of the CBOW model is the words around w t , and the predicted output is w t . The Skip-gram model is just the opposite. It assumes that each word in the sentence determines the selection of adjacent words, so it can be seen that the input of the Skip-gram model is w t , and the predicted output is the words around w t . According to general experience, the effect of the Skip-gram model will be better. Next, Skip-gram is used as the framework to illustrate the details of the Word2vec model.

2. How Word2vec samples are generated

As a model of natural language processing, the samples for training Word2vec are of course from corpus. For example, if we want to train an Embedding model of keywords in an e-commerce website, then the description text of all items in the e-commerce website is a good corpus. We extract a sentence from the corpus, select a sliding window with a length of 2c+1 (choose c words before and after the target word), and slide the sliding window from left to right. Every time we move, the phrases in the window form a Training samples. According to the concept of the Skip-gram model, the central word determines its adjacent words, and we can define the input and output of the Word2vec model based on this training sample. The input is the central word of the sample, and the output is all adjacent words.
For example:
the sentence "The importance of Embedding technology for deep learning recommendation system" is selected as a sentence sample.
First, we segment it, remove stop words, generate word sequences, and then select a sliding window with a size of 3 to slide from the beginning to the end to generate training samples. Then we use the central word as input and the edge word as output. The training samples available for training the Word2vec model are obtained.
insert image description here

3. Word2vec model structure

The structure of the Word2vec model is essentially a three-layer neural network.
insert image description here
The dimensions of its input layer and output layer are both V, and this V is actually the size of the corpus dictionary. Assuming that the corpus uses a total of 10,000 words, then V is equal to 10,000. According to the training samples, the input vector here is naturally the One-hot encoded vector converted from the input word, and the output vector is the Multi-hot encoded vector converted from multiple output words. Obviously, the Skip-gram based The Word2vec model solves a multi-classification problem.

The dimension of the hidden layer is N, and the choice of N requires a certain ability to adjust parameters. We need to weigh the effect of the model and the complexity of the model to determine the final value of N, and finally the Embedding vector dimension of each word Also depends on N. Finally, there is the question of the activation function. Here we need to note that the hidden layer neurons do not have an activation function, or that the identity function of input and output is used as the activation function, while the output layer neurons use softmax as the activation function.
. Because this neural network is actually to express such a conditional probability relationship from the input vector to the output vector. The
insert image description here
conditional probability of predicting the output word WO from the input word WI is actually what the Word2vec neural network wants to express. By maximizing this conditional probability through the method of maximum likelihood, we can make the inner product distance of similar words closer, which is what we want the Word2vec neural network to learn.

4. Extract word vectors from the Word2vec model

After training the Word2vec neural network, the Embedding vector corresponding to each word is hidden in the weight matrix W VxN from the input layer to the hidden layer. Each row vector of
insert image description here
the input vector matrix W VxN corresponds to the "word vector" we are looking for. ". For example, we want to find the Embedding corresponding to the i-th word in the dictionary, because the input vector is encoded by One-hot, so the i-th dimension of the input vector should be 1, then the row vector of the i-th row in the input vector matrix W VxN Nature is the Embedding of the word.

The output vector matrix W′ also follows this principle, but in general, we are still accustomed to using the input vector matrix as the word vector matrix. In actual use, we often convert the input vector matrix into a word vector lookup table (Lookup table, as shown in the figure below). For example, if the input vector is a One-hot vector consisting of 10,000 words, and the hidden layer dimension is 300 dimensions, then the weight matrix from the input layer to the hidden layer is 10000x300 dimensions. After converting to the word vector Lookup table, the weight of each row becomes the Embedding vector of the corresponding word. If we store this lookup table in an online database, we can easily use Embedding to calculate important features such as similarity in the process of recommending items.
insert image description here

5. The foundational significance of Word2vec to Embedding technology

Word2vec was officially proposed by Google in 2013. In fact, it is not completely original. The academic research on word vectors can be traced back to 2003 or even earlier. However, it is Google's successful application of Word2vec that allows the technology of word vectors to be rapidly promoted in the industry, which in turn makes the research topic of Embedding a hot topic. It is no exaggeration to say that Word2vec has a foundational significance for the research of Embedding in the era of deep learning. From another point of view, the model structure, objective function, negative sampling method, and objective function in negative sampling proposed in the Word2vec research have been reused and optimized many times in subsequent research. Mastering every detail in Word2vec has become the basis for studying Embedding.

4. Item2Vec

Item2Vec is an extension of the Word2vec method. After the birth of Word2vec, the idea of ​​Embedding quickly spread from the field of natural language processing to almost all fields of machine learning, and the recommendation system is no exception. Since Word2vec can embedding words in the word "sequence", there should also be a corresponding Embedding method for the user to purchase a product in the "sequence" or watch a movie in the "sequence". Therefore, Microsoft proposed the Item2Vec method in 2015, which is an extension of the Word2vec method, making the Embedding method suitable for almost all sequence data. The technical details of the Item2Vec model are almost identical to those of Word2vec. As long as we can express the object we want to express in the form of sequence data, and then "feed" the sequence data to the Word2vec model, we can get the Embedding of any item. The proposal of Item2vec is of course crucial to the recommendation system, because it makes "everything is Embedding" possible. For the recommendation system, Item2vec can use the Embedding of the items to directly obtain their similarity, or input the recommendation model as an important feature for training, which will help improve the effectiveness of the recommendation system.

V. Summary

insert image description here

Guess you like

Origin blog.csdn.net/Edward_Legend/article/details/121456450