Embedding layer in deep learning

Before starting the Embedding layer, explain the binary one-hot encoding (one-hot vector):

1. Introduction to binary one-hot encoding

In order to represent words as vectors for input into neural networks, a simple method is to use binary one-hot encoding (one-hot vectors). Assuming that the number of different characters in the dictionary is N (that is, the dictionary size vocab_size=N), each character has a one-to-one correspondence with a continuous integer value index from 0 to N-1. For example, here is the dictionary we built for N=10:

我  从  哪  里  来  要  到  何  处  去
0   1   2   3   4   5   6   7   8   9  # 索引

So in fact, we can use a vector to express the words composed of these 10 words:

Such as: where do I come from and where do I go ——>>>[0 1 2 3 4 5 6 7 8 9]

Or: where do I come from and where do I go ——>>>[0 1 7 8 4 5 6 2 3 9]

The binary one-hot encoding method means that when expressing a character, the position corresponding to the index is 1, and the other positions are all 0. Correspond each word to an array/list of (total number of samples/total number of words) elements, in which each word is corresponding to a unique corresponding array/list, and the uniqueness of the array/list is represented by 1. As above, "I" is expressed as [1 0 0 0...], and "go" is expressed as [... 0 0 0 1], so that each series of texts is integrated into a sparse matrix (there are more 0 matrix).

Using binary one-hot encoding, these two sentences can be expressed as follows:

# Where am I from and where am I going
[
[1 0 0 0 0 0 0 0 0 0]
[0 1 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 1 0]
[0 0 0 0 0 0 0 0 0 1]
]

# Where am I from and where am I going
[
[1 0 0 0 0 0 0 0 0 0]
[0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 1 0]
[0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 0 1 0 0 0]
[0 0 1 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1]
]

2. Why is there an Embedding layer?

The question is, what are the advantages of sparse matrices (two-dimensional) over lists (one-dimensional).

Obviously, the calculation is simple. When doing matrix calculations for sparse matrices, you only need to multiply and sum the numbers corresponding to 1. Maybe you can calculate it in your head; and you can calculate it quickly for a one-dimensional list? What's more, this list is still one row, what if it is 100 rows, 1000 rows and or 1000 columns?

Therefore, the advantages of one-hot encoding are reflected, the calculation is convenient and fast, and the expression ability is strong .

However, there are also disadvantages.

For example: there are about 100,000 large and small simplified and traditional Chinese fonts, and a novel with 100W characters, you want to express it as a matrix of 100W*10W? ? ?

This is its most obvious shortcoming. When it is too sparse, resources are excessively occupied .

For example: In fact, although our novel has 100W words, in fact, when we integrate it, 99W words are repeated, and only 1W words are not repeated at all. Then we use 100W X 10W, wouldn't it be a waste of 99W * 10W matrix storage space.

Also, the biggest problem with binary encoding is that it cannot represent the similarity between different words. As we can see from the front, the vectors represented by each character are orthogonal to each other. If we use the commonly used cosine similarity, the similarity between any two characters degrees are all zero:

\frac{x^{T}y}{\left \| x \right \|\left \| y \right \|}\in \left [ -1,1 \right ]

In order to solve these two problems, we introduce the Embedding layer.

3. Two functions of Embedding - dimensionality reduction and dimensionality enhancement

(1) Dimensionality reduction

First look at a matrix multiplication:

We have a 2 x 6 matrix, and after multiplying by a 6 x 3 matrix, it becomes a 2 x 3 matrix.

Regardless of what it means, in this process, we change a matrix of 12 elements into a matrix of 6 elements. Intuitively, is the size reduced by half? The embedding layer is used to reduce dimensionality to some extent, and the principle of dimensionality reduction is matrix multiplication. In the convolutional network, it can be understood as a special fully connected layer operation, which is similar to the 1x1 convolution kernel!

That is to say, we have a 100W X10W matrix in front of us, and by multiplying it by a 10W X 20 matrix, we can reduce it to 100W X 20, and instantly reduce it to 1/5000 of the original!

This is one role of the embedding layer - dimensionality reduction. The 10W X 20 matrix in the middle can be understood as a lookup table, a mapping table, or a transition table...

(2) Dimension upgrade

Look at another picture to find the difference:

In this picture, it is required to find five differences from 10 meters away! What? Please ask the questioner to take two steps closer. I will take out my knife first. I didn't hear you when you said the question again.

Of course, this is impossible by visual inspection. But if I put you one meter away, maybe you will find that the heart on the clothes is different in an instant, and then walk half a meter closer, and you will find that the upper left corner and the upper right corner are also different. Walking 20 centimeters closer, I found that the ears were also different. Finally, at a distance of 10 centimeters from the screen, I finally found a fifth different cloud that was a little below the ears.

However, being infinitely close does not mean that the recognition is high. For example, you can only look at a place 1 cm away from the screen to find five differences... At this time, you can only see a lump of green or a lump of blue ...The author of the question, is your head squeezed by the door...

It can be seen that the distance will affect our observation effect. The same is true for the same reason. The features that low-dimensional data may contain are very general. We need to keep zooming in and out to change our receptive field , so that we can have different observation points on this picture, and find out our The stubble you want.

Another role of Embedding is reflected. When upgrading low-dimensional data, some other features may be enlarged, or general features may be separated. At the same time, this Embedding is always learning and optimizing, which makes the whole process of zooming in and out slowly form a good observation point. For example: I approached and moved away from the screen back and forth, and found that 45 cm is the best observation point. This distance can find 5 different points in the shortest time.

Recall why the deeper the CNN layer is, the higher the accuracy rate is. The convolutional layer is rolled up and rolled up, the pooled layer pool is raised and lowered, and the fully connected layer is connected and connected. Because we don't know when it suddenly learned a useful feature. But in any case, learning is a good thing, so let the machine roll one more roll, and connect more. Anyway, I will tell you how many mistakes are made by cross entropy. I will tell you how to do it right. I will tell you by gradient descent algorithm. You time, you will learn sooner or later. Therefore, in theory, as long as the number of layers is deep and the parameters are sufficient, the network can fit any feature. In short, it is similar to virtualizing a relationship to map the current data.

4. Implementation of Embedding

In order to solve the second problem raised above, next, continue to assume that we have a sentence called "the princess is very beautiful", and another sentence called "the princess is very beautiful", using binary one-hot encoding, it is impossible to see that "the princess is very beautiful". "What is the similarity with "Princess". But judging from the Chinese expression, we can immediately feel that the princess and the princess are actually closely related. For example, the princess is the emperor's daughter, and the princess is the emperor's concubine, which can be related from the word "emperor"; The princess lives in the palace, and the princess lives in the palace, which can be related from the word "palace"; the princess is a woman, and the princess is also a woman, and can be related from the word "female".

We have associated the words "emperor", "palace" and "female", so let's try to define princesses and concubines in this way:

The princess must be the daughter of the emperor. We assume that the similarity between her and the emperor is 1.0. The princess lived in the palace from birth until she was 20 years old. The relationship similarity is 0.25; the princess must be a woman, and the relationship similarity with a woman is 1.0;

The concubine is the concubine of the emperor, and she is not related, but there is a certain relationship. Let's assume that the similarity between her and the emperor is 0.6. The concubine has lived in the palace since she was 20 years old and lived to be 80 years old. The similarity of the relationship with the palace is 0.75; the princess must be a woman, and the similarity of the relationship with a woman is 1.0;

So we can express the four words princess and princess like this:

Imperial Palace
Dili Girl
Princess [ 1.0 0.25 1.0]
Wangfei [ 0.6 0.75 1.0]
In this way, we associate the words princess and concubine with the words (features) of emperor, palace, and female. We can think that:

Princess=1.0*emperor+0.25*gongli+1.0*female

Concubine=0.6*emperor+0.75*gongli+1.0*female

Or this way, we assume that every word without lyrics is equivalent (note: just assumption, for ease of explanation):

Imperial
Palace Dili Female
Princess [ 0.5 0.125 0.5]
Lord [ 0.5 0.125 0.5]
King [ 0.3 0.375 0.5]
Concubine [ 0.3 0.375 0.5]
In this way, we can characterize some words or even a word with three characteristics. Then, we call the emperor feature (1), the palace is called feature (2), and the woman is called feature (3), so we have come to the implicit feature relationship between the princess and the princess:

Concubine=Characteristics of Princess(1)*0.6 +Characteristics of Princess(2)*3+Characteristics of Princess(3)*1

Ever since, we changed the one-hot encoding of text from a sparse state to a dense state, and turned independent vectors into intrinsically related relational vectors.

So, what did the Embedding layer do? It turns our sparse matrix into a dense matrix through some linear transformations (converted with a fully connected layer in CNN, also known as a table lookup operation). This dense matrix uses N (N=3 in the example) In this dense matrix, the appearance represents the one-to-one correspondence between the dense matrix and a single word, but in fact it also contains a large number of characters, words and even sentences. Intrinsic relationship with the sentence (such as: the relationship between the princess and the princess we got). The relationship between them is represented by the parameters learned by the embedding layer. The process from a sparse matrix to a dense matrix is ​​called Embedding, and many people also call it a table lookup, because there is also a one-to-one mapping relationship between them.

More importantly, this relationship is always updated in the process of backpropagation, so after multiple iterations (epoch), this relationship can become relatively mature, that is, to correctly express the entire semantics and each relationship between sentences. This mature relationship is all the weight parameters of the Embedding layer.

Embedding is one of the most important inventions in the field of natural language processing (NLP), which associates independent vectors at once. What is this equivalent to? You are your father's son, your father is A's colleague, and B is A's son. It seems that they have a close relationship with you. As a result, when you look at B, it is your deskmate. The Embedding layer is the weapon used to discover this secret.

In Pytorch, there is a special Embedding that can be called directly:

nn.Embedding(vocab_size, embed_size)

Guess you like

Origin blog.csdn.net/qq_54708219/article/details/129331889