How does tf1.x use Embedding?

How to use Embedding?

Recently, I need to use Embedding for feature embedding, but I can't find the specific usage of embedding on the Internet. I finally understand it after piecing it together. I will write an article to summarize it and sort out the ins and outs.

Embedding can be said to be a means of encoding discrete features.
When it comes to discrete feature encoding, I believe most people will first think of Onehot encoding. Let’s review Onehot encoding as an example.

1. What is OneHot encoding

I believe everyone is familiar with the mnist data set. It is a data set used for handwritten digit classification. There are ten numbers from 0 to 9, so there must be 10 types of labels: 0-9, corresponding to the numbers 0-9.
So if you use OneHot encoding, then:

0: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
1: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
2: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
3: [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
4: [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
5: [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
6: [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
7: [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
8: [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
9: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

This is what we usually do during training. We first Onehot encode the labels to facilitate our subsequent training, verification and testing.

One-hot encoding (because most algorithms are calculated based on measures in vector space, in order to make the values ​​of variables in non-partially ordered relationships not have partial ordering and be equidistant from the origin. Use one-hot
encoding , extending the values ​​of discrete features to the Euclidean space, and a certain value of the discrete feature corresponds to a certain point in the Euclidean space. Using one-hot encoding for discrete features will make the distance calculation between features more reasonable.

But there are problems with OneHot encoding. When our feature space is very large, such as encoding all words in a dictionary, assuming there are 10W words in the dictionary, we will need a 10W*10W matrix to match them. Encoding, obviously the redundancy of this encoding method is too high, most of the values ​​are 0, and the amount of information contained is too little.

2. What is embedding?

At this time Embedding came into being. Isn't OneHot's biggest problem redundant? Then my Embedding is here to eliminate redundancy for you (in the case of dimensionality reduction, Embedding can increase the dimensionality of features). Suppose we now have a feature matrix of 10W rows and 10W columns. Each row represents a word in the dictionary. If We multiply this matrix by a matrix of 10W rows and 200 columns, then the result will be a matrix of 10W rows * 200 columns. Each row still represents a word, but we only use 200 features to distinguish this word from other words. opened. The dimension of the entire matrix is ​​reduced by 100000/200 = 500.

Let's take another specific example:
Suppose there are only 6 words in the dictionary: sun, orange, grape, wheel, apple, durian.
If you use onehot to encode, you need to use a 6*6=36 feature matrix:

    太阳:[1, 0, 0,0,0,0]
	橘子:[0, 1, 0,0,0,0]
	葡萄:[0, 0, 1,0,0,0]
	车轮:[0, 0, 0,1,0,0]
	香蕉:[0, 0, 0,0,1,0]
	榴莲:[0, 0, 0,0,0,1]

And we can now completely distinguish them using three features: Fruit? Round? size?

    	水果  圆形 大小 
    太阳:[0, 1, 1]
	橘子:[1, 1, 1]
	葡萄:[1, 1, 1]
	车轮:[0, 1, 0]
	香蕉:[1, 1, 0]
	榴莲:[1, 0, 1]

It can be seen that we can perfectly distinguish these six words using only the feature matrix of 6*3=18. The reason is that in the onehot feature matrix, each row has no meaning except the number 1. significance. Each feature in each row of the matrix below has a fixed meaning.

3. How to use embedding in tf1.X?

We mark the upper matrix as A and the lower matrix as B. We can think of it as B=A*X.
This X is our Embedding matrix. We can infer that the dimensions of X are: 6 rows and 3 columns. Each column represents a feature. However, these features are not as interpretable as the above examples, so we generally choose to set X as a variable matrix, which is obtained by training in a neural network.

To calculate the
dimension of matrix
* X = 10*4
Obviously, according to matrix multiplication, the dimension of X can be obtained as 10 *4

Let’s take a simple neural network demo:

def generator(x, y):
    reuse = len([t for t in tf.global_variables() if t.name.startswith('generator')]) > 0
    with tf.variable_scope('generator', reuse = reuse):
        embedding_dict = tf.get_variable(name="embedding_1", shape=(10, 8), dtype=tf.float32)
        y = tf.nn.embedding_lookup(embedding_dict, y)
        y = slim.flatten(y)
        x = tf.concat([x, y], 1)
        x = slim.fully_connected(x, 32, activation_fn = tf.nn.relu)
        x = slim.fully_connected(x, 128, activation_fn = tf.nn.relu)
        x = slim.fully_connected(x, mnist_dim, activation_fn=tf.nn.sigmoid)
    return x

This is a simple generator of a generative adversarial network. Two vectors, X and y, are input into it, where y is the label of mnist, 0-9, so the feature dimension is 10. Now we want to check it into the length of 8 vector, then we create an embedding dictionary matrix, the variable values ​​in which need to be learned.

Then we encode the features by calling the tf, nn, embedding_lookup() function. We need to pass in two parameters, one is the embedding dictionary matrix just created, and the other is the feature we need to encode.
The essence of the function tf, nn, embedding_lookup() is equivalent to first performing onehot encoding on all features, and then performing matmul matrix multiplication using the onehot feature matrix and the dictionary matrix (mentioned in detail above, the example of A*X=B).

In fact, to put it bluntly, the embedding operation is the same as the fully connected network. It is matrix multiplication and can be replaced by a layer of Dense Neural Network (called Fully Connected Net (FC) fully connected layer in CV).

The alternative method is also very simple. Onehot encode the features and then input a layer of FC with dim = the number of feature classes, and then enter a layer of FC with dim = the length of the check-in vector. The vector obtained after training is the embedding vector.

Taking the mnist label embedding as an example, we first perform onehot encoding on the label. The length of the onehot vector obtained for each label is 10, input the FC layer with dim=10, and then input the FC layer with dim=8, and get The result is the result of Embedding a label.

Coding is not easy. If it helps you, please like and follow!

Guess you like

Origin blog.csdn.net/weixin_43669978/article/details/122738768