embedding layer effect

Text extraction feature

1, the most straightforward way is represented by a different word sequence numbers, feature extraction

     For example: describe a woman with beautiful, with a lovely girl

     -0-shaped female receiving -1 -2 -3 people with love -9 -6 -4 bleaching -5 bright child may -7 -8

    The characteristics of the original sentence: 0123456789

Using this way, because they are not words in different dimensions, characteristics can not be calculated, so the one-host notation came

2, the use of one-dimensional arrays represent a word, a sentence is two-dimensional sparse matrix

    For example: describe a woman with beautiful, with a lovely girl

  

Shaped - 1000000000 
receiving - 0100000000 
Female - 0,010,000,000 
people - 0001000000 
with - 0000100000 
bleaching - 0000010000 
light - 0000001000 
children - 0,000,000,100 
can - 0000000010 
Love - 0000000001 
then the original sentence characterized 
1000000000 
0100000000 
0010000000 
0001000000 
0000100000 
0000010000 
0000001000 
0000000100 
0000000010 
0000000001

This feature represents an advantage relative to the above feature calculation is simple, direct sparse matrix multiplication are added to a corresponding position. In addition, he is a disadvantage because it is sparse, most of the information is 0, wasting storage space and computing space, which is deduced to the role of a layer embeddding

3, the one-host sparse matrix mapped to a total number of feature / smaller dimension matrices, called embedding layer effects

  The second point of using sparse matrix

X = [
1000000000
0100000000
0010000000
0001000000
0000100000
0000010000
0000001000
0000000100
0000000010
0000000001
]
使用
w=[
w10   w11   
w20   w21
w30   w31
w40   w41
w50   w51
w60   w61
w70   w71
w80   w81
w90   w91
w100 w101
]

X * w = [
w10   w11   
w20   w21
w30   w31
w40   w41
w50   w51
w60   w61
w70   w71
w80   w81
w90   w91
w100 w101
]

 Or more, 10 X 10 matrix, the matrix 10 is multiplied by X 2, into a 10 X 2 matrix, reduced feature size 10/2 = 5 times. (Note: This place can be set to reduce the dimensions when w to 10 x20, on the role of L dimensions may be achieved) 

You can see above embedding dimension reduction, you have lost information is not lost or worried about what's real significance of this dimension reduction? The following examples from the understanding of the text dimension reduction

4、word-embedding

Suppose there are 1000 vocabulary. The first word is [1,0,0,0,0, ...], the remaining words are a location is 1, 0 is a vector of dimension 1000, i.e. one-hot encoding.

Seen from the one-hot encoding principle, different words just after the first-random or ordered from the set to 1 in a different position, the rest position is set to 0. That is no relationship between the different words, which does not match reality. such as:

Semantic: girl and woman, although used in the different age groups, but the mean are female. While the man and the boy used at different ages, but refers to all men.

Plural: word and words only difference is plural and singular.

Tense: buy and expressed bought are "buy", but at different times occurred.

We hope that with dimensions such as "semantics", "complex", "tense" and so on to describe a word. Each dimension is not 0 or 1, but a continuous real number represents a different degree. This is the Distributed representation of the way

Neural network analysis

Suppose our vocabulary only four, girl, woman, boy, man, you think in the following two different ways of expressing what the difference.

One hot representation

Although we know their relationship to each other, but the computer does not know. In the input layer of the neural network, each word will be treated as a node. And we know that the neural network is trained to learn the weight of each cable weight. If you look at the heavy weight of the first layer, the following case is determined by the relationship 3 4 * connecting lines, since each dimension independently of one another, girl does not have any data to help train other words, the training data required the amount of the basic fixed there.

 

 

 

Distributed representation

We are here to manually find the relationship between these four words  [official] . Two nodes can be used to represent four words. Significance at different values for each node in the following table. Girl can then be encoded into a vector [0,1], man can be encoded as [1, 1] (first dimension is the second dimension is the gender ,, age)

 

 

 

Then the right cable next time you look at the neural network to learn on the weight reduced to 2 * 3. Meanwhile, when the girl is fed to input the training data as it is encoded by the two nodes. Then the other input connected to the examples sharing the same girl may be trained to (e.g., to help to share the woman female, and the boy child training).

 

 

 

Word embedding that is, to achieve the results of the second neural network indicated to reduce the amount of data needed for training.

While the upper four words can be split into two nodes is determined by a priori knowledge of our provided manually through the original input space  [official](yellow arrow in the figure above) is projected to another dimension (dimension smaller), so it training can reduce the amount of data required. But we have no way to provide artificial, machine learning aim is to make machines instead of manpower to discover pattern.

Word embedding is automatically learn from the data input space mapping Distributed representation of space[official]

5, that embedding how training it?

The question is, how do we automatically look similar to the above relationship, will transform One hot representation into Distributed representation. We not clear in advance what the goal is, so this is an unsupervised learning tasks.

Unsupervised learning is commonly thought: when to get the data [official], we do not target (output) know,

    • A direction: from the respective input { [official] find the relationship between the target}. Such as clustering.
    • 方向二:并接上以目标输出 [official] 作为新输入的另一个任务 [official] ,同时我们知道的对应 [official]值。用数据 [official] 训练得到 [official] ,也就是 [official] ,中间的表达 [official] 则是我们真正想要的目标。如生成对抗网络。

 

Word embedding更偏向于方向二。 同样是学习一个 [official] ,但训练后并不使用 [official] ,而是只取前半部分的 [official] 。

到这里,我们希望所寻找的 [official] 既有标签 [official] ,又可以让 [official] 所转换得到的 [official] 的表达具有Distributed representation中所演示的特点。

同时我们还知道,

单词意思需要放在特定的上下文中去理解。

那么具有相同上下文的单词,往往是有联系的。

实例:那这两个单词都狗的品种名,而上下文的内容已经暗指了该单词具有可爱,会舔人的特点。

  • 这个可爱的 泰迪 舔了我的脸。
  • 这个可爱的 金巴 舔了我的脸。

而从上面这个例子中我们就可以找到一个 [official] :预测上下文。

用输入单词  [official] 作为中心单词去预测其他单词  [official] 出现在其周边的可能性。

We both know that corresponding  [official] , at the same time the task  [official] but also allows  [official] the conversion resulting  [official] expression has Distributed representation as the presentation features. Because we make similar words (such as Teddy and gold bar) to get the same output (context), then the neural network will be Teddy's input and gold bars through the neural network input  [official] output Teddy obtained and gold bar output almost the same.

Guess you like

Origin www.cnblogs.com/pyclq/p/12405631.html