[NLP] Adaptive Softmax

1. Overview

Adaptive softmax algorithm in 1 paper proposed link, the algorithm is designed to improve operational efficiency softmax function for some of the neural network has a very large vocabulary.

In most of the tasks in NLP, will be used softmax, but for very large vocabulary task, every time a very large amount of computation will be complete softmax, very time-consuming (each predict a token requires O (| V |) time complexity).

So paper proposes adaptive softmax to improve the operational efficiency of softmax.

1) The algorithm is proposed to use the characteristics of the uneven distribution of words (unbalanced word distribution) formed Clusters , so that the calculation of the linear dependence softmax avoided vocabulary size, reduce the calculation time;

2) In conjunction with modern architecture and features of matrix multiplication operation , the acceleration is calculated by making it more suitable for further GPU units manner.

2. Introduction

2.1 General two ways to reduce the computational complexity softmax

1) Consider the original distribution: approximation of the original probability distribution or a probability distribution of the approximation of the original subset

2) approximation model structure, but produce an accurate probability distribution. For example: hierarchical softmax.

(Described above can be roughly divided into two methods: one is to generate an accurate approximation of the probability distribution model, the second is to produce accurate approximate the probability distribution model)

2.2 point contribution to this article

The above-described embodiment (2) paper is mainly used, the main draw hierarchical softmax and some variations. Unlike previous work in that, in conjunction with the present paper GPU acceleration characteristics were calculated. main

1. Definitions can generate a similar level model policy, the policy while considering the calculation time of the matrix product operation . This calculation is not the time dimension of the matrix is a simple linear relationship.

2. On the recent GPU empirical analysis of this model. In the proposed optimization algorithm it is also included in the definition of the actual calculation time model.

3. Compared with the general softmax, to have 2 × 10 × acceleration. This is equivalent to improve the accuracy of the count in force under the same restrictions. Also very important, this computationally efficient method proposed in this paper as compared with efficiency in parallel by a number of methods, for a given amount of training data, and no loss of accuracy.

3. Adaptive Softmax 

3.1 Model calculation time of the matrix multiplication operation

hidden states: (B × d); Word representation: (d × k); computing the product of two times the matrix: g (k, B). Wherein B is the batch size, d: hidden layer of the size, the number of k vectors.

(1) fixed to B, d, explore the relationship between the size of the value k g (k) of:

Experiments were performed K40, M40 two kinds of GPU model found in approximately k $ k_0 \ approx 50 $ of the range, G (k) is a constant level, after $ k> k_0 $, linear dependence. The calculation model:

Similarly, a function of the computing time B also presents such a relationship. This can be expressed, when the matrix multiplication, wherein when a very small dimension, matrix multiplication is inefficient.

How to understand it? For example, $ k_1 = 2 ^ 2 $, $ k_2 = 2 ^ 4 $, they are completed during the same operational time constant, then clearly there is wasted power operator for the order of $ $ K1.

Then this also shows the hierarchy of a number of words, each child node has only a small number of nodes (such as tree huffman coding), it is suboptimal.

(2) to explore the relationship between the size of the batch size B value g (B) of:

Similarly, when the calculation time exploring the relationship between batch size $ B $, found in the matrix multiplication operation in which one dimension is not efficient in hours. Therefore result in:

(I) in a hierarchy, each node as a child node only a few (such as Huffman tree), which is sub-optimal computational efficiency;

(Ii) according to the word frequency division Clusters, cluster contains only those rare words, the probability of being selected for p, and is also reduced to a batch size $ p B $, inefficient operation of the above matrix product can also occur problem.

(3) calculation model described herein

Therefore, in considering the paper k and B, presents the following calculation model:


For Adaptive Softmax, the core idea is based on the size of word frequency, word frequency range of the different words divided into different clusters, according to the principle of word frequency high priority access to, and then to cluster each word in a visit to each cluster as a whole , a softmax operation.

Well first of all to paper in two clusters, for example, to explain the model, then extended to the general case of multiple clusters.


3.2 Two-Clusters Case

According to Zipf's law, Penn TreeBank 20% of the word, can override the general document appeared in 87% of the word.

Intuitively, the dictionary of words in $ $ V may be divided into $ V_h $, $ V_t $ two portions. Wherein V_h $ $ word set represents frequency (head), $ V_t $ word set represents a low-frequency (tail). In general, $ | V_h | << | V_t | $, $ P (V_h) >> P (V_t) $.

(1) Clusters of organization

Intuitively conceivable for these two clusters are organized in two ways: (i) two clusters are tree structures of two layers, (ii) a head portion with a short list of saved, tail portions of the second floor with a tree structure. More preferably ways can be compared to its computational efficiency and accuracy:

In terms of accuracy, (i) compared to (ii) is generally 5 to 10% decrease. For the following reasons:

For the probability of each word belongs to cluster $ c $ of $ w $ is calculated as: 

The use of a (i): $ P (c | h) * P (w | c, h) $

When using (ii): For the head section of high-frequency words can be directly obtained by calculation $ P | direct (w h) $, which is simpler.

Therefore, unless there is a very big difference between (i) (ii) of computing time, otherwise select (ii) of the organization.

(2) shorten the calculation time

 

A schematic view of FIG. 2. Two clusters

2, $ k_h = | V_h |, k_t = k - k_h, p_t = 1 - p_h $

(I) For the first layer: for a batch size of input B, in the head of the vector $ k_h $ high frequency words as well as a floor unit of cluster (shaded portion), a total of $ k_h + 1 $ th vector, do softmax,

Such accounting document $ p_h $ word can basically finalized, find the corresponding word in the short list of section head;

And if the value is greater softmax shaded portions indicate the word to be predicted, the low frequency portion of the words of the second layer;

Computing a cost of the first layer: $ g (k_h + 1, B) $

After (ii) determining the content of $ p_h × B $ in the short list, the remaining $ p_t B $ input softmax need to continue in the tail to determine the predicted word.

The second layer is required to make $ K_t $ softmax vectors;

Computing a cost of the second layer: $ g (k_t, pB) $

In summary, the total computational overhead: $$ C = g (k_h + 1, B) + g (k_t, p_t B) $$

SoftMax compared to the original, the denominator is the need for vector of each word in the dictionary is calculated one by one; adaptive softmax be employed such that the process is divided into two parts, each part of the vector containing only the portion of the word is It can be calculated, calculation time is reduced.

Thesis in Figure 2 shows, for a reasonable division of $ k_h $, the maximum can be achieved compared to 5x speed up entirely of softmax.

(3) different capabilities of different cluster of classifiers

Since each cluster is substantially calculated independently, they do not require the same ability (capacity).

General can be more capacity for the cluster of high-frequency words, and can be appropriately reduced capacity cluster of word frequency words. Because low-frequency words rarely appear in the document, so it is not appropriate to reduce capacity and very affect the overall performance.

Herein, we are taking a different clusters shared hidden layer, reducing the classifier mapping matrix manner by adding the size of the input. In general, mapping matrix tail portion will shrink in size to a dimension from $ D $ $ d_t = d / 4 $ dimension.

3.3 General situation

3.2 In the cluster to be divided into two for example, describes how to organize and adaptive softmax calculated, where it is expanded to the general case, i.e., when the processing mode can be divided into a plurality of clusters.

Suppose the entire dictionary is divided into a head portion of a word frequency and word cluster portion J, then there is:

In V_h $$ = \ cup V_1 ... V_j, V_i \ cap V_j = \ phi $$

$ $ V_h wherein the first layer, the rest are in the second layer as shown in Figure 3.

The number of words in each cluster in the vector for $ k_i = | V_i | $, the word $ w $ belong to a cluster of probability: $ p_i = \ sum_ {w \ in V_i} p (w) $.

Well,

Time cost calculation is head portion: $ C_h = g (J + k_h, B) $

Layer cluster is calculated for each time overhead is: $ \ forall_i, C_i = g (k_i, p_i B) $

Then the total time cost is: $ C = g (J + k_h, B) + \ sum_i g (k_i, p_i B) \ tag8 $

 

Recalling the formula (7):

$$g(k,B) = max(c + \lambda k_0 B_0, c + \lambda k B) \tag7$$

There are two parts, a constant part of $ c + \ lambda k_0 B_0 $, a linear conversion section $ c + \ lambda k B $.

3.1 By understood, force constants in the calculation section of the GPU underutilized, and therefore falls within the section should be avoided. Then you need to meet:

$ KB \ geq k_0B_0 $, so that when seeking max, than the second portion can be utilized.

The (8) into (7) have a second portion:

 

Then the next target is based on (10) to minimize overhead time C. 

In (10), $ J, B $ is fixed, we can focus on the $ \ sum_i p_ik_i $ and $ k_h $.

(1) $ \ $ sum_i p_ik_i

Suppose $ p_ {i + j} = p_i + p_j $, then $ p_jk_j = (p_ {i + j} - p_i) k_j $

$ P_ik_i + = p_jk_j p_i (k_i - k_j) + p_ {i + j} k_j \ tag {11} $  

Suppose $ k_i> k_j, p_ {i + j} $, $ k_j $ are fixed, then (11) only the variable $ $ P_i, can (11) is reduced by means of reducing the $ $ P_i. => That $ k_i> k_j $ and $ p_i $ as small as possible, that contains cluster more words, the smaller the probability.

(2) $k_h$

Can be reduced $ k_h $ (10) is reduced. => Is the high-frequency words can be made where the cluster comprising fewer words.

In summary, given the cluster number of J, and batch size B, the smaller can reduce the probability of a large time overhead allocation cluster C.


Further the paper may be mentioned a method of dynamic programming, after a given J, the size of $ $ K_i divided.

Dividing the number of cluster $ J $. In the paper experiments using different resulting in different J computation time. Figure 4 shown in FIG. Although the effect of the $ 10 to $ 15 range the best, but then $> $ 5 promotion effect is not very obvious. It is recommended paper 2-5 clusters.

The main experiment in text8, europarl, one billion word made on the three data sets, by comparing ppl found, adaptive softmax ppl still remain low, and compared to the original model has 2x to 10x the speed.

  

Reference links:

1. Efficient softmax approximation for GPUs: https://arxiv.org/pdf/1609.04309.pdf

Guess you like

Origin www.cnblogs.com/shiyublog/p/11316465.html