Attentional mechanisms could explain it? This ACL 2019 paper said ......

Attention has recently promoted the development of a series of mechanisms NLP tasks. Since the layer may be calculated to characterize the right attention heavy layers, it is also considered important that the model can be used to find information (e.g., context specific words). Researchers attention right by modifying the text classification model has been trained in heavy, tested this hypothesis. And analyzes the reasons leading to the prediction model changes. The researchers observed that, although in some cases, high attention to weight prediction models have higher impact, but they also found this phenomenon and many different cases. The researchers believe that although attention mechanism predicts the input portion of the model with respect to the importance of the whole, but this is not a way to illustrate the importance of insurance.

Selected from arXiv, heart compile Sofia Serrano, Noah A. Smith, of the machine: Author.

In addition, attention compared to the previous Almost Human reported can improve the model interpretability of the article , the paper more words from context level (contextualized word level), to explore whether attentional mechanisms can be interpreted. Unfortunately, the author also believes that the focus of attention layer is not sufficient to explain the model of interest.

Links: arxiv.org/abs/1906.03...

Interpretability For many NLP model for both an urgent problem. With the models become more complex, and learn from the data, to ensure that we can understand why the model to make some kind of decision is very important.

Talk about interpretability of existing work is only beginning to assess the weight to calculate the attention convey what kind of information. Herein, researchers use a different method of analysis based on the intermediate characterizing erased may depend on whether the focus evaluation weights inputted to explain the relative importance of attention layer itself. They found: attention weights just noisy forecast for the middle of the importance of the ingredients, should not be considered grounds decisions.

Test Set

Category 5 and 10 class text classification model researchers will focus on include attention, for reasons explained in the text classification has been studied in a field of interpretability attract the attention of researchers (Yang et al, 2016;. Ribeiro et al, 2016;. Lei et al, 2016;. Feng et al, 2018)..

Interpretable model is not only a need to provide a reasonable explanation, but also to ensure that these are real reasons to explain the decision-making model. Note that this analysis does not depend on the real tag data; if a model produces an incorrect output, but it also gives a plausible explanation of what factors play an important role in the calculation, we also believe that model is interpretable.

Characterization erased intermediate

Some researchers are interested in the context of the input (I '⊂ I) Effect on model of attentional output layer. For I test 'importance, the researchers classified the two-layer model runs (see Figure 1): once without making any changes, I use a' zero attention to weight distribution attention renormalization , similar to other work-based erased. Next, affect the outcome of the model output researcher observed. Erasing them attention layer to the isolation layer and the focus encoder preceding it off. Renormalization reason behind this is to avoid the characterization of the output document is to train in a way never encountered artificially reduced to near zero, which may make subsequent measurement is not representative of the behavior of the model in its space mapping input in.

Figure 1: Characterization of calculating the weights corresponding to the attention importance zeroing methods herein, it is assumed there are four output class.

Data and models

Four model architecture on researchers to explore a subject classification data set (Yahoo Answers) and the three assessment ratings data set (IMDB, Yelp 2017, Amazon). Statistical data are shown in Table 1 for each data set.

Table 1: Data used in the experiment set.

Text classification model model architecture herein inspired attention hierarchical network, which is comprising two layers of attention, attention is first token for each word in the sentence, the sentence is then characterized the resulting attention. Documents characterizing linear relationship with layer upon layer of softmax the final classification.

Softmax researchers tested formula of attention, most models including HAN, including use of the formula. Specifically, the additive formula (additive formulation) researchers used Bahdanau et al. (2015) originally defined.

The importance of the right to a single heavy attention

Test the beginning, the researchers explored the relative importance when only one weight can remove the attention of weights. I ^ * ∈ I such that the component becomes the highest attention, α_i ^ * as their attention. Researchers in two ways the importance of i ^ * compared with some of the attention the importance of other items,

Model JS divergence of the output distribution

Comparative Effect researchers hope i ^ * output distribution model with random effect term attention I r corresponding to the unified extraction. The first method is to calculate two JS divergence: is a removable only from i ^ * output distribution after the original model to its original output distribution JS divergence, output distribution other only after removal of the model r JS divergence, and compare them.

They JS divergence is removed by subtracting the output of the output r JS divergence is removed after i ^ *:

A formula: the formula ΔJS

Intuitively, if i ^ * really is the most important, then we would expect Eq. 1 is positive, and this is the truth most of the time. Further, it can be seen from Figure 3, almost all of the values ​​are close to 0 ΔJS. Can be seen in Figure 4, is small in impact i ^ *, the little difference between i ^ * r attention and attention. This result is relatively encouraging, it said in these cases, i ^ * and r in terms of attention almost "connected."

Figure 3: Major small differences in attention right vs HANrnn of ΔJS.

Figure 4: Test Example count HANrnn model, i ^ * JS divergence is smaller.

However, when considering the size of the start value ΔJS CKS FIG. 3, attention interpretability more blurred. The researchers noted that in the data set, even if the attention of the weight difference is very large, such as 0.4, many positive ΔJS still very close to zero. Although eventually found, once Δα increases, ΔJS will soar, it showed that the distribution of only a very high attention to weight, the impact of i ^ * r and exactly how much energy, there is a big controversy here.

Zero attention from decision-making due to rollover

Due attention weights are generally considered interpretation of the model argmax decisions, so the second test concerns the model output in another, more intuitive change: Flip decisions (decision flip). For clarity, only the results of HANrnn discussed herein, which is the result of reaction to the same pattern observed in other architectures.


Figure 9: Using the definition of i ^ * given above, is compared with a different random selection item attention, is the diagram of a digital set of four test on all models of each category indicator variable decision flip test examples percentage. Since the item can not be random were asked to i ^ *, they exclude all instances the final sequence of length 1 from the analysis.

In most cases, erase i ^ * does not change the decision model ( "no" column in the figure). This may be related to the classification and signal distribution in a document related (for example, Yahoo Answers questions about the data set a "Sports" can be expressed as "sports" in a few words, any one of which is enough to correctly classified).

Heavy attention to the importance of the right to layer

In order to solve the interpretability attention layers, and solve the problem of re-testing and weight, the researchers used a new test for the right to study multi-layer focusing attention on the performance predictor.

Table 2: Each flip percentage of each decision HANrnn indicator variable category test case.

Multi-weight test

For the hypothetical order of importance, such as the sort represented by a heavy layer of attention right, researchers hope to sort most attention neurons can serve as a concise explanation of the decision-making model. These do not explain the more concise, after more attention by neurons really driving the decision-making model ranking, then the less likely it is to better explain the importance. In other words, the researchers hope that the importance of effective rankings, the highest ranked neurons need to use only a small portion of the important information to guide decision-making model.

Specific methods importance ranking

The researchers made specific order of importance of the two methods.

The first is the importance of random order. The researchers hope that this sort produces a poor performance, but you can compare the results in descending order of weight and attention right method.

The second sorting method is rights attention layer is re-sorted. This method requires attention to weight each of the gradient and decision functions in descending order of weight. Since each data set is class 5 or 10, a vector according to the output of decision function of the true model is:

Attention mechanism is not the ideal way of describing the decision-making model

According to the analysis results in Figure 5, the researchers found that the method attention weights in order of importance, the model has the encoder is not ideal. Although the method using the weight is removed descending attention intermediate representation can often make decisions faster than the flip randomly ordered, in many cases, this sort gradient or gradient than - the product less efficient sorting decision flip - attention.

In addition, although the sorting product sorting often than gradient-based (but not always) need to be removed a little bit less of neurons, the researchers found no attention purely sort of gradient and it (the performance) very close, and the ratio of pure sort of attention based on performance better. In the model 16 in the encoder 10 model, we found that over 50% of the set of test cases to achieve a smaller gradient than the decision by removing attention flip removed. The study found that in every model has an encoder, gradient-based ordering only lead to decisions based flip faster than the speed of attention. In the test set, the number of such cases is a counter-example (decision due attention flip faster) of its 1.6 times.

Overturns the decision later

On each sorting mechanism and many models, researchers have encountered problems flip aim to reach decisions need to be removed to a large part of neurons. For HAN, this is not surprising, because these mechanisms attention to calculate attention from shorter text. For FLAN, this result somewhat unexpected. FLAN often have calculated the attention of hundreds of characters on the series, the weight of each of the heavy attention is likely to be very small.

For model studies, especially FLAN (calculated using the attention of hundreds of characters), the fact that there may be problems explanatory. Lipton believes that "if a person look once understand the whole model, this model is transparent" (The mythos of model interpretability arXiv preprint arXiv:. 1606.03490.). According to this interpretation, if an important explanation to consider hundreds token attention right weight, even though every attention is very small, it still will cause serious transparency issues.

Figure 5: The three models in the architecture, in a different sort of program, before the first decisions to be removed scores of items distributed rollover occurs.

The influence of context on the scope of attention due to the interpretability

In machine translation, a previous study observed that, on a complete sequence, the encoder may be recurrent neural movement of the token signals, thereby calculating counter-intuitive results after attention layer. We hypothesized that the experimental set text classification, the neural network bidirectional loop, and such HANrnn FLANrnn encoder may choose to adjust the distribution of signals from a portion of the token, not represented in other contexts input. Comparison chart 5 FLANconv and FLAN-rnn decision reversed the results support this theory. The researchers noted that the decision overturned based velocity model rnn faster than two, indicating that two-way cycle network can effectively learn how widely redistributed classification signals. In contrast, the convolutional encoder indicates that only two characters according to input character before and after the learning context.

In contrast to results of the two HAN architecture can be seen in the same situation, although less pronounced. This may be because the context of the HAN drawn less a part of token representation (rather than sentence represents the word), it shows a character drawn neighborhood context is already a large part of the complete sequence.

If the comparison model architecture without encoder, this difference will be more obvious, as shown in FIG. Compared to other two model architecture, you can see the important part of the model is erased, the rate of decline reversed the decision. At the same time we can see, in random order better performance than before, indicating that the decision boundary is more fragile, especially in the Amazon data set. Such instructions may be more important compared to the attention of the gradient.

in conclusion

Attention mechanism is considered to be a tool to explain the model, but the researchers found, attention and importance is not sufficiently layer corresponding.

In some cases, the two are related. For example, when comparing high attention weights and a low weight, high impact weight to the attention of the model is often greater. However, when taking into account, in some instances, the attention of the highest weights can not have a huge impact, the picture is bleak.

From heavy multi-weight test, the researchers found that attention weights are often unable to find play the most important role in the final decision-making model representation. Even, even in the case of reversing the decision of the importance of sorting attention model layer based on the speed is much faster than the other sort, zero the number of neurons involved are usually too big, to explain (the process) does not help.

The researchers also noted that the context of the scope of the first model to influence the decision-making level attention. Although the attention is not drawn layer largely in the example shown in the context of a more efficient, in other cases, that decision is based on obtaining poor performance is an issue. The researchers believe that their test setup, attention layer is not found for a particular input is causing an ideal tool for a specific output. Attention layer might become interpreted in other ways, but not in the order of importance. (In order of importance the issue), attention layer model can not explain the decision.


Reproduced in: https: //juejin.im/post/5d06fae26fb9a07eb67d8f49

Guess you like

Origin blog.csdn.net/weixin_33722405/article/details/93169615
ACL