How can I trust you - 6 common ways to explain the interpretability of NLP models

I. Introduction



There are many ways to explain the interpretability of NLP models. Some of them need to go deep into the model to see through the process of parameter transmission. We can classify them as "white box methods". There are some methods that do not require the internal parameters of the model, and use the relationship between input and output to "guess" the possible reasoning process of the model. We can classify it as a "black box method." The two types of methods have their own advantages and disadvantages and different goals. The purpose of the "white box method" is to explain the "real" reasoning process of the model due to its perspective of the internal structure of the model and the process of parameter transmission. However, many white box methods are limited to specific models, and these methods are not applicable when the internal parameters of the model are not available. The "black box method" does not know the internal parameters of the model, so it cannot explain the true reasoning process of the model. This method is more about "guessing" the reasoning process of the model, and can explain the capabilities of the model. In contrast, the "black box method" can be implemented faster and bring out value in business scenarios. This article introduces three common "white box methods" and three "black box methods" respectively. The "white box method" includes neuron analysis, diagnostic classifiers, and Attention. The "black box method" includes "antagonistic data set", "Rationales", and "local agent model".





2. Neuron analysis



In 2015, Karpathy & Johnson of Stanford University compared different recurrent neural networks and visualized the values ​​of different memory cells of LSTM according to the charactor. There are some interesting findings.  They found that in LSTM, different memory cells have different functions. Some cells can determine whether the sentence is in quotation marks, and some cells can determine the length of the sentence. If the input is code text, some cells can determine whether it is an if statement or a comment, and some cells can determine the amount of indentation. At the same time, they analyzed the switching conditions of forget gate, in gate and out gate. If the gate value is lower than 0.1, it is regarded as "left saturation", and if the gate value is higher than 0.9, it is regarded as "right saturation". They found that in the forget gate, there is a "right-saturated" gate to record long-distance information, while in the output gate, there is no continuous saturated gate. This finding is consistent with the characteristics of LSTM. In 2017, OpenAI researchers discovered that there is a neuron that is very sensitive to emotion in the recurrent neural network. They proved that the value of this neuron is very important for sentiment classification tasks. Moreover, if only this neuron is used for emotion classification, very good results can also be achieved. The figure below shows the value of the neuron at the charactor level. Green indicates positive emotion, and red indicates negative emotion. The author believes that the visualization of neurons can objectively reflect the process of model reasoning. However, this method also has some problems. First, not all neurons can be visualized to make people discover patterns. Most neurons appear to be irregular. Therefore, the visualization method needs to artificially judge which information it may represent, which is not universal. Second, although the visualization method can intuitively see the function of the neuron, it cannot be judged by a quantitative index. Third, this method has many uncertainties, and it is difficult to achieve value in business scenarios.









Three, diagnostic classifier



Sometimes it is difficult to see what kind of information a vector contains through visualization methods. So we can change our thinking. That is: if a vector contains certain information, then it has the ability to do tasks that require this information. Therefore, another commonly used method is: for a vector inside the model, we can train a simple model (usually a linear classifier) ​​to predict some features (usually some linguistic features). In this way, we can know what has been learned in each layer inside the model. In 2019, Google researchers used this method to analyze Bert. They extracted the embedding output from each layer of Bert and used the diagnostic classifier to test on 8 tasks. Experiments show that Bert performs traditional NLP pipeline (pos tagging, referencing resolution, etc.) tasks in its transormer layer. It also shows that the subdivision tasks in these pipelines can dynamically adjust themselves to achieve better performance. The following figure shows the expected value of which layer of information is needed to complete these tasks, and where the "center of gravity" of these layers is. From the results, it can be seen that for some low-level tasks (such as pos tagging, entity recognition, etc.), the required information can be obtained at the lower layer, while for some complex tasks (such as reference resolution), it can be obtained at the higher layer. Information needed. This is in line with our expectations. The author believes that the diagnostic classifier method can analyze the training process of the model relatively simply, and is more suitable for deep models. At the same time, this can play an enlightening effect on the optimization of the model. The ACL article of Jo & Myaeng in 2020 pointed out that when using Bert, if you choose some embedding in the middle layer and the embedding in the last layer to perform tasks together, you can achieve better results than just using the last layer.








Four, Attention



Due to the characteristics of Attention itself, there will be a weight between words, and this weight may be able to express the reasoning process of the model. In 2019, researchers from Stanford University and Facebook analyzed the attention head in Bert and found that different heads would extract different information, and using head directly to subdivide tasks can also achieve relatively good results. However, there are still many controversies as to whether attention can represent the reasoning process of the model. Also in 2019, Jain & Wallance's article "Attention is not explaination" pointed out that the conclusions of attention and gradient-based analysis methods are different, and different attention weights can lead to the same prediction. Therefore, it is argued that there is no direct relationship between attention and output results. And Wiegreffe & Pinter's article "Attention is not not explaination" refuted J&W's point of view, thinking that its experimental design was insufficient to prove its conclusion. In ACL articles in 2020, the discussion on attention is still a hot topic. Abnar & Zuidema proposed a method that can make attention more explanatory. Since the information of different tokens is mixed after the first layer of Transformer, they propose a rollout method to extract the mixed information. Sen et al. proposed to compare the model's attention with the artificially labeled attention, and proposed a method to calculate the similarity between the two. However, Pruthi et al. believe that the explanation based on attention is not reliable because the artificial model can be made unbiased through artificial manipulation, which is not the case.

image






笔者认为,attention自带可解释的特性,相较于其他模型有着天然的优势。而如何运用attention进行解释,还需要更多的研究。并且,这种方法局限在模型本身,并不适用于其他的模型。此外,有关attention的可解释性如何应用在模型优化上,还需要更多的思考。


五、Rationals



Rationals可以认为是做一项任务的"Supporting evidence", 可以理解为“关键词”组合。
2019年,DeYoung等人提出一个benchmark,叫做ERASER(Evaluating Rationals and Simple English Reasoning). 
他们认为,人们在进行NLP相关任务时,会重点关注一些“关键词”。将输入文本的部分词语抹去,会导致任务的置信度下降甚至失败。而使得任务置信度下降最大的关键词组合,就可以认为是模型认为的“关键词”。
他们提出了7个任务的数据集,其中包含了人为标注的“关键词”。通过判断模型认为的“关键词”和人工标注的“关键词”之间的关系,就可以判断出模型是否在用人们认为的关键信息进行任务。
如下图所示,第一个图展示了4个任务的人为标注的“关键词”,第二张图展示了判别模型认为的“关键词”的方法。

笔者认为,这种方法虽然不能解释出模型的推理过程,但是可以探究出对模型而言的关键词有哪些。 并且,这种方法适用于所有的模型。
此外,这种方法的成本在于人为标注“关键词”,而作为评测的数据集,并不需要大量的数据,且可以重复利用。因此,这种方法在模型质量保证方面,可以较好的落地。


六、对抗数据集




将数据进行一些微小的改变,根据这些扰动是否会对模型结果造成干扰,可以分析出模型的推理过程。
例如,2019年,Niven & Kao在Bert进行推理任务的数据集中,加入了not,导致Bert的推理能力直线下降到盲猜水平。由此可以看出,Bert在进行推理时,是将not作为一个明显的“线索”。若将这个线索改变,模型的能力就会大幅下降。

然而,对抗数据集的构造比较有难度,在某些任务中,需要大量语言学知识。因此,对抗数据集的构造成本很高。
2020年ACL最佳论文《Beyond Accuracy: Behavioral Testing of NLP Models with CheckList》提出了一套NLP模型诊断的方法。
他们提出一个常用语言学特征的capabilities和test types的矩阵,可以帮助较为完善的构造测试数据集。他们根据软件测试的方法,提出了三个capabilities, 即,最小单元(MFT)、不变性(INV)和方向性。并且提出了一个工具能够快速生成大量的测试数据。
image
笔者认为,这种checklist的方法,具有较低的成本,并且能够较为完善的构造对抗数据集。
虽然其无法解释出模型的推理过程,但是它可以解释出模型是“怎样看数据”的。即,哪些类型的数据扰动会对模型产生较大的影响,模型更关注哪些类型的扰动信息等等。这在实际业务场景中,可以作为模型鲁棒性保证的方法。目前我们也已经尝试用这种方法进行模型评测。

七、局部代理模型 



The local agency model is a "post-mortem" method. It can be applied to all models. Its basic idea is to use some interpretable models to locally fit unexplainable black box models, so that the model results have a certain interpretability. In 2016, Ribeiro et al. proposed the LIME method. For each piece of data, a small perturbation is performed on it to generate a data subset, and the prediction result of the model for the data in this subset is obtained. Then, an interpretable model (such as linear regression) is used to fit the results in this subset, so that the results of this data can be partially explained. A similar method is SHAP. However, the interpretation obtained by this method is not robust. Because we believe that small adjustments will not have a great impact on the results, which is not always true for local agency models. Moreover, if the distribution of the training data and the test data are different, the results of the local agent model cannot be trusted. Rahnama et al.'s research "A study of data and label shift in the LIME framework" puts forward this point. They found through experiments that there is a big difference between the distribution of the data obtained through the disturbance process of LIME and the training data. In addition, the interpretation of LIME relies heavily on the choice of hyperparameters. For example, how many disturbance data points should be selected? How to set the weight of these points? How much strength does the regularization need? The author believes that although the local agent model can explain all models, this explanation does not reflect the reasoning process of the model, but "the model reasoning process that people think". In business scenarios, models for decision support (such as robo-advisors) may be effective, but the faithfulness of this method is still worth discussing and studying.

image





Guess you like

Origin blog.51cto.com/15060467/2678869