Explainable Artificial Intelligence - Input Unit Importance Attribution

insert image description here
Input unit importance attribution, which is to calculate the importance of each unit in the input (Importance). The importance can reflect the influence of the input unit on the neural network, and the higher the importance, the greater the influence. Quantifying and analyzing the importance of input units can help people understand which input variables drive the neural network to obtain the current results, and thus have a preliminary understanding of the characteristic modeling of the neural network.

insert image description here

In some studies, people useAttribution value(Attribution) orsignificant(Saliency) describes the influence of an input unit on a neural network, while importance, attribution, and significance have very similar meanings. Therefore, the term importance will be used uniformly in the following discussion.


Most representative interpretability methods :
insert image description here


1. SHAP algorithm

1.1 Shapley Value

Shapley Value is a method of cooperative game theory, a method of assigning total payouts to players based on their contribution (features) to the total payout (prediction). Players cooperate in the alliance and obtain certain benefits from the cooperation.

Shapley Value 是Average of all marginal contributions to all possible coalitions. Shapley Value can be intuitively understood in this way: the eigenvalues ​​enter the room in a random order, and all eigenvalues ​​in the room participate in the game (equivalent to helping predictions). The Shapley value of an eigenvalue is the average change in predicted values ​​received by coalitions already in the room when that eigenvalue is added to them.

Example: Three players cooperate to complete a project (making 500 parts).



probability player join order Player 1's contribution margin Player 2's contribution margin Player 3's Contribution Margin
1 6 \frac{1}{6} 61 1,2,3 100 170 230
1 6 \frac{1}{6}61 1,3,2 100 125 275
1 6 \frac{1}{6}61 2,1,3 145 125 230
1 6 \frac{1}{6}61 2,3,1 150 125 225
1 6 \frac{1}{6}61 3,1,2 325 125 50
1 6 \frac{1}{6}61 3,2,1 150 300 50

Then, the Shapley Value of each player is:

player Shapley Value
1 1 6 ( 100 + 100 + 145 + 150 + 325 + 150 ) = 970 6 \frac{1}{6}(100+100+145+150+325+150)=\frac{970}{6} 61(100+100+145+150+325+150)=6970
2 1 6 ( 170 + 125 + 125 + 125 + 125 + 300 ) = 970 6 \frac{1}{6}(170+125+125+125+125+300)=\frac{970}{6} 61(170+125+125+125+125+300)=6970
3 1 6 ( 230 + 275 + 230 + 225 + 50 + 50 ) = 1060 6 \frac{1}{6}(230+275+230+225+50+50)=\frac{1060}{6} 61(230+275+230+225+50+50)=61060

Therefore, if the bonus is divided according to the proportion, the bonus allocated by player 1 is 32.3% of the total bonus, the bonus allocated by player 2 is 32.3% of the total bonus, and the bonus allocated by player 3 is 35.3% of the total bonus.

So in machine learning, how do we control whether a certain player (feature) participates in the game?
If a certain player does not participate in the game, we give this feature a random value, which comes from the distribution of features in the training data set. A feature assigned a random value has no predictive power, so we consider that feature not participating in the game (prediction).


Disadvantages of Shapley Value:

  • Time complexity is too high
  • Shapley Value returns a simple value for each feature, but there is no model like LIME. This means it cannot be used to make changing statements about predictions of changing inputs.
  • If you want to calculate the Shapley value of a new data instance, you need to access a lot of data to perform the calculation.
  • Like many other permutation-based interpretation methods, the Shapley value method suffers from unrealistic data instances where features are correlated.



1.2 SHAP algorithm

Considering that the calculation of Shapley Value requires a very high computational complexity, in 2017, Lundberg and Lee proposedShapley Additive Explanation (SHapley Additive exPlanations, SHAP) Algorithm [ 2 ] ^{[2]} [2]Efficient approximation of the Shapley Value of the input unit, the goal is to explain instance xx by computing the contribution of each feature to the predictionprediction of x . Interpret the model's predictions as a linear function of binary variables:
g ( z ′ ) = ϕ 0 + ∑ i = 1 M ϕ izi ′ g(z') = \phi_0 + \sum\limits^M_{i=1}\phi_iz'_ig(z)=ϕ0+i=1MϕiziAmong them, ggg is the explanation model;z ′ ∈ { 0 , 1 } M z'\in\{0,1\}^Mz{ 0,1}M isunion vector(also known as "simplified features", 1-present, 0-obsent); MMM is the size of the largest union;ϕ i ∈ R \phi_i \in RϕiR is featureiiThe characteristic attributable Shapley values ​​for i .

SHAP describes the following three desirable properties:

  • Local Accuracy : The sum of the feature attributions is equal to the output of the model to be explained. That is f ^ ( x ) = g ( x ′ ) = ϕ 0 + ∑ i = 1 M ϕ ixi ′ \widehat{f}(x) = g(x') = \phi_0 + \sum\limits^M_{i=1}\phi_ix'_if (x)=g(x)=ϕ0+i=1Mϕixi
  • Missingness : All eigenvalues ​​xi ′ x’_i of the instance to be explainedxiBoth should be "1". If it is "0", it means that the instance to be explained lacks this feature value.
  • Consistency : If the model changes so that the marginal contribution of the eigenvalue increases or remains the same (regardless of other features), the Shapley value will increase or remain the same accordingly.

Calculate importance:

  • KernelSHAP [ 2 ] ^{[2]} [ 2 ] : KernelSHAP is a kernel-based proxy method that estimates Shapley values ​​based on a local proxy model.
  • DeepSHAP [ 3 ] ^{[3]}[ 3 ] : Through some previous algorithms based on backpropagation (DeepLIFT[ 4 ] ^{[4]}[ 4 ] ) was improved to obtain DeepSHAP.

1. KernelSHAP(Linear LIME + Shapley values)

(1) Introduction to KernelSHAP

KernelSHAP for an instance xxx estimates the contribution of each feature value to the prediction.

Recall SHAP: SHAP defines interpretation as: g ( z ′ ) = ϕ 0 + ∑ j = 1 M ϕ jzj ′ = f ^ ( z ) g(z') = \phi_0 + \sum\limits^M_{j=1} \phi_j z'_j = \widehat{f}(z)g(z)=ϕ0+j=1Mϕjzj=f ( z ) , i.e. SHAP wants totrain a regression model to locally fit the model output to be interpreted. (Wait, isn't that what the LIME model does?!) Here we want to see the Shapley value (the contribution of each feature to the prediction) corresponding to each feature, ie ϕ j \phi_jϕj

The biggest difference between KernelSHAP and LIME is thatThe weight of the instance in the regression model

  • LIME weights instances according to how close they are to the original instance ;

  • SHAP weights the sampled instances according to the weights obtained by the coalition in Shapley value estimation .

    In order to achieve the weighting of the Shapley standard, Lundberg et al. proposed the SHAP kernel : ( π x ( z ′ ) \pi_x(z')Pix(z )meansinstancez ′ z'z corresponding weights)

    π x ( z ′ ) = ( M − 1 ) ( M ∣ z ′ ∣ ) ∣ z ′ ∣ ( M − ∣ z ′ ∣ ) \pi_x(z') = \frac{(M-1)}{(\begin{matrix}M\\|z'|\\ \end{matrix})|z'|(M - |z'|)} Pix(z)=(Mz)z(Mz)(M1)Among them, MMM is the maximum coalition size;∣ z ′ ∣ |z'|zis instancez ′ z'z The number of current features in .


(2) How is KernelSHAP calculated?

The calculation of KernelSHAP mainly includes the following 5 steps: [ 10 ] ^{[10]}[10]

  1. Initialize some simplified features zk ′ ∈ { 0 , 1 } M z'_k \in \{ 0,1 \}^M near the target instancezk{ 0,1}M (1 means the feature exists in the coalition, 0 means the feature does not exist in the coalition).
  2. 将 simplified features z k ′ z'_k zkConvert to original data space hx ( z ′ ) = z h_x(z') = zhx(z)=z , then apply the modelf ^ : f ^ ( hx ( zk ′ ) ) \widehat{f} : \widehat{f}(h_x(z'_k))f :f (hx(zk)) forzk ′ z’_kzkMake predictions.
    Here, if the vector is 1, the original data of the feature is used; if the vector is 0, it means "missing feature value", at this time hx h_xhxNon-existing eigenvalues ​​are sampled from the marginal distribution based on the current eigenvalues ​​(sampling from the marginal distribution means ignoring dependencies between the current feature and the nonexistent features).
  3. Compute each zk ′ z’_k using the SHAP kernelzk 的权重。
    π x ( z ′ ) = ( M − 1 ) ( M ∣ z ′ ∣ ) ∣ z ′ ∣ ( M − ∣ z ′ ∣ ) \pi_x(z') = \frac{(M-1)}{(\begin{matrix}M\\|z'|\\ \end{matrix})|z'|(M - |z'|)} Pix(z)=(Mz)z(Mz)(M1)
  4. Fit a weighted linear model.
  5. Return the Shapley Value ϕ k \phi_kϕk, the coefficients of the linear model.



2. DeepSHAP(DeepLIFT + Shapley values)

DeepSHAP uses the DeepLIFT method to estimate the Shapley Value corresponding to each feature.




2. LIME local and model-independent explanation method

2.1 Introduction to LIME

LIME [ 9 ] ^{[9]} [ 9 ] The full name is: Local Interpretable Model-agnostic Explanations, local proxy model (that is, focusing on training a local proxy model to explain a single prediction). local - local (local to the sample space), model-agnostic - independent of the model.

This approach treats the original model as a black box into which data points can be fed and the model's predictions obtained. A black box can be probed at any time, and the goal of a LIME model is to understand why that black box model made a particular prediction.

LIME generates a new data set, and then on this new data set, LIME trains an interpretable model (eg, linear model, decision tree, etc.), which is weighted according to the proximity of the new sample to the target instance (which can be understood as distance). The learned model should be a good approximation of the machine learning model's local predictions, but not necessarily a good global approximation.

insert image description here
Mathematically, a local surrogate model with interpretability constraints can be expressed as:

e x p l a n a t i o n ( x ) = a r g m i n g ∈ G L ( f ^ , g , π x ) + Ω ( g ) explanation(x) = argmin_{g \in G} L(\widehat{f}, g, \pi_x) + \Omega(g) explanation(x)=argmingGL(f ,g,Pix)+Ω ( g )

where, ggg is forxxThe explanatory model of x ,f ^ \widehat ff Denotes the original model, minimizing the loss LLL measures the explained modelggg with original modelf ^ \widehat ff The predicted closeness, closeness π x \pi_xPixDefines instance xx when considering interpretationNeighborhood size near x , model complexityΩ ( g ) \Omega(g)Ω ( g ) is kept low (eg: for decision trees,Ω ( g ) \Omega(g)Ω ( g ) may be the depth of the tree). During the actual operation, LIME only optimizes the loss part, and the user must determine the complexity.


2.2 How is LIME calculated?

insert image description here
On the left is a complex model used to solve classification problems. For a certain sample, we want to know which factors affect the classification of the sample.
We take this local region out and fit a decision boundary with a linear model. How do we do it?

  1. Generate many sample points near the predicted sample . For this sample point, we randomly generate many sample points around it.
    insert image description here
  2. Label the newly generated samples . We put the newly generated samples into the already trained complex model f ^ \widehat ff Forecast in and get the corresponding prediction results.
  3. Use the labeled samples to train the explanation model g . Find the weighted loss L( f ^ , g , π x ) L(\widehat f , g, \pi_x)L(f ,g,Pix) the smallest explanatory model g. Why weighted loss? Because our newly generated sample points are near or far from the predicted sample, these sample points cannot be treated equally. The sample points with a larger distance from the predicted sample have a higher weight, and the sample points with a farther distance have a smaller weight.
    insert image description here

2.3 How is the LIME algorithm explained?

Through the above introduction, we know that LIME has trained an interpretable model (linear model, decision tree, etc.), which is a local good approximation of the target instance and the original machine learning model. The weight assigned to each input unit by the interpretable model trained is the importance of these input units.



3. An interpretable method for computing gradients based on backpropagation

Among the interpretability methods, there is a large class of methods based on backpropagation, computing the gradient of each input unit, andGradient as the importance of this input unit. As shown in the figure below, the interpretability method based on backpropagation will design backpropagation rules for each layer of the neural network, such as the convolutional layer, pooling layer, and nonlinear activation layer, so that it can distribute the value of importance more fairly and reasonably during the process of backpropagation, and finally the gradient value obtained for each input unit can well reflect the importance of the input unit.

insert image description here


insert image description here


3.1 GBP Guided Backpropagation Algorithm

here we need to understandDeconvolution (deconvnet)backpropagationandGuided backpropagationthree concepts.

1. Deconvolution (Deconvnet)

In the paper Visualizing and Understanding Convolutional Networks, deconvolution (deconvnet) is used to visualize the intermediate layers of CNN. The basic building block of a general convolutional neural network is 卷积层 + ReLU激活函数 + 最大池化that deconvnet is the reverse operation of these three processes, ie 反最大池化 + 反ReLU + 反卷积操作. See this article for details .

2. Backpropagation

Reference Article 1
Reference Article 2


3. Guided BackPropagation (Guided BackPropagation)

In the Guided-Backpropagation (GBP) algorithm [5] ^{[5]}In [ 5 ] , other layers except the ReLU layer, including the convolutional layer, pooling layer, etc., will use the traditional backpropagation calculation rules during backpropagation. For the ReLU layer, the GBP algorithm will set the gradient value less than 0 to 0, and then continue to propagate the gradient, and use the gradient obtained by each input sample as the interpretation result of its importance.

The GBP algorithm is equivalent to adding guidance to ordinary backpropagation, limiting the return of gradients less than 0, and the part of the gradient less than 0 corresponds to the part of the original image that weakens the features we want to visualize. These parts are exactly what we don't want.

insert image description here
In this picture,

  • Figure a shows that given an input image, we perform forward propagation on the layer of interest, then set all but one activation to zero, and perform backpropagation to reconstruct the original image;
  • Figure b shows different methods of nonlinear backpropagation through ReLU;
  • Figure c formally defines different methods for the output activation to propagate outward through the ReLU unit of the l layer. Note: deconvnet and guided backpropagation calculate not the real gradient, but the estimated version of the gradient.



3.2 Integral Gradient integral gradient algorithm

The interpretability method based on backpropagation to calculate the gradient is very intuitive, and the gradient value obtained for each input unit can be used to characterize the importance of this input unit. However, the limitation of this method lies in the problem of Gradient Saturation, that is, adding a certain factor has played a role in promoting the prediction of the picture, but this effect is limited. After exceeding a certain level, the probability of prediction will not be increased. At this time, the gradient will be 0 in the section beyond this level. When the gradient is 0, no useful information is revealed. As shown in the figure below:
insert image description here
In view of the defects of the Gradient-based method, Sundararajan et al. proposed the Integral Gradient (IG) algorithm in 2017 [8] ^{[8]}[ 8 ] . This method integrates the gradient value of each input variable to represent the importance of the input variable. Right now:
insert image description here



3.3 LRP layer-by-layer correlation propagation algorithm

Layer-wise Relevance Propagation (LRP) algorithm assumes that there is a certain correlation (Relevance) between the neurons of each layer of the neural network and the output of the neural network.

I haven't fully understood it yet. It feels a bit like the gating mechanism of LSTM, which selectively retains the content of the upper layer of neurons.









参考
[1] 可解释人工智能导论
[2] Lundberg S M, Lee S I. A unified approach to interpreting model predictions[J]. Advances in neural information processing systems, 2017, 30.
[3] Fernando Z T, Singh J, Anand A. A study on the Interpretability of Neural Retrieval Models using DeepSHAP[C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019: 1005-1008.
[4] Shrikumar A, Greenside P, Kundaje A. Learning important features through propagating activation differences[C]//International conference on machine learning. PMLR, 2017: 3145-3153.
[5] Springenberg J T, Dosovitskiy A, Brox T, et al. Striving for simplicity: The all convolutional net[J]. arXiv preprint arXiv:1412.6806, 2014.
[6] "Intuitive understanding" convolutional neural network (2): Guided backpropagation (Guided-Backpropagation) [
7] Intuitive understanding of deconvolution and guided backpropagation in deep learning
[8] Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks[C]//International conference on machine learning. PMLR, 2017: 3319-3328. [9] R ibeiro MT, Singh S, Guestrin C. "Why should i trust you?" Explaining the predictions of any classifier[C]//Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016: 1135-1144. [10]
KernelSHAP
& TreeSHAP

Guess you like

Origin blog.csdn.net/qq_42757191/article/details/126586216