Integral gradient: a novel neural network visualization method

This article introduces a visualization method of neural networks: Integrated Gradients, which was first proposed in the paper Gradients of Counterfactuals [1], and later introduced by Axiomatic Attribution for Deep Networks [2]. The authors of both papers are Similarly, the content is basically the same. The latter one is relatively easier to understand. If you want to read the original paper, it is recommended that you read the latter one first.

Of course, it is already a work in 2016-2017, and “novelty” refers to its innovative and interesting ideas, not its recent publication.

The so-called visualization, in simple terms, is that for a given input x and model F(x), we want to find a way to point out which components of x have an important influence on the decision-making of the model, or to rank the importance of each component of x, use In professional terms, it is "attribution." A simple idea is to directly use the gradient \bigtriangledown_{x}F(x)as the importance index of each component of x, and the integral gradient is an improvement to it.

However, the author believes that many articles (including the original papers) that introduce the integral gradient method are too "blind" (formal), and do not highlight the essential reason why the integral gradient is more effective than the naive gradient. This article attempts to introduce the integral gradient method in its own way.

Naive gradient

First, let's learn about the gradient-based method, which is actually based on Taylor expansion:

We know that \bigtriangledown_{x}F(x)is the same size with the vector x, here [\bigtriangledown_{x}F(x)]_{i}for its i-th component, then for the same size \bigtriangleup x_{i}, [\bigtriangledown_{x}F(x)]_{i}the larger the absolute value, then the F(x+\bigtriangleup x)relative F(x)changes in the larger, that is to say: [\bigtriangledown_{x}F(x)]_{i} a measure of the model's input the sensitivity of the i-th component, so we use [\bigtriangledown_{x}F(x)]_{i}as the i-th component of the importance of indicators.

This kind of thinking is relatively simple and straightforward. It is described in the paper How to Explain Individual Classification Decisions [3] and Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps [4]. In many cases, it can indeed successfully explain some prediction results. , But it also has obvious shortcomings.

Many articles have mentioned the situation of the saturation zone, that is, once you enter the saturation zone (typically the negative semi-axis of h), the gradient is 0, and then no effective information can be revealed.

From a practical point of view, this understanding is reasonable, but I think it is not deep enough. From the previous article on confrontation training: meaning, methods and thinking (with Keras implementation), it can be seen that the goal of confrontation training can be understood as pushing \left \| \bigtriangledown _{x}F(x) \right \|^{2}\rightarrow 0, which can also be understood as the gradient can be "manipulated" , Even if it does not affect the prediction accuracy of the model, we can make the gradient as close to 0 as possible.

So, back to the subject of this article, that is:   [\bigtriangledown_{x}F(x)]_{i}it does measure the sensitivity of the model to the i-th component of the input, but the sensitivity is not enough to be a good measure of importance.

Integral gradient

In view of the above shortcomings of the direct use of gradients, some new improvements have been proposed, such as LRP [5], DeepLift [6], etc., but relatively speaking, the author still feels that the improvement of the integral gradient is more concise and beautiful.

2.1 Reference background

First of all, we need to understand the original problem from a different perspective: our purpose is to find the more important component, but this importance should not be absolute, but relative. For example, if we want to find out popular buzzwords recently, we can’t just look for them based on word frequency. Otherwise, we must find stop words like "的" and "了". We should prepare a balanced corpus for statistics. "Refer to" the word frequency table and compare word frequency differences instead of absolute values. This tells us that in order to measure the importance of each component of x, we also need a "reference background"  \ bar {x}.

Of course, in many scenarios, we can simply let it \ bar {x} = 0, but it may not be optimal. For example, we can also choose \ bar {x}the mean of all training samples. We expect that F(\bar{x})a relatively trivial prediction result should be given, such as a classification model,   \ bar {x}the prediction result should be that the probability of each class is very balanced. So we think about it F(\bar{x})-F(x), we can imagine that this is \ bar {x}the cost of moving from x to .

If we still use approximate expansion (1), then we will get:

F(\bar{x})-F(x)\approx \sum_{i}^{..}[\bigtriangledown _{x}F(x)]_{i}[\bar{x}-x]_{i}(2)

For the above formula, we can have a new understanding:

Moves from x to \ bar {x}the total cost F(\bar{x})-F(x), which is the cost of each component and, the cost of each component is approximately [\bigtriangledown _{x}F(x)]_{i}[\bar{x}-x]_{i}, so we can use   [\bigtriangledown _{x}F(x)]_{i}[\bar{x}-x]_{i}as the i-th component of the importance index.

Of course, whether it is   [\bigtriangledown_{x}F(x)]_{i}or [\bigtriangledown _{x}F(x)]_{i}[\bar{x}-x]_{i}they are flaws in mathematics are the same (gradient disappears), but they are not the same as the corresponding explanation. As [\bigtriangledown_{x}F(x)]_{i} mentioned earlier, the defect stems from "the degree of sensitivity is not enough to be a good measure of importance." While looking at the reasoning process in this section,   [\bigtriangledown _{x}F(x)]_{i}[\bar{x}-x]_{i}the defect is only because "Equation (2) is only approximately true", but The whole logical reasoning is fine.

2.2 Integral identity

Many times a new explanation can give us a new perspective, which in turn inspires us to make new improvements. For example, in the previous analysis of defects, it means " [\bigtriangledown _{x}F(x)]_{i}[\bar{x}-x]_{i}  not good enough is because formula (2) is not precise enough". If we can directly find a similar expression that is exactly equal, then this problem can be solved.

It is integral gradient find such an expression: Let \gamma (\alpha),\alpha\in [0,1]behalf of the x and connection   \ bar {x}of a parametric curve, which \ gamma (0) = x, \ gamma (1) = \ bar {x},, we have:

So the integral gradient is proposed to use:

 

As a measure of the importance of the i-th component. As the simplest solution, naturally it will be \ gamma (\ alpha)taken as a straight line between two points, namely:

At this time, the integral gradient is embodied as:

So compared to [\bigtriangledown _{x}F(x)]_{i}[\bar{x}-x]_{i}then, it is integral with the gradient 

Replacement \bigtriangledown_{x}F(x), that is \ bar {x} , the average result of the sum of the gradients of each point on the line from x to . Intuitively, since the gradient of all points on the entire path is considered, it is no longer restricted by a point gradient of 0.

If readers read the two original papers on the integral gradient, they will find that the introduction of the original paper is reversed: first give formula (6) inexplicably, and then prove that it satisfies two inexplicable properties (sensitivity and invariance) ), and then prove that it satisfies formula (3).

Anyway, with readers made a big circle, it is not clear that it is the essence of a better reason for the importance of metrics - we are based on the F(\bar{x})-F(x)decomposition, and the formula (3) is more accurate than the formula (2) .

2.3 Discrete approximation

Finally, how to calculate the amount of this integral form? The deep learning framework does not have the function of counting points. In fact, it is also simple. According to the definition of "approximate-take limit" of integral, we can use discrete approximation directly. Taking formula (6) as an example, it is approximated to:

So the same sentence is essentially " \ bar {x} the average of the sum of the gradients of each point on the line from x to ", which is better than the gradient at a single point.

 

 


 

Guess you like

Origin blog.csdn.net/devil_son1234/article/details/114849804