14. Visualization and Understanding of Convolutional Neural Networks

CNN visualization and understanding

Before we discussed a series of content

  • Attention mechanism: How attention is a mechanism we can add to current neural networks, let the model focus on different parts of the input at different time steps, and then build general self-attention layers, which can be used to build new neural networks network model
  • Transformer: We can use self-attention to build this new neural network model, which relies entirely on attention processing

But when we face visual tasks, how do we judge what the neural network has learned? Suppose we train a convolutional neural network model, then what are the intermediate features that the neural network is looking for? If we can observe the internals of the neural network and understand the differences What features are the layers looking for

Convolution layer visualization

We have the idea that a linear classifier is learning a set of templates, one for each class, and the class score is computed by our linear classifier as the inner product of the learned templates and the input image

We have the same idea when we generalize to Convolutional Neural Networks

For the convolution kernel of the first layer in the network, after learning, when it continues to slide around the image, an inner product (representing the matching degree) can be obtained

If we visualize these convolution kernels, then we will find that the filter, as an image, can give a strong response to the matching image, so by visualizing the convolution kernel, we can have some idea of ​​the features that these layers are looking for. learn

The following are four different models, after pre-training on the ImageNet dataset, visualize the first layer of convolution kernels

5

We can see that although the architectures of different models are different, the features that the first layer of convolution kernels want to find are very close

If we want to apply the visualization method to a higher layer, there will be some problems, because in the first convolutional layer, the convolution kernel usually learns some basic visual features, such as edges and colors. For example, a convolutional kernel might produce high activations when vertical edges are detected. We can see these features by visualizing the weights of these filters as images.

But in deeper convolution layers, convolution kernels usually learn more complex features. Since these features are learned in a high-dimensional space, it may be difficult to visualize directly. However, methods such as deconvolution or feature backprojection can be used to try to understand these features.

But we can still see some cases on the second layer, that is, they are still looking for a speckle pattern or edge, but now instead of looking for an edge or a speckle in the RGB feature space, they are looking for a feature space defined by the previous Blobs or edges produced by a convolutional layer

6

But we still don't have a strong intuition about what these filters are looking for, nor does visualization of the filters allow us to understand what the higher-level filters are doing, so we need to use other methods to Try to understand what the other layers of the ConvNet are doing

Let's first try to skip the previous convolutional layers, and then directly try to understand what the last fully connected layer is doing

Fully connected layer visualization

The FC7 layer of AlexNet has 4096 features, and after using the linear transformation, it can provide the class score of a thousand classes in the ImageNet dataset, so one of the things we can try to do is understand what this 4096-dimensional vector represents.

What is this trained AlexNet doing? It takes our input image and turns it into a 4096 dimensional vector, then applies a linear classifier on top of that 4096 dimensional vector, so we can try to understand visually by understanding what's going on inside the 4096 dimensional vector

We use the trained AlexNet model, perform forward inference on the images of the test set, and then record the 4096-dimensional feature vectors generated by each image. Once we have collected such image datasets and their feature vectors, we can try Visualize them using various techniques, first we use the nearest neighbor algorithm on these eigenvectors

8

Recall that when we first used the nearest neighbor algorithm, we directly used pixels for calculation. At this time, the nearest neighbor algorithm tends to judge images containing similar pixels as one category, although images of the same category are not really the same category.

Here, we use the feature vectors calculated by AlexNet to perform nearest neighbor search, which allows us to understand how the classifier learns how close images are to each other in the feature space, or how the classifier learns how close the images are to each other based on the feature vectors of the images. Determine which images belong to a category

Let's look at the example in the picture above, taking the elephant category in the second row as an example. Although the background and elephant in different images are quite different, the classification can still be done very well, or in other words, AlexNet is in the image When processing, a lot of low-level pixel content in the image is ignored

Or it can be understood in this way, somehow encode something like an elephant in this vector, and then get such a nice feature vector

Visualization of Dimensionality Reduction Methods

We can train classifiers in the 4096-dimensional feature space, but it is difficult for us humans in three-dimensional space to understand this high-dimensional space, so we can only find ways to reduce the dimensionality and reduce the dimensionality to two-dimensional and three-dimensional. Human understandable visual display

The first here is the principal component analysis (PCA) method, which is a linear dimensionality reduction method that preserves the structure of the high-dimensional feature space as much as possible and performs linear dimensionality reduction projections.

9

Then there is another method is the t-SNE algorithm (t-Distributed Stochastic Neighbor Embedding), which is a dimensionality reduction technique for data visualization, especially good at dealing with high-dimensional data. This algorithm was developed by Laurens van der Maaten and Geoffrey Hinton in 2008. t-SNE is mainly used to visualize the distribution of high-dimensional data sets in two-dimensional or three-dimensional space, and its characteristics are nonlinear, local structure and high-dimensional visualization

  1. Nonlinear : Unlike linear dimensionality reduction techniques such as Principal Component Analysis (PCA), t-SNE is a nonlinear dimensionality reduction technique. This enables it to handle complex data schemas.
  2. Preserving local structure : t-SNE is particularly good at preserving the local structure of the data. This means that points that are close in high-dimensional space will also be close in low-dimensional space.
  3. High-dimensional visualization : Since t-SNE is often used to reduce the dimensionality of data to two or three dimensions, it is a very useful data visualization tool.

It should be noted that although t-SNE has many advantages, it also has some limitations. For example, t-SNE is sensitive to the choice of hyperparameters and may give very different results for different hyperparameters. Furthermore, t-SNE is quite computationally expensive and can be intractable for large-scale datasets.

Then, we visualize the dimensionality reduction of the feature vector calculated by AlexNet on MNIST, and we can see that for ten kinds of numbers, they tend to be in different areas, which lets us know that this network can indeed be used in some way. encode

9

Visualization of Convolutional Activations

Another way to understand what a ConvNet is looking for is to visualize the convolutional activations in the middle layers

For example, the fifth convolutional layer of AlexNet has an output feature map of 13x13 size and 128 channels, which means that there are 128 convolution kernels in the fifth convolutional layer, so we can try to convert the feature map of a single channel It is a grayscale image, of course, there will be a large number of pure black images in it, because of the existence of the activation function

For the feature map that is not zero, we align it with the original input image. For example, in the following figure, we input a portrait image, and one of the convolution kernels is actually similar to the shape of a face. This is because this convolution kernel is based on a certain aligned with the face or human skin color in a way, so let's feel that maybe the convolution kernels in this layer of the neural network somehow learned to respond to the human face or human skin color

11

We can visualize these convolutional activations to give us some intuition as to what different features these different convolutional kernels might respond to

Why most of the pictures are black, probably because ReLU is non-linear, any negative number will be set to zero, any positive number will be left intact, also when we visualize this thing, we need to compress it somehow The interval is 0-255, which may have some impact on the overall brightness of the image

Maximally Activating Patches

This concept describes the image segment or region that produces the largest activation value in a given neural network layer; when training a convolutional neural network, neurons in each layer learn to respond to certain features, such as color , shape or more complex image features. When these features are included in the input image, the corresponding neuron is "fired". Therefore, "Maximally Activating Patches" are those image regions that elicit the strongest response from a particular neuron, also called the largest activation patch .

By looking at these regions of maximum activation, we can understand and explain how the neural network "sees" the image and how it makes decisions. This is very useful for understanding how neural networks work, improving model performance, and improving model interpretability.

Because this is a convolutional neural network, each element in the feature grid in the activation grid actually corresponds to some finite-sized patch in the input image (the minimum is the size of a convolution kernel, and the maximum is The size of the whole picture), this is because it is assumed that the convolution kernel is 3x3 in size, two convolution stacks, one element depends on a 5x5 block, and three stacks depend on a 7x7 block

Still the trained model, we choose the middle convolutional layer this time, then input all the images, and find the tiles with the highest response of the selected neurons, then record and display these tiles, we can try to understand What features the selected neurons are looking for

12

From the above figure, we can see that the elements in the first row are trying to find the characteristics of the dog nose style, or looking at other elements, the corresponding maximum activation area has very similar characteristics

This visualization of the largest activation area allows us to understand what the intermediate convolutional layers are recognizing

Saliency Visualization via Occlusion

So another thing we can try to do is to understand which pixels in the input image these networks use to calculate the result, which is very important for classification problems

14

We still have a trained model, and then we input a picture of an elephant, and then get the correct classification result, we want to know which pixels, for the network to classify the picture as an elephant, made more big contribution

We do some processing on the elephant picture, use gray squares to block some places (or replace pixels in certain areas of the image), and then pass this image into the convolutional neural network, and then repeat this process, each time Cover different areas, and then input it into the network for classification, calculate the probability of elephants in each position in the network, and then we can get a saliency map (saliency map) representing the distribution probability (the right side of the above figure), we will Can see which pixels are important for classification

If we occlude where the elephant is, the predicted probability of the elephant on the corresponding saliency map drops a lot, which is intuitive and means that the neural network is actually somehow looking at the correct part of the image to make classification decision

We can repeat this approach on other images and get similar results: if we block out the sky or mountains, the neural network can still accurately predict (or classify it correctly), but if we block out the car or sailboat , then it will not be so accurate

However, if there is a way in your data set that allows the neural network to cheat, so that the neural network is wrong but still gets the correct answer, then we can see whether there is a problem with the neural network in this way, or judge the neural network. Is the network looking at the parts of the image that we think it should be looking at to make the right decision, for example, the neural network sees a picture of a sailboat in the sea, it sees a certain color of water and judges that the picture is a sailboat, instead of seeing a sailboat picture for sailboat

But this method requires many calculations and a large amount of calculation, so we are considering other methods

Saliency Visualization with Backpropagation

First take our input image, which is this cute dog (shown below), then in backpropagation, we can calculate the gradient of the dog score with respect to each pixel in the input image, which tells us the For each pixel, if we slightly change that pixel, how much does it affect the classification score at the end of the network, in the saliency map, brighter areas indicate that these pixels have a greater influence on the decision of the model

16

Using this image gradient, we can get a gradient saliency map (also called a gradient return map) , which is a bit like a ghost image in the lower right corner of the above image, which tells us that the pixels that can most change the classification score are actually the pixels inside the dog, if we were to change some pixels outside the dog, the classification score might not change that much

So this again lets us know that the neural network is looking at the correct part of the image, which you can use to gain some intuition about what the neural network is learning (pictured below)

17

But note that we must have a trained model in this way. If we operate on a model that has not been trained, the saliency map we get may be very messy.

This is because the convolutional structure actually has a strong regularization effect on the calculated function, that is, it makes the model use as many features as possible to make judgments, but the untrained model has not yet learned how to effectively learn from the input image. To extract useful information, the model has no preference for any particular input image or category, so we cannot get meaningful interpretations from the gradient saliency map.

That is, no region in the image will have a significant impact on the output of the model, because the model has not yet learned how to distinguish between different categories or features.

However, as the model's training process progresses, it will start to learn how to extract useful features from the data, at which point we can start using gradient saliency maps to understand and explain the behavior of the model.

Saliency Maps: Unsupervised Segmentation

We have obtained the saliency map of the gradient. Can we unsupervisedly segment the objects in the image based on this saliency map? For example, in the image below, we can segment objects such as grasshoppers and snakes

18

We can try to use these saliency maps and use some kind of image processing technique on these saliency maps computed by these neural networks, and then we can use the trained network to somehow realize the classification of object categories in the input image The division operation of the corresponding part

So we have this idea of ​​computing each pixel using the gradient information and seeing how much each pixel contributes to the final output score, so that we can understand what the neural network is doing, but we are not limited to this, we want to use the gradient information to Look at the intermediate features looked for inside the network

Intermediate features found by bootstrapping reverse gradients

We want to use the gradient information to see the intermediate features that the network is looking for, then we use the trained model, then test a picture and complete the backpropagation from the intermediate layer , and then we can see which pixels do not affect the classification result, but will affect the value of the neurons in the middle layer, or see which pixels have the greatest influence on the neurons in the middle layer

Note that after we use the trained model and an image to complete the forward reasoning, we do not start backpropagating from the loss function or the classification result, but from the middle layer

The basic idea of ​​guided backpropagation is: during the backpropagation process, if the activation value of a neuron is negative, then the gradient of this neuron is set to zero. This means that only those features that are both activated and have a positive impact on a particular class are considered important for the prediction of that class.

For networks using the ReLU activation function, this is achieved by setting all negative gradients to zero during backpropagation. The intuitive understanding of this is that we only care about those neurons whose activation values ​​are positive (i.e. are activated) and contribute positively to the predicted outcome.

When using this method to visualize, you can make the image better, so that we can see which pixels have a greater impact on the output of the neuron. The following figure shows which features different neurons are looking for, or which pixels are indeed have a significant effect on the output of neurons

21

gradient ascent

Before we tested the image against the trained model to see which pixels would affect the output of the neuron or the classification result, we can now go a step further, not to test the image, but to try to find an image that maximizes the output of the neuron

Activation maximization is a method for understanding deep learning models, especially CNNs. Its basic idea is to generate an image that maximizes the activation value of a particular neuron. In this way, we can intuitively understand how this neuron responds to the input.

At the same time, we also need to use a regularization function to make our images natural, avoid overfitting and maintain the interpretability of the generated images, otherwise the generated images may appear overly complex, difficult to interpret, or in the eyes of humans Not like the case for any meaningful images. This is because the neural network may overly adjust details on a per-pixel basis, which may not be comprehensible to the human eye, in order to obtain the maximum activation possible.

23

Specifically, activation maximization first selects a neuron and then creates a random noise image or an image with zero elements. Next, it modifies this image using a gradient ascent method (this is somewhat similar to training a network model to minimize the loss function) so that the activation value of the selected neuron is maximized. This process continues for many iterations until the image converges (eg ).

24

In this way, we can generate an image that "activates" a specific neuron in the model. This can help us understand what kind of features this neuron is looking for, thus providing a way to understand how the model works. For example, with an image classification model, we can use activation maximization to understand how the model recognizes different classes.

27

The above picture is some maximized activation images generated by backpropagating neurons using the trained model. You can see some rough shapes, because the regularizer used is not particularly good, so the images are not Not very realistic, so people have been trying to invent better regularizers to make the images generated in this way more natural, for example, the image in the image below is more natural

29

Of course, there are people who are fascinated by using more complex natural image regularizers to generate more realistic-looking images, so there are some very fancy regularizers that are actually based on a generative adversarial network, Can generate some very nice and natural looking images (shown below)

34

The original intention of this research direction is to understand what the neural network is actually looking for, and the lecturer Dr. Justin believes that the more obsessed with strong regularizers to find these maximum activation images, the more you will go astray, so when he sees this When it comes to beautiful images, it's hard to say how much of that is what the ConvNet is actually looking for; he prefers to use a simple regularizer, arguing that this gives a more pure picture of what the ConvNet is looking for in the original image and feature

Adversarial samples

Adversarial attack is an attack method against neural networks, which can cause neural networks to make wrong predictions by adding extremely small perturbations (changes that are almost imperceptible to humans) to the input data.

This phenomenon was first discovered in 2014 by Ian Goodfellow et al. They found that by optimizing an objective function, it was possible to generate specific noise that, when added to the original image, could cause the neural network to completely change its predictions, even if the noise was so small that it was barely noticeable to the human eye. This noise-added image is called an adversarial example (Adversarial Example).

In the figure below, adding noise that is imperceptible to humans to the images of elephants and sailboats, respectively, will lead to serious deviations in the prediction results of the neural network.

36

Adversarial attacks reveal the vulnerability of deep learning models in the face of small perturbations, which is of great significance for the security and robustness of deep learning. For example, in key domains such as autonomous driving and medical diagnosis, adversarial attacks may lead to serious consequences.

There is no perfect solution for how to defend against adversarial attacks, but there are some commonly used defense strategies, such as adversarial training (adding adversarial samples during training), or input transformation (for example, denoising or compression to remove possible adversarial noise).

feature reversal

Feature Inversion is a technique commonly used in computer vision to understand the inner workings of convolutional neural networks. The core idea is that given an image, the feature representation of a specific layer in the neural network is obtained, and the gradient descent method is used to try to reconstruct an image that is as close as possible to the original input image in terms of features.

37

The process of feature inversion is usually implemented by optimization algorithms such as gradient descent or gradient ascent. First, we select a specific layer from the CNN, and then forward-propagate an input image through the network to obtain the feature representation of that layer. Next, we create a random noisy image and forward-propagate through the network to the same layers to get a feature representation of that noisy image. We then define a loss function whose value is the difference between these two feature representations, and finally use an optimization algorithm to minimize this loss. In this process, the noisy image is continuously adjusted so that its feature representation in the network is as close as possible to the feature representation of the original image.

This technique can help us understand what each layer in a neural network is learning. For example, if we choose the first layer of the network, the image generated by feature inversion will usually be very close to the original image, because the first layer usually learns low-level features such as edges and colors. However, if we choose deeper layers in the network, the generated images may be more blurred and abstract, because these layers usually learn higher-level, abstract features.

Feature inversion provides us with an intuitive way to understand the role of each layer in the neural network in the image recognition process, which helps us better understand and improve the neural network model.

Guess you like

Origin blog.csdn.net/qq_46202265/article/details/130646761