Understanding convolutional networks with Pytorch

In today's era, machines have successfully achieved 99% accuracy in understanding and identifying features and targets in images. We see this every day-smartphones can recognize faces in cameras; the ability to use Google Images to search for specific photos; and scan text from barcodes or books. With the help of convolutional neural networks (CNN), all this is possible. Convolutional neural networks are a specific type of neural network, also known as convolutional networks.

If you are a deep learning enthusiast, then you may have heard of convolutional neural networks, maybe you even developed some image classifiers yourself. Modern deep learning frameworks like Tensorflow and PyTorch make it easy to learn images from machines. However, there are still some questions: How can data be transferred through artificial layers of neural networks? How does the computer learn from it? One way to better explain convolutional neural networks is to use PyTorch. Therefore, let's delve into CNN by visualizing the image of each layer.

 

    Interpretation of convolutional neural networks

What is a convolutional neural network?

Convolutional neural network (CNN) is a special type of neural network that performs particularly well on images. The convolutional neural network was proposed by Yan LeCun in 1998 and can recognize the numbers present in a given input image.

Before starting to use convolutional neural networks, it is important to understand how neural networks work. Neural networks mimic how the human brain solves complex problems and finds patterns in a given data set. Over the past few years, neural networks have swept many machine learning and computer vision algorithms.

The basic model of a neural network consists of neurons organized in different layers. Each neural network has an input layer and an output layer, and many hidden layers are added according to the complexity of the problem. Once the data passes through these layers, neurons learn and recognize patterns. This representation of a neural network is called a model. After training the model, we require the network to make predictions based on the test data. If you are not familiar with neural networks, this article on deep learning using Python is a good starting point.

On the other hand, CNN is a special neural network that performs particularly well on images. The convolutional neural network was proposed by Yan LeCun in 1998 and can recognize the numbers present in a given input image. Other applications that use CNN include speech recognition, image segmentation, and text processing. Before the convolutional neural network, a multi-layer perceptron (MLP) was used to construct an image classifier.

Image classification refers to the task of extracting information categories from multi-band raster images. Multilayer perceptrons require more time and space to find information in the picture, because each input function needs to be connected to each neuron in the next layer. CNN replaces MLP by using a concept called local connection, which involves connecting each neuron only to the local area of ​​the input volume. By allowing different parts of the network to specifically handle advanced functions (such as textures or repeating patterns), the number of parameters can be minimized. Confused? do not worry. Let's compare how images are transferred through multi-layer perceptrons and convolutional neural networks for a better understanding.

 

    Compare MLPS and CNNS

Considering the MNIST dataset, since the size of the input image is 28x28 = 784, the total number of input layers of the multilayer perceptron will be 784. The network should be able to predict the number in a given input image, which means that the output may fall into any of the following ranges, ranging from 0 to 9 (1, 2, 3, 4, 5, 6, 7, 8, 9). In the output layer, we return the category score. For example, if the given input is an image with the number "3", then in the output layer, the corresponding neuron "3" has a higher category score than other neurons. How many hidden layers do we need to include, and how many neurons should be included in each layer? This is an example of encoding MLP:

The above code snippet is implemented using a framework called Keras (ignoring the syntax for now). It tells us that there are 512 neurons in the first hidden layer, which are connected to the input layer of shape 784. This hidden layer is followed by a random deactivation layer, which overcomes the problem of overfitting. 0.2 means that the probability of not considering neurons after the first hidden layer is 20%. Again, we added the same number of neurons (512) as the first hidden layer to the second hidden layer, and then added another random inactivation. Finally, we end this set of layers with an output layer containing 10 classes. The class with the highest value will be the model prediction result.

This is the appearance of the network multilayer after defining all layers. One disadvantage of this multilayer perceptron is that it is fully connected for network learning, which requires more time and space. MLP only accepts vectors as input.

Convolutional layers do not use fully connected layers, but sparsely connected layers, that is, they accept matrices as input, which has advantages over MLP. The input feature is connected to the local coding node. In MLP, each node is responsible for gaining an understanding of the entire picture. In CNN, we decompose an image into regions (local regions of pixels). Each hidden node must output the layer report. At the output layer, the output layer combines the received data to find the pattern. The following figure shows how each layer is connected locally.

Before we understand how CNN finds information in pictures, we need to understand how to extract features. Convolutional neural networks use different layers, and each layer will save the features in the image. For example, consider a picture of a dog. Whenever the network needs to classify a dog, it should recognize all features-eyes, ears, tongue, legs, etc. Using filters and kernels, these features are decomposed and identified in the local layer of the network.

 

    How does the computer see the image?

Unlike human computers that use images to understand images, computers use a set of pixel values ​​between 0 and 255 to understand pictures. The computer looks at these pixel values ​​and understands them. At first glance, it does not know the object or color, only recognizes the pixel value, which is what the image is used for the computer.

After analyzing the pixel values, the computer will slowly begin to understand whether the image is grayscale or color. It knows the difference because the grayscale image has only one channel, because each pixel represents the intensity of a color. Zero means black, 255 means white, and other variations of black and white, that is, gray in between. On the other hand, a color image has three channels-red, green and blue. They represent the intensity of three colors (3D matrix), and when the value changes at the same time, it will produce a lot of colors! After determining the color attributes, the computer will recognize the curves and contours of the objects in the image.

You can use PyTorch to explore this process in a convolutional neural network to load a data set and apply filters to the image. Below is the code snippet. (This code can be found on GitHub)

  

Now, let us see how to input a single image into the neural network.

(This code can be found on GitHub)

img = np.squeeze(images[7])fig = plt.figure(figsize = (12,12)) ax = fig.add_subplot(111)ax.imshow(img, cmap= gray )width, height = img.shapethresh = img.max()/2.5for x in range(width):    for y in range(height):        val = round(img[x][y],2) if img[x][y] !=0 else 0        ax.annotate(str(val), xy=(y,x),            color= white  if img[x][y]<thresh else  black )

This is how the number 3 is broken down into pixels. From a set of handwritten digits, randomly select "3", which displays the pixel value. Here, ToTensor () normalizes the actual pixel value (0–255) and limits it to 0 to 1. why? This is because it makes the calculations in later parts easier, whether it's interpreting an image or finding a general pattern that exists in the image.

Create your own filter

In the convolutional neural network, the pixel information in the image is filtered. Why do we need filters at all? Just like children, computers need to go through the learning process of understanding images. Thankfully, it doesn't take a few years! The computer completes this task by learning from the beginning and then gradually progressing to the whole. Therefore, the network must first know all the original parts of the image, such as edges, contours, and other low-level features. Once these are detected, the computer can handle more complex functions. In short, you must extract low-level features first, then mid-level features, and then high-level features. Filters provide a way to extract information.

The low-level features can be extracted using a specific filter, which is also a set of pixel values ​​similar to the image. It can be understood as the weight connecting the layers in CNN. Multiplying these weights or filters with the input results in an intermediate image, which represents the computer's partial understanding of the image. These by-products are then multiplied by more filters to expand the view. This process and the testing of functions continue until the computer understands its appearance.

You can use as many filters as you need. You may need to blur, sharpen, deepen, perform edge detection, etc.-all filters.

Let's look at some code snippets to understand the function of the filter.

This is how the image looks after applying the filter. In this case, we used the Sobel filter.

 

    Complete Convolutional Neural Network (CNNS)

We already know how the filter proposes features from the image, but in order to complete the entire convolutional neural network we need to understand the layers used to design the CNN. The layers in the convolutional neural network are called:

1. Convolutional layer

2. Pooling layer

3. Fully connected layer

Using these 3 layers, an image classifier like this can be constructed:

The role of CNN layers

Now let's take a look at what each layer is used for

Convolutional layer -Convolutional layer (CONV) uses filters to perform convolution operations while scanning the size of the input image. Its hyperparameters include filter size, usually set to 2x2, 3x3, 4x4, 5x5 (but not limited to these sizes), step size (S). The output (O) is called a feature map or activation map and contains all the characteristics calculated by the input layer and the filter. The following figure describes the feature map generated when applying convolution:

 

Convolution operation

Pooling layer -The pooling layer (POOL) is used for feature downsampling and is usually applied after the convolutional layer. Two common pooling operations are maximum pooling and average pooling, and the maximum and average values ​​of features are obtained respectively. The following diagram describes the basic principles of pooling:

Max pooling

Average pooling

Fully connected layer -The fully connected layer (FC) acts on a flat input, where each input is connected to all neurons. The fully connected layer is usually used at the end of the network to connect the hidden layer to the output layer, which helps to optimize the class score. 

Fully connected layer

 

    Visualizing CNN in Pytorch

We have a better understanding of the function of CNN, now let us use Facebook's PyTorch framework to achieve it.

Step 1: Load the input image. We will use Numpy and OpenCV. (The code can be found on GitHub)

  

  

Step 2: Visualize the filter to better understand the filter we will use. (The code can be found on GitHub)

Step 3: Define the convolutional neural network. The CNN has a convolutional layer and a maximum pooling layer, and the weights are initialized using the above filter: (code can be found on GitHub)

Step 4: Visualize the filter. Take a quick look at the filter being used. (The code can be found on GitHub)

def viz_layer(layer, n_filters= 4):    fig = plt.figure(figsize=(20, 20))            for i in range(n_filters):        ax = fig.add_subplot(1, n_filters, i+1)        ax.imshow(np.squeeze(layer[0,i].data.numpy()), cmap= gray )        ax.set_title( Output %s  % str(i+1))fig = plt.figure(figsize=(12, 6))fig.subplots_adjust(left=0, right=1.5, bottom=0.8, top=1, hspace=0.05, wspace=0.05)for i in range(4):    ax = fig.add_subplot(1, 4, i+1, xticks=[], yticks=[])    ax.imshow(filters[i], cmap= gray )    ax.set_title( Filter %s  % str(i+1))    gray_img_tensor = torch.from_numpy(gray_img).unsqueeze(0).unsqueeze(1)

filter:

Step 5: Cross-layer filter output. The images output in the CONV and POOL layers are shown below.

viz_layer(activated_layer)viz_layer(pooled_layer)

Convolutional layer

Pooling layer

参考:CS230 CNNs(https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks).

You can view the code here: https://github.com/vihar/visualising-cnns

 

via https://medium.com/better-programming/three-ways-to-use-the-walrus-operator-in-python-d5550f3a7dd

Published 117 original articles · 69 praises · 10,000+ views

Guess you like

Origin blog.csdn.net/zsd0819qwq/article/details/105396364