Deep learning into the pit - convolution and numpy implementation

foreword

Convolutional neural network (ConvNets or CNNs), as a type of neural network, supports the development of cv. This article mainly introduces the soul of convolutional neural network - convolution operation, its principle, and from the perspective of Xiaobai, complete convolution from 0 to a numpy implementation of 1.

1

Convolutional neural networks (ConvNets or CNNs), as an entry-level neural network for artificial intelligence, have been widely used in areas such as image recognition and classification. In addition to providing vision for robots and self-driving cars, ConvNets are also widely used in recognizing faces, objects and traffic signs. Among them, the convolution operation is the soul of cnn, and its appearance has accelerated the development of artificial intelligence.

The word convolution first appeared in mathematics, in order to represent the total superposition effect of a function on all traces of another function; it is essentially a filter, which completes the filtering of the signal on the one-dimensional signal, and in the two The three-dimensional image is a kind of filtering of the image, which filters out the unimportant parts and extracts the main features. Here, the convolution operation of the image is mainly explained in detail. Image convolution did not appear recently, but there is sobelfiltering in traditional image processing. Convolution is used when doing edge detection.

2

To understand convolution, you must first understand its operating object: an image, which is essentially a matrix of pixels:
insert image description here

As shown in the figure above, the grayscale image shown in the figure above is a single channel, while a general image has three RGB channels. In each channel, the value of each pixel is between 0 and 255, 0 is black, and 255 for white. Convolution ( conv) is a convolution operator. For the convenience of understanding, we use the 5*5 matrix in the following figure as an image pixel:
insert image description here
consider another 3 X 3 matrix as a convolution matrix:

then you can calculate the product of a 5x5 image and a 3x3 matrix:

Slide the convolution matrix on the image from left to right and from top to bottom in turn by 1 pixel (also called stride). For each position, multiply the convolution matrix and the pixels of the corresponding image and add them to get the final Integer, as a single element in the output matrix, the above picture is a single-channel image, for a general three-channel image, the in_channel of the kernel also needs to be guaranteed to be 3 channels. The calculation process is as shown in the figure below. After calculating the features of each layer, the corresponding positions of the features of each layer are added to obtain the final output featurelayer. insert image description here

In CNN, the 3×3 matrix is called **"filter**", " kernel " or " feature detector ", and the output matrix is called " convolution feature ", " feature map ", from the above animation It can be seen that the non-stop kernel will generate different feature maps for the same input image, as shown in the figure below, first input an original image:
insert image description here
select different filter matrices to perform convolution operations on the image to achieve edge detection, sharpening and blurring Etc. operations – detect different features of the image, such as edges, curves, etc.:

the actual convolution operation features are formed as follows:
insert image description here

The red and green boxes are two convolution kernels, which slide and convolve on the input image to generate two feature maps, as shown in the figure. And because the size of the convolution kernel determines that he can only obtain the local dependencies of the image (of course you can set the size of the kernel to the image size...), in the actual CNN, it is necessary to learn the values of these kernels through training to determine What features do they need to extract from the image.

3

The implementation of the conv convolution operator has been packaged in torch, tensorflow and other frameworks, and it is very convenient to use it out of the box. Here is to facilitate your own understanding, and numpyimplement conv from 0. The idea is as follows, because considering that the conv operator not only needs forward calculation, but also requires reverse update, so first create a Layersclass:

import numpy as np
import os

class Layers():
     def __init__(self, name):
         self.name = name 
 
     def forward(self, x):
        pass

     def zero_grad(self):
        pass 

     def backward(self, grad_out):
        pass

     def update(self, lr):
        pass

The conv convolution operator integrates the Layer class, and the forward and reverse implementations are as follows:

 import numpy as np
 from module import Layers

 class Con2d(Layers):
      """
      卷积前向：
      输入：input:[b, cin, h, w]
           weight：[cin, cout, ksize, ksize], stride, padding 
     计算过程：
         1.  将权重拉平成：[cout, cin*ksize*ksize] self.weight 先transpose(1, 0, 2,3) 再reshpe(cout, -1)
         2.  将输入整理成：[b*hout*wout,cin*ksize*ksize]: 
             先根据hin和win 通过pad, ksize和stride计算出hout和wout (h+2*pad-ksize)//stride + 1 (b, cout, hout, wout)
             再根据img展平，整理成自己的：img  (b, hout, wout, cin*kszie*ksize)  -> (b*hout*wout, cin*kszie*ksize)
         3. 两者相乘后，np.dot 再去reshape (cout, b*hout*wout) -> (b, cout, hout*wout)
     """
     """
     卷积反向：
     输入：input:[b, cout, hout, wout] -loss 
     计算过程： 
         1. 将输入换成输出格式： [b, cout, hout, wout] -> [cout, b, hout, wout] ->[cout, b*hout*wout] 
         2. 计算的输入与之前的图相乘： (cout, b*hout*wout) * (b*hout*wout, cin*kszie*ksize) -> (cout, cin*kszie*ksize) 得到更新后的权重
         3. 将更新后的权重与图相乘，
 
     """
     def __init__(self,name, in_channel, out_channel, kernel_size, padding, stride=1 ):
         super(Con2d,self).__init__(name)
         self.in_channel = in_channel
         self.out_channel = out_channel
         self.ksize = kernel_size
         self.padding = padding
         self.stride = stride
 
         self.weights = np.random.standard_normal((out_channel, in_channel, kernel_size, kernel_size))
         self.bias = np.zeros(out_channel)
         self.grad_w = np.zeros(self.weights.shape)
         self.grad_b = np.zeros(self.bias.shape)
 
     def img2col(self, x, ksize, strid):
        b,c,h,w = x.shape # (5, 3, 34, 34)
         img_col = []
         for n in range(b): # 5
                 for i in range(0, h-ksize+1, strid):
                     for j in range(0, w-ksize+1, strid):
                         col = x[n,:, i:i+ksize, j:j+ksize].reshape(-1) # (1, 3, 4, 4) # 48
                         img_col.append(col)
         return np.array(img_col) # (5, 3, 31, 31, 48)
 
     def forward(self, x):
         self.x = x #(5, 3, 34,34)
         weights = self.weights.reshape(self.out_channel, -1) # (12, 3*4*4)
         x = np.pad (x, ((0,0), (0,0), (self.padding, self.padding), (self.padding, self.padding)), "constant") # (5, 3, 34, 34)
         b, c, h, w = x.shape
         self.out = np.zeros((b, self.out_channel, (h-self.ksize)//self.stride+1, (w-self.ksize)//self.stride+1))# (5, 12, 31, 31)
         self.img_col = self.img2col(x, self.ksize, self.stride) #  (5, 31, 31, 48) #(4805, 48)
         out = np.dot(weights, self.img_col.T).reshape(self.out_channel, b, -1).transpose(1, 0,2) # (12 ,48) *(48, 4805) = (12, 4805) =(12, 5, 961) =(5, 12, 961)
         self.out = np.reshape(out, self.out.shape) 
         return self.out
 
     def backward(self, grad_out):
         b, c, h, w = self.out.shape
         grad_out_ = grad_out.transpose(1, 0, 2, 3 )
         grad_out_flag = np.reshape(grad_out_,[self.out_channel, -1]) # [cout, b*h*w]
         self.grad_w = np.dot(grad_out_flag, self.img_col).reshape(c, self.in_channel, self.ksize, self.ksize) #  (cout, cin*kszie*ksize)  -权重值
         self.grad_b = np.sum(grad_out_flag, axis=1) # [cout] -偏置值
         tmp = self.ksize -self.padding -1
         grad_out_pad = np.pad(grad_out, ((0,0),(0,0),(tmp, tmp),(tmp,tmp)),'constant')
         weights = self.weights.transpose(1, 0, 2, 3).reshape([self.in_channel, -1]) # [cin. cout*ksize*ksize]
         col_grad = self.img2col(grad_out_pad, self.ksize, 1) # 
         next_eta = np.dot(weights, col_grad.T).reshape(self.in_channel, b, -1).transpose(1, 0, 2)
         next_eta = np.reshape(next_eta, self.x.shape)
         return next_eta
 
     def zero_grad(self):
         self.grad_w = np.zeros_like(self.grad_w)  
         self.grad_b = np.zeros_like(self.grad_b)

     def update(self, lr=1e-3):
        self.weights -= lr*self.grad_w
        self.bias -= lr*self.grad_b 

if __name__ == '__main__':
     x = np.ones([2,3,32,32])
     conv = Con2d('conv1',3,12,3,1,1)
     for i in range(100):
       y = conv.forward(x)
       loss =abs( y - 1)
       x = conv.backward(loss)
       lr = 1e-4 
       conv.update(lr)
       print(np.sum(loss))

From 卷积the implementation, it is found that a large amount of time in the convolution operation is spent on converting the image into a matrix img2col function, which is also later in the model lightweight process, mobilenet depth separable convolution + 1x1 convolution, 1x1 convolution does not require img2col , easy to deploy on the chip side and speed up convolution.

For more articles about Xiaobai's entry into the pit, please pay attention to the official account [ The Invincible Zhang Dadao ]

Deep learning into the pit - convolution and numpy implementation

foreword

1

2

3

Guess you like