foreword
Convolutional neural network (ConvNets or CNNs), as a type of neural network, supports the development of cv. This article mainly introduces the soul of convolutional neural network - convolution operation, its principle, and from the perspective of Xiaobai, complete convolution from 0 to a numpy implementation of 1.
1
Convolutional neural networks (ConvNets or CNNs), as an entry-level neural network for artificial intelligence, have been widely used in areas such as image recognition and classification. In addition to providing vision for robots and self-driving cars, ConvNets are also widely used in recognizing faces, objects and traffic signs. Among them, the convolution operation is the soul of cnn, and its appearance has accelerated the development of artificial intelligence.
The word convolution first appeared in mathematics, in order to represent the total superposition effect of a function on all traces of another function; it is essentially a filter, which completes the filtering of the signal on the one-dimensional signal, and in the two The three-dimensional image is a kind of filtering of the image, which filters out the unimportant parts and extracts the main features. Here, the convolution operation of the image is mainly explained in detail. Image convolution did not appear recently, but there is sobel
filtering in traditional image processing. Convolution is used when doing edge detection.
2
To understand convolution, you must first understand its operating object: an image, which is essentially a matrix of pixels:
As shown in the figure above, the grayscale image shown in the figure above is a single channel, while a general image has three RGB channels. In each channel, the value of each pixel is between 0 and 255, 0 is black, and 255 for white. Convolution ( conv
) is a convolution operator. For the convenience of understanding, we use the 5*5 matrix in the following figure as an image pixel:
consider another 3 X 3 matrix as a convolution matrix:
then you can calculate the product of a 5x5 image and a 3x3 matrix:
Slide the convolution matrix on the image from left to right and from top to bottom in turn by 1 pixel (also called stride). For each position, multiply the convolution matrix and the pixels of the corresponding image and add them to get the final Integer, as a single element in the output matrix, the above picture is a single-channel image, for a general three-channel image, the in_channel of the kernel also needs to be guaranteed to be 3 channels. The calculation process is as shown in the figure below. After calculating the features of each layer, the corresponding positions of the features of each layer are added to obtain the final output feature
layer.
In CNN, the 3×3 matrix is called **"filter**", " kernel " or " feature detector ", and the output matrix is called " convolution feature ", " feature map ", from the above animation It can be seen that the non-stop kernel will generate different feature maps for the same input image, as shown in the figure below, first input an original image:
select different filter matrices to perform convolution operations on the image to achieve edge detection, sharpening and blurring Etc. operations – detect different features of the image, such as edges, curves, etc.:
the actual convolution operation features are formed as follows:
The red and green boxes are two convolution kernels, which slide and convolve on the input image to generate two feature maps, as shown in the figure. And because the size of the convolution kernel determines that he can only obtain the local dependencies of the image (of course you can set the size of the kernel to the image size...), in the actual CNN, it is necessary to learn the values of these kernels through training to determine What features do they need to extract from the image.
3
The implementation of the conv convolution operator has been packaged in torch, tensorflow and other frameworks, and it is very convenient to use it out of the box. Here is to facilitate your own understanding, and numpy
implement conv from 0. The idea is as follows, because considering that the conv operator not only needs forward calculation, but also requires reverse update, so first create a Layers
class:
import numpy as np
import os
class Layers():
def __init__(self, name):
self.name = name
def forward(self, x):
pass
def zero_grad(self):
pass
def backward(self, grad_out):
pass
def update(self, lr):
pass
The conv convolution operator integrates the Layer class, and the forward and reverse implementations are as follows:
import numpy as np
from module import Layers
class Con2d(Layers):
"""
卷积前向:
输入:input:[b, cin, h, w]
weight:[cin, cout, ksize, ksize], stride, padding
计算过程:
1. 将权重拉平成:[cout, cin*ksize*ksize] self.weight 先transpose(1, 0, 2,3) 再reshpe(cout, -1)
2. 将输入整理成:[b*hout*wout,cin*ksize*ksize]:
先根据hin和win 通过pad, ksize和stride计算出hout和wout (h+2*pad-ksize)//stride + 1 (b, cout, hout, wout)
再根据img展平,整理成自己的:img (b, hout, wout, cin*kszie*ksize) -> (b*hout*wout, cin*kszie*ksize)
3. 两者相乘后,np.dot 再去reshape (cout, b*hout*wout) -> (b, cout, hout*wout)
"""
"""
卷积反向:
输入:input:[b, cout, hout, wout] -loss
计算过程:
1. 将输入换成输出格式: [b, cout, hout, wout] -> [cout, b, hout, wout] ->[cout, b*hout*wout]
2. 计算的输入与之前的图相乘: (cout, b*hout*wout) * (b*hout*wout, cin*kszie*ksize) -> (cout, cin*kszie*ksize) 得到更新后的权重
3. 将更新后的权重与图相乘,
"""
def __init__(self,name, in_channel, out_channel, kernel_size, padding, stride=1 ):
super(Con2d,self).__init__(name)
self.in_channel = in_channel
self.out_channel = out_channel
self.ksize = kernel_size
self.padding = padding
self.stride = stride
self.weights = np.random.standard_normal((out_channel, in_channel, kernel_size, kernel_size))
self.bias = np.zeros(out_channel)
self.grad_w = np.zeros(self.weights.shape)
self.grad_b = np.zeros(self.bias.shape)
def img2col(self, x, ksize, strid):
b,c,h,w = x.shape # (5, 3, 34, 34)
img_col = []
for n in range(b): # 5
for i in range(0, h-ksize+1, strid):
for j in range(0, w-ksize+1, strid):
col = x[n,:, i:i+ksize, j:j+ksize].reshape(-1) # (1, 3, 4, 4) # 48
img_col.append(col)
return np.array(img_col) # (5, 3, 31, 31, 48)
def forward(self, x):
self.x = x #(5, 3, 34,34)
weights = self.weights.reshape(self.out_channel, -1) # (12, 3*4*4)
x = np.pad (x, ((0,0), (0,0), (self.padding, self.padding), (self.padding, self.padding)), "constant") # (5, 3, 34, 34)
b, c, h, w = x.shape
self.out = np.zeros((b, self.out_channel, (h-self.ksize)//self.stride+1, (w-self.ksize)//self.stride+1))# (5, 12, 31, 31)
self.img_col = self.img2col(x, self.ksize, self.stride) # (5, 31, 31, 48) #(4805, 48)
out = np.dot(weights, self.img_col.T).reshape(self.out_channel, b, -1).transpose(1, 0,2) # (12 ,48) *(48, 4805) = (12, 4805) =(12, 5, 961) =(5, 12, 961)
self.out = np.reshape(out, self.out.shape)
return self.out
def backward(self, grad_out):
b, c, h, w = self.out.shape
grad_out_ = grad_out.transpose(1, 0, 2, 3 )
grad_out_flag = np.reshape(grad_out_,[self.out_channel, -1]) # [cout, b*h*w]
self.grad_w = np.dot(grad_out_flag, self.img_col).reshape(c, self.in_channel, self.ksize, self.ksize) # (cout, cin*kszie*ksize) -权重值
self.grad_b = np.sum(grad_out_flag, axis=1) # [cout] -偏置值
tmp = self.ksize -self.padding -1
grad_out_pad = np.pad(grad_out, ((0,0),(0,0),(tmp, tmp),(tmp,tmp)),'constant')
weights = self.weights.transpose(1, 0, 2, 3).reshape([self.in_channel, -1]) # [cin. cout*ksize*ksize]
col_grad = self.img2col(grad_out_pad, self.ksize, 1) #
next_eta = np.dot(weights, col_grad.T).reshape(self.in_channel, b, -1).transpose(1, 0, 2)
next_eta = np.reshape(next_eta, self.x.shape)
return next_eta
def zero_grad(self):
self.grad_w = np.zeros_like(self.grad_w)
self.grad_b = np.zeros_like(self.grad_b)
def update(self, lr=1e-3):
self.weights -= lr*self.grad_w
self.bias -= lr*self.grad_b
if __name__ == '__main__':
x = np.ones([2,3,32,32])
conv = Con2d('conv1',3,12,3,1,1)
for i in range(100):
y = conv.forward(x)
loss =abs( y - 1)
x = conv.backward(loss)
lr = 1e-4
conv.update(lr)
print(np.sum(loss))
From 卷积
the implementation, it is found that a large amount of time in the convolution operation is spent on converting the image into a matrix img2col function, which is also later in the model lightweight process, mobilenet depth separable convolution + 1x1 convolution, 1x1 convolution does not require img2col , easy to deploy on the chip side and speed up convolution.
For more articles about Xiaobai's entry into the pit, please pay attention to the official account [ The Invincible Zhang Dadao ]