Deep Learning|Convolutional Neural Network

1. Introduction to Convolutional Neural Networks

Convolutional Neural Network (CNN) is a deep learning neural network structure, mainly used in image recognition , computer vision and other fields. This structure performs well when processing high-dimensional data such as images because it has the characteristics of shared weights and local perception. On the one hand, it reduces the number of weights and makes the network easy to optimize. On the other hand, it reduces the complexity of the model, that is, it reduces the number of weights. Reduces the risk of overfitting.

The convolutional neural network mainly consists of a convolutional layer, a pooling layer, a fully connected layer and an activation function. Among them, the convolutional layer is the core part of CNN. It extracts the features of the input image through convolution operations and uses these features as the input of the next layer. The pooling layer is used for downsampling, which can reduce the size of the feature map output by the convolutional layer, thereby reducing network parameters and calculation amount. The fully connected layer is used to connect the outputs of the convolutional layer and the pooling layer for final classification and prediction.

The training of convolutional neural network mainly uses the back propagation algorithm (Back-propagation) to update the weights in the network (similar to BP neural network), so that the network can gradually learn the characteristics of the input data and perform final classification or classification. Get better performance in prediction tasks.

At present, convolutional neural networks have achieved outstanding results in image classification, object recognition, face recognition, natural language processing and other fields, and are an important part of the modern deep learning field.

2. Convolution layer

In mathematics,  the "convolution" between two functions (say and g ) is defined as:

\left ( f*g \right )(x)=\int f(z)g(x-z)dz

That is, convolution "flips" a function and shifts x, measuring the overlap between and g  . When it is a discrete object, integration becomes summation. For example, for a vector drawn from a set of square-summable, infinite-dimensional vectors with index Z, we get the following definition:

(f*g)(i)=\sum_af(a)g(i-a)

For a two-dimensional tensor, it is  the corresponding sum on the index (a, b) of  f  and the index (ia, jb) of g  :

(f*g)(i,j)=\sum _a\sum_bf(a,b)g(i-a,j-b)

What needs to be explained here is that in convolutional neural networks, the implementation of the convolutional layer is actually the cross-correlation operation defined in mathematics, which is different from the definition of convolution in mathematical analysis. Cross-correlation is used Operation as definition of convolution. Cross-Correlation is a function that measures the correlation between two sequences. It is usually implemented by dot product calculation of a sliding window, while convolution requires inverting the filter.

2.1 One-dimensional convolution

One-dimensional convolution is often used in signal processing to calculate the delay accumulation of a signal. Assume that a signal generator generates a signal at each time t x_t, and the decay rate of its information w_kis, that is, after k-1 time steps, the information is w_ktimes the original value. The signal received at time t is y_tthe superposition of the information generated at the current time and the delayed information at the previous time, that is:

y_t=w_1x_t+w_2x_{t-1}+w_3x_{t-2}=\sum_{k=1}^3w_kx_{t-k+1}

Among them, w_kit is called filter or convolution kernel.

2.2 Two-dimensional convolution 

In image processing, the image is input into the neural network in the form of a two-dimensional matrix. Given an image ( X\in R^{M\times N}) and a filter ( W\in R^{U\times V}), and there is U\ll M, V\ll N, then its convolution is:

y_{ij}=\sum_{u=1}^U\sum_{v=1}^Vw_{uv}x_{i-u+1,j-v+1}

The specific calculation process is shown in the figure below:

Usually we define a two-dimensional convolution of input information X and filter W as Y=W*X. Assuming that the height and width of the convolution kernel are respectively k_hand k_w, it will be called k_h\times k_wa convolution. For example 3\times 5, convolution means that the height of the convolution kernel is 3 and the width is 5. In a convolutional neural network, in addition to the convolution process described above, a convolution operator also includes the operation of adding a bias term.

When the convolution kernel size is greater than 1, after one convolution, the size of the output feature map will be smaller than the size of the input image. The calculation method of the output feature map size is:

\left\{\begin{matrix} H_{out}=H-k_h+1\\ W_{out}=W-k_w+1 \end{matrix}\right.

2.2.1 Filling

If you go through multiple convolutions, the size of the output image will continue to decrease. In order to prevent the image from becoming smaller after convolution, padding is usually performed around the periphery of the image, as shown in the following figure:

If along the height direction of the picture, the rows are filled before the first row p_{h1}and the rows are filled after the last row p_{h2}; along the width direction of the picture, the columns are filled before the first column p_{w1}and the columns are filled after the last column p_{w2}, then the filled picture will be resized After k_h\times k_wthe convolution kernel operation, the size of the output image is:

\left\{\begin{matrix} H_{out}=H+p_{h1}+p_{h2}-k_h+1\\ W_{out}=W+p_{w1}+p_{w2}-k_w+1 \end{matrix}\right.

During the convolution calculation process, equal padding is usually taken on both sides of the height or width, which requires:

p_{h1}=p_{h2}=p_h, p_{w1}=p_{w2}=p_w

Then the transformed size becomes:

\left\{\begin{matrix} H_{out}=H+2p_h-k_h+1\\ w_{out}=W+2p_w-k_w+1 \end{matrix}\right.

The convolution kernel usually uses odd numbers such as 1, 3, 5, and 7. If the padding size used is:

p_h=\frac{k_h-1}{2}, p_w=\frac{k_w-1}{2}

Then the image size remains unchanged after convolution. For example, when the convolution kernel size is 5 and the padding size is 2, the image size will not change after convolution.

To understand better, suppose you create a 2D convolutional layer with a height and width of 3 and pad it with one pixel on all sides. Given an input with a height and width of 8, the output will also have a height and width of 8:

import torch
from torch import nn
# 为了方便起⻅,我们定义了一个计算卷积层的函数。
# 此函数初始化卷积层权重,并对输入和输出提高和缩减相应的维数 
def comp_conv2d(conv2d, X):
    # 这里的(1,1)表示批量大小和通道数都是1
    X = X.reshape((1, 1) + X.shape)
    Y = conv2d(X)
    # 省略前两个维度:批量大小和通道
    return Y.reshape(Y.shape[2:])

# 请注意,这里每边都填充了1行或1列,因此总共添加了2行或2列 
conv2d = nn.Conv2d(1, 1, kernel_size=3, padding=1)
X = torch.rand(size=(8, 8))
comp_conv2d(conv2d, X).shape

## 结果
torch.Size([8, 8])

When the height and width of the convolution kernel are different, we can pad the different heights and widths so that the output and input have the same height and width. If a convolution kernel with a height of 5 and a width of 3 is used, the padding on both sides of the height and width will be 2 and 1 respectively:

conv2d = nn.Conv2d(1, 1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape

## 结果
torch.Size([8, 8]) 

2.2.2 Stride

The stride will also affect the size of the output image. The following figure is a convolution process with a stride of 2. When the convolution kernel moves on the image, each movement is 2 pixels:

 When the strides in the width and height directions are s_hand respectively s_w, the output feature map size is:

\left\{\begin{matrix} H_{out}=\frac{H+2p_h-k_h}{s_h}+1\\ \\ W_{out}=\frac{W+2p_h-k_w}{s_w}+1 \end{matrix}\right.

2.2.3 Code implementation

We can use the torch library to implement the above process. Assuming the following input data and kernel function, the output data can be expressed as:

import torch
from torch import nn

def corr2d(X, K): #@save 
    """计算二维互相关运算"""
    h, w = K.shape
    Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j] = (X[i:i + h, j:j + w] * K).sum()
    return Y

Input the tensor x and convolution kernel tensor k in the above figure, and you can get the output tensor on the figure:

X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
corr2d(X, K)

## 输出结果
tensor([[19., 25.],
        [37., 43.]])

2.2.4 Receptive field

In two-dimensional convolution, as the number of convolution layers increases, the value of each point on the output feature map will represent more information. In one layer of convolution, the value of each point on the output feature map is k_h\times k_wobtained by multiplying and adding the elements in the area of ​​size on the input image and each element of the convolution kernel, so each element in this area on the input image Changes in the value of each element will affect the pixel value of the output point. We can call this area the receptive field of the corresponding point on the output feature map. Taking 3\times 3convolution as an example, the corresponding receptive field is 3\times 3an area of ​​​​large and small:

 When the number of convolutional layers is 2, the size of the receptive field will increase 5\times 5, as shown in the following figure:

2.2.5 Connection

Each neuron in the convolutional layer is only connected to the local neurons in the next layer, forming a local connection network. After using the convolutional layer instead of the fully connected layer, the number of connections between layers can be greatly reduced, as shown in the figure (one-dimensional and two-dimensional):

 2.2.6 Channel 

The images we are familiar with generally contain three channels (three colors), called RGB. In fact, the image is not a two-dimensional tensor, but a three-dimensional tensor composed of height, width and color, such as a 1024\times 1024\times 3pixel. The first two axes are related to the spatial position of the pixel, and the third axis can be regarded as A multidimensional representation of each pixel.

(1) Multiple input channel scenario

To calculate the output result of convolution, the form of the convolution kernel will also change. Assume that the number of channels of the input image is and C_{in}the shape of the input data is C_{in}\times H_{in}\times W_{in}. The calculation process is as follows:

  1. Each channel involves a 2-dimensional array as the convolution kernel, and the shape of the convolution kernel array is C_{in}\times k_h\times k_w;
  2. For any channel C_{in}\in [0, C_{in}), use k_h\times k_wa convolution kernel of size H_{in}\times W_{in}to perform convolution on a two-dimensional array of size;
  3. C_{in}Adding the calculation results of this channel results in a H_{out}\times W_{out}two-dimensional array of shape.

The following operations between the two input channels are implemented through the torch library:

import torch
from d2l import torch as d2l

def corr2d_multi_in(X, K):
    # 先遍历“X”和“K”的第0个维度(通道维度),再把它们加在一起 
    return sum(d2l.corr2d(x, k) for x, k in zip(X, K))

Input the tensor x and kernel tensor k corresponding to the values ​​in the above figure into this function:

X = torch.tensor([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]],
               [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])
K = torch.tensor([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]])
corr2d_multi_in(X, K)

## 输出结果
tensor([[ 56.,  72.],
        [104., 120.]])

 (2) Multi-output channel scenario

The output feature map of the convolution operation will also have multiple channels C_{out}. In this case, a convolution kernel C_{out}with a dimension of . The dimension of the convolution kernel array is , as shown in the following figure:C_{in}\times k_h\times k_wC_{out}\times C_{in}\times k_h\times k_w

The number of output channels of the convolution kernel is also called the number of convolution kernels. The figure contains two convolution kernels. Red, green and blue represent the three input channels of the first convolution kernel, and the lighter color represents the second one. Three input channels of a convolution kernel.

In a convolution layer, a convolution kernel can learn and extract a feature in the image, but the image often contains a variety of different feature information, so we need multiple different convolution kernels to extract different features. 

Similarly, we implement an output function that calculates multiple channels:

def corr2d_multi_in_out(X, K):
# 迭代“K”的第0个维度,每次都对输入“X”执行互相关运算。 # 最后将所有结果都叠加在一起
return torch.stack([corr2d_multi_in(X, k) for k in K], 0)

Concatenate the kernel tensor k with k+1 (each element in k plus 1) and k+2 to construct a convolution kernel with 3 output channels:

K = torch.stack((K, K + 1, K + 2), 0)
K.shape
## 结果
torch.Size([3, 2, 2, 2])

We perform an operation on the input tensor and the convolution kernel tensor k. The output now contains 3 channels. The first channel result is consistent with the previous input tensor x and the multi-input single output channel result:

corr2d_multi_in_out(X, K)
## 输出结果
tensor([[[ 56.,  72.],
         [104., 120.]],
        [[ 76., 100.],
         [148., 172.]],
        [[ 96., 128.],
         [192., 224.]]])

(3) Batch operation

In the calculation process of convolutional neural network, multiple samples are usually put together to form a mini-batch for batch operations, that is, the dimension of the input data is N\times C_{in}\times H_{in}\times W_{in}. Since the same convolution kernel will be used for each image for convolution operation, the dimensions of the convolution kernel are the same as in the case of multiple output channels above. Still, C_{out}\times C_{in}\times k_h\times k_wthe dimensions of the output feature map are N\times C_{out}\times H_{out}\times W_{out}, as shown in the following figure:

3. Pooling layer and fully connected layer

3.1 Pooling

3.1.1 Pooling operation

Pooling layer is also called aggregation layer or sub-sampling layer. Its main function is to select features , reduce the number of features and thereby reduce the number of parameters. Pooling is equivalent to dimensionality reduction within the spatial range, acting on each input feature separately and reducing its size.

The pooling layer contains a preset pooling function, whose function is to replace the result of a single point in the feature map with the feature map statistics of its adjacent regions. The advantage of using the overall statistical characteristics of adjacent outputs at a certain position to replace the network's output at that position is that when the input data undergoes a small amount of translation, most of the outputs after the pooling function remain unchanged.

There are usually two types of pooling, namely average pooling and maximum pooling, as shown in the following figure:

Similar to the convolution kernel, when the pooling window (used to k_h\times k_wrepresent the pooling window) slides on the picture, the step size of each movement is called the stride. When the movement sizes in the width and height directions are different, they are represented by and s_wrespectively s_h. Of course, the pictures that need to be pooled can also be filled. The filling method is similar to convolution. It is assumed that the rows are filled before the first row p_{h1}, the rows are filled after the last row p_{h2}, the columns are filled before the first column , and the columns p_{w1}are filled after the last column. p_{w2}, then the output feature map size of the pooling layer is:

\left\{\begin{matrix} H_{out}=\frac{H+p_{h1}+p_{h2}-k_{h}}{s_h}+1\\ \\ W_{out}=\frac{W+p_{w1}+p_{w2}-k_w}{s_w}+1 \end{matrix}\right.

Usually, 2\times 2a pooling window of size is used, the stride is also 2, and the padding is 0; through pooling in this way, the height and width of the output feature map are halved, but the number of channels does not change.

To implement forward propagation of the pooling layer, we can still use the corr2d function, but there is no convolution kernel here, and the output is the maximum or average value of each region in the input:

import torch
from torch import nn
from d2l import torch as d2l

def pool2d(X, pool_size, mode='max'):
    p_h, p_w = pool_size
    Y = torch.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            if mode == 'max':
                Y[i, j] = X[i: i + p_h, j: j + p_w].max()
            elif mode == 'avg':
                Y[i, j] = X[i: i + p_h, j: j + p_w].mean()
return Y

Construct the input tensor x and verify the output of the 2D max pooling layer:

## 最大池化
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
pool2d(X, (2, 2))

## 结果
tensor([[4., 5.],
        [7., 8.]])


# 平均池化
pool2d(X, (2, 2), 'avg')

# 结果
tensor([[2., 3.],
        [5., 6.]])

3.1.2 Pooling

The pooling layer can not only effectively reduce the number of neurons, but also make the network remain invariant to some small local morphological changes and have a larger receptive field, as shown in the following figure:

 3.2 Fully connected layer

Each node in the fully-connected layer is connected to all nodes in the previous layer, and the features extracted from the forward layer are combined. Due to its fully connected characteristics, the fully connected layer generally has the most parameters.

In the CNN structure, after multiple convolutional layers and pooling layers, one or more fully connected layers are connected. Similar to MLP, each neuron in the fully connected layer is related to all the neurons in the previous layer. element for full connection. In order to improve the performance of the CNN network, the activation function of each neuron in the fully connected layer generally uses the ReLU function. The output value of the last fully connected layer is passed to an output, which can be classified using softmax logistic regression.

4. LeNet network

LeNet is one of the earliest published convolutional neural networks and has received widespread attention for its efficient performance in computer vision tasks. This model was proposed by Yann LeCun, a researcher at AT&T Laboratories, in 1989. At that time, LeNet achieved performance comparable to that of support vector machines and became the mainstream method of supervised learning. LeNet is widely used in automatic teller machines (ATMs) to help identify numbers for processing checks, and is still used in a wide range of applications today.

4.1 LeNet structure

The LeNet network consists of two parts, namely:

  • Convolutional encoder: consists of two convolutional layers
  • Fully connected layer dense block: consists of three fully connected layers.

Its network architecture is shown in the figure below:

 The basic units in each convolutional block are a convolutional layer, a sigmoid activation function and an average pooling layer. Each convolutional layer uses 5 5\times 5convolution kernels and a sigmoid activation function. These layers map the input to multiple 2D feature outputs, often increasing the number of channels simultaneously. The first convolutional layer has 6 output channels, while the second convolutional layer has 16 output channels. The output shape of the convolution is determined by the batch size, number of channels, height, and width.

It is not difficult for a deep learning framework to implement such a model, just instantiate a Sequential module and connect the required layers together:

import torch
from torch import nn
from d2l import torch as d2l
net = nn.Sequential(
    nn.Conv2d(1, 6, kernel_size=5, padding=2), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2),
    nn.Conv2d(6, 16, kernel_size=5), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2),
    nn.Flatten(),
    nn.Linear(16 * 5 * 5, 120), nn.Sigmoid(),
    nn.Linear(120, 84), nn.Sigmoid(),
    nn.Linear(84, 10))

Make a small change to the original model, remove the Gaussian activation of the last layer, and the rest is consistent with the original network. Passing a 28\times 28single-channel (black and white) image of size , through LeNet, we can check the model to make sure it operates as we expect by printing the shape of the output at each layer, as shown below:

X = torch.rand(size=(1, 1, 28, 28), dtype=torch.float32)
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output shape: \t',X.shape)

## 输出结果
Conv2d output shape:            torch.Size([1, 6, 28, 28])
Sigmoid output shape:           torch.Size([1, 6, 28, 28])
AvgPool2d output shape:         torch.Size([1, 6, 14, 14])
Conv2d output shape:            torch.Size([1, 16, 10, 10])
Sigmoid output shape:           torch.Size([1, 16, 10, 10])
AvgPool2d output shape:         torch.Size([1, 16, 5, 5])
Flatten output shape:           torch.Size([1, 400])
Linear output shape:            torch.Size([1, 120])
Sigmoid output shape:           torch.Size([1, 120])
Linear output shape:            torch.Size([1, 84])
Sigmoid output shape:           torch.Size([1, 84])
Linear output shape:            torch.Size([1, 10])

Throughout the convolution block, the feature height and width of each layer are reduced compared to the previous layer. The first convolutional layer uses 2 pixel padding to compensate for 5\times 5the feature reduction caused by the convolution kernel. In contrast, the second convolutional layer has no padding, so both height and width are reduced by 4 pixels. As the stack goes up, the number of channels increases from 1 at the input, to 6 after the first convolutional layer, to 16 after the second convolutional layer. At the same time, the height and width of each convergence layer are halved. Finally, each fully connected layer reduces the dimensionality and finally outputs an output whose dimensionality matches the number of resulting categories.

4.2 LeNet model training

Based on the implementation of LeNet, it is used to achieve performance on the Fashion-MNIST data set.

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)

Although convolutional neural networks have fewer parameters, they are still computationally expensive compared to deep multi-layer perceptrons because each parameter involves more multiplications. It can be used to speed up training by using a GPU.

def evaluate_accuracy_gpu(net, data_iter, device=None): #@save 
    """使用GPU计算模型在数据集上的精度"""
    if isinstance(net, nn.Module):
        net.eval() # 设置为评估模式
        if not device:
            device = next(iter(net.parameters())).device 
    # 正确预测的数量,总预测的数量
    metric = d2l.Accumulator(2)
    with torch.no_grad():
        for X, y in data_iter:
            if isinstance(X, list):
                X = [x.to(device) for x in X]
            else:
                X = X.to(device)
            y = y.to(device)
            metric.add(d2l.accuracy(net(X), y), y.numel())
    return metric[0] / metric[1]

As with the fully connected layer, we use the cross-entropy loss function and mini-batch stochastic gradient descent:

#@save
def train_ch6(net, train_iter, test_iter, num_epochs, lr, device): 
    def init_weights(m):
        if type(m) == nn.Linear or type(m) == nn.Conv2d:
            nn.init.xavier_uniform_(m.weight)
    net.apply(init_weights)
    print('training on', device)
    net.to(device)
    optimizer = torch.optim.SGD(net.parameters(), lr=lr)
    loss = nn.CrossEntropyLoss()
    animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
                            legend=['train loss', 'train acc', 'test cc'])
    timer, num_batches = d2l.Timer(), len(train_iter)
    for epoch in range(num_epochs):
        # 训练损失之和,训练准确率之和,样本数 
        metric = d2l.Accumulator(3)
        net.train()
        for i, (X, y) in enumerate(train_iter):
            timer.start()
            optimizer.zero_grad()
            X, y = X.to(device), y.to(device)
            y_hat = net(X)
            l = loss(y_hat, y)
            l.backward()
            optimizer.step()
            with torch.no_grad():
                metric.add(l * X.shape[0],
                           d2l.accuracy(y_hat,y), X.shape[0])
            timer.stop()
            train_l = metric[0] / metric[2]
            train_acc = metric[1] / metric[2]
            if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
                animator.add(epoch + (i + 1) / num_batches,
                             (train_l, train_acc, None))
        test_acc = evaluate_accuracy_gpu(net, test_iter)
        animator.add(epoch + 1, (None, None, test_acc))
    print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, '
          f'test acc {test_acc:.3f}')
    print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
          f'on {str(device)}')

Afterwards, we can train and evaluate the LeNet model:

lr, num_epochs = 0.9, 10
train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

## 输出结果
loss 0.467, train acc 0.825, test acc 0.821
88556.9 examples/sec on cuda:0

Guess you like

Origin blog.csdn.net/weixin_58243219/article/details/129777227