Table of contents
- 1 Fully connected network
- 2 Convolutional Neural Networks
-
- 2.1 What is a Convolutional Neural Network
- 2.2 Convolutional neural network or fully connected network?
- 2.3 Convolutional Neural Network Execution Process
- 2.4 Calculation process of convolution
- 2.5 Detailed Code Explanation
- 2.6 Detailed explanation of each step of convolutional neural network
- 2.7 Realize the calculation process
- 3 Comes with a wallpaper (^ ^)/~
1 Fully connected network
Fully Connected Neural Network ( Fully Connected Neural Network ) is a basic artificial neural network model, also known as Multi-Layer Perceptron ( MLP ). In a fully connected neural network, the connection between neurons is fully connected, each neuron is connected to all neurons in the previous layer, and each neuron has a weight, which is used to calculate the weighted sum of the input signal , and perform nonlinear transformation through the activation function , and finally output a result.
However, since the connections between neurons are fully connected, this leads to a large number of model parameters and is prone to problems such as overfitting. Therefore, fully connected neural networks are gradually replaced by other types of neural network models in practical applications, such as convolutional neural networks and recurrent neural networks.
The following is a schematic diagram of a fully connected neural network:
2 Convolutional Neural Networks
2.1 What is a Convolutional Neural Network
Convolutional Neural Network ( CNN ) is a feedforward neural network. It extracts image features through convolution ( convolution ) operations, and then reduces the size and dimension of feature maps through pooling ( pooling ) operations.
2.2 Convolutional neural network or fully connected network?
Convolutional neural networks have the following advantages over fully connected networks:
Parameter sharing : Each convolution kernel of the convolutional layer performs a convolution operation on the entire input, so parameters are shared in different positions. This means that the model has a lower number of parameters, is faster to train, and has better generalization capabilities.
Position invariance : Convolution operations are local operations that can efficiently handle transformations such as translation, rotation, and scaling in images. This position invariance is one of the important reasons why convolutional neural networks can be successfully applied to tasks such as image recognition.
Deep composability : Convolutional neural networks consist of multiple convolutional and pooling layers, each of which can extract specific features of the data. More complex feature extraction and classification tasks can be achieved by stacking multiple layers together.
In summary, convolutional neural networks have better parameter efficiency, position invariance, and depth composability than fully connected networks.
2.3 Convolutional Neural Network Execution Process
Note: In the Input image 1x28x28
, it meanschannel(通道) * width(宽度) * height(高度)
The role of each layer in the above process:
C 1 C_1C1Layer, the role is to preserve the spatial structure of the image. Because when accessing image data, we store the image as one-dimensional by default, so some spatial structure information of the image is lost, which needs to be restored through this layer.
S 1 S_1 S1Layer, which means subsampling , is used to reduce the amount of data and reduce computing requirements. Downsampling can reduce the size of Feature maps, thereby reducing the number of parameters that need to be calculated in the network, thereby improving the computational efficiency of the network.
Feature extraction : CNN can automatically learn features, and can gradually extract more abstract and meaningful feature representations of input data through multi-layer convolution and pooling operations.
Classification : Map these features to different categories to classify the input data.
CNN extracts image features through convolutional layers, and finally flattens the feature map into a one-dimensional vector, and maps it to the dimension of the target output space through a fully connected layer. For example, the map is mapped to a ten-dimensional output, so the cross-entropy loss can be calculated, and then the softmax function is used to convert the output of the model into a probability distribution for comparison with the real label to solve the classification problem.
2.4 Calculation process of convolution
To put it simply, an image is divided into three channels of red, green and blue. Then take a patch
in the image , and then do convolution. The convolution is done for a patch in the image.
In convolutional neural networks, features are usually extracted by taking a small area ( called a patch or kernel ) in an image and performing convolution operations on it. The size of this small area is usually a square or a rectangle, specified by the user. In the convolution operation, the convolution kernel slides over the image and performs a dot product operation on each small area with the convolution kernel to produce an output value. By changing the size and shape of the convolution kernel, different feature information can be extracted.
The operation process of convolution:
Note: the red box in the above picture is equivalent to a patch
Final Results:
Multiple convolution kernels participate in the operation:
Note: The number of convolution kernels must be the same as the number of input channels.
If you want to get n output channels, then you need n convolution kernels:
m * n * 卷积核的宽度 * 卷积核的高度
. n is the input channel, and m is the number of convolution kernels.
For example, here m=4, that is, the convolution kernel is assembled into a four-dimensional tensor
2.5 Detailed Code Explanation
import torch
# 定义输入数据
input = [3, 4, 6, 5, 7,
2, 4, 6, 8, 2,
1, 6, 7, 8, 4,
9, 7, 4, 6, 2,
3, 7, 5, 4, 1]
# 将输入数据转换为张量并设置其维度为 (batch_size=1, channels=1, height=5, width=5)
input = torch.Tensor(input).view(1, 1, 5, 5)
# 定义卷积层,设置卷积核大小为 3*3,填充为 1,且不使用偏置项
conv_layer = torch.nn.Conv2d(1, 1, kernel_size=3, padding=1, bias=False)
# 定义卷积核
```这里的view函数是用来对Tensor进行形状的变换,
具体地,[1, 2, 3, 4, 5, 6, 7, 8, 9]
这个一维Tensor被变换成了形状为[1, 1, 3, 3]的四维Tensor。
其中第一个1表示batch_size的维度,
第二个1表示输入通道的维度,
第三个3表示kernel的高度,
第四个3表示kernel的宽度。```
kernel = torch.Tensor([1, 2, 3, 4, 5, 6, 7, 8, 9]).view(1, 1, 3, 3)
# 将卷积核的值赋给卷积层的权重
conv_layer.weight.data = kernel.data
# 对输入数据进行卷积操作
output = conv_layer(input)
# 打印卷积结果
print(output)
Output result:
2.6 Detailed explanation of each step of convolutional neural network
2.6.1 Padding
Filling operation (padding):
In the convolution process, the padding operation refers to adding a fixed number of virtual pixels (usually 0) around the input data, so that the convolution kernel can completely cover the boundary pixels of the input data during convolution . .
The filling operation can prevent the edge pixels of the input data from being ignored during the convolution process, so that the size of the output feature map can be kept consistent with the input feature map or reduced more slowly. At the same time, the filling operation can also increase the network's ability to extract features, thereby improving the performance of the network.
import torch
input = [3,4,6,5,7,
2,4,6,8,2,
1,6,7,8,4,
9,7,4,6,2,
3,7,5,4,1]
input = torch.Tensor(input).view(1, 1, 5, 5)
conv_layer = torch.nn.Conv2d(1, 1, kernel_size=3, padding=1, bias=False)
kernel = torch.Tensor([1,2,3,4,5,6,7,8,9]).view(1, 1, 3, 3)
conv_layer.weight.data = kernel.data
output = conv_layer(input)
print(output)
2.6.2 Step size
Assume that the step size is 2:
In the convolutional neural network, the step size ( Stride ) refers to the step size of each slide of the convolution kernel when convolving the input data. The larger the stride, the smaller the size of the output feature map after each convolution, thereby reducing the amount of calculation. On the contrary, the smaller the stride, the larger the size of the output feature map, which increases the amount of calculation, but improves the detail representation ability of the feature map.
Therefore, according to actual needs, setting an appropriate stride value can improve calculation speed and efficiency while ensuring accuracy.
2.6.3 Pooling
Maximum pooling layer operation:
one of the more pooling methods used in downsampling is to set the maxpooling size to the 2 * 2
default stride=2, which is equivalent to dividing the image into four blocks and finding the maximum value in each block. Then stitch the maximum value together:
Note: using maxpooling, the number of channels remains unchanged
code:
import torch
# 定义输入矩阵
input = [3, 4, 6, 5,
2, 4, 6, 8,
1, 6, 7, 8,
9, 7, 4, 6]
# 转换为张量,并设定维度
input = torch.Tensor(input).view(1, 1, 4, 4)
# 定义最大池化层,设定核大小为2
maxpooling_layer = torch.nn.MaxPool2d(kernel_size=2)
# 输入矩阵通过最大池化层处理
output = maxpooling_layer(input)
# 输出处理后的结果
print(output)
2.7 Realize the calculation process
2.7.1 Detailed calculation process
Next, let's implement a simple neural network to realize the identification and processing of the minist data set:
let's take a look at the calculation process in detail:
(batch, 1, 28, 28)
Represents a data set with a size of batch, where the shape of each data point is (1, 28, 28)
, and the shape of the data point (1, 28, 28)
indicates that this data point is a three-dimensional array, including 1 channel (channel), 28 rows (rows) and 28 columns (columns) image data.
The first convolution kernel size we used is 5 x 5
1, the number of input channels is 1, and the number of output channels is 10.
The size after convolution (batch,10,12,12)
10
represents 10 channels, 12,12
representing the size after convolution (take the row as an example, the number of rows after convolution = 原行数 - 卷积核行数 + 1
)
Next, after a maximum pooling, the size 24 x 24
is reduced by half to become12 x 12
Then go through a 5 x 5
convolution kernel and set the output to 20
a channel
, so the size becomes(20,8,8)
Then perform maximum pooling, which becomes(20,4,4)
Finally, the multidimensional feature map is flattened ( flatten ) to convert the data into a one-dimensional feature vector. Finally, through a fully connected layer, 320
elements are mapped to 10
feature values
The whole process is as follows:
2.7.2 Code implementation
whole code:
# 定义一个名为Net的类,继承自torch.nn.Module类
class Net(torch.nn.Module):
def __init__(self):
# 调用父类的构造函数
super(Net, self).__init__()
# 定义一个卷积层,输入通道数为1,输出通道数为10,卷积核大小为5
self.conv1 = torch.nn.Conv2d(1, 10, kernel_size=5)
# 定义第二个卷积层,输入通道数为10,输出通道数为20,卷积核大小为5
self.conv2 = torch.nn.Conv2d(10, 20, kernel_size=5)
# 定义一个最大池化层,池化核大小为2
self.pooling = torch.nn.MaxPool2d(2)
# 定义一个全连接层,输入节点数为320,输出节点数为10
self.fc = torch.nn.Linear(320, 10)
# 定义模型的前向传播过程
def forward(self, x):
# 获取批次大小
batch_size = x.size(0)
# 经过第一层卷积层,再通过ReLU激活函数,然后进行最大池化
x = F.relu(self.pooling(self.conv1(x)))
# 经过第二层卷积层,再通过ReLU激活函数,然后进行最大池化
x = F.relu(self.pooling(self.conv2(x)))
# 将多维特征图展平为一维向量
x = x.view(batch_size, -1) # flatten展开成一维向量
# 经过全连接层,输出结果
x = self.fc(x)
return x
# 实例化Net类,创建一个名为model的模型
model = Net()
2.7.3 Some possible difficulties
Why use RELU?
Through the function of the activation function, the negative value in the feature map can be set to zero, thereby realizing nonlinear mapping and enhancing the semantic expression ability of the feature map for the image. Therefore, the ReLU activation function is usually used after the convolutional layer to enhance the expressive ability of the feature map.
What does -1 in x.view(batch_size, -1) mean?
-1
Indicates the automatically derived dimension size such that the size of the tensor can be correctly flattened into (batch_size, num_features)
the shape. Here, the value of is automatically derived -1
from the sum of the total number of elements in the tensor . batch_size
This operation flattens the multi-dimensional feature map into a one-dimensional vector, which is convenient for the subsequent processing of the fully connected layer.