Deep learning into the pit - fully connected layer and numpy implementation

foreword

Convolutional neural network (ConvNets or CNNs), as a type of neural network, supports the development of cv. This article mainly introduces another operation of convolutional neural network-full connection operation, its principle, and completes the full connection from the perspective of Xiaobai A numpy implementation from 0 to 1.

1

As a series of Xiaobai's entry into the pit, I will start today's operator " FC". If you miss other operators, please pay attention to the official account "Invincible Zhang Dadao " to receive it in the background.

FC layer is the abbreviation of Full Connection, fully connected, which means that each neuron in the previous layer is connected to each neuron in the next layer, referred to as 全连接层, generally placed in the convolutional neural network convolution, pooling and other operations Finally, the previous explanation of the reason is that when the entire convolutional neural network wants to complete a certain goal such as a "classification" task, it will adopt a two-step strategy : the first step is to extract features from the input data, and the second step is to extract the extracted The feature is mapped to the classification target; in this way, operations such as convolution, pooling, and activation function correspond to the first step, which is to do feature engineering of the data, and the fully connected layer corresponds to the second step, which weights the features and acts as a " classifier ". "The role.

2

The essence of the FC layer is a traditional multi-layer perceptron ( MLP), as shown in the following figure:
insert image description here

Such as the calculation of the neuron value in the middle orange:
insert image description here

Where w is the weight, bias is the bias, f is the activation function, and each output neuron is equal to the weight and +bias of each neuron in the previous layer before being activated. The specific forward and reverse transfer of the multi-layer perceptron refers to [1].
insert image description here

The activation function of the fully connected layer generally uses the softmax probability activation function, so that the sum of the concepts of the entire output layer is 1, the softmax function compresses the vector of any score into a value between 0 and 1, and the sum is 1, and the softmax For a detailed explanation of the activation function, see the previous article "Activation function and numpy implementation" ,

The operation from extracting features to the fully connected layer is generally to directly flatten the feature map, and then go through the fully connected layer to the corresponding branch, as shown in the following figure:
picture

After inputting the picture of "7", after several rounds of feature extraction such as convolution pooling, the feature map is flattened into a long strip, and the full connection operation is completed (the picture here is misleading, it does not have feature map expansion Flat operation, but each neuron in the upper layer has a weighted connection with each value of the feature map in the lower layer, essentially replacing the full connection operation with global convolution), you can see the following detailed diagram:
picture

In actual use, the fully connected layer can be implemented by convolution operations. If the front layer is convolutional, the network behind the fully connected layer can be converted into a global convolution with a convolution kernel of h*w, as shown in the detailed diagram above As shown, for example, a feature map with h=3, w=3, channel=3, if you want to output FC with a length of 2, then you need two convolutions as large as the feature map, the specific convolution operation You can refer to "Convolution Operator and Numpy Implementation" . You can deduce the specific implementation by yourself, and its mathematical expression is exactly the same. Here we recommend the feature map view of the 3D visualization in the above figure [2].

And if the previous layer is a fully connected layer, it can be realized by 1×1 convolution as shown in the figure below:

picture

A feature map with a length of 4 becomes fully connected and becomes a length of 3. Its essence is consistent with the figure below. For a feature map with a length of 4, it can be seen that h and w are both 1, and a feature map with a channel of 4 can be passed The output length is a convolution kernel of 1 1 input length, which is convolved into a feature map of output length, and its mathematical expression is the same:

picture

In fact, there is another operation called global average pooling (GAP) operation , which takes the length and width equal to the length and width of the feature map as the pooling size, so that each layer has only one value, and a matrix Tensor of the entire feature map passes through the global pool After transformation, it becomes a vector and completes the full connection operation, which can be said to be another way to replace the full connection operation.

picture

It is said that because of the data redundancy of the FC layer, GAPoverfitting can be prevented to a certain extent, but I saw in a paper that the convergence speed of global pooling is slower than that of FC. The reason may be that in the entire convolutional network, If the FC layer is used, the FC layer has a large amount of parameters and redundant data, so the learning of the classification part in the two-step process can be completed in the FC layer, and the pressure on the previous feature extraction part is relatively small, but if GAP is selected, the global pool The transformation is relatively simple and crude, and the expression of the mutual positional relationship between features is not obvious. In order to effectively express the entire network, the convolutional layer of the previous feature extraction must be expressed more accurately, and another layer of meaning is more accurate. The generalization will be weaker, which will lead to another problem. During the transfer learning, the feature extraction part needs to adjust the parameter range to be larger than that of the convolutional network corresponding to the fully connected layer. This also shows that the volume containing the FC layer The generalization ability of the product neural network in migration training is relatively strong. Friends are welcome to discuss here.

3

The implementation of the FC layer has been packaged in frameworks such as torch and tensorflow. It is very convenient to use it out of the box. The idea is as follows. Also consider inheriting the Layers class. For the code of the Layer class, see the implementation of the Layer class in the conv operator .
The FC layer inherits the Layer class, and the forward and reverse implementations are as follows:

import numpy as np
from module import Layers

class FC(Layers):
    """
    前向:

    反向:反向先计算出来的值,需要和之前的输入值相乘,乘完后,再与lr相乘才是梯度
    """

    def __init__(self, name, in_channels, out_channels):
        super(FC).__init__(name)
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.weights = np.random.standard_normal((in_channels, out_channels))
        self.bias = np.zeros(out_channels)
        self.grad_w = np.zeros((in_channels, out_channels))
        self.grad_b = np.zeros((out_channels))

    def forward(self, input):
        self.in_shape = input.shape
        input = np.reshape(input,(input.shape[0], -1))
        self.input = input
        return np.dot(self.input, self.weights) + self.bias 

    def backward(self, grad_out):
        N =grad_out.shape[0]
        dx = np.dot(grad_out,self.weights.T)
        self.grad_w = np.dot(self.input.T, grad_out)
        self.grad_b = np.sum(grad_out, axis=0)
        return dx.reshape(self.input) 

    def zerp_grad(self):
        self.grad_w.fill(0)
        self.grad_b.fill(0)

    def update(self, lr):
        self.weights -=lr*self.grad_w
        self.bias -= lr*self.grad_b

The FC layer has not been paid much attention due to the large amount of data, redundancy, etc. From the above convolution and full connection can replace each other, it can be seen that if the convolution is a global convolution, it is the same as the FC layer , global convolution can learn global information, mutual position information between globals, etc., learn local information for general convolution, and expand the local learning range by increasing the receptive field, but the amount of data for general convolution is small, and it feels general volume The product is a compromise of the amount of calculation. The current transformerapplication on cv uses the attention mechanism to learn the entire feature map area, which is a bit like the operation of the fully connected layer. This is something to say later, welcome friends to discuss together.

Reference:
[1] https://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/
[2] https://www.cs.ryerson.ca/~aharley/vis/conv/
[ 3] https://www.zhihu.com/question/41037974/answer/150522307

For more in-depth learning, please pay attention to the official account "The Invincible Zhang Dao "

Guess you like

Origin blog.csdn.net/zqwwwm/article/details/123703045