The basic concepts of CNN, commonly used calculation formulas and pytorch codes


1. The basic concept of CNN

Convolution filling pooling reference here
Activation function reference here
In short:
convolution helps us find specific local image features.
Pooling is used for feature fusion and dimensionality reduction, and can also prevent overfitting.
The activation function is to increase nonlinearity, mainly to solve the problem of gradient disappearance and gradient explosion .

2. Common convolution

refer here

1. General convolution

This is the most common convolution, and it is easier to understand by looking at the animation.
insert image description here

2. Expansion convolution (hole convolution)

This is mainly used to expand the receptive field, and changing the expansion rate can obtain multi-scale information.

3. Transposed convolution (deconvolution)

It is the inverse process of convolution to restore (restore the size without restoring the value), such as visualization, segmentation and restoration, gan generation, etc.

4. Separable convolution

Separable convolution can be mainly divided into spatially separable convolution and depthwise separable convolution.

4.1. Spatial Separable Convolution

Replacing one 2D convolution kernel with two 1D convolution kernels achieves the same result, which reduces the amount of computation.

4.2. Depthwise separable convolution

After convolving the N channels separately, a pointwise kernel (1 1 N) convolution is performed on this output, which also reduces the amount of calculation.

The 1 1 convolution kernel can reduce dimensionality, increase model depth, and increase nonlinearity.
The convolution kernel of N
N can

3. The input and output size calculation formula of CNN

3.1. Convolution layer

N=(W-F+2P)/S + 1
where
N: output size
W: input size
F: convolution kernel size
P: size of filling value
S: step size

such as when W=256, F=3, P =1, when S=2, N=(256-3+2*1)/2+1=128

3.2. Pooling layer

N=(W+2P-D (F-1)-1)/S + 1
where
N: output size
W: input size
D: hole
F: pooling kernel size
P: size of filling value
S: step size
*
For example, when W=128, D=0, F=2, P=0, S=2, N=(128+0-0-1)/2+1=64 when the numerator is odd, no matter the denominator is
1 Or 2, the numerator will be taken as an even number in the final calculation

Four, CNN commonly used activation function

4.1.Sigmoid

Calculation formula:
insert image description here
function image:
insert image description here
derivative image:
insert image description here

4.2

Calculation formula: insert image description here
function image:
insert image description here

Derivative image:
insert image description here

4.3. Re

Calculation formula:
insert image description here
function image:
insert image description here
derivative image:
insert image description here

In addition, there are LeakRelu, ERelu, Silu, etc.

5. Standardization and normalization

refer here

1. Standardization

Transform the data into a distribution with mean 0 and standard deviation 1

1.1. Batch Normalization

Each activation function has its own unsaturated area. If there is no BN before the activation function, the data outside the unsaturated area will cause the gradient to disappear. Only the data in the unsaturated area is available, which also speeds up convergence and prevents overfitting. , so BN is generally added before the activation function.

2. Normalization

Transform the data into a certain interval range, such as [0,1]
insert image description here
Summary: Normalization is based on the scaling of the minimum and maximum values. Starting from this definition, it is
more common to see whether your data needs to be processed in this way. For example, it can speed up the convergence speed during gradient calculation, handle features of different scales, etc.

6. Save and load the model

1. Only save the learned parameters

1.1. Save the model: torch.save(model.state_dict(), PATH)

1.2. Call the model: model.load_state_dict(torch.load(PATH)); model.eval()

2. Save the entire model

2.1. Save the model: torch.save(model, PATH)

2.2. Call the model: model = torch.load(PATH)

7. The role and difference between model.eval() and with torch.no_grad()

Refer to here
In the train mode, the dropout network layer will set the probability of retaining the activation unit according to the set parameter p (retention probability = p); the batchnorm layer will continue to calculate and update the mean and var parameters of the data.

In val mode, the dropout layer will pass all activation units, and the batchnorm layer will stop calculating and updating mean and var, and directly use the mean and var values ​​​​that have been learned in the training phase

The data in model.eval() will not be backpropagated, but gradients still need to be computed.
The data in with torch.no_grad() does not need to calculate the gradient and will not be backpropagated.

You can see that the model.eval() mode is used for testing, and with torch.no_grad() can save computing resources .

8. Copy

Assignment: The memory address remains unchanged, which is equivalent to a reference, and it will be changed once it is changed.
copy.copy(): shallow copy, adding elements has no effect, but changing the shared elements will also change the original data.
copy.deepcopy(): deep copy, that is, a complete copy, including new memory addresses.

Nine, fine-tuning

Refer to here.
Under what circumstances can fine-tuning be done?
(1) The data set you want to use is similar to the data set of the pre-trained model
(2) The correct rate of the CNN model built or used by yourself is too low.
(3) The datasets are similar, but the number of datasets is too small.
(4) Too few computing resources.
Judging whether pre-training is needed based on your own data

There are also different ways of fine-tuning for different situations:
Dataset 1 - small amount of data, but very high data similarity - in this case, all we do is modify the output categories of the last few layers or the final softmax layer.

Dataset 2 - small amount of data, low data similarity - in this case we can freeze the initial layers (say k layers) of the pretrained model and train the remaining (nk) layers again. Since the similarity of the new dataset is low, it makes sense to retrain the higher layers on the new dataset.

Dataset 3 - Large amount of data, low data similarity - In this case, since we have a large dataset, our neural network training will be efficient. However, since our data is very different compared to the data used to train our pretrained model. Predictions made using pretrained models will not be valid. Therefore, it is better to train a neural network from scratch on your data (Training from scratch)

Dataset 4 - Large amount of data, high similarity of data - This is the ideal situation. In this case, a pretrained model should be most effective. The best way to use the model is to preserve the architecture of the model and the initial weights of the model. Then, we can use the weights in the pre-trained model to retrain the model.

1. freeze

Refer to here
Freezing is to prevent some layers from backpropagating:

for layer in list(model.parameters())[:]:#这里就写你要要冻结的层
	layer.requires_grad = False #这里不计算梯度
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=0.0001)#过滤掉不需要反向传播的层

thaw:

for layer in list(model.parameters())[:]:
    layer.requires_grad = True
	optimizer.add_param_group({
    
    'params': layer})

2. Modify the layer

After viewing the structure of the model, we get
model(
    (feature):Sequential(
        (0):Conv(
        (conv):Conv2d(3,24,kernel_size=(3,3),stride=(2,2),padding =(1,1),bias=False)
        …
        )
        …
        …
    (fc):Sequantial(
        (0):Dropout(p=0.3,inplace=True)
        (1):Linear(in_features=1792,out_features=1000,bias =True)
        )
    )

For example, if you want to modify the fc layer, the category is changed from 1000 to 10:

model.fc = nn.Linear(1792,10)

Then the output fc layer is
(fc):Sequantial(
    (0):Dropout(p=0.3,inplace=True)
    (1):Linear(in_features=1792,out_features=1000,bias=True)

10. Evaluation indicators

Refer to here
TP: The prediction is 1 (Positive), and the actual is 1 (Truth-prediction is correct)
TN: The prediction is 0 (Negative), the actual is 0 (Truth-prediction is correct)
FP: The prediction is 1 (Positive ), the actual is 0 (False-wrong prediction)
FN: The prediction is 0 (Negative), the actual is 1 (False-wrong prediction) The
total number of samples is: TP+TN+FP+FN

Generally, what you see directly is the accuracy rate Accuracy=(TP+TN)/(TP+TN+FP+FN)

Precision rate Precision = TP/(TP+FP)

Recall rate Recall = TP/(TP+FN)

F1 = 2*(PrecisionRecall)/(Precision+Recall)=2TP/(2*TP+FP+FN)
AR = TP/(TP+FP)

Guess you like

Origin blog.csdn.net/qq_37249793/article/details/122244198