A whole routine of pytorch writing training and testing models

Take the network model of CIFAR10 as an example.

insert image description here

network structure

First write the network structure according to the structure diagram:

import torch
from torch import nn

# 搭建神经网络
class Cifar10(nn.Module):
    def __init__(self):
        super(Cifar10, self).__init__()
        self.model = nn.Sequential(
            nn.Conv2d(3, 32, 5, padding=2),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 32, 5, padding=2),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 5, padding=2),
            nn.MaxPool2d(2),
            nn.Flatten(),
            nn.Linear(64*4*4, 64),
            nn.Linear(64, 10)
        )
    def forward(self, input):
        output = self.model(input)
        return output

# 验证网络的正确性
if __name__ == '__main__':
    cifar10 = Cifar10()
    input = torch.ones([64, 3, 32, 32])
    output = cifar10(input)
    print(output.shape)

Running this file alone terminal prints:

tensor([64,10)]

64 lines of data represent 64 (batch_size) pictures, and each line has 10 data representing the probability distribution of each picture in 10 categories.

training file

Then write the training file:

import torchvision
from torch.utils.tensorboard import SummaryWriter

from model import *   # 导入编写网络模型的文件
from torch.utils.data import DataLoader

# 定义训练的设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# 准备数据集
train_data = torchvision.datasets.CIFAR10(root="dataset", train=True, transform=torchvision.transforms.ToTensor(),
                                          download=True)
test_data = torchvision.datasets.CIFAR10(root="dataset", train=False, transform=torchvision.transforms.ToTensor(),
                                          download=True)

# 数据集长度
train_data_size = len(train_data)
test_data_size = len(test_data)

# 加载数据集
train_dataloader = DataLoader(dataset=train_data, batch_size=64)
test_dataloader = DataLoader(dataset=test_data, batch_size=64)

# 创建网络模型
cifar10 = Cifar10()
cifar10 = cifar10.to(device)

# 损失函数
loss_fn = nn.CrossEntropyLoss()
loss_fn = loss_fn.to(device)

# 优化器
lr = 1e-2
optimizer = torch.optim.SGD(cifar10.parameters(), lr=lr)

writer = SummaryWriter("./logs")

epochs = 100
for epoch in range(epochs):
    print("----------开始第{}轮训练".format(epoch))
    # 开始训练步骤
    # 让网络进入训练状态 对于Dropout、BatchNormal等特殊层有作用,必须调用(没有的话也可以调用)
    cifar10.train()
    for data in train_dataloader:
        imgs, targets = data
        imgs = imgs.to(device)
        targets = targets.to(device)
        output = cifar10(imgs)
        loss = loss_fn(output, targets)

        # 优化器优化模型
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # 开始验证
    # 让网络进入验证状态 对于Dropout、BatchNormal等特殊层有作用,必须调用(没有的话也可以调用)
    cifar10.eval()
    total_test_loss = 0
    preds = 0
    with torch.no_grad():
        for data in test_dataloader:
            imgs, targets = data
            imgs = imgs.to(device)
            targets = targets.to(device)
            output = cifar10(imgs)
            pred = (output.argmax(1) == targets).sum()
            preds += pred
            loss = loss_fn(output, targets)
            total_test_loss += loss.item()
    print("test loss:{}".format(total_test_loss))
    print("test accuracy:{}".format(preds/test_data_size))
    writer.add_scalar("test_loss", total_test_loss, epoch)
    writer.add_scalar("test_accuracy", preds/test_data_size, epoch)

    # 保存每轮模型训练结果
    torch.save(cifar10, "models/cifar10_{}.pth".format(epoch))

writer.close()

1. Gradient clearing: In order to prevent the gradient calculated in the last round from affecting the update parameters of this round, the gradient should be cleared first in each round and then the parameters should be updated according to the gradient.
2. Gradient calculation is not performed during verification: one is to save resources, and the other is that the verification is only for our reference model training from various indicators, and the obtained gradient does not affect the update of parameters.
3. Prediction accuracy:
output.argmax(1): It is the position where the maximum value of each row of data of the inference output (predicted value) is obtained, that is, the predicted category.

If the parameter is 0, the position of the maximum value of the uncolumn data is calculated, but here we have a probability distribution of a picture, so we need to calculate horizontally instead of vertically.

output.argmax(1) == targets: The predicted category is compared with the real category, and the predicted category and the real category of the same photo are equal to True.

(output.argmax(1) == targets).sum(): Add the equal values ​​of each prediction, and then add the total number of test data to get the prediction accuracy on the test data.

operation result

Enter the command to start tensorboard in the terminal: tensorboard --logdir=logs, and then open the corresponding URL:
First, look at the accuracy rate on the verification set:
insert image description here
the overall increase is normal, but it is quite low if it does not increase after reaching 0.65 (but it is only an experiment and not a practical application. ).
Look at train_loss again:
insert image description here
It is normal for the loss to drop.
Look at test_loss again:
insert image description here
the trend of this curve is abnormal, and it starts to soar after 20 rounds. Combined with the trend of train_loss, it may be a local optimum. The network is too simple and the learning rate does not drop in the later period, which leads to overfitting, that is, the network is desperately learning in a local direction.

Revise

Here I did an experiment (only adjust the learning rate) to see if there will be an improvement.
Put the usage statement of the optimizer into a cycle of 100 rounds, and then decrease the learning rate × 0.95 for each round.
insert image description here
Let’s look at the correct rate first:
insert image description here
After 100 rounds of training, there is still an upward trend at the end, and it is close to 0.7, which is higher than before, indicating that the effect can be seen.
Look at train_loss again:
insert image description here
the curve is very smooth, which is better than the last slight shock just now.
Finally, look at the test_loss that just went wrong:
insert image description here
it is very beautiful, and it also declines smoothly.
So far we can see that our changes are still very successful.

test file

First of all, I randomly found a picture of a class in the 10 categories of the cifar10 dataset on the Internet. Here I found a dog as an example.
insert image description here

import torch
from PIL import Image
import torchvision

img_path = "./dog.png"
img = Image.open(img_path)

img = img.convert("RGB")  # 保留rgb通道 png有4个通道,多一个透明度通道 这样就可以适应各种格式的图片

transform = torchvision.transforms.Compose([torchvision.transforms.Resize((32, 32)),
                                            torchvision.transforms.ToTensor()])

img = transform(img)
print(img.shape)

img = torch.reshape(img, (1, 3, 32, 32))

img = img.cuda()
model = torch.load("models/cifar10_99.pth")  # 因为观察曲线走势发现最后一轮的效果最好所以用最后一次训练保存的模型
model.eval()
with torch.no_grad():     # 节约内存
    output = model(img)
print(output.argmax(1))

question

There are several problems here:
1. If there is a parameter in the Resize method, it should be enclosed in a tuple or list. Because an int indicates that the minimum side is resized to a specified value, and the maximum side is scaled proportionally; if it is for the size of two sides, it is scaled to the specified [h, w].
2. Because there is only one picture input here, not a batch of pictures in the folder, so the batch acquisition of batch_size is not set with dalaloader. Therefore, if you do not reshape into four dimensions (one more batch), you will report this error:

RuntimeError: Expected 4-dimensional input for 4-dimensional weight [32, 3, 5, 5],
 but got 3-dimensional input of size [3, 32, 32] instead

That is, the network input should be 4-dimensional, but it is actually only 3-dimensional.
3. If the input data does not use cuda, it will report an error like this:

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) 
should be the same or input should be a MKLDNN tensor and weight is a dense tensor

The input type and the output type should be the same, because we use cuda for weight training, so here we also need cuda type input.
If you are running the model in a different environment, such as training on a device with a gpu, and testing on a device with only a cpu, you need to specify the mapping environment when loading the model and map it to the cpu environment:

model = torch.load("models/cifar10_99.pth", map_location=torch.device("cpu"))

output

Output result (cpu output)

tensor([5])

gpu output

tensor([5], device='cuda:0')

Go to the pytorch official website to see the class corresponding to the cifar10 dataset:
insert image description here
you can see that the class with idx of 5 is dog, and the prediction is correct.

Summarize

train.py
test.py
model.py
输入
前向传播
输入
输出与真实值
数据集长度/batch_size次循环
epochs轮循环
模型
实例化网络模型
根据梯度更新相关卷积核参数
反向传播更新梯度
梯度清零
计算loss
使用DataLoader加载数据
生成datasets
测试集验证
保存模型
模型2
加载模型
预测结果
测试数据
编写网络结构

Guess you like

Origin blog.csdn.net/weixin_45354497/article/details/130348301