[Pytorch系列-46]:卷积神经网络 - 用GPU训练ResNet+CIFAR100数据集

作者主页(文火冰糖的硅基工坊):文火冰糖(王文兵)的博客_文火冰糖的硅基工坊_CSDN博客

本文网址:https://blog.csdn.net/HiWangWenBing/article/details/121288804


目录

前言:ResNet网络详解

(1)ResNet网络详解

(2)Pytorch官网对ResNet的定义

第1章 业务领域分析

1.1 业务分析

1.2 前置条件

第2章 前向运算模型定义

2.1 步骤2-1:数据集选择

2.2 步骤2-2:数据预处理 

2.3 步骤2-3:神经网络建模

2.4 步骤2-4:定义神经网络实例以及输出

第3章 定义反向计算

3.1 步骤3-1:定义loss

3.2 步骤3-2:定义优化器

3.3 步骤3-3:训练前的准备

3.4 步骤3-4:模型训练 

3.5 可视化loss迭代过程

3.6 可视化训练批次的精度迭代过程

第4章 模型性能验证

4.1 手工验证

4.2 整个训练集上的精度验证:精度只有58%

4.3 整个测试集上的精度验证:精度只有58%

第9章 模型的存储与保存

第10 章 模型的恢复

第11 章 笔者感悟




前言:ResNet网络详解

(1)ResNet网络详解

[人工智能-深度学习-38]:卷积神经网络CNN - 常见分类网络- ResNet网络架构分析与详解_文火冰糖(王文兵)的博客-CSDN博客https://blog.csdn.net/HiWangWenBing/article/details/120915279

(2)Pytorch官网对ResNet的定义

[Pytorch系列-43]:工具集 - torchvision预训练模型参数的导入(以ResNet为例)_文火冰糖(王文兵)的博客-CSDN博客https://blog.csdn.net/HiWangWenBing/article/details/121184678

第1章 业务领域分析

1.1 业务分析

ReNet原本是针对ImageNet数据集而设计的一个1000分类的网络。

本文的目的是,通过Torchvision提供的ReNet神经网络模型,在CFIAR100数据集重新进行训练,从而实现ReNet对CIFAR100数据集图像的分类。

为了演示FineTuning,本文将采用官网上提供的预先训练好的模型。

1.2 前置条件

#环境准备
import numpy as np              # numpy数组库
import math                     # 数学运算库
import matplotlib.pyplot as plt # 画图库
import time as time

import torch             # torch基础库
import torch.nn as nn    # torch神经网络库
import torch.nn.functional as F
import torchvision.datasets as dataset  #公开数据集的下载和管理
import torchvision.transforms as transforms  #公开数据集的预处理库,格式转换
import torchvision.utils as utils 
import torch.utils.data as data_utils  #对数据集进行分批加载的工具集
from PIL import Image #图片显示
from collections import OrderedDict
import torchvision.models as models

print("Hello World")
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.version.cuda)
print(torch.backends.cudnn.version())
Hello World
1.10.0
True
10.2
7605

第2章 前向运算模型定义

2.1 步骤2-1:数据集选择

(1)CFAR100数据集

(1条消息) [Pytorch系列-33]:数据集 - torchvision与CIFAR10/CIFAR100详解_文火冰糖(王文兵)的博客-CSDN博客

(2)样本数据与样本标签格式

​不CIFR10相同,不同的是其分类的种类有10到100。

(3)源代码示例 -- 下载并读入数据

#2-1 准备数据集
# 数据集格式转换
transform_train = transforms.Compose(
    [transforms.Resize(256),           #transforms.Scale(256)
     transforms.CenterCrop(224), 
     transforms.ToTensor(),
     transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

transform_test = transforms.Compose(
    [transforms.Resize(256),         #transforms.Scale(256)
     transforms.CenterCrop(224), 
     transforms.ToTensor(),
     transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])


# 训练数据集
train_data = dataset.CIFAR100 (root = "../datasets/cifar100",
                           train = True,
                           transform = transform_train,
                           download = True)

# 测试数据集
test_data = dataset.CIFAR100 (root = "../datasets/cifar100",
                           train = False,
                           transform = transform_test,
                           download = True)

print(train_data)
print("size=", len(train_data))
print("")
print(test_data)
print("size=", len(test_data))
Files already downloaded and verified
Files already downloaded and verified
Dataset CIFAR100
    Number of datapoints: 50000
    Root location: ../datasets/cifar100
    Split: Train
    StandardTransform
Transform: Compose(
               Resize(size=256, interpolation=bilinear, max_size=None, antialias=None)
               CenterCrop(size=(224, 224))
               ToTensor()
               Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
           )
size= 50000

Dataset CIFAR100
    Number of datapoints: 10000
    Root location: ../datasets/cifar100
    Split: Test
    StandardTransform
Transform: Compose(
               Resize(size=256, interpolation=bilinear, max_size=None, antialias=None)
               CenterCrop(size=(224, 224))
               ToTensor()
               Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
           )
size= 10000

2.2 步骤2-2:数据预处理 

(1)批量数据读取 -- 启动dataloader从数据集中读取Batch数据

# 批量数据读取
batch_size = 32

train_loader = data_utils.DataLoader(dataset = train_data,  #训练数据
                                  batch_size = batch_size,           #每个批次读取的图片数量
                                  shuffle = True)           #读取到的数据,是否需要随机打乱顺序

test_loader = data_utils.DataLoader(dataset = test_data,   #测试数据集
                                  batch_size = batch_size,
                                  shuffle = True)

print(train_loader)
print(test_loader)
print(len(train_data), len(train_data)/batch_size)
print(len(test_data),  len(test_data)/batch_size)
<torch.utils.data.dataloader.DataLoader object at 0x000002C43E31AD60>
<torch.utils.data.dataloader.DataLoader object at 0x000002C43E31ADC0>
50000 1562.5
10000 312.5

(2)#显示一个batch图片 -- 仅仅用于调试

#显示一个batch图片
print("获取一个batch组图片")
imgs, labels = next(iter(train_loader))
print(imgs.shape)
print(labels.shape)
print(labels.size()[0])

print("\n合并成一张三通道灰度图片")
images = utils.make_grid(imgs)
print(images.shape)
print(labels.shape)

print("\n转换成imshow格式")
images = images.numpy().transpose(1,2,0) 
print(images.shape)
print(labels.shape)


print("\n显示样本标签")

for i in range(batch_size):
    print(labels[i], end=" ")
    i += 1
    #换行
    if i%8 == 0:
        print(end='\n')
        
print("\n显示图片")
plt.imshow(images)
plt.show()
获取一个batch组图片
torch.Size([32, 3, 224, 224])
torch.Size([32])
32

合并成一张三通道灰度图片
torch.Size([3, 906, 1810])
torch.Size([32])

转换成imshow格式
(906, 1810, 3)
torch.Size([32])

显示样本标签
tensor(86) tensor(99) tensor(51) tensor(7) tensor(21) tensor(5) tensor(68) tensor(23) 
tensor(43) tensor(77) tensor(61) tensor(97) tensor(97) tensor(5) tensor(32) tensor(2) 
tensor(51) tensor(5) tensor(16) tensor(70) tensor(69) tensor(17) tensor(54) tensor(68) 
tensor(45) tensor(78) tensor(10) tensor(41) tensor(72) tensor(86) tensor(94) tensor(70) 

显示图片

2.3 步骤2-3:神经网络建模

# 2-3 定义网络
# 直接使用框架提供的预定义模型,设定输出分类的种类数100,默认为1000分类
net = models.resnet101(num_classes = 100) 
#print(net)
#导入模型预先参数
# 预训练参数预先从官网上下载
net_params_path = "../models/resnet101.pth" 

# 导入预训练参数
net_params = torch.load(net_params_path)
#print(net_params)

备注:从后续训练的结果来看,该预训练参数,对于CIFAR100数据集来说,没有多大的参考价值,实际可以不用导入,放在这里,只是展示“fine tunning”的流程。

2.4 步骤2-4:定义神经网络实例以及输出

(1)训练模式

# 2-4 定义网络预测输出
# 测试网络是否能够工作
print("定义测试数据")
input = torch.randn(1, 3, 224, 224)
print(input.shape)

print("\n设定在训练模式, 由于随机dropout,导致相同的输入,输出每次都不相同")
net.train() 
print("net的输出方法1:")
out = net(input)
print(out.shape)
print(out)

print("")
print("net的输出方法2:")
out = net.forward(input)
print(out)
定义测试数据
torch.Size([1, 3, 224, 224])

设定在训练模式, 由于随机dropout,导致相同的输入,输出每次都不相同
net的输出方法1:
torch.Size([1, 100])
tensor([[ 0.5060, -0.5695,  0.8392, -0.6783,  0.9691, -0.4808, -0.8619, -0.8942,
          1.4290, -0.1397,  0.7581, -0.0853, -0.7866, -0.3950, -1.2386, -0.1561,
         -1.6180,  0.1931,  0.4454,  0.5431,  0.3443,  0.0752,  0.7343, -0.0758,
          1.2076,  0.0883, -0.4626,  0.8100, -0.2666,  0.9284, -0.6745,  0.3130,
          0.2800,  0.2795,  0.4193, -0.1789, -0.4788,  0.3691,  0.5758, -0.0748,
          0.8036,  0.1345,  0.9415,  0.3341,  0.6594,  0.5718, -0.5204, -0.2753,
         -0.3526,  0.6810,  0.1863, -0.5330, -1.3422, -0.1423, -0.9509, -0.6229,
         -0.1301, -0.0780,  0.2751,  0.3925, -0.6509,  0.2398, -1.2762,  0.1492,
         -0.5825,  0.4123, -0.2696,  0.4022, -0.3477,  0.4775, -0.2604,  0.8975,
          0.5744, -0.7714,  0.6217, -0.7464,  0.4009, -0.5662, -0.4907,  0.3237,
          0.0228, -0.5212,  0.3008,  0.5322, -0.3559,  0.2930,  0.1472, -0.0984,
         -0.1121, -0.3464,  0.5836,  0.6288, -0.5286,  0.1338,  0.5177,  0.5826,
          0.7557,  0.1628,  0.2516,  0.6534]], grad_fn=<AddmmBackward0>)

net的输出方法2:
tensor([[ 0.5060, -0.5695,  0.8392, -0.6783,  0.9691, -0.4808, -0.8619, -0.8942,
          1.4290, -0.1397,  0.7581, -0.0853, -0.7866, -0.3950, -1.2386, -0.1561,
         -1.6180,  0.1931,  0.4454,  0.5431,  0.3443,  0.0752,  0.7343, -0.0758,
          1.2076,  0.0883, -0.4626,  0.8100, -0.2666,  0.9284, -0.6745,  0.3130,
          0.2800,  0.2795,  0.4193, -0.1789, -0.4788,  0.3691,  0.5758, -0.0748,
          0.8036,  0.1345,  0.9415,  0.3341,  0.6594,  0.5718, -0.5204, -0.2753,
         -0.3526,  0.6810,  0.1863, -0.5330, -1.3422, -0.1423, -0.9509, -0.6229,
         -0.1301, -0.0780,  0.2751,  0.3925, -0.6509,  0.2398, -1.2762,  0.1492,
         -0.5825,  0.4123, -0.2696,  0.4022, -0.3477,  0.4775, -0.2604,  0.8975,
          0.5744, -0.7714,  0.6217, -0.7464,  0.4009, -0.5662, -0.4907,  0.3237,
          0.0228, -0.5212,  0.3008,  0.5322, -0.3559,  0.2930,  0.1472, -0.0984,
         -0.1121, -0.3464,  0.5836,  0.6288, -0.5286,  0.1338,  0.5177,  0.5826,
          0.7557,  0.1628,  0.2516,  0.6534]], grad_fn=<AddmmBackward0>)

(2)测试模式

print("")
print("\n设定在预测模式, 没有随机dropout,相同的输入,确定性的输出")
net.eval()
print("net的输出方法1:")
out = net(input)
print(out)
print("")
print("net的输出方法2:")
out = net.forward(input)
print(out)
设定在预测模式, 没有随机dropout,相同的输入,确定性的输出
net的输出方法1:
tensor([[ 0.0056, -0.1342,  0.2105, -0.3518,  0.3745,  0.0017, -0.2244, -0.2227,
          0.3048, -0.1846,  0.1992,  0.1433, -0.0071, -0.0260, -0.2884, -0.0533,
         -0.2792,  0.0174,  0.1062,  0.2369,  0.0064, -0.1088,  0.1273,  0.0255,
          0.2831, -0.0370, -0.1836,  0.2289, -0.0434,  0.1998, -0.1511,  0.0357,
         -0.0087,  0.1043,  0.1091, -0.0693,  0.0074,  0.1424,  0.0913, -0.1034,
          0.1704,  0.0785,  0.2595, -0.0798,  0.0065,  0.0271, -0.2049,  0.0609,
          0.0778,  0.1077,  0.0576, -0.0038, -0.2775, -0.0428, -0.2166, -0.2215,
         -0.0695,  0.0724,  0.0616,  0.0592, -0.1542,  0.1019, -0.2565,  0.1209,
         -0.2126,  0.0780, -0.0670,  0.0913, -0.0570,  0.0251,  0.0436,  0.1729,
          0.0907, -0.1402,  0.1202, -0.0949,  0.0666, -0.1870, -0.0631, -0.0450,
         -0.0495, -0.1537, -0.0672,  0.1438, -0.1168,  0.2345, -0.0313, -0.1587,
         -0.1158,  0.0047,  0.0150,  0.1421, -0.1779, -0.1804,  0.1350,  0.2048,
          0.0224, -0.0026,  0.0082,  0.0645]], grad_fn=<AddmmBackward0>)

net的输出方法2:
tensor([[ 0.0056, -0.1342,  0.2105, -0.3518,  0.3745,  0.0017, -0.2244, -0.2227,
          0.3048, -0.1846,  0.1992,  0.1433, -0.0071, -0.0260, -0.2884, -0.0533,
         -0.2792,  0.0174,  0.1062,  0.2369,  0.0064, -0.1088,  0.1273,  0.0255,
          0.2831, -0.0370, -0.1836,  0.2289, -0.0434,  0.1998, -0.1511,  0.0357,
         -0.0087,  0.1043,  0.1091, -0.0693,  0.0074,  0.1424,  0.0913, -0.1034,
          0.1704,  0.0785,  0.2595, -0.0798,  0.0065,  0.0271, -0.2049,  0.0609,
          0.0778,  0.1077,  0.0576, -0.0038, -0.2775, -0.0428, -0.2166, -0.2215,
         -0.0695,  0.0724,  0.0616,  0.0592, -0.1542,  0.1019, -0.2565,  0.1209,
         -0.2126,  0.0780, -0.0670,  0.0913, -0.0570,  0.0251,  0.0436,  0.1729,
          0.0907, -0.1402,  0.1202, -0.0949,  0.0666, -0.1870, -0.0631, -0.0450,
         -0.0495, -0.1537, -0.0672,  0.1438, -0.1168,  0.2345, -0.0313, -0.1587,
         -0.1158,  0.0047,  0.0150,  0.1421, -0.1779, -0.1804,  0.1350,  0.2048,
          0.0224, -0.0026,  0.0082,  0.0645]], grad_fn=<AddmmBackward0>)

第3章 定义反向计算

3.1 步骤3-1:定义loss

# 3-1 定义loss函数: 
loss_fn = nn.CrossEntropyLoss()
print(loss_fn)

3.2 步骤3-2:定义优化器

# 3-2 定义优化器
Learning_rate = 0.01     #学习率

# optimizer = SGD: 基本梯度下降法
# parameters:指明要优化的参数列表
# lr:指明学习率
#optimizer = torch.optim.Adam(model.parameters(), lr = Learning_rate)
optimizer = torch.optim.SGD(net.parameters(), lr = Learning_rate, momentum=0.9)
print(optimizer)
SGD (
Parameter Group 0
    dampening: 0
    lr: 0.01
    momentum: 0.9
    nesterov: False
    weight_decay: 0
)

3.3 步骤3-3:训练前的准备

# 3-3 训练前准备
# Assume that we are on a CUDA machine, then this should print a CUDA device:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

#把网络转移到GPU
net.to(device)     # 自适应选择法
#net.cuda()        # 强制指定法

#把loss计算转移到GPU
loss_fn = loss_fn.to(device)   # 自适应选择法
#loss_fn.cuda()                 # 强制指定法

# 定义迭代次数
epochs = 30

loss_history = [] #训练过程中的loss数据
accuracy_history =[] #中间的预测结果

accuracy_batch = 0.0

# 网络进入训练模式
mode = net.train() 
cuda:0

备注:经过测试,epochs = 30轮后,在训练集上的进度可以得到98%以上。

3.4 步骤3-4:模型训练 

(1)模型训练

# 3-4 模型训练
train_start = time.time()
print('train start at {}'.format(train_start))

for i in range(0, epochs):
    epoch_start = time.time()
    for j, (x_train, y_train) in enumerate(train_loader):
        
        # 设置模型在 train 模式
        # net.train()
        #指定数据处理的硬件单元
        x_train = x_train.to(device)
        #x_train = x_train.cuda()
        y_train = y_train.to(device)
        #y_train = y_train.cuda()
        
        #(0) 复位优化器的梯度
        optimizer.zero_grad()    
        
        #(1) 前向计算
        y_pred = net(x_train)
    
        #(2) 计算loss
        loss = loss_fn(y_pred, y_train)
    
        #(3) 反向求导
        loss.backward()
    
        #(4) 反向迭代
        optimizer.step()
    
        # 记录训练过程中的损失值
        loss_history.append(loss.item())  #loss for a batch
        
        # 记录训练过程中的准确率
        number_batch = y_train.size()[0] # 图片的个数
        _, predicted = torch.max(y_pred.data, dim = 1)
        correct_batch = (predicted == y_train).sum().item() # 预测正确的数目
        accuracy_batch = 100 * correct_batch/number_batch
        accuracy_history.append(accuracy_batch)
    
        if(j % 10 == 0):
            print('epoch {} batch {} In {} loss = {:.4f} accuracy = {:.4f}%'.format(i, j , len(train_data)/batch_size, loss.item(), accuracy_batch)) 
    
    epoch_end = time.time()
    epoch_cost = epoch_end - epoch_start
    print('epoch {} cost {}s '.format(i, epoch_cost))

print("")
train_end = time.time()
train_cost = train_end - train_start
print('train finshed at {}'.format(train_end))
print('train cost {}s '.format(train_cost))
print("final loss =", loss.item())
print("final accu =", accuracy_batch)
train start at 1636735439.1526277
epoch 0 batch 0 In 1562.5 loss = 5.0001 accuracy = 0.0000%
epoch 0 batch 10 In 1562.5 loss = 6.5815 accuracy = 0.0000%
epoch 0 batch 20 In 1562.5 loss = 5.9869 accuracy = 6.2500%
epoch 0 batch 30 In 1562.5 loss = 6.7164 accuracy = 0.0000%
epoch 0 batch 40 In 1562.5 loss = 6.5062 accuracy = 0.0000%
epoch 0 batch 50 In 1562.5 loss = 4.6485 accuracy = 3.1250%
epoch 0 batch 60 In 1562.5 loss = 4.7968 accuracy = 0.0000%
epoch 0 batch 70 In 1562.5 loss = 4.5075 accuracy = 3.1250%
epoch 0 batch 80 In 1562.5 loss = 4.7070 accuracy = 9.3750%
epoch 0 batch 90 In 1562.5 loss = 4.8183 accuracy = 0.0000%
epoch 0 batch 100 In 1562.5 loss = 4.4981 accuracy = 0.0000%
epoch 0 batch 110 In 1562.5 loss = 4.7588 accuracy = 0.0000%
epoch 0 batch 120 In 1562.5 loss = 4.7810 accuracy = 6.2500%
epoch 0 batch 130 In 1562.5 loss = 4.7711 accuracy = 0.0000%
..........................................
epoch 0 batch 1510 In 1562.5 loss = 3.8084 accuracy = 6.2500%
epoch 0 batch 1520 In 1562.5 loss = 3.5800 accuracy = 12.5000%
epoch 0 batch 1530 In 1562.5 loss = 4.3732 accuracy = 6.2500%
epoch 0 batch 1540 In 1562.5 loss = 4.3241 accuracy = 0.0000%
epoch 0 batch 1550 In 1562.5 loss = 3.6709 accuracy = 15.6250%
epoch 0 batch 1560 In 1562.5 loss = 3.5751 accuracy = 12.5000%
epoch 0 cost 555.7686467170715s 
epoch 4 batch 1500 In 1562.5 loss = 2.5541 accuracy = 40.6250%
epoch 4 batch 1510 In 1562.5 loss = 1.8727 accuracy = 46.8750%
epoch 4 batch 1520 In 1562.5 loss = 2.3561 accuracy = 37.5000%
epoch 4 batch 1530 In 1562.5 loss = 2.3473 accuracy = 31.2500%
epoch 4 batch 1540 In 1562.5 loss = 2.1816 accuracy = 50.0000%
epoch 4 batch 1550 In 1562.5 loss = 2.3976 accuracy = 40.6250%
epoch 4 batch 1560 In 1562.5 loss = 1.7835 accuracy = 59.3750%
epoch 4 cost 546.6405458450317s 
epoch 9 batch 1500 In 1562.5 loss = 1.3677 accuracy = 62.5000%
epoch 9 batch 1510 In 1562.5 loss = 1.5999 accuracy = 62.5000%
epoch 9 batch 1520 In 1562.5 loss = 1.6098 accuracy = 59.3750%
epoch 9 batch 1530 In 1562.5 loss = 0.8323 accuracy = 75.0000%
epoch 9 batch 1540 In 1562.5 loss = 1.6116 accuracy = 53.1250%
epoch 9 batch 1550 In 1562.5 loss = 1.2115 accuracy = 71.8750%
epoch 9 batch 1560 In 1562.5 loss = 0.9043 accuracy = 68.7500%
epoch 9 cost 547.8049418926239s 
epoch 14 batch 1530 In 1562.5 loss = 0.3810 accuracy = 81.2500%
epoch 14 batch 1540 In 1562.5 loss = 0.3506 accuracy = 87.5000%
epoch 14 batch 1550 In 1562.5 loss = 0.2332 accuracy = 90.6250%
epoch 14 batch 1560 In 1562.5 loss = 0.2628 accuracy = 93.7500%
epoch 14 cost 547.7282028198242s 
epoch 19 batch 1480 In 1562.5 loss = 0.2537 accuracy = 93.7500%
epoch 19 batch 1490 In 1562.5 loss = 0.0164 accuracy = 100.0000%
epoch 19 batch 1500 In 1562.5 loss = 0.2798 accuracy = 90.6250%
epoch 19 batch 1510 In 1562.5 loss = 0.0585 accuracy = 96.8750%
epoch 19 batch 1520 In 1562.5 loss = 0.0779 accuracy = 93.7500%
epoch 19 batch 1530 In 1562.5 loss = 0.2052 accuracy = 90.6250%
epoch 19 batch 1540 In 1562.5 loss = 0.0291 accuracy = 100.0000%
epoch 19 batch 1550 In 1562.5 loss = 0.0898 accuracy = 93.7500%
epoch 19 batch 1560 In 1562.5 loss = 0.0784 accuracy = 96.8750%
epoch 19 cost 547.69011926651s 
epoch 24 batch 1500 In 1562.5 loss = 0.0145 accuracy = 100.0000%
epoch 24 batch 1510 In 1562.5 loss = 0.0015 accuracy = 100.0000%
epoch 24 batch 1520 In 1562.5 loss = 0.0194 accuracy = 100.0000%
epoch 24 batch 1530 In 1562.5 loss = 0.0024 accuracy = 100.0000%
epoch 24 batch 1540 In 1562.5 loss = 0.0042 accuracy = 100.0000%
epoch 24 batch 1550 In 1562.5 loss = 0.0419 accuracy = 96.8750%
epoch 24 batch 1560 In 1562.5 loss = 0.0164 accuracy = 100.0000%
epoch 24 cost 547.8020505905151s 
epoch 29 batch 1550 In 1562.5 loss = 0.0356 accuracy = 100.0000%
epoch 29 batch 1560 In 1562.5 loss = 0.0832 accuracy = 96.8750%
epoch 29 cost 547.8359341621399s 

train finshed at 1636751872.6923056
train cost 16433.539677858353s 
final loss = 0.09446956217288971
final accu = 93.75

(2)在windows上查看训练过程中GPU的使用率

nvidia-smi.exe -l 3

GPU的利用率高达94%,条件包括:

  • Resnet网络
  • CIFAR100数据集
  • batch size = 32

(3)在pytorch中显示内存

print(torch.cuda.memory_summary())
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  537296 KB |    4735 MB |  738598 GB |  738598 GB |
|       from large pool |  376576 KB |    4578 MB |  736127 GB |  736127 GB |
|       from small pool |  160720 KB |     158 MB |    2470 GB |    2470 GB |
|---------------------------------------------------------------------------|
| Active memory         |  537296 KB |    4735 MB |  738598 GB |  738598 GB |
|       from large pool |  376576 KB |    4578 MB |  736127 GB |  736127 GB |
|       from small pool |  160720 KB |     158 MB |    2470 GB |    2470 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    4998 MB |    4998 MB |    4998 MB |       0 B  |
|       from large pool |    4838 MB |    4838 MB |    4838 MB |       0 B  |
|       from small pool |     160 MB |     160 MB |     160 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |     821 MB |     862 MB |  340993 GB |  340992 GB |
|       from large pool |     820 MB |     860 MB |  338252 GB |  338251 GB |
|       from small pool |       1 MB |       2 MB |    2740 GB |    2740 GB |
|---------------------------------------------------------------------------|
| Allocations           |    1264    |    1680    |   68026 K  |   68024 K  |
|       from large pool |     104    |     316    |   32216 K  |   32216 K  |
|       from small pool |    1160    |    1371    |   35809 K  |   35808 K  |
|---------------------------------------------------------------------------|
| Active allocs         |    1264    |    1680    |   68026 K  |   68024 K  |
|       from large pool |     104    |     316    |   32216 K  |   32216 K  |
|       from small pool |    1160    |    1371    |   35809 K  |   35808 K  |
|---------------------------------------------------------------------------|
| GPU reserved segments |     187    |     187    |     187    |       0    |
|       from large pool |     107    |     107    |     107    |       0    |
|       from small pool |      80    |      80    |      80    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      68    |     105    |   31604 K  |   31604 K  |
|       from large pool |      25    |      77    |   15924 K  |   15924 K  |
|       from small pool |      43    |      52    |   15680 K  |   15680 K  |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

(4)清除不在使用的GPU内存

torch.cuda.empty_cache()

3.5 可视化loss迭代过程

#显示loss的历史数据
plt.grid()
plt.xlabel("iters")
plt.ylabel("")
plt.title("loss", fontsize = 12)
plt.plot(loss_history, "r")
plt.show()

3.6 可视化训练批次的精度迭代过程

#显示准确率的历史数据
plt.grid()
plt.xlabel("iters")
plt.ylabel("%")
plt.title("accuracy", fontsize = 12)
plt.plot(accuracy_history, "b+")
plt.show()

第4章 模型性能验证

4.1 手工验证

# 手工检查
index = 0
print("获取一个batch样本")
images, labels = next(iter(test_loader))
print(images.shape)
print(labels.shape)
print(labels)


print("\n对batch中所有样本进行预测")
outputs = net(images)
print(outputs.data.shape)

print("\n对batch中每个样本的预测结果,选择最可能的分类")
_, predicted = torch.max(outputs, 1)
print(predicted.data.shape)
print(predicted)


print("\n对batch中的所有结果进行比较")
bool_results = (predicted == labels)
print(bool_results.shape)
print(bool_results)

print("\n统计预测正确样本的个数和精度")
corrects = bool_results.sum().item()
accuracy = corrects/(len(bool_results))
print("corrects=", corrects)
print("accuracy=", accuracy)

print("\n样本index =", index)
print("标签值    :", labels[index]. item())
print("分类可能性:", outputs.data[index].numpy())
print("最大可能性:",predicted.data[index].item())
print("正确性    :",bool_results.data[index].item())
获取一个batch样本
torch.Size([32, 3, 224, 224])
torch.Size([32])
tensor([16, 73, 86, 25, 95,  4, 93, 17, 65, 81, 14, 58, 19, 91, 44, 60,  4, 32,
        34, 13, 52, 58,  2,  4, 11, 33, 68, 58, 23, 40, 59, 34],
       device='cuda:0')

对batch中所有样本进行预测
torch.Size([32, 100])

对batch中每个样本的预测结果,选择最可能的分类
torch.Size([32])
tensor([16, 73, 20, 25, 95, 74, 45, 15, 54, 81, 18, 58, 19, 32, 51, 60, 55, 14,
        34, 13, 59, 58, 57, 73,  4, 33, 68, 58, 73, 40, 47, 34],
       device='cuda:0')

对batch中的所有结果进行比较
torch.Size([32])
tensor([ True,  True, False,  True,  True, False, False, False, False,  True,
        False,  True,  True, False, False,  True, False, False,  True,  True,
        False,  True, False, False, False,  True,  True,  True, False,  True,
        False,  True], device='cuda:0')

统计预测正确样本的个数和精度
corrects= 16
accuracy= 0.5

样本index = 0
标签值    : 16
分类可能性: [ -2.5697033   -1.220762     5.808267     1.0053865   -5.303615
  -6.0952377   -4.375238    -7.512243     6.0723825   10.191355
   0.35357097   6.1513376   -0.15592083   1.59748      0.98906416
   6.4971786   15.163284    13.626103    -3.0172207    6.5226107
  -5.030594     4.090198     9.860107    -2.37206    -11.219335
   1.1417238  -10.93424     -6.318894    -3.6712632   -1.5013194
  -6.6784906   10.382248     2.768404     3.186685    -4.0075517
   7.340126     3.0197084   13.070431     5.8434143    9.353743
   5.7942214   -2.6026294    1.9901704    2.215488   -15.612972
  -5.3089857    7.888714    -8.096723     1.6488069    1.1043426
  -0.43023878  -0.130286    -5.267493    -3.2633924   -2.571636
  -5.6545606   -1.6048725    1.5597013   -8.894119     0.795938
  -3.3343716   -4.036054     1.180913     3.0039542   12.392508
   9.878975    11.541693    -5.346169    -2.3276188   -4.377976
  -3.426058    -8.6689415   -0.97619486  -1.8408787  -11.230231
  -1.0999361    6.2287726  -13.040726    -1.2952008   -7.3313904
  -3.3088627    5.210841    -6.2590923    1.3729277   -0.26622474
  -3.5561118   10.700713     2.9397843   -2.2580347   -4.614285
   2.527799    -8.197358     1.2615534   -9.895809     6.7659397
  -5.247879     2.1971831    9.978545     8.740657    -7.7820134 ]
最大可能性: 16
正确性    : True

4.2 整个训练集上的精度验证:精度只有58%

# 对训练后的模型进行评估:测试其在训练集上总的准确率
correct_dataset  = 0
total_dataset    = 0
accuracy_dataset = 0.0

# 进行评测的时候网络不更新梯度
with torch.no_grad():
    for i, data in enumerate(train_loader):
        #获取一个batch样本" 
        images, labels = data
        images = images.to(device)
        labels = labels.to(device)
        
        #对batch中所有样本进行预测
        outputs = net(images)
        
        #对batch中每个样本的预测结果,选择最可能的分类
        _, predicted = torch.max(outputs.data, 1)
        
        #对batch中的样本数进行累计
        total_dataset += labels.size()[0] 
        
        #对batch中的所有结果进行比较"
        bool_results = (predicted == labels)
        
        #统计预测正确样本的个数
        correct_dataset += bool_results.sum().item()
        
        #统计预测正确样本的精度
        accuracy_dataset = 100 * correct_dataset/total_dataset
        
        if(i % 100 == 0):
            print('batch {} In {} accuracy = {:.4f}'.format(i, len(train_data)/64, accuracy_dataset))
            
print('Final result with the model on the dataset, accuracy =', accuracy_dataset)
batch 0 In 781.25 accuracy = 96.8750
batch 100 In 781.25 accuracy = 98.8552
batch 200 In 781.25 accuracy = 98.8184
batch 300 In 781.25 accuracy = 98.8476
batch 400 In 781.25 accuracy = 98.9557
batch 500 In 781.25 accuracy = 98.9958
batch 600 In 781.25 accuracy = 99.0017
batch 700 In 781.25 accuracy = 98.9881
batch 800 In 781.25 accuracy = 98.9739
batch 900 In 781.25 accuracy = 98.9456
batch 1000 In 781.25 accuracy = 98.9417
batch 1100 In 781.25 accuracy = 98.9356
batch 1200 In 781.25 accuracy = 98.9124
batch 1300 In 781.25 accuracy = 98.9047
batch 1400 In 781.25 accuracy = 98.9026
batch 1500 In 781.25 accuracy = 98.9028
Final result with the model on the dataset, accuracy = 98.902

备注:训练集上的精度高达98%

4.3 整个测试集上的精度验证:精度只有58%

# 对训练后的模型进行评估:测试其在测试集上总的准确率
correct_dataset  = 0
total_dataset    = 0
accuracy_dataset = 0.0
net.eval()
# 进行评测的时候网络不更新梯度
with torch.no_grad():
    for i, data in enumerate(test_loader):
        #获取一个batch样本" 
        images, labels = data
        images = images.to(device)
        labels = labels.to(device)
        
        #对batch中所有样本进行预测
        outputs = net(images)
        
        #对batch中每个样本的预测结果,选择最可能的分类
        _, predicted = torch.max(outputs.data, 1)
        
        #对batch中的样本数进行累计
        total_dataset += labels.size()[0] 
        
        #对batch中的所有结果进行比较"
        bool_results = (predicted == labels)
        
        #统计预测正确样本的个数
        correct_dataset += bool_results.sum().item()
        
        #统计预测正确样本的精度
        accuracy_dataset = 100 * correct_dataset/total_dataset
        
        if(i % 100 == 0):
            print('batch {} In {} accuracy = {:.4f}'.format(i, len(test_data)/64, accuracy_dataset))
            
print('Final result with the model on the dataset, accuracy =', accuracy_dataset)
batch 0 In 156.25 accuracy = 43.7500
batch 100 In 156.25 accuracy = 52.1968
batch 200 In 156.25 accuracy = 53.0162
batch 300 In 156.25 accuracy = 53.0108
Final result with the model on the dataset, accuracy = 52.89

第9章 模型的存储与保存

辛辛苦苦顺利模型不容易,需要把训练的模型保存下来。

#存储模型
torch.save(model, "models/resnet_model_cifar100.pkl")

#存储参数
torch.save(model.state_dict() , "models/resnet_model_params_cifar100.pkl")

第10 章 模型的恢复

模块的恢复有两种方式:

(1)直接load训练好的模型和参数

(2)直接load预训练的参数

  • 先定义预定义模型
  • 再加训练好的参数

本文模型的建立,就采用的是第二种方式。

第11 章 笔者感悟

1.1 训练条件

本文训练采用的GPU是单片RTX2070.

11.2 训练结果分析

(1)训练集

虽然,该GPU能够很高效的、快速的完成AlexNet网络 + CIFAR10的训练。

然后,使用该GPU训练ReNet + CIFAR100时,还是比较耗时的。

在batch size = 32的情况下,每个epoch的训练时间在600s ~= 10min

当训练到第05个epoch的时候,精度达到50%左右,耗时:50min.

当训练到第10个epoch的时候,精度达到70%左右,耗时:100min = 1个半小时左右

当训练到第15个epoch的时候,精度达到70%左右,耗时:150min = 2个多小时左右

当训练到第20个epoch的时候,精度达到95%左右,耗时:200min = 3个小时左右

当训练到第25个epoch的时候,精度达到98%左右,耗时:250min = 4个小时左右

(2)测试集

batch 300 In 156.25 accuracy = 53.0108 Final result with the model on the dataset, accuracy = 52.89

很明显,在测试集上的精度远低于训练集,至于是什么原因,如何解决,在后续文章中在进一步探讨。


作者主页(文火冰糖的硅基工坊):文火冰糖(王文兵)的博客_文火冰糖的硅基工坊_CSDN博客

 本文网址:https://blog.csdn.net/HiWangWenBing/article/details/121288804

猜你喜欢

转载自blog.csdn.net/HiWangWenBing/article/details/121288804
今日推荐