1. Introduction

今天是尝试用 PyTorch 框架来跑 MNIST 手写数字数据集的第三天，主要学习训练网络。本 blog 主要记录一个学习的路径以及学习资料的汇总。

注意：这是用 Python 2.7 版本写的代码

第一天（LeNet 网络的搭建）：https://blog.csdn.net/qq_36627158/article/details/108098147

第二天（加载 MNIST 数据集）：https://blog.csdn.net/qq_36627158/article/details/108119048

第三天（训练模型）：https://blog.csdn.net/qq_36627158/article/details/108163693

第四天（单例测试）：https://blog.csdn.net/qq_36627158/article/details/108183655

2. Code（mnist_train.py）

感谢凯神提供的代码与耐心指导！

from lenet import Net
import torch
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt
from mnist_load import testset_loader, trainset_loader


LEARNING_RATE = 0.001
MOMENTUM = 0.9
EPOCH = 5


if torch.cuda.is_available():
    device = torch.device('cuda')
    print 'cuda'
else:
    device = torch.device('cpu')
    print 'cpu'


mnist_model = Net().to(device)

optimizer = optim.SGD(
    mnist_model.parameters(),
    lr=LEARNING_RATE,
    momentum=MOMENTUM
)


# save_model
def save_checkpoint(checkpoint_path, model, optimizer):
    # state_dict: a Python dictionary object that:
    # - for a model, maps each layer to its parameter tensor;
    # - for an optimizer, contains info about the optimizer's states and hyperparameters used.
    state = {
        'model': model.state_dict(),
        'optimizer' : optimizer.state_dict()
    }
    torch.save(state, checkpoint_path)
    print 'model saved to ', checkpoint_path


# train
def mnist_train(epoch, save_interval):
    mnist_model.train()  # set training mode

    iteration = 0
    loss_plt = []

    for ep in range(epoch):
        for batch_idx, batch_data in enumerate(trainset_loader):
            images, labels = batch_data
            images = images.to(device)
            labels = labels.to(device)

            optimizer.zero_grad()

            output = mnist_model(images)

            loss = F.cross_entropy(output, labels)
            loss_plt.append(loss)

            loss.backward()
            optimizer.step()

            print 'Train Epoch:', ep+1, '\tBatch: ', batch_idx+1, '/', len(trainset_loader), '\tLoss: ', loss.item()

            # different from before: saving model checkpoints
            if iteration % save_interval == 0 and iteration > 0:
                save_checkpoint('module/pytorch-mnist-batchsize-128-%i.pth' % iteration, mnist_model, optimizer)

            iteration += 1

        mnist_test()

    # save the final model
    save_checkpoint('module/pytorch-mnist-batch-128-%i.pth' % iteration, mnist_model, optimizer)
    plt.plot(loss_plt, label='loss')
    plt.legend()
    plt.show()


# test
def mnist_test():
    mnist_model.eval()  # set evaluation mode

    test_loss = 0
    correct = 0

    with torch.no_grad():
        for images, labels in testset_loader:
            images = images.to(device)
            labels = labels.to(device)

            output = mnist_model(images)

            test_loss += F.cross_entropy(output, labels).item()

            pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
            correct += pred.eq(labels.view_as(pred)).sum().item()


    test_loss /= len(testset_loader.dataset)

    print '\nTest set: Average loss:', test_loss, '\tAccuracy:', (100. * correct / len(testset_loader.dataset)), '%\n'


if __name__ == '__main__':
    mnist_train(EPOCH, save_interval=1000)

3. Materials

1、torch.optim 优化算法包：

https://pytorch.org/docs/stable/optim.html

4. Details

1、OSError: [Errno 12] Cannot allocate memory

一开始以为是自己电脑配置（内存不够大）太低，每次 load 一个 batch 的图片数量不能太多，所以就一直在改 BATCH_SIZE 这个超参数。后面不停降低 BATCH_SIZE 还总报错，就意识到应该不是内存容量的问题。

后来查了一下，是加载数据（batch）的线程数目问题

https://blog.csdn.net/breeze210/article/details/99679048

2、需要自己先新建好 Module 文件夹

好吧，原来 Python 写文件的时候，如果路径中的文件夹不存在，是不会自动创建好的。Mark！

3、优化器中的 momentum 参数（待查阅更多有关优化器的资料）

凯神的解释：MOMENTUM 动量是随机梯度下降中用于更新模型权重的一个参数

https://www.lizenghai.com/archives/29512.html

https://pytorch.org/docs/stable/optim.html

4、model.to(device)

将所有最开始读取数据时的 tensor 变量 copy 一份到指定设备 device 上，之后的运算都在指定设备上进行。

https://www.jb51.net/article/178049.htm

5、Module.parameters()

https://blog.csdn.net/qq_39463274/article/details/105295272?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param

6、`if name == "main"`

https://blog.csdn.net/yjk13703623757/article/details/77918633

7、state_dict()

8、checkpoint

a way to save the current state of your experiment so that you can pick up from where you left off.

https://www.cnblogs.com/Arborday/p/9740253.html

9、为什么要使用 optimizer.zero_grad()

因为后面反向传播时优化器会自动计算梯度，不要让上一次迭代的梯度影响到本次迭代的梯度

https://blog.csdn.net/scut_salmon/article/details/82414730

10、optimizer.step() 和 loss.backward() 的区别

最开始有点搞不清楚这两个函数分别是干什么的。后来看视频拿个类比，我就明白了

线性回归中，权值参数的公式为：w_new = w_old + lr * gradient

loss.backward() 就相当于计算 gradient 的

optimizer.step() 就相当于根据 gradient 计算 w_new = w_old + lr * gradient 的

https://v.qq.com/x/page/t0554h33liw.html

11、with torch.no_grad() 和 model.eval()

Use both. They do different things, and have different scopes.
with torch.no_grad： disables tracking of gradients in autograd.
model.eval()： changes the forward() behaviour of the module it is called upon. eg, it disables dropout and has batch norm use the entire population statistics

https://www.cnblogs.com/shiwanghualuo/p/11789018.html

https://blog.csdn.net/songyu0120/article/details/103884586?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param

（系列更新完毕）深度学习零基础使用 PyTorch 框架跑 MNIST 数据集的第三天：训练模型

1. Introduction

2. Code（mnist_train.py）

3. Materials

1、torch.optim 优化算法包：

4. Details

1、OSError: [Errno 12] Cannot allocate memory

2、需要自己先新建好 Module 文件夹

3、优化器中的 momentum 参数（待查阅更多有关优化器的资料）

4、model.to(device)

5、Module.parameters()

6、`if name == "main"`

7、state_dict()

8、checkpoint

9、为什么要使用 optimizer.zero_grad()

10、optimizer.step() 和 loss.backward() 的区别

11、with torch.no_grad() 和 model.eval()

猜你喜欢

（系列更新完毕）深度学习零基础使用 PyTorch 框架跑 MNIST 数据集的第三天：训练模型

1. Introduction

2. Code（mnist_train.py）

3. Materials

1、torch.optim 优化算法包：

4. Details

1、OSError: [Errno 12] Cannot allocate memory

2、需要自己先新建好 Module 文件夹

3、优化器中的 momentum 参数（待查阅更多有关优化器的资料）

4、model.to(device)

5、Module.parameters()

6、if __name__ == "__main__"

7、state_dict()

8、checkpoint

9、为什么要使用 optimizer.zero_grad()

10、optimizer.step() 和 loss.backward() 的区别

11、with torch.no_grad() 和 model.eval()

猜你喜欢

6、`if name == "main"`