0. Abstract

Restart techniques are common in gradient-free optimization to deal with multi-modal functions.

这说明热重启策略并非是这篇论文提出的，而是早就应用到不需要梯度的优化器中。
这也就说明，本文的工作是：将热重启策略应用到梯度优化器中。

In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks.

点明本文的工作是将将热重启策略应用到梯度优化器中。

1. Introduction

作者首先说明了DNNs（Deep Neural Networks）在分类、目标检测、语音处理等方面做的非常好，随后提出问题：DNN虽然有很好的性能表现，它们一般在大规模数据集上进行训练，这往往需要花费几天的时间。所以，如何有效减少训练时间是一个值得探讨的问题。

引出问题，并强调自己工作的价值

作者也强调了在当年训练大规模数据集（CIFAR, MS COCO, PSACAL）效果比较好的模型使用的优化器并不是最先进（比如AdaDelta、Adam这类先进的优化器），而是使用了经典的SGD优化器。

接着作者引出了学习率策略并解释 A common learning rate schedule is to use a constant learning rate and divide it by a fixed constant in (approximately) regular intervals.

在这里插入图片描述

意思是说，当年比较好的模型在训练时虽然用了SGD优化器，但它们的学习率策略是阶梯下降的。

注意：对数轴模糊了余弦函数的典型形状

In this paper, we propose to periodically simulate warm restarts of SGD, where in each restart the learning rate is initialized to some value and is scheduled to decrease.

作者提出了他们的方法，即使用带有热重启的SGD（以后简称为SGDR），并且使用该策略重新训练了4个模型。

实验结果表明，带有热重启的方法比原始的方法要减少2~4倍的epoch次数。除了减少了训练时间外，在CIFAR-10和CIFAR-100数据集上的结果分别提升了3.14%和16.21%，这也说明了带有热重启的SGD的优越性：

加速模型收敛
提升模型准确率

3. SGDR(Stochastic Gradient Descent with Warm Restarts)

作者为了简化热重启以更好的推广，作者将其简化为下面的公式：

$\eta_t = \eta^i_{min} + \frac{1}{2}(\eta^i_{max} - \eta^i_{min}) (1 + \cos(\frac{T_{cur}}{T_i}\pi))$

其中， $\eta_t$ 为当前的学习率， $\eta^i_{min} 和 \eta_{max}^i$ 是学习率的范围， $T_{cur}$ 表示已经执行了多少个Epoch，即当前的Epoch数量。当 $t = 0$ 且 $T_{cur}=0$ 时，此时的学习率是最大的，即 $\eta_t = \eta_{max}^i$ ；当 $T_{cur}=T_i$ 时，此时的余弦函数输出 $- 1$ ，这导致学习率是最小的，即 $\eta_t = \eta_{min}^i$ 。

为了提高SGDR的普适性，作者建议在Epoch比较小的时候就使用SGDR，并给出了推荐的调参：

$\begin{cases} T_0 = 1, T_{mult}=2 \\ T_0 = 10, T_{mult}=2 \end{cases}$

4. Experiments

作者复现了一些网络模型，以下是结果。

在这里插入图片描述

Figure 2: Test errors on CIFAR-10 (left column) and CIFAR-100 (right column) datasets. Note that for SGDR we only plot the recommended solutions. The top and middle rows show the same results on WRN-28-10, with the middle row zooming into the good performance region of low test error. The bottom row shows performance with a wider network, WRN-28-20.

在这里插入图片描述

5. 总结

SGDR的热重启学习率策略的确是有效的，特别是残差结构。相比于人工设计的阶梯下降的学习，SGDR可以实现更早的收敛到相同精度（快约2~4倍）。

6. PyTorch代码

# 导包
from torch import optim
from torch.optim import lr_scheduler

# 定义模型
model, parameters = generate_model(opt)

# 定义优化器
if opt.nesterov:
    dampening = 0
else:
    dampening = 0.9
optimizer = opt.SGD(parameters, lr=0.1, momentum=0.9, dampening=dampending, weight_decay=1e-3, nesterov=opt.nesterov)

# 定义热重启学习率策略
scheduler = lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2, eta_min=0, last_epoch=-1)

7. 重启周期计算

变量名	重启时的Epoch
a	$T_0$
b	$a\times 3$
c	$\times T_{mult} + a$
d	$c\times T_{mult} + a$
e	$d\times T_{mult} + a$
…	…

举个简单的例子：

变量名	重启时的Epoch
a	$10$
b	$30$
c	$30 \times 2 + 10$
d	$70 \times 2 + 10$
e	$150\times 2 + 10$
…	…

Cosine Annealing Warm Restart论文讲解