The strongest deep learning optimizer Ranger is open source: RAdam+LookAhead is a strong combination, with better performance and faster speed

As part of the Ranger optimizer, LookAhead was first proposed by Geoffrey Hinton in the July 2019 paper "LookAhead optimizer: k steps forward, 1 step back". Lookahead is inspired by the latest understanding of the loss surface of neural networks, and it proposes a new method to stabilize deep learning training and convergence speed. Based on the breakthrough of deep learning variance management achieved by RAdam (Rectified Adam), I found that combining RAdam and Lookahead, Ranger is enough to become a dynamic "dream team", even higher than RAdam's own optimization level.

I have combined the two into a single Ranger optimizer code base, aiming to reduce the difficulty of use and integrate them into FastAI. Currently, you can get the Ranger source code immediately.

image

Comparison results of Vanilla Adam, SGD and Look Ahead + Adam/SGD on LSTM (from LookAhead paper)

Why RAdam and LookAhead can complement each other

RAdam can be said to be the best basis for the optimizer at the beginning of training. RAdam uses a dynamic rectifier to adjust Adam's adaptive momentum according to the variance, and effectively provides an automatic warm-up mechanism that can be customized according to the current data set, which can ensure that the training takes the first step smoothly on a solid foundation.

LookAhead is inspired by the latest understanding of the loss surface of deep neural networks, and can provide robust and stable breakthroughs throughout the training period.

To quote the LookAhead team-LookAhead "reduces the need for extensive hyperparameter adjustments" and at the same time achieves "with minimal computational overhead to ensure faster convergence for different deep learning tasks."

Therefore, both have brought breakthroughs in different aspects of deep learning optimization, and this combination is highly synergistic, and is expected to provide two best improvements for everyone's deep learning results. In this way, by combining the two latest breakthroughs (RAdam + LookAhead), Ranger's integration results are expected to bring new development drivers to deep learning and help us further pursue more stable and powerful optimization methods.

Hinton and others once stated: "We have empirically proved that Lookahead can significantly improve the performance of SGD and Adam, including ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank under the default hyperparameter settings."

The forward-looking comparison between Lookahead and SGD in practical applications-Lookahead succeeded in giving tighter minimum results with its dual exploration settings. (From Lookahead paper)

Based on the previous introduction of RAdam, this article will explain what Lookahead is and how it can achieve higher accuracy by combining RAdam and Lookahead into a single optimizer Ranger. Here is a brief summary. I ran 20 rounds of tests and personally got a higher accuracy percentage, which is 1% higher than the current FastAI ranking record:

Ranger's first test accuracy rate was 93%

The 20-round test accuracy rate of the FastAI leaderboard is 92%

More importantly, anyone can use the source code and related information to use Ranger and check whether it can improve your deep learning results in terms of stability and accuracy!

Below, we will delve into the two major components that drive Ranger-RAdam and Lookahead:

What is RAdam (Rectified Adam)

A brief summary of RAdam: The researchers of this project investigated the mechanism of adaptive momentum optimizers (Adam, RMSProp, etc.) and found that all projects need to be warmed up, otherwise it will often lead to poor local optimization at the beginning of training status.

当优化器没有足够的数据来做出准确的自适应动量决策时,我们就会看到这些在训练初始阶段出现的糟糕表现。因此,预热有助于降低训练起步阶段时的差异……但即使已经确定了预热量需求,我们仍然需要手动调整并根据具体数据集的不同进行微调。

因此,Rectified Adam 应运而生,旨在利用基于所遇到的实际方差的整流函数确定“预热启发式”。整流器会以动态形式对自适应动量进行关闭及“压实”,确保其不会全速跳跃,直到数据的方差稳定下来。

通过这种方式,我们就顺利摆脱了手动预热的需求并能够使训练过程自动实现稳定。

一旦方差稳定下来,RAdam 在余下的训练过程中基本上充当着 Adam 甚至是 SGD 的作用。因此,可以看到 RAdam 的贡献主要在于训练的开始阶段。

读者朋友可能会在结果部分注意到,虽然 RAdam 在很长一段时间内的表现优于 Adam……但 SGD 最终仍会迎头赶上,并在准确率方面成功超越 RAdam 与 Adam。

在这方面,是时候请出 LookAhead 了,我们将其整合至一种新的探索机制当中,最终能够在超高轮数(甚至上千轮)的情况下继续保持对 SGD 的准确率优势。

Lookahead:探索损失面的辅助系统,带来更快且更稳定的探索与收敛效果

根据 Lookahead 的研究人员们所言,目前大多数成功的优化器都建立在 SGD 基础之上,同时加入:自适应动量(Adam、AdaGrad)或者一种加速形式(Nesterov 动量或者 Polyak Heavy Ball)以改善探索与训练过程,并最终收敛。

但是,Lookahead 是一种新型开发成果,其会保留两组权重而后在二者之间进行插值,从而推动更快权重集的“前瞻”或者探索,同时让较慢的权重集留在后面以维持长期稳定性。

结果就是,训练期间的方差有所降低,对次优超参数的敏感性下降,同时减少了对广泛超参数调整的需求。这种作法能够在多种深度学习任务之上实现更快收敛。换言之,这为我们带来了令人印象深刻的重大突破。

image

通过简单的类比,相信大家能够更好地理解 Lookahead 的作用。想象一下,我们身处山脉的顶端,而周边的峰峦则各有高低。其中一座最终延至山脚下构成了成功的通道,但其它则只是绕来绕去,无法帮助我们走到山脚。

亲自进行探索当然非常困难,因为我们在选定一条路线的同时,总会同时放弃其它路线,直到最终找到正确的通路。

但是,我们在山顶或者靠近山顶的位置留下一位伙伴,其会在情况看起来不妙时及时把我们唤回。这能帮助我们在寻找出路的时候快速取得进展,因为全部地形的探索速度将更快,而且被卡在死路中的可能性也更低。

Lookahead 的意义基本上就是这样。它会保留一个额外的权重副本,而后让内化得“更快”的优化器(在 Ragner 中,即 RAdam)进行 5 或 6 轮探索。轮处理间隔通过 k 参数进行指定。

LookAhead 随后,一旦 k 间隔触发,Lookahead 就会在其保存的权重与 RAdam 的最新权重之间进行差值相乘,并在每 k 个轮次乘以 alpha 参数(默认情况下为.5),同时更新 RAdam 的权重。

image

Ranger 代码显示,Lookahead 会更新 RAdam 的参数

结果实际上来自内部优化器(在本示例中为 RAdam)的快速移动平均值以及通过 Lookahead 的慢速移动平均值。快速平均值进行探索,慢速平均值则作为回撤或者稳定机制——一般来说,慢速平均值负责垫后,但有时候也会将快速平均值“压实”为正确概率更高的结果。

凭借 Lookahead 提供的安全性保障,优化器能够充分探索下山路径,而不再需要担心卡在死胡同中进退两难。

这种方法与目前的两种主流机制完全不同——即自适应动量以及“重球(heavy ball)”Nesterov 型动量。

因此,由于训练稳定性提到显著提高,Lookahead 能够更快结束探索并实现“压实”,最终获得超越 SGD 的准确率结果。

Ranger:一套利用 RAdam 与 Lookahead 的单一优化器整合代码库

Lookahead 能够与任意优化器共同运行以获取“快速”权重——论文当中使用的是初版 Adam,因为 RAdam 在一个月前才刚刚诞生。

image

LookAhead 的 PyTorch 集成 (lonePatient实现:https://github.com/lonePatient/lookahead_pytorch)

然而,为了便于同 FastAI 进行代码集成并实现更为广泛的应用,我进一步将二者合并为一款名为 Ranger 的单一优化器(Ranger 中的 RA 是为了致敬 Rectified Adam;而之所以选择 Ranger,是因为 LookAhead 的作用就像是在迷途时探索出路,如同「游侠」一般)。

image

上图为我个人在 ImageNette 上得到的 20 轮测试最高分数——这实际上也是我第一次使用 Ranger。(目前的排行榜为 92%。)另外需要注意,训练过程的稳定性也有所改善。

马上使用 Ranger!

目前 GitHub 上已经有多种 Lookahead 实现,我从 LonePatient 开始,因为我喜欢它简洁的代码风格,并以此为基础进行构建。至于 Radam,当然使用了来自官方的 RAdam GitHub 代码库。Ranger 的源文件如下:

https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer

使用指南:

1 — 将 ranger.py 复制至你的项目目录。

2 — 导入 Ranger:

image

导入 Ranger 即可使用。

3 — 在 FastAI 当中创建一个分区以供 Ranger 使用,而后将学习方的 opt_func 指定该分区。

image

4 — 开始测试!

image

image

LookAhead 参数:

  • k parameter :— 用于控制在合并 Lookahead 权重之前需要运行的轮次数。常见的默认值为 5 或 6,我认为论文当中可能使用了 20 轮。

  • alpha = Used to control the percentage of Lookahead variance used for update. The default value is .5. Hinton et al. have made a strong proof of the rationality of .5, but a simple test may be required.

The paper also mentioned the follow-up guess that it may be possible to include k or alpha into the adjustment category according to the degree of training progress.

Based on the feedback from the previous RAdam article, I plan to publish a notebook as soon as possible to help you quickly use Ranger with ImageNet or other data sets, thereby further reducing the threshold for using Ranger/RAdam/Lookahead.

to sum up

Two independent research teams have made new breakthrough contributions to the realization of fast and stable optimization algorithms for deep learning. I found that by combining RAdam and Lookaheaad, I can create a collaborative optimizer Ranger, and finally achieve a new highest score in the 20 rounds of ImageNet tests.

In the future, I need further testing to optimize the k parameters and learning rate of Lookahead and RAdam. But even if you look at the present, Lookahead and RAdam are enough to reduce the amount of manual hyperparameter adjustments, so they should be able to provide you with new ways to optimize training results.

The Ranger source code has been officially released. You can click here to experience whether the new combination of RAdam and Lookahead can further enhance your deep learning effect!

Further testing shows that using Ranger plus the new Mish activation function (rather than ReLU) can produce better results. For details about Mish, please refer to:

https://medium.com/@lessw/meet-mish-new-state-of-the-art-ai-activation-function-the-successor-to-relu-846a6d93471f


Guess you like

Origin blog.51cto.com/15060462/2677936