ACCELERATING NEURAL ARCHITECTURE SEARCH USING PERFORMANCE PREDICTION 论文阅读笔记

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/Maybemust/article/details/84205403

这篇文章从网上也没有找到讲解,所以还是只能自己慢慢啃,如果有错误的话还望能够不吝赐教。

ACCELERATING NEURAL ARCHITECTURE SEARCH USING PERFORMANCE PREDICTION

Bowen Baker, Otkrist Gupta, Ramesh Raskar, Nikhil Naik

ABSTRACT

Methods for neural network hyperparameter optimization and meta-modeling are computationally expensive due to the need to train a large number of model configurations. In this paper, we show that standard frequentist regression models can predict the final performance of partially trained model configurations using features based on network architectures, hyperparameters, and time-series validation performance data. We empirically show that our performance prediction models are much more effective than prominent Bayesian counterparts, are simpler to implement, and are faster to train. Our models can predict final performance in both visual classification and language modeling domains, are effective for predicting performance of drastically varying model architectures, and can even generalize between model classes. Using these prediction models, we also propose an early stopping method for hyperparameter optimization and meta-modeling, which obtains a speedup of a factor up to 6x in both hyperparameter optimization and meta-modeling. Finally, we empirically show that our early stopping method can be seamlessly incorporated into both reinforcement learning-based architecture selection algorithms and bandit based search methods. Through extensive experimentation, we empirically show our performance prediction models and early stopping algorithm are state-of-the-art in terms of prediction accuracy and speedup achieved while still identifying the optimal model configurations.

神经网络超参数优化和元模型(和链接的文章应该是形式上相近,特别恰当元模型解释的没找到)因为需要训练大量模型所以在计算上代价高昂。在这篇文章中,作者发现标准频率回归模型(对frequentist的补充)可以用网络结构的特征、超参数和随时间序列表达的有效数据预测刚刚训练了一小部分的网络参数的最终表现。并且按照经验来看比之前的贝叶斯模型更易实现和训练。这个模型能在图像分类和语言模型中都表现良好,而且对于预测极不相同的模型架构也十分有效。甚至可以概括出模型间类别的不同。利用该模型可以改造出一种中断方法,加速超参数优化和元建模过程达6倍。这种方法还可以和基于强化学习的选择算法与基于bandit算法的搜索算法无缝衔接。通过进一步的拓展实验,作者证实了这是目前在准确预测领域最有效的算法(state-of-the-art),可以在保证优化效果的基础上提升速度。

1 INTRODUCTION

At present, significant human expertise and labor is required for designing high-performing neural network architectures and successfully training them for different applications. Ongoing research in two areas—meta-modeling and hyperparameter optimization—attempts to reduce the amount of human intervention required for these tasks. Hyperparameter optimization methods (e.g., Hutter et al. (2011); Snoek et al. (2015); Li et al. (2017)) focus primarily on obtaining good optimization hyperparameter configurations for training human-designed networks, whereas meta-modeling algorithms (Bergstra et al., 2013; Verbancsics & Harguess, 2013; Baker et al., 2017; Zoph & Le, 2017) aim to design neural network architectures from scratch. Both sets of algorithms require training a large number of neural network configurations for identifying the right set of hyperparameters or the right network architecture—and are hence computationally expensive.

目前有众多专家和实验室致力于设计效果优良的神经网络架构并将它们成功应用到其他领域。目前主要集中在两个方向上——元建模和超参数优化——尽力减少人工的介入。超参数优化方法专注于在人工设计神经网络的基础上获得优良的模型参数。而元建模方法则致力于独立设计神经网络结构。这两种方式都要求训练大量的数据,既要训练参数,又要训练结构。

When sampling many different model configurations, it is likely that many subpar configurations will be explored. Human experts are quite adept at recognizing and terminating suboptimal model configurations by inspecting their partial learning curves. In this paper we seek to emulate this behavior and automatically identify and terminate subpar model configurations in order to speedup both meta-modeling and hyperparameter optimization methods. Our method parameterizes learning curve trajectories with simple features derived from model architectures, training hyperparameters, and early time-series measurements from the learning curve. We use these features to train a set of frequentist regression models that predicts the final validation accuracy of partially trained
neural network configurations using a small training set of fully trained curves from both image classification and language modeling domains. We use these predictions and uncertainty estimates obtained from small model ensembles to construct a simple early stopping algorithm that can speedup
both meta-modeling and hyperparameter optimization methods.

当采样不同的参数结构进行训练的时候,很可能会采集到欠佳的参数结构。人类专家可以通过观察一部分的训练曲线从而识别出来。本文将寻找一种方法可以替代人类专家,自动识别并中断欠佳模型的训练过程从而加速训练过程。我们通过从模型结构、超参数训练、随时间序列数据的测算提取的特征确定学习预测曲线的参数。之后我们用一在图像分类和语言建模上完全训练好的小型学习曲线以及提取的特征训练一组基于频率的回归模型从而预测部分训练的神经网络表现的准确性。最后,用这些预估曲线和从小的模型上得到的不确定的参数共同构建一个简单的可以帮助加速超参数优化和元建模的“停止算法”。

在这里插入图片描述

本文研究包括的学习曲线样例。 注意收敛时间和整体学习曲线形状的多样性。

While there is some prior work on neural network performance prediction using Bayesian methods (Domhan et al., 2015; Klein et al., 2017), our proposed method is significantly more accurate, accessible, and efficient. We hope that our work leads to inclusion of neural network performance
prediction and early stopping in the practical neural network training pipeline.

扫描二维码关注公众号,回复: 4600576 查看本文章

此前已经有过一些基于贝叶斯方法的成果,而本文的方法相比则更准确,无障碍和有效。我们希望我们的工作能够同时覆盖神经网络性能在实际神经网络训练管道中的预测和及时中断。

2 RELATED WORK

Neural Network Performance Prediction: There has been limited work on predicting neural network performance during the training process. Domhan et al. (2015) introduce a weighted probabilistic model for learning curves and utilize this model for speeding up hyperparameter search in small convolutional neural networks (CNNs) and fully-connected networks (FCNs). Building on Domhan et al. (2015), Klein et al. (2017) train Bayesian neural networks for predicting unobserved learning curves using a training set of fully and partially observed learning curves. Both methods rely on expensive Markov chain Monte Carlo (MCMC) sampling procedures and handcrafted learning curve basis functions. We also note that Swersky et al. (2014) develop a Gaussian Process kernel for predicting individual learning curves, which they use to automatically stop and restart configurations.

神经网络表现预测:之前只有很少一部分相关领域的研究。Domhan 引入了一种概率全职模型用于训练学习曲线并将其应用到小型卷积神经网络和全连接神经网络种超参数的搜索加速过程。基于此,Klein也用一组全部或者部分监督训练的学习曲线训练了一个贝叶斯神经网络用于预测非监督学习曲线。这两种方法都依靠马尔可夫链、蒙特卡洛(MCMC)取样过程和热工学习曲线贝叶斯函数。同时也应注意到Swersky也提出了一种基于高斯过程的预测模型,也能起到中断和重新训练的作用。

Meta-modeling: We define meta-modeling as an algorithmic approach for designing neural network architectures from scratch. The earliest meta-modeling approaches were based on genetic algorithms (Schaffer et al., 1992; Stanley & Miikkulainen, 2002; Verbancsics & Harguess, 2013) or Bayesian optimization (Bergstra et al., 2013; Shahriari et al., 2016). More recently, reinforcement learning methods have become popular. Baker et al. (2017) use Q-learning to design competitive CNNs for image classification. Zoph & Le (2017) use policy gradients to design state-of-the-art CNNs and Recurrent cell architectures. Several methods for architecture search (Cortes et al., 2017; Negrinho & Gordon, 2017; Zoph et al., 2017; Brock et al., 2017; Suganuma et al., 2017) have been proposed this year since the publication of Baker et al. (2017) and Zoph & Le (2017).

元模型:我们定义这里提到的元模型是一种额能够自动设计神经网络拓扑结构的算法。早期的元模型算法是基于遗传算法或者贝叶斯算法的。最近,强化学习方法正在兴起。Baker用Q-learning 设计了一个对抗网络用于图像分类。Zoph和Le用policy gradient设计出了目前最好的CNN网络和循环元胞结构。自从他们的文章发表以来,陆续提出了多种结构搜索方法。

Hyperparameter Optimization: We define hyperparameter optimization as an algorithmic approach for finding optimal values of design-independent hyperparameters such as learning rate and batch size, along with a limited search through the network design space. Bayesian hyperparameter optimization methods include those based on sequential model-based optimization (SMAC) (Hutter et al., 2011), Gaussian processes (GP) (Snoek et al., 2012), TPE (Bergstra et al., 2013), and neural networks Snoek et al. (2015). However, random search or grid search is most commonly used in practical settings (Bergstra & Bengio, 2012). Recently, Li et al. (2017) introduced Hyperband, a multiarmed bandit-based efficient random search technique that outperforms state-of-the-art Bayesian
optimization methods.

超参数优化:我们定义这里的超参数优化是一种能在有限的网络结构空间种进行有限的搜索就能够独立设计诸如学习率、batch大小等参数最优值的算法。贝叶斯超参数优化方法包括了基于序贯模型的SMAC(我怎么觉得是SMBO),基于高斯过程,基于TPE的和神经网络。然后随机搜索和网格搜索还是实践上用的最多的。最近,Li又引入了一种基于多臂老虎机模型的随即搜索技术并在目前来说最好的提升了贝叶斯优化方法。

3 NEURAL NETWORK PERFORMANCE PREDICTION

We first describe our model for neural network performance prediction, followed by a description of the datasets used to evaluate our model, and finally present experimental results.

本文将从三方面展开,分别是描述模型,描述评估模型的数据集,最后呈现实验结果。
3.1 MODELING LEARNING CURVES

在这里插入图片描述

我们的目标是利用之前对表现的观测y(t),对网络配置参数x和时间序列T预测模型的准确性yT。对于以x为参数的第T次训练,我们记录了一组时间序列的有效准确性y(T)。我们训练了n组,从而获得了集合S。注意到这里的形式很像Klein文章所描述的。

在这里插入图片描述

用上面的数据训练回归模型来预测yT。同时,我们用特征集xf确定该模型的参数。此外,我们训练了T-1组回归模型,其中每个连续模型使用时间序列验证数据的另一点。我们可以看到,这种基于SRM的应用比起单独的贝叶斯模型更加精准和容易计算。

在这里插入图片描述

特征:我们使用的特征包括随时间变化的模型准确度TS、网络架构AP和超参数HP。TS:准确度、验证精度的一阶差分方程和二阶差分方程。AP:包含了总的层数;HP:包含了所有用于训练网络的超参数,比如初始的学习率和学习率的衰减。
3.2 DATASETS AND TRAINING PROCEDURES

We experiment with small and very deep CNNs (e.g., ResNet, Cuda-Convnet) trained on image classification datasets and with LSTMs trained with Penn Treebank (PTB), a language modeling dataset. Figure 1 shows example learning curves from three of the datasets considered in our experiments. We provide brief summary of the datasets below. Please see Appendix Section A for further details on the search space, preprocessing, hyperparameters and training settings of all datasets.

我们使用小但是深的CNN里训练进行验证。下面贴一下用到的训练集和训练结果。

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

3.3 PREDICTION PERFORMANCE

Choice of Regression Method: We now describe our results for performing final neural network performance. For all experiments, we train our SRMs on 100 randomly sampled neural network configurations. We obtain the best performing method using random hyperparameter search over 3-fold cross-validation. We then compute the regression performance over the remainder of the dataset using the coefficient of determination R2. We repeat each experiment 10 times and report the results with standard errors. We experiment with a few different frequentist regression models, including ordinary least squares (OLS), random forests, and -support vector machine regressions (v-SVR). As seen in Table 1, v-SVR with linear or RBF kernels perform the best on most datasets,
though not by a large margin. For the rest of this paper, we use -SVR RBF unless otherwise specified.

回归方法的选择:我们将展示我们执行神经网络的对结果的预测性能。对于所有的实验,我们在100个随机采样的神经网络上训练SRM的配置。我们利用随机超参数搜索以及3次交叉验证获得表现最好的模型。然后,我们计算其余部分的回归性能。数据集使用确定系数R^2。我们重复10次实验并记录结果的标准误差。我们用几个不同的频率回归模型进行实验,包括普通的最小二乘法(OLS)、随机森林和v支持向量机回归(V-SVR)。如表1所示,具有线性或RBF内核的v-SVR在大多数数据集上表现最好,虽然差距不是很大。对于本文的其余部分,除非另有说明,我们都默认使用V-SVR RBF。

Ablation Study on Feature Sets: In Table 2, we compare the predictive ability of different feature sets, training SVR (RBF) with time-series (TS) features obtained from 25% of the learning curve, along with features of architecture parameters (AP), and hyperparameters (HP). TS features explain the largest fraction of the variance in all cases. For datasets with varying architectures, AP are more important that HP; and for hyperparameter search datasets, HP are more important than AP, which is expected. AP features almost match TS on the ResNet (TinyImageNet) dataset, indicating that choice of architecture has a large influence on accuracy for ResNets. Figure 2 shows the true vs. predicted performance for all test points in three datasets, trained with TS, AP, and HP features.

特征集消融研究:在表2中,我们比较了不同特征集的预测能力,训练SVR(RBF)和从学习曲线的25%获得的时间序列(TS)特征,以及结构参数(AP)和超参数(HP)的特征。在所有情况下,TS特征对于方差的影响都是最大的。和预期的一样,对于具有不同结构的数据集,AP比HP更重要;而对于超参数搜索数据集,HP比AP更重要。在ResNet(TinyImageNet)数据集上,AP的特征几乎和TS完全一致,这表明体系结构AP的选择对ResNet的准确性有很大的影响。图2显示了三个数据集中所有测试点的真实与预测性能的对比。

在这里插入图片描述

Generalization Between Depths: We also test to see whether SRMs can accurately predict the performance of out-of-distribution neural networks. In particular, we train SVR (RBF) with 25% of TS, along with AP and HP features on ResNets (TinyImagenet) dataset, using 100 models with number of layers less than a threshold d and test on models with number of layers greater than d, averaging over 10 runs. Value of d varies from 14 to 110. For d = 32, R2 is 80:66~3:8. For d = 62,
R2 is 84:58 ~ 2:7.

深度之间的推广:我们还测试SRM是否能够准确地预测分布外神经网络的性能。特别地,我们在ResNets(TinyImagenet)数据集上训练25%TS的SVR(RBF)以及AP和HP特征,使用层数小于阈值d的100个模型,并在层数大于d的模型上测试,平均超过10次运行。d的取值范围从14到110不等,对于d=32这种情况,方差为80.66±3.8;当d为62时,方差为84.58±2.7
3.3.1 COMPARISON WITH EXISTING METHODS:

We now compare the neural network performance prediction ability of SRMs with three existing learning curve prediction methods: (1) Bayesian Neural Network (BNN) (Klein et al., 2017), (2) the learning curve extrapolation (LCE) method (Domhan et al., 2015), and (3) the last seen value (LastSeenValue) heuristic (Li et al., 2017). When training the BNN, we not only present it with the subset of fully observed learning curves but also all other partially observed learning curves from the training set. While we do not present the partially observed curves to the v-SVR SRM for training, we felt this was a fair comparison as v-SVR uses the entire partially observed learning curve during inference. Methods (2) and (3) do not incorporate prior learning curves during training. Figure 3 shows the R2 obtained by each method for predicting the final performance versus the percent of the learning curve used for training the model. We see that in all neural network configuration spaces and across all datasets, either one or both SRMs outperform the competing methods. The LastSeenValue heuristic only becomes viable when the configurations are near convergence, and its performance is worse than an SRM for very deep models. We also find that the SRMs outperform the LCE method in all experiments, even after we remove a few extreme prediction outliers produced by LCE. Finally, while BNN outperforms the LastSeenValue and LCE methods when only a few iterations have been observed, it does worse than our proposed method. In summary, we show that our simple, frequentist SRMs outperforms existing Bayesian approaches on predicting neural network performance on modern, very deep models in computer vision and language modeling tasks.

现在,我们将SRM的神经网络性能预测能力与三种现有的学习曲线预测方法进行比较:(1)贝叶斯神经网络(BNN)(Klein等人,2017),(2)学习曲线外推(LCE)方法(Domhan等人,2015),(3)最后看到的值(LastSeenValue)启发式(Li等人,2017)算法。在训练BNN时,我们不仅给出了完全观测学习曲线的子集,而且还给出了训练集中的所有其他部分观测学习曲线。虽然我们没有将部分观察到的曲线呈现给v-SVR SRM用于训练,但我们认为这是一个公平的比较,因为v-SVR在推理期间使用整个“部分观察到的学习曲线”。方法(2)和(3)在训练过程中没有合并先前的学习曲线。图3显示了每个方法获得的R^2,用于预测最终性能与用于训练模型的学习曲线的百分比。我们看到,在所有神经网络配置空间和所有数据集中,使用一个或两个SRM的性能都优于其他方法。LastSeenValue启发式算法只有在配置接近收敛时才是可行的,而且对于非常深的模型,它的性能比SRM差。我们还发现SRM在所有的实验中都优于LCE方法,即使在我们去除了由LCE产生的一些极端预测异常值之后。最后,当仅观察到少量迭代时,BNN优于LastSeenValue和LCE方法,但其性能比我们提出的方法差。总之,我们发现,在计算机视觉和语言建模任务中,我们的简单、频繁的SRM在预测神经网络性能方面优于现有的贝叶斯方法。

Since most of our experiments perform stepwise learning rate decay; it is conceivable that the performance gap between SRMs and both LCE and BNN results from a lack of sharp jump in their basis functions. We experimented with exponential learning rate decay (ELRD), which the basis functions in LCE are designed for. We trained 630 random nets with ELRD, from the 1000 MetaQNN-CIFAR10 nets. Predicting from 25% of the learning curve, the R2 is 0.95 for v-SVR (RBF), 0.48 for LCE (with extreme outlier removal, negative without), and 0.31 for BNN. This comparison illuminates another benefit of our method: we do not require handcrafted basis functions to model new learning curve types.

由于我们的大多数实验执行逐步学习速率衰减,可以想象,SRM与LCE和BNN之间的性能差距是由于它们的基本函数缺乏急剧跳跃(博主猜测是不是模型因为太稳定了没有变化)造成的。我们实验了指数学习速率衰减(ELRD),同时也是设计在LCE作为基函数使用的。我们在1000 个Meta AQNN-CIFAR10网络中用ELRD训练了630个随机网。从学习曲线的25%预测,V-SVR(RBF)的R^2为0.95,LCE为0.48(去除极端异常值,无负值),BNN为0.31。这个比较说明了我们方法的另一个优点:我们不需要人工干预基函数来建模新的学习曲线类型。

Training and Inference Speed Comparison: Another advantage of our regression approach is speed. SRMs are much faster to train and do inference in than proposed Bayesian methods (Domhan et al., 2015; Klein et al., 2017). On 1 core of a Intel 6700k CPU, an -SVR (RBF) with 100 training points trains in 0.006 seconds, and each inference takes 0.00006 seconds. In comparison, the LCE code takes 60 seconds and BNN code takes 0.024 seconds on the same hardware for each inference.

训练和推理速度比较:我们回归方法的另一个优点是速度快。SRM比之前的贝叶斯方法能更快地训练和推理(Danghanet等人,2015;Klein等人,2017)。在英特尔6700kCPU的一个核心上,一个100个训练点的v-SVR(RBF)在0.006秒内训练,每个推理花费0.00006秒。相比之下,LCE代码需要60秒,BNN代码在同一硬件上每次推理需要0.024秒。

4 APPLYING PERFORMANCE PREDICTION FOR EARLY STOPPING

To speed up hyperparameter optimization and meta-modeling methods, we develop an algorithm to determine whether to continue training a partially trained model configuration using our sequential regression models. If we would like to sample N total neural network configurations, we begin by sampling and training n  N configurations to create a training set S. We then train a model f(xf )to predict yT . Now, given the current best performance observed yBEST, we would like to terminate training a new configuration x0 given its partial learning curve y0(t)1– if f(xf 0) = ^yT  yBEST so as to not waste computational resources exploring a suboptimal configuration.

为了加速超参数优化和元模型,基于已有的序列回归模型,我们开发了一种算法来判断某模型在训练了一段时间后是否要继续训练。如果我们想采样N个总神经网络的配置参数,我们要从采样并训练n<<N次配置以创建训练集S。然后训练一个模型f(xf)来预测yT。现在,当给定了目前训练的表现最好的观测参数yBEST,我们通过上式的计算就可以终止训练,从而节省资源。

在这里插入图片描述
在这里插入图片描述

然而,当f(xf)的扩展性比较差的时候,可能会造成错误终止训练优化过程。
既然我们可以估计模型的不确定性,就可以据此设定终止阈值。这一阈值可以很好的平衡速度的增长和错误中断的风险。很多情况下,一个人可能会想要多组接近最优配置的参数。对此又两种方法,一种是放宽阈值,另一种是基于最好的观测到的n个参数设置一个可以保证把这些结果都包括进去的阈值。

4.1 EARLY STOPPING FOR META-MODELING

Baker et al. (2017) train a Q-learning agent to design convolutional neural networks. In this method, the agent samples architectures from a large, finite space by traversing a path from input layer to termination layer. However, the MetaQNN method uses 100 GPU-days to train 2700 neural architectures and the similar experiment by Zoph & Le (2017) utilized 10,000 GPU-days to train 12,800 models on CIFAR-10. The amount of computing resources required for these approaches makes them prohibitively expensive for large datasets (e.g., Imagenet) and larger search spaces. The main computational expense of reinforcement learning-based meta-modeling methods is training the neural network configuration to T epochs (where T is typically a large number at which the network stabilizes to peak accuracy).

We now detail the performance of a -SVR (RBF) SRM in speeding up architecture search using sequential configuration selection. First, we take 1,000 random models from the MetaQNN (Baker et al., 2017) search space. We simulate the MetaQNN algorithm by taking 10 random orderings of each set and running our early stopping algorithm. We compare against the LCE early stopping algorithm (Domhan et al., 2015) as a baseline, which has a similar probability threshold termination criterion. Our SRM trains off of the first 100 fully observed curves, while the LCE model trains from each individual partial curve and can begin early termination immediately. Despite this “burn in” time needed by an SRM, it is still able to significantly outperform the LCE model (Figure 4). In addition, fitting the LCE model to a learning curve takes between 1-3 minutes on a modern CPU due to expensive MCMC sampling, and it is necessary to fit a new LCE model each time a new point on the learning curve is observed. Therefore, on a full meta-modeling experiment involving thousands of neural network configurations, our method could be faster by several orders of magnitude as compared to LCE based on current implementations.

现在详细描述一下v-SVR(BRF)SRM的训练细节。首先,随机搭建了1000个模型。然后使用集合的十个随机值模拟NetaQNN算法,并和LCE算法进行比较,二者终止阈值相近。我们的SRM从前100条完全观察的曲线开始训练,而LCE模型从每个单独的部分曲线开始训练,并且可以立即开始早期终止。尽管SRM所需要更多时间,它仍然能够显著优于LCE模型(图4)。此外,由于MCMC采样昂贵,在现代CPU上,将LCE模型拟合到学习曲线需要1-3分钟,并且每次观察到学习曲线上的新点时都需要拟合新的LCE模型。因此,在涉及数千个神经网络配置的完整元建模实验中,与基于当前实现的LCE相比,我们的方法可以快几个数量级。

We furthermore simulate early stopping for ResNets trained on CIFAR-10. We found that only the probability threshold  = 0:99 resulted in recovering the top model consistently. However, even with such a conservative threshold, the search was sped up by a factor of 3.4 over the baseline. While we do not have the computational resources to run the full experiment from Zoph & Le (2017), our method could provide similar gains in large scale architecture searches.

我们进一步模拟在CIFAR-10上训练的预测。我们发现,只有概率阈值= 0.99导致恢复顶部模型一致。然而,即使有这样一个保守的阈值,搜索速度提高了3.4倍以上的基线。虽然我们没有计算资源来运行Zoph&Le(2017)的完整实验,我们的方法可以在大规模体系结构搜索中提供类似的增益。

It is not enough, however, to simply simulate the speedup because meta-modeling algorithms typically use the observed performance in order to update an acquisition function to inform future sampling. In the reinforcement learning setting, the performance is given to the agent as a reward, so we also empirically verify that substituting ^yT for yT does not cause the MetaQNN agent to converge to a subpar policy. Replicating the MetaQNN experiment on CIFAR-10 (see Figure 5), we find that integrating early stopping with the Q-learning procedure does not disrupt learning and resulted in a speedup of 3.8x with  = 0:99. The speedup is relatively low due to a conservative value of . After training the top models to 300 epochs, we also find that the resulting performance (just under 93%) is on par with original results of Baker et al. (2017).

然而,仅仅模拟加速是不够的,因为元建模算法通常使用观察到的性能来更新采集函数以告知未来的采样。在强化学习设置,表现结果会被作为给代理的奖励,所以我们还实验验证,对yT的近似处理不会引起MetaQNN代理收敛到一个水平一般的策略。在CIFAR-10上复制MetaQNN实验(见图5),我们发现将早期停止与Q-learning过程集成不会中断学习,并导致3.8x的加速速度为= 0.99。由于保守值的原因,加速率相对较低。在将顶级模特训练到300代之后,我们还发现最终的表现(略低于93%)与Baker等人(2017)的原始结果相当。
在这里插入图片描述
在这里插入图片描述
4.2 EARLY STOPPING FOR HYPERPARAMETER OPTIMIZATION

Recently, Li et al. (2017) introduced Hyperband, a random search technique based on multi-armed bandits that obtains state-of-the-art performance in hyperparameter optimization in a variety of settings. The Hyperband algorithm trains a population of models with different hyperparameter configurations and iteratively discards models below a certain percentile in performance among the population until the computational budget is exhausted or satisfactory results are obtained.

最近,Li等人(2017)介绍了Hyperband,这是一种基于多臂老虎机模型的随机搜索技术,在各种设置下都能获得最先进的超参数优化性能。超带算法训练一组具有不同超参数配置的模型,迭代地丢弃模型在种群中的性能低于一定百分位数,直到计算预算耗尽或得到满意的结果。

4.2.1 FAST HYPERBAND

We present a Fast Hyperband (f-Hyperband) algorithm based on our early stopping scheme. During each iteration of successive halving, Hyperband trains ni configurations to ri epochs. In f-Hyperband, we train an SRM to predict yri and do early stopping within each iteration of successive halving. We initialize f-Hyperband in exactly the same way as vanilla Hyperband, except once we have trained 100 models to ri iterations, we begin early stopping for all future successive halving iterations that train to ri iterations. By doing this, we exhibit no initial slowdown to Hyperband due to a “burn-in” phase. We also introduce a parameter  which denotes the proportion of the ni models in each iteration that must be trained to the full ri iterations. This is similar to setting the criterion based on
the nth best model in the previous section. See Appendix section C for an algorithmic representation of f-Hyperband.

本文提出了一种基于早期停止方案的快速Fast Hyperband算法。在每次连续减半的迭代中,超带训练ni配置到ri epoch。在f-Hyperband中,我们训练一个SRM来预测yri,并在每次迭代中提前停止。我们初始化f-Hyperband的方式与vanilla Hyperband完全相同,只是一旦我们训练了100个模型来进行ri迭代,我们就开始为所有未来的连续减半迭代提前停止,这些迭代将训练为ri迭代。通过这样做,我们没有显示最初的减速超带由于“burn in”阶段。我们还引入了一个参数,该参数表示每个迭代中ni模型的比例,该比例必须训练到完整的ri迭代。这与基于此设置标准类似上一节中的第n个最佳模型。有关f-超带的算法表示,请参阅附录C部分。

We empirically evaluate f-Hyperband using Cuda-Convnet trained on CIFAR-10 and SVHN datasets. Figure 6 shows that f-Hyperband evaluates the same number of unique configurations as Hyperband within half the compute time, while achieving the same final accuracy within standard error. When reinitializing hyperparameter searches, one can use previously-trained set of SRMs to achieve even larger speedups. Figure 8 in Appendix shows that one can achieve up to a 7x speedup in such cases.

我们使用经过CIFAR-10和SVHN数据集训练的Cuda-Convnet对f-Hyperband进行实证评估。图6显示了f-Hyperband用了一半的时间评估与Hyperband相同数目的配置数,同时在标准误差内实现相同的最终精度。当重新初始化超参数搜索时,可以使用以前训练过的SRM集来实现更大的加速。附录中的图8显示了在这种情况下,可以达到7x的加速。

5 CONCLUSION

In this paper we introduce a simple, fast, and accurate model for predicting future neural network performance using features derived from network architectures, hyperparameters, and time-series performance data. We show that the performance of drastically different network architectures can be jointly learned and predicted on both image classification and language models. Using our simple algorithm, we can speedup hyperparameter search techniques with complex acquisition functions, such as a Q-learning agent, by a factor of 3x to 6x and Hyperband—a state-of-the-art hyperparameter search method—by a factor of 2x, without disturbing the search procedure. We outperform all competing methods for performance prediction in terms of accuracy, train and test time, and speedups obtained on hyperparameter search methods. We hope that the simplicity and success of our method will allow it to be easily incorporated into current hyperparameter optimization pipelines for deep neural networks. With the advent of large scale automated architecture search (Baker et al., 2017; Zoph & Le, 2017), methods such as ours will be vital in exploring even larger and more complex search spaces.

本文介绍一种简单、快速、准确的模型,用于利用从网络结构、超参数和时间序列性能数据中导出的特征来预测未来的神经网络性能。我们显示完全不同的网络架构的性能可以在图像分类和语言模型上联合学习和预测。使用我们的简单算法,我们能够以3x至6x的因子加速具有复杂获取功能的超参数搜索技术,例如Q-学习代理,以及以2x的因子加速Hyperband——最先进的超参数搜索方法。在精度、训练和测试时间以及超参数搜索方法获得的加速方面,我们在性能预测方面优于所有竞争方法。我们希望我们的方法简单和成功,将允许它很容易地结合到当前用于深神经网络的超参数优化流水线中。随着大规模自动化体系结构搜索(Baker等人,2017;Zoph&Le,2017)的出现,像我们这样的方法将在探索更大和更复杂的搜索空间方面至关重要。

REFERENCES

在这里插入图片描述
在这里插入图片描述

APPENDIX

在这里插入图片描述
在这里插入图片描述

B HYPERPARAMETER SELECTION IN RANDOM FOREST AND SVM BASED EXPERIMENTS

在这里插入图片描述
在这里插入图片描述

C F-HYPERBAND

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/Maybemust/article/details/84205403