Early Stopping | but when?

https://page.mi.fu-berlin.de/prechelt/Biblio/stop_tricks1997.pdf

摘要
1. early stop应该是validation-based，但实际中，总是基于an ad-hoc fashion或是training is stopped interactively
  1. 基于临时的策略或是交互的方式停止训练，交互的方式个人理解就是工程师观察loss等指标变化，人为的决定训练是否停止
2. validation-based early stop
3. 文章提出了a systematic fashion的标准
4. 训练时间和泛化能力之间的trade-off
early stop 不简单
1. 为什么early stop
  2. 两个方法防止过拟合：
    1. 减少参数空间维度
      1. greedy constructive learning 不知道什么意思
      2. pruning 剪枝
      3. weight sharing 参数共享
    2. 减少每一维的有效size
      1. regularization 归一化
        
        weight decay
        
        early stop，reported 比归一化方法要好
2. basic
3. The uglyness of reality
  1. 实际中的验证集error变化曲线不是平滑的，可能是先上升后下降的，并且局部最小值也不止一个，figure2中的变化曲线就有16个局部最小点
  2. 400个epoch以后结束训练（过拟合开始显现）和45个epoch以后结束训练（到达第一个局部最小值）相比，训练时间长了7倍，但是验证集的error只下降了1.1%。1.1%还是在验证集数据是相当具有代表性的情况下。
  3. 所有validation error曲线的变化都不一样，唯一有共性的是第一个局部最优点和全局最优点的差值不大。
    1. Unfortunately, the above or any other validation error curve is not typical in the sense that all curves share the same qualitative behavior. Other curves might never reach a better minimum than the first, or than, say, the third; the mountains and valleys in the curve can be of very dierent width, height, and shape. The only thing all curves seem to have in common is that the dierences between the rst and the following local minima are not huge.

怎么最好的实现early stop

yield the lowest generalization error and also for those with the best "price-performance ratio"
1. 确定generalization error时，训练时间最少；或是确定训练时间时，generalization loss最小
几类停止标准
1. 定义：
  1. 算法的损失函数： $E$
  2. 训练数据集损失：
    1. the average error per example over the training set, measured after epoch t.
  3. 验证数据集的损失：
    1. used by stopping criteria
  4. 测试数据集的损失：
    1. it is not known to the training algorithm but estimates the generalization error and thus benchmarks the quality of the network resulting from training.
    2. 实验中，这个值是不知道的
  5. t个epoch训练中最小的验证数据集损失： $E_{opt}(t):= \underset{t^{'}<=t}{min} E_{va}(t^{'})$
  6. t个epoch时的generalization loose： $GL(t) = 100*(\frac{E_{va}(t)}{E_{opt}(t)}-1)$
2. the first class of stopping criteria: stop as soon as the generalization loss exceeds a certain threshold: $GL_{\alpha }$ stop after the first epoch t with $GL(t)>\alpha$
3. want to suppress stopping if the training is still progressing very rapidly,训练误差还在快速减小，那么验证集误差就有可能继续减小。we assume that often overtting does not begin until the error decreases only slowly. To formalize this notion we define a training strip of length k to be a sequence of k epochs numbered n + 1 ... n + k where n is divisible by k
  1. $P_{k}(t):=1000*(\frac{\sum_{t^{'}=t-k+1}^{t}E_{tr}(t^{'})}{k*min_{t^{'}=t-k+1}^{t}E_{tr}(t^{'})}-1)$
  2. how much was the average training error during the strip larger than the minimum training error during the strip.
  3. 当训练error变化不大时， $P_{k}(t)$ 趋近于0.
4. the second class of stopping criteria: use the quotient of generalization loss and progress
  1. $PQ_{\alpha }$ stop after rst end-of-strip epoch t with $\tfrac{GL(t)}{P_{k}(t)}>\alpha$
5. 假设：strip的长度为5，只有在每个strip的最后做验证。
6. the third class of stopping criteria: stop when the generalization error increased in s successive strips.
  1. 只利用generalization loss的变化来决定是否停止训练
  2. $UP_{s}$ ：当验证集error在连续s个strip都上升
  3. $UP_{1}$ ：当 first end-of-strip epoch t 的验证集error ： $E_{va}(t) > E_{va}(t-k)$
7. None of these criteria alone can guarantee termination. We thus complement them by the rule that training is stopped when the progress drops below 0.1 or after at most 3000 epochs.
8. All stopping criteria are used in the same way: They decide to stop at some time t during training and the result of the training is then the set of weights that exhibited the lowest validation error Eopt(t). Note that in order to implement this scheme, only one duplicate weight set is needed
criteriion selection relus：
1. The results indicate that "slower" criteria, which stop later than others, on the average lead to improved generalization compared to "faster" ones. However, the training time that has to be expended for such improvements is rather large on average and also varies dramatically when slow criteria are used. The systematic differences between the criteria classes are only small.
2. rules:
  1. 用快的停止策略，除非慢的停止策略带来的性能提升很明显
  2. 为了最大化“找到好的solution”的可能性，用GL指标
  3. 为了最大化“找的的solution的quality均值”，当network过拟合程度不严重，用PQ；否则用UP指标。

rule怎么work

Concrete questions
1. 训练时间
2. efficiency，是不是在 the to-be-chosen validation error minimum 已经出现之后
3. effectiveness
4. robustness
5. trade-off
6. quantification
训练集，验证集，测试集 train validation test

		a
criteria	GL	1	2	3	5
	PQ	0.5	0.75	1	2	3
	UP	2	3	4	6	8

定义： $S_{x}(C)=\frac{t_{s}(C)}{t_{s}(x)}$

qq_32110859

发布了45 篇原创文章 · 获赞 1 · 访问量 8582

私信关注

Early Stopping | but when?

猜你喜欢