深度学习中的几种正则化（Regularization）方法

正则化是一种策略，目的是减少测试误差，大体方式是通过增加（或减少）模型所能拟合的函数的数量来增加（或减少）模型的容量。

使用参数范数惩罚，参考文献：[1]、[2]、[3]
可以参考《凸优化》第297页的 “正则化逼近”。
通常只惩罚权重，不惩罚偏置。
基本公式：
$\widetilde{J}(\pmb{\theta};\pmb{X},y)=J(\pmb{\theta};\pmb{X},y)+\alpha\Omega(\pmb{\theta})$
常用类别：
- $L^2$ 正则（权重衰减、岭回归、Tikhonov正则）
  $\Omega(\pmb{\theta})=\frac{1}{2}||\pmb{w}||^2_2$
- $L^1$ 正则
  $\Omega(\pmb{\theta})=||\pmb{w}||_1=\sum_i|w_i|$
- 对惩罚项进行约束，比如： $\Omega(\pmb{\theta})<k$ ：
  $\mathcal{L}(\pmb{\theta},\alpha;\pmb{X},y)=J(\pmb{\theta};\pmb{X},y)+\alpha(\Omega(\pmb{\theta})-k)$
数据集增强
噪声注入
- 在输入数据中注入噪声（等价于权重的范数惩罚，参考文献：[4]、[5]）
- 向隐藏单元添加噪声（如Dropout）
- 将噪声添加到权重，参考文献：[6]、[7]、[8]
- 向输出目标添加噪声（原因是数据集的标签会存在一定比例的错误）
  - 标签平滑，参考文献：[9]
    对标签不再分类0与1，而是利用 $s o f t m a x$ 输出 $\frac{\epsilon}{k-1}$ 与 $1-\epsilon$ 的数值。
多任务学习，参考文献：[10]、[11]
Early-Stopping，参考文献：[12]、[13]
在二次误差的简单线性模型和简单梯度下降的情况下，它相当于 $L_2$ 正则化。
参数绑定，参考文献：[14]
基本思想：类似的任务，所使用的模型的权重可能是相互接近的。
基本方式是使用参数范数惩罚：
$\Omega(\pmb{w}^{(A)}, \pmb{w}^{(B)})=||\pmb{w}^{(A)}-\pmb{w}^{(B)}||^2_2$
这类方法中的杰出代表：参数共享。
Bagging，参考文献：[15]、[16]、[17]、[18]、[19]
分别训练几个不同的模型，通过对这些模型输出结果进行表决的方式，来决定最终的输出。
Dropout，参考文献：[20]、[21]、[22]、[23]、[24]、[25]、[26]、[27]、[28]、[29]、[30]、[31]、[32]、[33]
对某个隐藏层的神经元通过乘零操作来进行随机删除，每个神经元被乘零的概率是 $p$ ，这个值是人工控制的超参数。
在推断阶段，应当使用权重比例推断规则来对被使用Dropout的层进行修正：将该层的权重乘以概率值 $p$ 。
以 Bagging 的角度来解释 Dropout 比较好。
对抗训练，参考文献：[34]、[35]、[36]

参考文献：
[1] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012c). Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580. 235, 260, 264
[2] Srebro, N. and Shraibman, A. (2005). Rank, trace-norm and max-norm. In Proceedings of the 18th Annual Conference on Learning Theory, pages 545–560. Springer-Verlag. 235
[3] Tibshirani, R. J. (1995). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58, 267–288. 233

[4] Bishop, C. M. (1995a). Regularization and complexity control in feed-forward networks. In Proceedings International Conference on Artificial Neural Networks ICANN’95, volume 1, page 141–148. 238, 247
[5] Bishop, C. M. (1995b). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1), 108–116. 238

[6] Jim, K.-C., Giles, C. L., and Horne, B. G. (1996). An analysis of noise in recurrent neural networks: convergence and generalization. IEEE Transactions on Neural Networks, 7(6), 1424–1438. 238
[7] Graves, A. (2011). Practical variational inference for neural networks. In NIPS’2011 . 238
[8] Hochreiter, S. and Schmidhuber, J. (1995). Simplifying neural nets by discovering flat minima. In Advances in Neural Information Processing Systems 7 , pages 529–536. MIT Press. 239

[9] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2015). Rethinking the Inception Architecture for Computer Vision. ArXiv e-prints. 240, 318

[10] Caruana, R. (1993). Multitask connectionist learning. In Proc. 1993 Connectionist Models Summer School , pages 372–379. 241
[11] Baxter, J. (1995). Learning internal representations. In Proceedings of the 8th International Conference on Computational Learning Theory (COLT’95), pages 311–320, Santa Cruz, California. ACM Press. 241

[12] Bishop, C. M. (1995a). Regularization and complexity control in feed-forward networks. In Proceedings International Conference on Artificial Neural Networks ICANN’95 , volume 1, page 141–148. 238, 247
[13] Sjöberg, J. and Ljung, L. (1995). Overtraining, regularization and searching for a minimum, with application to neural networks. International Journal of Control, 62(6), 1391–1407. 247

[14] Lasserre, J. A., Bishop, C. M., and Minka, T. P. (2006). Principled hybrids of generative and discriminative models. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR’06), pages 87–94, Washington, DC, USA. IEEE Computer Society. 240, 250

[15] Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. 253
[16] Koren, Y. (2009). The BellKor solution to the Netflix grand prize. 255, 475
[17] Freund, Y. and Schapire, R. E. (1996a). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of Thirteenth International Conference, pages 148–156, USA. ACM. 255
[18] Freund, Y. and Schapire, R. E. (1996b). Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pages 325–332. 255
[19] Schwenk, H. and Bengio, Y. (1998). Training methods for adaptive boosting of neural networks. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information Processing Systems 10 (NIPS’97), pages 647–653. MIT Press. 255

[20] Srivastava, N. (2013). Improving Neural Networks With Dropout. Master’s thesis, U. Toronto. 533
[21] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958. 255, 261, 262, 264, 669
[22] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2014a). Going deeper with convolutions. Technical report, arXiv:1409.4842. 22, 23, 197, 255, 265, 322, 341
[23] Warde-Farley, D., Goodfellow, I. J., Courville, A., and Bengio, Y. (2014). An empirical analysis of dropout in piecewise linear networks. In ICLR’2014 . 259, 263, 264
[24] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012c). Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580. 235, 260, 264
[25] Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction deep Boltzmann machines. In NIPS26 . NIPS Foundation. 98, 615, 668, 669, 670, 671, 672, 695
[26] Gal, Y. and Ghahramani, Z. (2015). Bayesian convolutional neural networks with Bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158 . 261
[27] Bayer, J. and Osendorfer, C. (2014). Learning stochastic recurrent networks. ArXiv e-prints. 262
[28] Pascanu, R., Gülçehre, Ç., Cho, K., and Bengio, Y. (2014a). How to construct deep recurrent neural networks. In ICLR’2014 . 18, 262, 393, 394, 406, 455
[29] Xiong, H. Y., Barash, Y., and Frey, B. J. (2011). Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context. Bioinformatics, 27(18), 2554–2562. 262
[30] Neal, R. M. (1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics. Springer. 262
[31] Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26 , pages 351–359. 262
[32] Wang, S. and Manning, C. (2013). Fast dropout training. In ICML’2013 . 263
[33] Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of neural networks using dropconnect. In ICML’2013 . 263

[34] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. (2014b). Intriguing properties of neural networks. ICLR, abs/1312.6199. 265, 266, 269
[35] Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b). Explaining and harnessing adversarial examples. CoRR, abs/1412.6572. 265, 266, 269, 553, 554
[36] Miyato, T., Maeda, S., Koyama, M., Nakae, K., and Ishii, S. (2015). Distributional smoothing with virtual adversarial training. In ICLR. Preprint: arXiv:1507.00677. 266

深度学习中的几种正则化（Regularization）方法

猜你喜欢