tensorflow使用L2 regularization正则化修正overfitting过拟合

L2正则化原理：

在Loss中加入参数w的平方和，这样训练过程中就会抑制w的值，w的值小，曲线就比较平滑，从而减小过拟合，参考公式如下图：

正则化是不影响你去拟合曲线的，并不是所有参数都会被无脑抑制，实际上这是一个动态过程，是cross_entropy和L2 loss博弈的一个过程。训练过程会去拟合一个合理的w，正则化又会去抑制w，两项相抵消，无关的wi越变越小，但是比零强，有用的wi会被保留，处于一个合理的范围。过多的道理和演算就不再赘述。

进行MNIST分类训练，对比cross_entropy和加了l2正则的total_loss。

因为MNIST本来就不复杂，所以FC之前不能做太多CONV，会导致效果太好，不容易分出差距。为展示l2 norm的效果，我只留一层CONV（注意看FC1的输入是h_pool1，短路了conv2），两层conv可以作为对照组。

机子性能关系,直接取train的前1000作为validation，test的前1000作为test。

代码说明，一个基础的CONV+FC结构，对图像进行label预测，通过cross_entropy衡量性能，进行训练。
把cross_entropy和l2 loss都扔进collection 'losses'中。

tf.add_to_collection('losses', weight_decay)
tf.add_to_collection('losses', cross_entropy)

total_loss = tf.add_n(tf.get_collection('losses'))提取所有loss，拿total_loss去训练，也就实现了图一中公式的效果。

完整代码如下：


from __future__ import print_function
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
# number 1 to 10 data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

def compute_accuracy(v_xs, v_ys):
    global prediction
    y_pre = sess.run(prediction, feed_dict={xs: v_xs, keep_prob: 1})
    correct_prediction = tf.equal(tf.argmax(y_pre,1), tf.argmax(v_ys,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    #result = sess.run(accuracy, feed_dict={xs: v_xs, ys: v_ys, keep_prob: 1})
    result = sess.run(accuracy, feed_dict={})
    return result

def weight_variable(shape, wd):
    initial = tf.truncated_normal(shape, stddev=0.1)

    if wd is not None:
        print('wd is not none!!!!!!!')
        weight_decay = tf.multiply(tf.nn.l2_loss(initial), wd, name='weight_loss')
        tf.add_to_collection('losses', weight_decay)

    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

def conv2d(x, W):
    # stride [1, x_movement, y_movement, 1]
    # Must have strides[0] = strides[3] = 1
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    # stride [1, x_movement, y_movement, 1]
    return tf.nn.max_pool(x, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')

# define placeholder for inputs to network
xs = tf.placeholder(tf.float32, [None, 784])/255.   # 28x28
ys = tf.placeholder(tf.float32, [None, 10])
keep_prob = tf.placeholder(tf.float32)
x_image = tf.reshape(xs, [-1, 28, 28, 1])
# print(x_image.shape)  # [n_samples, 28,28,1]

## conv1 layer ##
W_conv1 = weight_variable([5,5, 1,32], 0.) # patch 5x5, in size 1, out size 32
b_conv1 = bias_variable([32])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) # output size 28x28x32
h_pool1 = max_pool_2x2(h_conv1)                                         # output size 14x14x32

## conv2 layer ##
W_conv2 = weight_variable([5,5, 32, 64], 0.) # patch 5x5, in size 32, out size 64
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2) # output size 14x14x64
h_pool2 = max_pool_2x2(h_conv2)                                         # output size 7x7x64

###############################################################################################################
## fc1 layer ##
W_fc1 = weight_variable([14*14*32, 1024], wd = 0.)#do not use conv2
#W_fc1 = weight_variable([7*7*64, 1024], wd = 0.00)#use conv2
b_fc1 = bias_variable([1024])
# [n_samples, 7, 7, 64] ->> [n_samples, 7*7*64]
h_pool2_flat = tf.reshape(h_pool1, [-1, 14*14*32])#do not use conv2
#h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])#use conv2
##################################################################################################################



h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

## fc2 layer ##
W_fc2 = weight_variable([1024, 10], wd = 0.)
b_fc2 = bias_variable([10])
prediction = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)


# the error between prediction and real data
cross_entropy = tf.reduce_mean(-tf.reduce_sum(ys * tf.log(prediction),
                                              reduction_indices=[1]))       # loss

tf.add_to_collection('losses', cross_entropy)
total_loss = tf.add_n(tf.get_collection('losses'))
print(total_loss)

train_op = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
train_op_with_l2_norm = tf.train.AdamOptimizer(1e-4).minimize(total_loss)

sess = tf.Session()
# important step
# tf.initialize_all_variables() no long valid from
# 2017-03-02 if using tensorflow >= 0.12
if int((tf.__version__).split('.')[1]) < 12 and int((tf.__version__).split('.')[0]) < 1:
    init = tf.initialize_all_variables()
else:
    init = tf.global_variables_initializer()
sess.run(init)

for i in range(1000):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 1})
    # sess.run(train_op_with_l2_norm, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 1})
    # sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 0.5})#dropout
    if i % 100 == 0:
        print('train accuracy',compute_accuracy(
            mnist.train.images[:1000], mnist.train.labels[:1000]))
        print('test accuracy',compute_accuracy(
            mnist.test.images[:1000], mnist.test.labels[:1000]))

下边是训练过程

不加dropout，不加l2 norm，训练1000步：

weight_variable([1024, 10], wd = 0.)

明显每一步train中都好于test，出现过拟合！

train accuracy 0.094
test accuracy 0.089
train accuracy 0.892
test accuracy 0.874
train accuracy 0.91
test accuracy 0.893
train accuracy 0.925
test accuracy 0.925
train accuracy 0.945
test accuracy 0.935
train accuracy 0.954
test accuracy 0.944
train accuracy 0.961
test accuracy 0.951
train accuracy 0.965
test accuracy 0.955
train accuracy 0.964
test accuracy 0.959
train accuracy 0.962
test accuracy 0.956

不加dropout，FC层加l2 norm，weight decay因子设置0.004，训练1000步：

weight_variable([1024, 10], wd = 0.004)

过拟合现象明显减轻了不少，甚至有时测试集还好于训练集（因为验证集大小的关系，只展示大概效果。）

train accuracy 0.107
test accuracy 0.145
train accuracy 0.876
test accuracy 0.861
train accuracy 0.91
test accuracy 0.909
train accuracy 0.923
test accuracy 0.919
train accuracy 0.931
test accuracy 0.927
train accuracy 0.936
test accuracy 0.939
train accuracy 0.956
test accuracy 0.949
train accuracy 0.958
test accuracy 0.954
train accuracy 0.947
test accuracy 0.95
train accuracy 0.947
test accuracy 0.953

对照组：不使用l2正则，只用dropout：过拟合现象减轻。

W_fc1 = weight_variable([14*14*32, 1024], wd = 0.)
W_fc2 = weight_variable([1024, 10], wd = 0.)

    sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 0.5})#dropout

train accuracy 0.132
test accuracy 0.104
train accuracy 0.869
test accuracy 0.859
train accuracy 0.898
test accuracy 0.889
train accuracy 0.917
test accuracy 0.906
train accuracy 0.923
test accuracy 0.917
train accuracy 0.928
test accuracy 0.925
train accuracy 0.938
test accuracy 0.94
train accuracy 0.94
test accuracy 0.942
train accuracy 0.947
test accuracy 0.941
train accuracy 0.944
test accuracy 0.947

对照组：双层conv，本身过拟合不明显，结果略

https://github.com/huqinwei/tensorflow_demo/blob/master/tutorials/tensorflowTUT/tf18_CNN3/cnn_with_l2_norm.py

其他方法：正则化接口

loss公式直接加正则化项再拿去train就可以了。

loss =tf.reduce_mean(tf.square(y_ - y) + tf.contrib.layers.l2_regularizer(lambda)(w)

测一下单独运行正则化操作的效果（加到loss的代码懒得罗列了，太长，就替换前边的代码就可以）：

import tensorflow as tf
CONST_SCALE = 0.5
w = tf.constant([[5.0, -2.0], [-3.0, 1.0]])
with tf.Session() as sess:
    print(sess.run(tf.abs(w)))
    print('preprocessing:', sess.run(tf.reduce_sum(tf.abs(w))))
    print('manual computation:', sess.run(tf.reduce_sum(tf.abs(w)) * CONST_SCALE))
    print('l1_regularizer:', sess.run(tf.contrib.layers.l1_regularizer(CONST_SCALE)(w))) #11 * CONST_SCALE

    print(sess.run(w**2))
    print(sess.run(tf.reduce_sum(w**2)))
    print('preprocessing:', sess.run(tf.reduce_sum(w**2) / 2))#default
    print('manual computation:', sess.run(tf.reduce_sum(w**2) / 2 * CONST_SCALE))
    print('l2_regularizer:', sess.run(tf.contrib.layers.l2_regularizer(CONST_SCALE)(w))) #19.5 * CONST_SCALE

----------------------------------------

[[5. 2.]
 [3. 1.]]
preprocessing: 11.0
manual computation: 5.5
l1_regularizer: 5.5
[[25.  4.]
 [ 9.  1.]]
39.0
preprocessing: 19.5
manual computation: 9.75
l2_regularizer: 9.75

注意：L2正则化的预处理数据是平方和除以2，这是方便处理加的一个系数，就像Loss公式一样

其实在复杂系统下直接写公式不如把基本loss和正则化项都丢进collection用着方便，何况你还需要把不同的weight设置不同的衰减系数呢是吧，这写成公式就很繁琐了。