【深度学习基础5】深度神经网络的优化与调参(2)

     转载请注明出处。谢谢。

     本博文根据 coursera 吴恩达 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization整理。作为深度学习网络优化的记录,将三章内容均以要点的形式记录,并结合实例说明。
     这一章的重点在于理解几个优化函数,虽然在深度学习的框架中,均被较好的封装,一般一行代码即可实现,但是,理解其原理才能更好的应用。

     序号接上一篇。

Q7:mini-batch gradient descent

     梯度下降的方法有三种:

1. Gradient descent: 最早的训练方法,在进行每一次梯度下降时,处理整个数据集,做一次更新,其优点是cost function 总是向减小的方向下降; 但如果数据集很大,处理速度就会比较慢。

X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Forward propagation
    a, caches = forward_propagation(X, parameters)
    # Compute cost.
    cost = compute_cost(a, Y)
    # Backward propagation.
    grads = backward_propagation(a, caches, parameters)
    # Update parameters.
    parameters = update_parameters(parameters, grads)

2. 随机梯度下降 SGD:对每一个训练样本执行一次梯度下降;缺点是丢失了向量化带来的加速,而且虽然每次向最小值方向下降,但由于单个样本的噪声,无法达到全局最小值点,会呈现波动的形式。

X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    for j in range(0, m):
        # Forward propagation
        a, caches = forward_propagation(X[:,j], parameters)
        # Compute cost
        cost = compute_cost(a, Y[:,j])
        # Backward propagation
        grads = backward_propagation(a, caches, parameters)
        # Update parameters.
        parameters = update_parameters(parameters, grads)

3. mini-batch梯度下降:在GD与SGD中trade off, 选择1<batch_size<m,使既可以较快速,又可以使cost function下降处于两者之间。

记得分割时处理残留非完整batch。

def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
    """
    Creates a list of random minibatches from (X, Y)
    
    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    mini_batch_size -- size of the mini-batches, integer
    
    Returns:
    mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
    """
    
    np.random.seed(seed)            # To make your "random" minibatches the same as ours
    m = X.shape[1]                  # number of training examples
    mini_batches = []
        
    # Step 1: Shuffle (X, Y)
    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1,m))

    # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
    num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
    for k in range(0, num_complete_minibatches):
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[:, mini_batch_size*k : mini_batch_size*(k+1)]
        mini_batch_Y = shuffled_Y[:, mini_batch_size*k : mini_batch_size*(k+1)]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    # Handling the end case (last mini-batch < mini_batch_size)
    if m % mini_batch_size != 0:
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[:, mini_batch_size*(k+1) :]
        mini_batch_Y = shuffled_Y[:, mini_batch_size*(k+1) :]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    return mini_batches

经验表明:mini-batch的选择一般为2^6, 2^7,...,2^10,更需要匹配cpu/gpu的内存。

Q8. 梯度下降的优化方法

背景知识:指数加权平均 

公式:v_t=\beta v_{t-1}+(1-\beta)\theta_t

理解指数加权平均的意义可以从下面实例开始:

根据公式计算:令 \beta=0.9

v_0=0

v_1=\beta v_0+(1-\beta)\theta_1=0.1\theta_1

v_2=\beta v_1+(1-\beta)\theta_2=0.1*0.9*\theta_1+0.1*\theta_2

......

v_{100}=\beta v_{99}+(1-\beta)\theta_{100}=0.1*0.9^{99}*\theta_1+0.1*0.9^{98}*\theta_2+\cdots +0.1*0.9*\theta_{99}+0.1\theta_{100}

可以看出,vt 是对每天温度的加权平均;加权系数随时间以指数递减,时间越靠近,权重越大,因而称为指数加权平均

另外,考察一下加权系数 β 的作用:

β=0.9 红线,表示最近10天的加权平均

β=0.98 绿线,表示最近50天的加权平均

β=0.5 黄线,表示最近2天的加权平均

原因: 以β=0.9为例,0.9^{10}\approx 0.35\approx 1/e ; 以1/e 为分析点,小于1/e 忽略,

那么对于任意 β,  \beta^{\frac{1}{1-\beta}}\approx\frac{1}{e},  所以令N=\frac{1}{1-\beta}, 指数加权平均就是最近\frac{1}{1-\beta} 天的平均值。

另外理论上可以加上 \frac{v_t}{1-\beta^t} 进行修正,但在实践操作中,不常使用。

1. Momentum

公式:

V_{dw}=\beta_1 V_{dw}+(1-\beta_1)dw

V_{db}=\beta_1 V_{db}+(1-\beta_1)db

W:=W-\alpha V_{dw}      b:=b-\alpha V_{db}

分析:

    如果用原来的更新方法,较大的学习率会导致偏离函数范围,因而只能用较小的学习率,但这样会比较慢。因此利用动量的原理,在纵轴上由于上下波动,平均后近似为0, 水平方向平均后仍旧较大,可前进,从而导致整体的梯度呈现在横轴前进的趋势。减少了纵轴的波动。 

代码:

def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
    """
    Update parameters using Momentum
    
    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- python dictionary containing the current velocity:
                    v['dW' + str(l)] = ...
                    v['db' + str(l)] = ...
    beta -- the momentum hyperparameter, scalar
    learning_rate -- the learning rate, scalar
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
    v -- python dictionary containing your updated velocities
    """

    L = len(parameters) // 2 # number of layers in the neural networks
    
    # Momentum update for each parameter
    for l in range(L):
        
        ### START CODE HERE ### (approx. 4 lines)
        # compute velocities
        v["dW" + str(l+1)] = beta *  v["dW" + str(l+1)] + (1 - beta)* grads['dW' + str(l+1)]
        v["db" + str(l+1)] = beta *  v["db" + str(l+1)] + (1 - beta)* grads['db' + str(l+1)]
        # update parameters
        parameters["W" + str(l+1)] = parameters['W' + str(l+1)] - learning_rate * v["dW" + str(l+1)]
        parameters["b" + str(l+1)] = parameters['b' + str(l+1)] - learning_rate * v["db" + str(l+1)]
        ### END CODE HERE ###
        
    return parameters, v

2. RMSprop (root mean square prop)

公式:

S_{dw}=\beta_2 S_{dw}+(1-\beta_2)dw^2

S_{db}=\beta_2 V_{db}+(1-\beta_2)db^2

W:=W-\alpha \frac{dw}{\sqrt{S_{dw}+\epsilon}}     b:=b-\alpha \frac{db}{\sqrt{S_{db}+\epsilon}}

分析:

与上面唯一不同的是使用了微分平方加权平均数;消除了摆动幅度的方向,修正了摆动幅度,使各维度摆动幅度较小,使网络收敛更快。例如,纵轴维度摆动幅度大,因而S就大,将其作为除数,使得  1/sqrt(S) 就小,1/sqrt(S)就是该方向更新的梯度值,就小;同理 S 较小的时候,波动较小,倒数之后就大,该方向更新就多。

代码:

def update_parameters_with_rmsprop(parameters, grads, s, beta, learning_rate):
    """
    Update parameters using RMSprop
    
    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    s -- python dictionary containing the current velocity:
                    s['dW' + str(l)] = ...
                    s['db' + str(l)] = ...
    beta -- the momentum hyperparameter, scalar
    learning_rate -- the learning rate, scalar
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
    s -- python dictionary containing your updated velocities
    """

    L = len(parameters) // 2 # number of layers in the neural networks
    
    # Momentum update for each parameter
    for l in range(L):
        
        ### START CODE HERE ### (approx. 4 lines)
        # compute velocities
        s["dW" + str(l+1)] = beta *  s["dW" + str(l+1)] + (1 - beta)* np.power(grads['dW' + str(l+1)],2)
        s["db" + str(l+1)] = beta *  s["db" + str(l+1)] + (1 - beta)* np.power(grads['db' + str(l+1)],2)
        # update parameters
        parameters["W" + str(l+1)] = parameters['W' + str(l+1)] - learning_rate * grads['dW' + str(l+1)]/(np.sqrt(s["dW" + str(l+1)])+1e-8)
        parameters["b" + str(l+1)] = parameters['b' + str(l+1)] - learning_rate * grads["db" + str(l+1)]/(np.sqrt(s["dW" + str(l+1)])+1e-8)
        ### END CODE HERE ###
        
    return parameters, s

3. Adam

公式:

V_{dw}=\beta_1 V_{dw}+(1-\beta_1)dw  V_{db}=\beta_1 V_{db}+(1-\beta_1)db

S_{dw}=\beta_2 S_{dw}+(1-\beta_2)dw^2  S_{db}=\beta_2 V_{db}+(1-\beta_2)db^2

V_{dw}^{corrected}=\frac{V_{dw}}{1-\beta_1^t}    V_{db}^{corrected}=\frac{V_{db}}{1-\beta_1^t}

S_{dw}^{corrected}=\frac{S_{dw}}{1-\beta_2^t}     S_{db}^{corrected}=\frac{S_{db}}{1-\beta_2^t}    ,  其中 t 表示循环次数(iteration times)

W:=W-\alpha \frac{V_{dw}^{corrected}}{\sqrt{S_{dw}^{corrected}+\epsilon}}     b:=b-\alpha \frac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected}+\epsilon}}

分析:

将1和2的优点相融合

代码:

def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01,
                                beta1 = 0.9, beta2 = 0.999,  epsilon = 1e-8):
    """
    Update parameters using Adam
    
    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    learning_rate -- the learning rate, scalar.
    beta1 -- Exponential decay hyperparameter for the first moment estimates 
    beta2 -- Exponential decay hyperparameter for the second moment estimates 
    epsilon -- hyperparameter preventing division by zero in Adam updates

    Returns:
    parameters -- python dictionary containing your updated parameters 
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    """
    
    L = len(parameters) // 2                 # number of layers in the neural networks
    v_corrected = {}                         # Initializing first moment estimate, python dictionary
    s_corrected = {}                         # Initializing second moment estimate, python dictionary
    
    # Perform Adam update on all parameters
    for l in range(L):
        # Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
        ### START CODE HERE ### (approx. 2 lines)
        v["dW" + str(l+1)] = beta1 * v["dW" + str(l+1)] + (1 - beta1)* grads["dW" + str(l+1)]
        v["db" + str(l+1)] = beta1 * v["db" + str(l+1)] + (1 - beta1)* grads["db" + str(l+1)]
        ### END CODE HERE ###

        # Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
        ### START CODE HERE ### (approx. 2 lines)
        v_corrected["dW" + str(l+1)] =  v["dW" + str(l+1)] / (1 - np.power(beta1,t))
        v_corrected["db" + str(l+1)] = v["db" + str(l+1)] / (1 - np.power(beta1,t))
        ### END CODE HERE ###

        # Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
        ### START CODE HERE ### (approx. 2 lines)
        s["dW" + str(l+1)] = beta2 * s["dW" + str(l+1)] + (1 - beta2) * np.power(grads["dW" + str(l+1)],2)
        s["db" + str(l+1)] =  beta2 * s["db" + str(l+1)] + (1 - beta2) * np.power(grads["db" + str(l+1)],2)
        ### END CODE HERE ###

        # Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
        ### START CODE HERE ### (approx. 2 lines)
        s_corrected["dW" + str(l+1)] = s["dW" + str(l+1)] / (1 - np.power(beta2,t))
        s_corrected["db" + str(l+1)] = s["db" + str(l+1)] / (1 - np.power(beta2,t))
        ### END CODE HERE ###

        # Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
        ### START CODE HERE ### (approx. 2 lines)
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)]-learning_rate * v_corrected["dW" + str(l+1)] /( np.sqrt(s_corrected["dW" + str(l+1)])+epsilon)
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)]-learning_rate * v_corrected["db" + str(l+1)] /( np.sqrt(s_corrected["db" + str(l+1)])+epsilon)
        ### END CODE HERE ###

    return parameters, v, s

Q9. 学习率衰减

    在训练中,如果使用统一的学习率,在最小值附近,由于样本的固有噪声,会产生一定的波动,因而难以精确收敛。所以提出使用学习率衰减,在初始状态,使用较大的学习率,可以更快的移动,当靠近最小值附近,使用较小的学习率,更容易找到最优点。

常用函数:

(1) \alpha = \frac{1}{1+decay\_rate*epoch\_num}\alpha_0

(2) 指数衰减  \alpha = \0.95^{epoch\_num}\alpha_0

(3) \alpha = \frac{k}{epoch\_num}\alpha_0

(4) 离散下降 分段使用不同的 α

猜你喜欢

转载自blog.csdn.net/Biyoner/article/details/85016844