2.2吴恩达深度学习笔记之优化算法

1.Mini_batch gradient descent 小批量梯度下降法

思想:batch gd是一次性处理所有数据,在数据集较大时比如500万个样本,一次迭代的时间会很长,而mini_batch gd是把数据集分为多个子数据集,每个eopch中对子集进行一次处理

实现:实现mini_batch分为两步,第一步shuffle,将原集打乱,乱序步骤确保将样本被随机分成不同的小批次。第二步partition分割,把训练集打乱之后,我们就可以对它进行切分了

shuffle 程序段:

permutation=list(np.random.permutation(m))

    shuffled_X=X[:,permutation]

    shuffled_Y=Y[:,permutation].reshape(1,m)

partition程序段:

 num_complete_minibatches=math.floor(m/mini_batch_size)
    for k in range(0,num_complete_minibatches):
        mini_batch_X=shuffled_X[:,k*mini_batch_size:(k+1)*mini_batch_size]
        mini_batch_Y=shuffled_Y[:,k*mini_batch_size:(k+1)*mini_batch_size]
        mini_batch=(mini_batch_X,mini_batch_Y)
        mini_batches.append(mini_batch)
    if m%mini_batch_size!=0:
        mini_batch_X=shuffled_X[:,num_complete_minibatches*mini_batch_size:m]
        mini_batch_Y=shuffled_Y[:,num_complete_minibatches*mini_batch_size:m]
        mini_batch=(mini_batch_X,mini_batch_Y)

        mini_batches.append(mini_batch)       #mini_batches是一个list,其中元素为tuple

经过打乱,分割得到的mini_batches是一个包含tuple的列表,迭代时可以通过以下代码访问:

for minibatch in mini_batches:

        minibatch_X,minibatch_Y=minibatch

然后将minibatch_X,minibatch_Y看做X,Y进行前向传播,反向传播等处理。

2. Momentum 梯度下降算法

思想:计算梯度的指数加权平均,并利用该梯度更新权重,利用动量梯度下降算法可以减小纵向波动,加速收敛。

算法实现:

第一步,初始化V,即velocity:

def initialize_velocity(parameters):
    L=len(parameters)//2
    v={}
    for l in range(L):
        v['dW'+str(l+1)]=np.zeros((parameters['W'+str(l+1)].shape))
        v['db'+str(l+1)]=np.zeros((parameters['b'+str(l+1)].shape))

    return v

第二步,对更新参数函数,update_parameters()做一定修改:

def update_parameters_with_momentum(parameters,grads,v,beta,learning_rate):
    L=len(parameters)//2
    for l in range(L):
        v['dW'+str(l+1)]=beta*v['dW'+str(l+1)]+(1-beta)*grads['dW'+str(l+1)]
        parameters['W'+str(l+1)]=parameters['W'+str(l+1)]-learning_rate*v['dW'+str(l+1)]
        v['db'+str(l+1)]=beta*v['db'+str(l+1)]+(1-beta)*grads['db'+str(l+1)]
        parameters['b'+str(l+1)]=parameters['b'+str(l+1)]-learning_rate*v['db'+str(l+1)]

    return parameters,v

3.adam优化算法:

是动量梯度算法的另一种优化

实现同样分两步,需要对V,S进行初始化并对更新参数函数加以修改

整个实现过程如下:

def initialize_adam(parameters):
    L = len(parameters) // 2
    v = {}
    s = {}
    for l in range(L):
        v["dW" + str(l + 1)] = np.zeros_like(parameters["W" + str(l + 1)])
        v["db" + str(l + 1)] = np.zeros_like(parameters["b" + str(l + 1)])

        s["dW" + str(l + 1)] = np.zeros_like(parameters["W" + str(l + 1)])
        s["db" + str(l + 1)] = np.zeros_like(parameters["b" + str(l + 1)])

    return (v,s)

def update_parameters_with_adam(parameters,grads,v,s,t,learning_rate=0.01,beta1=0.9,beta2=0.999,epsilon=1e-8):
   
    L = len(parameters) // 2
    v_corrected = {} #偏差修正后的值
    s_corrected = {} #偏差修正后的值

    for l in range(L):
        #梯度的移动平均值,输入:"v , grads , beta1",输出:" v "
        v["dW" + str(l + 1)] = beta1 * v["dW" + str(l + 1)] + (1 - beta1) * grads["dW" + str(l + 1)]
        v["db" + str(l + 1)] = beta1 * v["db" + str(l + 1)] + (1 - beta1) * grads["db" + str(l + 1)]

        #计算第一阶段的偏差修正后的估计值,输入"v , beta1 , t" , 输出:"v_corrected"
        v_corrected["dW" + str(l + 1)] = v["dW" + str(l + 1)] / (1 - np.power(beta1,t))
        v_corrected["db" + str(l + 1)] = v["db" + str(l + 1)] / (1 - np.power(beta1,t))

        #计算平方梯度的移动平均值,输入:"s, grads , beta2",输出:"s"
        s["dW" + str(l + 1)] = beta2 * s["dW" + str(l + 1)] + (1 - beta2) * np.square(grads["dW" + str(l + 1)])
        s["db" + str(l + 1)] = beta2 * s["db" + str(l + 1)] + (1 - beta2) * np.square(grads["db" + str(l + 1)])

        #计算第二阶段的偏差修正后的估计值,输入:"s , beta2 , t",输出:"s_corrected"
        s_corrected["dW" + str(l + 1)] = s["dW" + str(l + 1)] / (1 - np.power(beta2,t))
        s_corrected["db" + str(l + 1)] = s["db" + str(l + 1)] / (1 - np.power(beta2,t))

        #更新参数,输入: "parameters, learning_rate, v_corrected, s_corrected, epsilon". 输出: "parameters".
        parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * (v_corrected["dW" + str(l + 1)] / np.sqrt(s_corrected["dW" + str(l + 1)] + epsilon))
        parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * (v_corrected["db" + str(l + 1)] / np.sqrt(s_corrected["db" + str(l + 1)] + epsilon))

    return (parameters,v,s)







猜你喜欢

转载自blog.csdn.net/qq_40103460/article/details/80192461
今日推荐