机器学习：梯度下降(GD)，随机梯度下降(SGD), 小批量随机梯度下降(Mini-batch SGD)

一、梯度下降（Gradient Descent）

在梯度下降中，每一次迭代都需要用到所有的训练数据，一般来讲我们说的Batch Gradient Desent(即批梯度下降)，跟梯度下降是一样的，这里的batch指的就是全部的训练数据。

损失函数：

$J(\theta ) = \frac{1}{{2m}}\sum\limits_{i = 1}^m {{{({h_\theta }({x^{(i)}}) - {y^{(i)}})}^2}}$

训练过程：

$\begin{array}{l} repeate\{ \\ \theta : = \theta - \alpha \frac{1}{m}\sum\limits_{i = 1}^m ( {h_\theta }({x^{(i)}}) - {y^{(i)}})x_j^{(i)}\\ \} \end{array}$

它的实现大概就是下面这样子：

for epoch in range(epoches):
    y_pred = np.dot(weight, x_train.T)
    deviation = y_pred - y_train.reshape(y_pred.shape)
    gradient = 1/len(x_train) * np.dot(deviation, x_train)
    weight = weight - learning_rate*gradient

优点：

1. 每次迭代的梯度方向计算由所有训练样本共同投票决定，所以它的比较稳定，不会产生大的震荡。

缺点：

1. 与SGD相比，收敛的速度比较慢(参数更新的慢）。

2. 当处理大数据量的时候，因为需要对所有的数据矩阵计算，所有会造成内存不够。

二、随机梯度下降（Stochastic Gradient Descent)

在随机梯度下降当中，每一次迭代（更新参数）只用到一条随机抽取的数据。所以随机梯度下降参数的更新次数更多，为epoches*m次，m为样本量，而在梯度下降中，参数的更新次数仅为epoches次。

它的训练过程是：

$\theta = \theta + \alpha(y^{(i)}-h_{\theta}(x^{(i)}))x^{(i)}$

它的实现大概是这样子的：

for epoch in range(epoches):
    for i in range(len(x_train):
        index = random.randint(0,len(x_train))
        y_pred = np.dot(weight, x_train[index].T)
        deviation = y_pred - y_train[index].reshape(y_pred.shape)
        gradient = np.dot(deviation, x_train[index])
        weight = weight - learning_rate*gradient

梯度下降和随机梯度下降在指定训练次数（epoches）的情况下，他们的计算大致一样的，因为在梯度下降中做了mxn和nx1的矩阵运算，而在随机梯度下降中则是做了m次的1xn和nx1的矩阵运算。所以不能说随机梯度下降就一定比梯度下降结束的要早。主要是因为它参数更新次数多，所以收敛的速度比较快。由于numpy的矩阵运算会比for循环更快，所以甚至梯度下降有可能比随机梯度下降结束得更早。当然在以收敛条件作为结束条件的模型下，随机梯度下降可能比梯度下降结束的早些。

优点：

1. 收敛速度快（参数更新次数多）。

缺点：

1. 每次迭代只依靠一条训练数据，如果该训练数据不是典型数据的话，w的震荡很大，稳定性不好。

2. 更新参数的频率大，自然开销也大。

三、小批量随机梯度下降（Mini Batch Stochastic Gradient Descent）

这是机器学习当中最常用到的方法，因为它是前两种方法的调和，所以能够拥有GD和SGD的优点，也能一定程度上摆脱GD和SGD的缺点。常用于数据量较大的情形。

它的实现大概是这样子的：

def batch_generator(x, y, batch_size):         //batch生成器
    nsamples = len(x)
    batch_num = int(nsamples / batch_size)
    indexes = np.random.permutation(nsamples)
    for i in range(batch_num):
        yield (x[indexes[i*batch_size:(i+1)*batch_size]], 
                y[indexes[i*batch_size:(i+1)*batch_size]])


for epoch in range(epoches):
    for x_batch, y_batch in batch_generator(X_train, y_train, batch_size):
        y_hat = np.dot(weight, x_batch.T)
        deviation = y_hat - y_batch.reshape(y_hat.shape)
        gradient = 1/len(x_batch) * np.dot(deviation, x_batch)
        weight = weight - learning_rate*gradient

四、参考文献

https://www.cnblogs.com/richqian/p/4549590.html

JacksonKim

原创文章 26 获赞 33 访问量 1902

关注私信