Batch gradient descent and stochastic gradient descent realization

The introduction to batch gradient descent and stochastic gradient descent

In the article M training samples, w is representative of the weight of the i-th sample b represents an intercept, x representative feature, y representative of a label, the representative Ÿ training values, represented by i

Using the error variance were:

      

1. batch gradient descent

  Batch gradient descent method is the most primitive form, which means that each iteration using all the samples to update gradient. Mathematically understood as follows:

       W find the partial derivative of the L:

               

       B find the partial derivative of the L:

                

  Advantages:
  (1) the first iteration are calculated for all the samples, this time using a matrix operation, to achieve parallelism.
  (2) determining the direction of the full data set more representative sample of the population, a more accurate direction toward the extremum located. When the objective function is a convex function, BGD will be able to obtain the global optimum.
  Disadvantages:
  (1) when the number of samples m is large, each iteration step needs to calculate all samples, the training process will be very slow.
  From the point of view of the number of iterations, the number of iterations BGD relatively small. Which is a schematic view of the iterative convergence curve may be represented as follows:

                      

 

python implemented as:

"" " Batch gradient descent ." ""

# Import Package 
Import numpy AS NP
 Import   matplotlib.pyplot AS PLT

# Import data 
DATAS = np.genfromtxt ( " CSV / the data.csv " , DELIMITER = " , " , skip_header =. 1 )
     # import data, to be the new CSV file in the project, and removing the first line of 
X = DATAS [: , 0]
and = datas [: 1 ]
plt.scatter(x,y)
plt.show()

Paint can be used to see the results below matplotlib:

Batch gradient descent algorithm:

# Gradient descent 
L = len (DATAS)
 DEF param_gra_des (CW, CB, DATAS, Alpha):
     "" "
    : Param cw: represents each i-th, w values
    : Param cb: represents each i-th, b value
    : Param datas: represents a file to import
    : Return: each gradient updated
    """
    sum_w = 0 #总体的梯度
    sum_b = 0 #总体的截距
    for i in range(l):
        x = datas[i,0]
        y = datas[i,1]
        sum_w += (cw*x + cb -y)*x #W关于Loss function的导函数总和
        sum_b += cw*x + cb -y #关于Loss function的导函数总和
    upw = cw - (alpha * sum_w/l)#每一次更新的w值
    upb = cb - (alpha *sum_b/l)#每一次更新的w值
    return [upw,upb]

def step_grd_des(qw,qb,alpha,times,datas):
    """
    :param qw: 表示w的初始值
    :param qb: 表示b的初始值
    :param alpha: 表示学习率
    :param times: 更新的次数
    :param data: 表示文件
    :return: 最后的权重
    """
    w = qw
    b = qb
    for i in range(times):
        w = param_gra_des(w,b,datas,alpha)[0]
        b = param_gra_des(w,b,datas,alpha)[1]
    return w,b
def compute_cost(w,b,datas):
    total_cost=0
    for i in range(l):
        x=datas[i,0]
        y=datas[i,1]
        total_cost += (y-w*x-b)**2
    return total_cost/l #一除都是浮点 两个除号是地板除,整型。 如 3 // 4

测试:

#设置起点、学习率
qw = 0
qb = 0
alpha = 0.0000001
times = 5000
#获得训练结果
w,b = step_grd_des(qw,qb,alpha,times,datas)
loss_cost = compute_cost(w,b,datas)
print("权重为:" + str(w))
print("截距为:" + str(b))
print("平均损失函数:" + str(loss_cost))
plt.scatter(x,y)
m_y = w*x + b
plt.plot(x,m_y,c = "r")
plt.show()

获得结果和图如下:

权重为:1.0180178588402415
截距为:7.997346339384322
平均损失函数:939.0039074472908

 

2.随机批量下降

  随机梯度下降法不同于批量梯度下降,随机梯度下降是每次迭代使用一个样本来对参数进行更新。使得训练速度加快。

         求w关于L的偏导数:

              

         求b关于L的偏导数:

              

  注意:在这里的x和y其实都是m样本中的随机一个样本!!

从迭代的次数上来看,SGD迭代的次数较多,在解空间的搜索过程看起来很盲目。其迭代的收敛曲线示意图可以表示如下:

                

  优点:
  (1)由于不是在全部训练数据上的损失函数,而是在每轮迭代中,随机优化某一条训练数据上的损失函数,这样每一轮参数的更新速度大大加快。
  缺点:
  (1)准确度下降。由于即使在目标函数为强凸函数的情况下,SGD仍旧无法做到线性收敛。
  (2)可能会收敛到局部最优,由于单个样本并不能代表全体样本的趋势。
  (3)不易于并行实现。

python实现为:

#导入数据包
import numpy as np
import  matplotlib.pyplot as plt
import random


#导入数据

datas = np.genfromtxt("csv/data.csv",delimiter=",",skip_header= 1)
    #导入数据时,需在项目中新建CSV文件,并且去除第一行
x = datas[:,0]
y = datas[:,1]
plt.scatter(x,y)
#plt.show()

#梯度下降 l = len(datas) def param_gra_des(cw,cb,datas,alpha): """ :param cw: 表示第i次,w值 :param cb: 表示第i次,b值 :param datas: 表示导入的文件 :return:每次更新后的梯度 """ i = random.randint(0,l-1)#随机在m样本中产生一个样本 x = datas[i, 0] y = datas[i, 1] upw = cw - (alpha * (cw*x + cb -y)*x)#每一次更新的w值 upb = cb - (alpha *(cw*x + cb -y))#每一次更新的w值 return [upw,upb] def step_grd_des(qw,qb,alpha,times,datas): """ :param qw: 表示w的初始值 :param qb: 表示b的初始值 :param alpha: 表示学习率 :param times: 更新的次数 :param data: 表示文件 :return: 最后的权重 """ w = qw b = qb for i in range(times): w = param_gra_des(w,b,datas,alpha)[0] b = param_gra_des(w,b,datas,alpha)[1] return w,b def compute_cost(w,b,datas): total_cost=0 for i in range(l): x=datas[i,0] y=datas[i,1] total_cost += (y-w*x-b)**2 return total_cost/l #一除都是浮点 两个除号是地板除,整型。 如 3 // 4 #设置位置和学习率以及循环的步数 qw = 0 qb = 0 alpha = 0.00000001 times = 50000 w,b = step_grd_des(qw,qb,alpha,times,datas) loss_cost = compute_cost(w,b,datas) print("权重为:" + str(w)) print("截距为:" + str(b)) print("平均损失函数:" + str(loss_cost)) plt.scatter(x,y) m_y = w*x + b plt.plot(x,m_y,c = "b") plt.show()

结果和截图:

权重为:1.0180178588402415
截距为:7.997346339384322
平均损失函数:939.0039074472908

解释一下为什么SGD收敛速度比BGD要快:
  答:这里我们假设有30W个样本,对于BGD而言,每次迭代需要计算30W个样本才能对参数进行一次更新,需要求得最小值可能需要多次迭代(假设这里是300);而对于SGD,每次更新参数只需要一个样本,因此若使用这30W个样本进行参数更新,则参数会被更新(迭代)30W次,而这期间,SGD就能保证能够收敛到一个合适的最小值上了。如果都迭代300次的话,那BGD一共运行了300×30W次,而运行了300次。

上面两个代码可以看出:BGD运行了5000次,而SGD使用了50000次获得结果才差不多,如果次数较少的话可能出入较大!

                

        

 

 

 

Guess you like

Origin www.cnblogs.com/hhxz/p/11992143.html