SGD (Stochastic Gradient Descent), translated as stochastic gradient descent, is a commonly used function optimization method in deep learning.
1.Citation
Introducing SGD SGDBefore SG D, let me first introduce an example. There are three people on the top of the mountain who are thinking about how to go down the mountain quickly. The boss, the second child and the third child respectively put forward three different opinions.
-
The boss said: Starting from the top of the mountain, every time you walk a certain distance, you will look for all the nearby mountain roads and choose the steepest mountain road to continue moving forward. As the name suggests, the boss always chooses the steepest mountain road to walk.
-
The second child said: Starting from the top of the mountain, every time you walk a certain distance, you will randomly search for some nearby mountain roads, and choose the steepest mountain road to continue moving forward. As the name suggests, the second child will randomly search for some mountain roads, and then take the steepest one.
-
The third child said: Starting from the top of the mountain, just randomly choose the mountain road and walk until you reach the bottom of the mountain.
Although the boss 's walking method is optimal for every road, it will consume a lot of time in the process of finding the steepest mountain road.
Although the second child ’s walking method does not guarantee that the path will be optimal every time, it can ensure that the path will be better every time, and it does not need to spend a lot of time looking for the steepest mountain road.
The third child 's moves are more random, and the path he takes each time may be the best or the worst.
So who do you think reaches the bottom of the mountain first? After studying SGD SGDAfter SG D , you will get your answer.
2. Introduction to SGD
2.1 Introducing problems
Give you xy xyx y coordinate system, with some points on it, gives you a straight line through the originy = wxy=wxy=w x , how to fit these points in the fastest way?
In order to solve this problem, we need to define a goal for the problem, which is to minimize the deviation of all points from the straight line. Our commonly used error function is the mean square error , for a point p 1 p_1p1In other words, its mean square error with the straight line can be defined as e 1 e_1e1:
e 1 = ( y 1 − w x 1 ) 2 = ( w x 1 − y 1 ) 2 e_1=(y_1-wx_1)^2=(wx_1-y_1)^2 e1=(y1−wx1)2=(wx1−y1)2
Complete square expansion:
e 1 = w 2 x 1 2 − 2 ( wx 1 y 1 ) + y 1 2 e_1=w^2{x_1}^2-2(wx_1y_1)+{y_1}^2e1=w2x _12−2(wx1y1)+y12 e 1 = x 1 2 w 2 − 2 ( x 1 y 1 ) w + y 1 2 e_1={x_1}^2w^2-2(x_1y_1)w+{y_1}^2 e1=x12w _2−2(x1y1)w+y12Similarly
, pointp 2 p2p2, p 3 p3 p3,. _ _ . . ......, p n pn This is true for p n
: e 2 = x 2 2 w 2 − 2 ( x 2 y 2 ) w + y 2 2 e_2={x_2}^2w^2-2(x_2y_2)w+{y_2}^2e2=x22w _2−2(x2y2)w+y22 e 3 = x 3 2 w 2 − 2 ( x 3 y 3 ) w + y 3 2 e_3={x_3}^2w^2-2(x_3y_3)w+{y_3}^2 e3=x32w _2−2(x3y3)w+y32 en = xn 2 w 2 − 2 ( xnyn ) w + yn 2 e_n={x_n}^2w^2-2(x_ny_n)w+{y_n}^2en=xn2w _2−2(xnyn)w+yn2And
our final errore = ( ∑ e 1 + e 2 + . . . + en ) / ne=(\sum{e_1+e_2+...+e_n})/ne=(∑e1+e2+...+en) / nBy
merging similar items:
finally get:
Because a = x 1 2 + . . . + xn 2 a={x_1}^2+...+{x_n}^2a=x12+...+xn2 , soa > 0 a>0a>0 , soeee is an upward parabola.
After defining the error function, we can start to calculate the gradient.
Obviously, whenew ewew is the lowest point in the image,eee is the smallest, wwat this timew is optimal.
How to quickly move from the yellow point on the right to the lowest point? This is stochastic gradient descent. Starting from its own position, it explores every other distance and randomly selects a direction with the largest gradient to move until it reaches the lowest point.
How often should we explore? This is the learning ratelearning learninglearning r a t e rate rate了,当 l e a r n i n g learning learning r a t e = 0.1 rate=0.1 rate=0.1时
当 l e a r n i n g learning learning r a t e = 0.2 rate=0.2 rate=At 0.2,
a good learning rate can quickly minimize the points
2.2 Calculation steps of SGD
Going back to the problem of climbing the mountain just now, we learned through a large amount of data experiments that the second child’s SGD SGDThe SG D method can reach the bottom of the mountain the fastest.
3. Code implementation of SGD
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
X_scaler = StandardScaler()
y_scaler = StandardScaler()
X = [[50],[100],[150],[200],[250],[300],[50],[100],[150],[200],[250],[300],[50],[100],[150],[200],[250],[300],[50],[100],[150],[200],[250],[300],[50],[100],[150],[200],[250],[300],[50],[100],[150],[200],[250],[300],[50],[100],[150],[200],[250],[300],[50],[100],[150],[200],[250],[300]]
y = [[150],[200],[250],[280],[310],[330],[150],[200],[250],[280],[310],[330],[150],[200],[250],[280],[310],[330],[150],[200],[250],[280],[310],[330],[150],[200],[250],[280],[310],[330],[150],[200],[250],[280],[310],[330],[150],[200],[250],[280],[310],[330],[150],[200],[250],[280],[310],[330]]
#plt.show()
X = X_scaler.fit_transform(X) #用什么方法标准化数据?
y = y_scaler.fit_transform(y)
X_test = [[40],[400]] # 用来做最终效果测试
X_test = X_scaler.transform(X_test)
model = SGDRegressor()
model.fit(X, y.ravel())
y_result = model.predict(X_test)
plt.title('single variable')
plt.xlabel('x')
plt.ylabel('y')
plt.grid(True)
plt.plot(X, y, 'k.')
plt.plot(X_test, y_result, 'g-')
plt.show()
result: