Summary:

Gradient descent (Gradient Descent) Strictly speaking, in fact, can not be regarded as a machine learning algorithm, but belongs to an optimization algorithm, which aims to minimize a loss search function to find the optimal solution based. Because of its simplicity but also has a good effect, which is widely used in machine learning algorithm.

Principle explain:

Assuming a two-dimensional spatial plane, we have the following functions:

Here Insert Picture Description

Therefore, we must find the minimum point, we can define a formula: y = y-ηy '(y' is the derivative of y). When we repeat the above-described expression, we can see as the point decreased, y 'value is smaller, y to find the lowest point until' the value 0, y values no longer change, then we find the lowest point.

Η which we call "learning efficiency." Not difficult to find smaller η, the more times we substituted into the formula, and the results of the slower the speed. But the η value can not be made too large, otherwise it may appear the first time crossed the lowest point, then the situation even more gradually deviated:
Here Insert Picture Description

Gradient descent method can be applied not only in the two-dimensional data, the multidimensional data is also applicable, may be extended to multi-dimensional time as shown below:

Here Insert Picture Description

Application in multi-dimensional thing, we can use multiple linear regression algorithm learned before to help us calculate the results, and the linear regression equation is converted into the form of a matrix, we can achieve the above algorithm through code.

Here again expand a stochastic gradient method (the method described above we usually become batch gradient descent method) decline, the purpose is to increase computing speed. Because all the data necessary to calculate all the above method every time we fall, like this requires a lot of computing time, and through experiments found that even among a randomly selected set of data is calculated, and ultimately we can reach the vicinity of the lowest point ( here it is similar for time with precision).

Well, probably thinking is like this, then we implement the above algorithm python code:

Code shows:

1, encapsulation algorithm

# 封装的多元线性回归的梯度下降算法
import numpy as np
from sklearn.metrics import r2_score

class LinearRegression:
	def __init__(self):
	    self.coef_ = None
	    self.interception_ = None
	    self._theta = None

	def fit_normal(self, x_train, y_train):
		assert x_train.shape[0] == y_train.shape[0], \
	        	"the size of x_train must be equal to the size of y_train"
	
	   	x_b = np.hstack([np.ones((len(x_train), 1)), x_train])
	  	self._theta = np.linalg.inv(x_b.T.dot(x_b)).dot(x_b.T).dot(y_train);
	
	  	self.interception_ = self._theta[0]
	    	self.coef_ = self._theta[1:]

	def fit_gd(self, x_train, y_train, eta=0.01, n_iters=1e4):
		assert x_train.shape[0] == y_train.shape[0], \
  			  "the size of x_train must be equal to the size of y_train"
		def lose(theta, x_b, y):
   			 try:
       				 return np.sum((y - x_b.dot(theta) ** 2)) / len(x_b)
  			  except:
      				  return float("inf")
      		def Derivative(theta, x_b, y):
      			return x_b.T.dot(x_b.dot(theta)-y)*2/len(x_b)
      				 
      		def gradient_descent(x_b, y, init_theta, eta, epsilon=1e-8):
   			 theta = init_theta
  		 	 i_iters = 0
 		  	 while i_iters < n_iters:
      				  gradient = Derivative(theta, x_b, y)
       				 last_theta = theta
       				 theta = theta - eta * gradient
       				 if (abs(lose(theta, x_b, y) - lose(last_theta, x_b, y)) < epsilon):
  					  break
  				i_iters += 1
  			return theta
  		x_b = np.hstack([np.ones((len(x_train), 1)), x_train])
		initial_theta = np.zeros(x_b.shape[1])
		self._theta = gradient_descent(x_b, y_train, initial_theta, eta, epsilon=1e-8)
		self.interception_ = self._theta[0]
		self.coef_ = self._theta[1:]
		return self

	def fit_random_gd(self, x_train, y_train, n_iters=5, t0=5, t1=50):
	
		assert x_train.shape[0] == y_train.shape[0], \
   			 "the size of x_train must be equal to the size of y_train"
		assert n_iters >= 1

		def Derivative(theta, x_b_i, y_i):
   			 return x_b_i*(x_b_i.dot(theta)-y_i)*2
   		def random_gradient_descent(x_b, y, initial_theta):
   			def learning_rate(t):
   				 return t0/(t+t1)
			theta = initial_theta
			m = len(x_b)
			for cur_iter in range(n_iters):
 				 indexes = np.random.permutation(m)
 			  	 x_b_new = x_b[indexes]
   				 y_new = y[indexes]
  				  for i in range(m):
       					 gradient = Derivative(theta, x_b_new[i], y_new[i])
        				theta = theta - learning_rate(cur_iter*m+i)*gradient
			return theta

		x_b = np.hstack([np.ones((len(x_train), 1)), x_train])
		initial_theta = np.zeros(x_b.shape[1])
		self._theta = random_gradient_descent(x_b, y_train, initial_theta)
		self.interception_ = self._theta[0]
		self.coef_ = self._theta[1:]
		return self

	def predict(self, x_predict):
		assert x_predict.shape[1] == len(self.coef_), \
	        	"Simple Linear regressor can only solve single feature training data"
		assert self.interception_ is not None and self.coef_ is not None, \
	        	"must fit before predict!"
	        	
		x_b = np.hstack([np.ones((len(x_predict), 1)), x_predict])
		return x_b.dot(self._theta)

	def score(self, x_test, y_test):
		y_predict = self.predict(x_test)
		return r2_score(y_test, y_predict)
		
	def __repr__(self):
	    return "LinearRegression()"

2, main functions:

from sklearn import datasets
from sklearn.model_selection import train_test_split
from LR_GD_class import LinearRegression
from sklearn.preprocessing import StandardScaler

boston = datasets.load_boston()
x = boston.data
y = boston.target
x = x[y < 50.0]
y = y[y < 50.0]

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=666)

standardScaler = StandardScaler()
standardScaler.fit(x_train)
x_train_stand = standardScaler.transform(x_train)
x_test_stand = standardScaler.transform(x_test)

# 批量梯度下降法
lin_reg = LinearRegression()
lin_reg.fit_gd(x_train_stand, y_train)
print(lin_reg.score(x_test_stand, y_test))

## 随机梯度下降法
lin_reg1 = LinearRegression()
lin_reg1.fit_random_gd(x_train_stand, y_train, n_iters=50)
print(lin_reg1.score(x_test_stand, y_test))

Readers want to be helpful, like it can look at my public, then I will study notes made in the above, we will be able to learn together!

Here Insert Picture Description

Rosen.

Published 19 original articles · won praise 12 · views 6104

Private letter concerns

Machine learning portal (c) of the gradient descent

Summary:

Principle explain:

Code shows:

1, encapsulation algorithm

2, main functions:

Guess you like