Gradient descent is an optimization algorithm used to minimize a cost function in various machine learning algorithms. It is mainly used to update the parameters of the learned model.
Types of Gradient Descent:
- Batch Gradient Descent:
This is a type of gradient descent that processes all training examples for each iteration of gradient descent. But batch gradient descent is computationally expensive if the number of training samples is large. Therefore, if the number of training examples is large, then batch gradient descent is not preferred. Instead, we prefer to use stochastic gradient descent or mini-batch gradient descent. - Stochastic Gradient Descent:
This is a type of gradient descent that processes one training example per iteration. Thus, parameters are updated even after only one iteration of an example has been processed. Therefore, this is much faster than batch gradient descent. But again, when the number of training examples is large, even then it will only process one example, which may introduce additional overhead to the system, as the number of iterations will be very large. - Mini-batch gradient descent:
This is a faster gradient descent than batch gradient descent and stochastic gradient descent. Here b examples where b < m are processed each iteration. Therefore, even if the number of training samples is large, b training samples can be processed in batches at one time. Therefore, it works with larger training examples and also with fewer iterations.
Variables used:
Let m be the number of training examples.
Let n be the characteristic number.
Note:
If b == m, then mini-batch gradient descent behaves like batch gradient descent.
1. Batch gradient descent algorithm:
The assumption h θ (x) is the assumption of linear regression. The cost function is then given by:
Let Σ denote the sum of all training examples from i=1 to m.
Jtrain(θ) = (1/2m) Σ( hθ(x(i)) - y(i))2
Repeat {
θj = θj – (learning rate/m) * Σ( hθ(x(i)) - y(i))xj(i)
For every j =0 …n
}
where xj(i) denotes the jth feature of the ith training example. So if m is very large (eg 5 million training samples), it will take hours or even days to converge to the global minimum. This is why batch gradient descent is not recommended for large datasets as it slows down learning.
2. Stochastic gradient descent algorithm SGD:
1) Randomly shuffle the dataset so that the parameters are trained evenly for each type of data.
2) As above, consider one example per iteration.
Hence,
Let (x(i),y(i)) be the training example
Cost(θ, (x(i),y(i))) = (1/2) Σ( hθ(x(i)) - y(i))2
Jtrain(θ) = (1/m) Σ Cost(θ, (x(i),y(i)))
Repeat {
For i=1 to m{
θj = θj – (learning rate) * Σ( hθ(x(i)) - y(i))xj(i)
For every j =0 …n
}
}
In SGD, we find the gradient of the cost function for a single example at each iteration, not the sum of the gradients of the cost function for all examples.
In SGD, since each iteration only randomly selects one sample from the dataset, the path the algorithm takes to reach the minimum is usually noisier than a typical gradient descent algorithm. But it doesn't matter because the path the algorithm takes doesn't matter as long as we reach the minimum and the training time is significantly reduced.
The path taken by Batch Gradient Descent——
Stochastic gradient descent has taken a path -
One thing to note is that since SGD is usually noisier than typical gradient descent, it usually takes more iterations to reach the minimum because of the stochastic nature of its descent. Although it requires more iterations than typical gradient descent to reach the minimum, it is still computationally much cheaper than typical gradient descent. Therefore, SGD outperforms batch gradient descent for optimizing learning algorithms in most cases.
Pseudocode for SGD in Python:
def SGD(f, theta0, alpha, num_iters):
"""
Arguments:
f -- the function to optimize, it takes a single argument
and yield two outputs, a cost and the gradient
with respect to the arguments
theta0 -- the initial point to start SGD from
num_iters -- total iterations to run SGD for
Return:
theta -- the parameter value after SGD finishes
"""
start_iter = 0
theta = theta0
for iter in xrange(start_iter + 1, num_iters + 1):
_, grad = f(theta)
# there is NO dot product ! return theta
theta = theta - (alpha * grad)
3. Mini-batch gradient descent algorithm:
Suppose b is the number of examples in a batch, where b < m.
Suppose b = 10, m = 100;
Note:
But we can adjust the batch size. It is usually kept as a power of 2. The reason behind this is because some hardware (e.g. GPU) achieves better runtimes with common batch sizes (e.g. powers of 2).
Repeat {
For i=1,11, 21,…..,91
Let Σ be the summation from i to i+9 represented by k.
θj = θj – (learning rate/size of (b) ) * Σ( hθ(x(k)) - y(k))xj(k)
For every j =0 …n
}
Convergence trends of different variants of gradient descent:
In the case of batch gradient descent, the algorithm follows a straight line towards the minimum. If the cost function is convex then it will converge to a global minimum, if the cost function is not convex then it will converge to a local minimum. Here the learning rate is usually kept constant.
In the case of stochastic gradient descent and mini-batch gradient descent, the algorithm does not converge but continues to fluctuate around the global minimum. Therefore, in order for it to converge, we have to vary the learning rate slowly. However, the convergence of stochastic gradient descent is noisier because in one iteration, it processes only one training sample.
Python implementation:
Step 1:
The first step is to import dependencies, generate data for linear regression, and visualize the generated data. We have generated 8000 data examples, each with 2 attributes/features. These data samples are further divided into training set (X_train, y_train) and test set (X_test, y_test), with 7200 and 800 samples respectively.
# importing dependencies
import numpy as np
import matplotlib.pyplot as plt
# creating data
mean = np.array([5.0, 6.0])
cov = np.array([[1.0, 0.95], [0.95, 1.2]])
data = np.random.multivariate_normal(mean, cov, 8000)
# visualising data
plt.scatter(data[:500, 0], data[:500, 1], marker='.')
plt.show()
# train-test-split
data = np.hstack((np.ones((data.shape[0], 1)), data))
split_factor = 0.90
split = int(split_factor * data.shape[0])
X_train = data[:split, :-1]
y_train = data[:split, -1].reshape((-1, 1))
X_test = data[split:, :-1]
y_test = data[split:, -1].reshape((-1, 1))
print(& quot
Number of examples in training set= % d & quot
% (X_train.shape[0]))
print(& quot
Number of examples in testing set= % d & quot
% (X_test.shape[0]))
output:
Number of examples in training set = 7200 Number of examples in test set = 800
Step 2:
Next, we write the code that implements linear regression using mini-batch gradient descent. gradientDescent() is the main driver function, and other functions are auxiliary functions for making predictions—hypothesis(), computing gradients—gradient(), computing errors—cost(), and creating mini-batches—create_mini_batches(). The driver function initializes the parameters, computes the optimal set of parameters for the model, and returns these parameters along with a list containing the error history when the parameters were updated.
example
# linear regression using "mini-batch" gradient descent
# function to compute hypothesis / predictions
def hypothesis(X, theta):
return np.dot(X, theta)
# function to compute gradient of error function w.r.t. theta
def gradient(X, y, theta):
h = hypothesis(X, theta)
grad = np.dot(X.transpose(), (h - y))
return grad
# function to compute the error for current values of theta
def cost(X, y, theta):
h = hypothesis(X, theta)
J = np.dot((h - y).transpose(), (h - y))
J /= 2
return J[0]
# function to create a list containing mini-batches
def create_mini_batches(X, y, batch_size):
mini_batches = []
data = np.hstack((X, y))
np.random.shuffle(data)
n_minibatches = data.shape[0] // batch_size
i = 0
for i in range(n_minibatches + 1):
mini_batch = data[i * batch_size:(i + 1)*batch_size, :]
X_mini = mini_batch[:, :-1]
Y_mini = mini_batch[:, -1].reshape((-1, 1))
mini_batches.append((X_mini, Y_mini))
if data.shape[0] % batch_size != 0:
mini_batch = data[i * batch_size:data.shape[0]]
X_mini = mini_batch[:, :-1]
Y_mini = mini_batch[:, -1].reshape((-1, 1))
mini_batches.append((X_mini, Y_mini))
return mini_batches
# function to perform mini-batch gradient descent
def gradientDescent(X, y, learning_rate=0.001, batch_size=32):
theta = np.zeros((X.shape[1], 1))
error_list = []
max_iters = 3
for itr in range(max_iters):
mini_batches = create_mini_batches(X, y, batch_size)
for mini_batch in mini_batches:
X_mini, y_mini = mini_batch
theta = theta - learning_rate * gradient(X_mini, y_mini, theta)
error_list.append(cost(X_mini, y_mini, theta))
return theta, error_list
Call the gradientDescent() function to calculate the model parameters (theta) and visualize the change in the error function.
theta, error_list = gradientDescent(X_train, y_train)
print("Bias = ", theta[0])
print("Coefficients = ", theta[1:])
# visualising gradient descent
plt.plot(error_list)
plt.xlabel("Number of iterations")
plt.ylabel("Cost")
plt.show()
Output:
Bias = [0.81830471] Coefficient = [[1.04586595]]
Step 3:
Finally, we make predictions on the test set and calculate the mean absolute error in the predictions.
# predicting output for X_test
y_pred = hypothesis(X_test, theta)
plt.scatter(X_test[:, 1], y_test[:, ], marker='.')
plt.plot(X_test[:, 1], y_pred, color='orange')
plt.show()
# calculating error in predictions
error = np.sum(np.abs(y_test - y_pred) / y_test.shape[0])
print(& quot
Mean absolute error = & quot
, error)
Output:
Mean Absolute Error = 0.4366644295854125
The orange line represents the final hypothesis function: theta[0] + theta[1]*X_test[:, 1] + theta[2]*X_test[:, 2] = 0
the end.