Machine Learning - Gradient Descent

Gradient descent is an optimization algorithm used to minimize a cost function in various machine learning algorithms. It is mainly used to update the parameters of the learned model.

Types of Gradient Descent:

  1. Batch Gradient Descent:
    This is a type of gradient descent that processes all training examples for each iteration of gradient descent. But batch gradient descent is computationally expensive if the number of training samples is large. Therefore, if the number of training examples is large, then batch gradient descent is not preferred. Instead, we prefer to use stochastic gradient descent or mini-batch gradient descent.
  2. Stochastic Gradient Descent:
    This is a type of gradient descent that processes one training example per iteration. Thus, parameters are updated even after only one iteration of an example has been processed. Therefore, this is much faster than batch gradient descent. But again, when the number of training examples is large, even then it will only process one example, which may introduce additional overhead to the system, as the number of iterations will be very large.
  3. Mini-batch gradient descent:
    This is a faster gradient descent than batch gradient descent and stochastic gradient descent. Here b examples where b < m are processed each iteration. Therefore, even if the number of training samples is large, b training samples can be processed in batches at one time. Therefore, it works with larger training examples and also with fewer iterations.

Variables used:
Let m be the number of training examples.
Let n be the characteristic number.

Note:
If b == m, then mini-batch gradient descent behaves like batch gradient descent.

1. Batch gradient descent algorithm:

The assumption h θ (x) is the assumption of linear regression. The cost function is then given by:
Let Σ denote the sum of all training examples from i=1 to m.

Jtrain(θ) = (1/2m) Σ((x(i))  - y(i))2

Repeat {
    
    
 θj = θj – (learning rate/m) * Σ((x(i))  - y(i))xj(i)
    For every j =0 …n 
}

where xj(i) denotes the jth feature of the ith training example. So if m is very large (eg 5 million training samples), it will take hours or even days to converge to the global minimum. This is why batch gradient descent is not recommended for large datasets as it slows down learning.

2. Stochastic gradient descent algorithm SGD:

1) Randomly shuffle the dataset so that the parameters are trained evenly for each type of data.
2) As above, consider one example per iteration.

Hence,
Let (x(i),y(i)) be the training example
Cost(θ, (x(i),y(i))) = (1/2) Σ((x(i))  - y(i))2

Jtrain(θ) = (1/m) Σ Cost(θ, (x(i),y(i)))

Repeat {
    
    

For i=1 to m{
    
    

         θj = θj – (learning rate) * Σ((x(i))  - y(i))xj(i)
        For every j =0 …n

                } 
}

insert image description here

In SGD, we find the gradient of the cost function for a single example at each iteration, not the sum of the gradients of the cost function for all examples.

In SGD, since each iteration only randomly selects one sample from the dataset, the path the algorithm takes to reach the minimum is usually noisier than a typical gradient descent algorithm. But it doesn't matter because the path the algorithm takes doesn't matter as long as we reach the minimum and the training time is significantly reduced.

The path taken by Batch Gradient Descent——

gd_path

Stochastic gradient descent has taken a path -

sgd_path

One thing to note is that since SGD is usually noisier than typical gradient descent, it usually takes more iterations to reach the minimum because of the stochastic nature of its descent. Although it requires more iterations than typical gradient descent to reach the minimum, it is still computationally much cheaper than typical gradient descent. Therefore, SGD outperforms batch gradient descent for optimizing learning algorithms in most cases.

Pseudocode for SGD in Python:

def SGD(f, theta0, alpha, num_iters):
	"""
	Arguments:
	f -- the function to optimize, it takes a single argument
			and yield two outputs, a cost and the gradient
			with respect to the arguments
	theta0 -- the initial point to start SGD from
	num_iters -- total iterations to run SGD for
	Return:
	theta -- the parameter value after SGD finishes
	"""
	start_iter = 0
	theta = theta0
	for iter in xrange(start_iter + 1, num_iters + 1):
		_, grad = f(theta)

		# there is NO dot product ! return theta
		theta = theta - (alpha * grad)

3. Mini-batch gradient descent algorithm:

Suppose b is the number of examples in a batch, where b < m.
Suppose b = 10, m = 100;

Note:
But we can adjust the batch size. It is usually kept as a power of 2. The reason behind this is because some hardware (e.g. GPU) achieves better runtimes with common batch sizes (e.g. powers of 2).

Repeat {
    
    
 For i=1,11, 21,..,91

    Let Σ be the summation from i to i+9 represented by k. 

    θj = θj – (learning rate/size of (b) ) * Σ((x(k))  - y(k))xj(k)
        For every j =0 …n

}

Convergence trends of different variants of gradient descent:

In the case of batch gradient descent, the algorithm follows a straight line towards the minimum. If the cost function is convex then it will converge to a global minimum, if the cost function is not convex then it will converge to a local minimum. Here the learning rate is usually kept constant.

In the case of stochastic gradient descent and mini-batch gradient descent, the algorithm does not converge but continues to fluctuate around the global minimum. Therefore, in order for it to converge, we have to vary the learning rate slowly. However, the convergence of stochastic gradient descent is noisier because in one iteration, it processes only one training sample.

Python implementation:
Step 1:
The first step is to import dependencies, generate data for linear regression, and visualize the generated data. We have generated 8000 data examples, each with 2 attributes/features. These data samples are further divided into training set (X_train, y_train) and test set (X_test, y_test), with 7200 and 800 samples respectively.

# importing dependencies
import numpy as np
import matplotlib.pyplot as plt

# creating data
mean = np.array([5.0, 6.0])
cov = np.array([[1.0, 0.95], [0.95, 1.2]])
data = np.random.multivariate_normal(mean, cov, 8000)

# visualising data
plt.scatter(data[:500, 0], data[:500, 1], marker='.')
plt.show()

# train-test-split
data = np.hstack((np.ones((data.shape[0], 1)), data))

split_factor = 0.90
split = int(split_factor * data.shape[0])

X_train = data[:split, :-1]
y_train = data[:split, -1].reshape((-1, 1))
X_test = data[split:, :-1]
y_test = data[split:, -1].reshape((-1, 1))

print(& quot
	Number of examples in training set= % d & quot
	% (X_train.shape[0]))
print(& quot
	Number of examples in testing set= % d & quot
	% (X_test.shape[0]))

output:

insert image description here

Number of examples in training set = 7200 Number of examples in test set = 800

Step 2:
Next, we write the code that implements linear regression using mini-batch gradient descent. gradientDescent() is the main driver function, and other functions are auxiliary functions for making predictions—hypothesis(), computing gradients—gradient(), computing errors—cost(), and creating mini-batches—create_mini_batches(). The driver function initializes the parameters, computes the optimal set of parameters for the model, and returns these parameters along with a list containing the error history when the parameters were updated.

example

# linear regression using "mini-batch" gradient descent
# function to compute hypothesis / predictions


def hypothesis(X, theta):
	return np.dot(X, theta)

# function to compute gradient of error function w.r.t. theta


def gradient(X, y, theta):
	h = hypothesis(X, theta)
	grad = np.dot(X.transpose(), (h - y))
	return grad

# function to compute the error for current values of theta


def cost(X, y, theta):
	h = hypothesis(X, theta)
	J = np.dot((h - y).transpose(), (h - y))
	J /= 2
	return J[0]

# function to create a list containing mini-batches


def create_mini_batches(X, y, batch_size):
	mini_batches = []
	data = np.hstack((X, y))
	np.random.shuffle(data)
	n_minibatches = data.shape[0] // batch_size
	i = 0

	for i in range(n_minibatches + 1):
		mini_batch = data[i * batch_size:(i + 1)*batch_size, :]
		X_mini = mini_batch[:, :-1]
		Y_mini = mini_batch[:, -1].reshape((-1, 1))
		mini_batches.append((X_mini, Y_mini))
	if data.shape[0] % batch_size != 0:
		mini_batch = data[i * batch_size:data.shape[0]]
		X_mini = mini_batch[:, :-1]
		Y_mini = mini_batch[:, -1].reshape((-1, 1))
		mini_batches.append((X_mini, Y_mini))
	return mini_batches

# function to perform mini-batch gradient descent


def gradientDescent(X, y, learning_rate=0.001, batch_size=32):
	theta = np.zeros((X.shape[1], 1))
	error_list = []
	max_iters = 3
	for itr in range(max_iters):
		mini_batches = create_mini_batches(X, y, batch_size)
		for mini_batch in mini_batches:
			X_mini, y_mini = mini_batch
			theta = theta - learning_rate * gradient(X_mini, y_mini, theta)
			error_list.append(cost(X_mini, y_mini, theta))

	return theta, error_list

Call the gradientDescent() function to calculate the model parameters (theta) and visualize the change in the error function.

theta, error_list = gradientDescent(X_train, y_train)
print("Bias = ", theta[0])
print("Coefficients = ", theta[1:])

# visualising gradient descent
plt.plot(error_list)
plt.xlabel("Number of iterations")
plt.ylabel("Cost")
plt.show()

Output:
Bias = [0.81830471] Coefficient = [[1.04586595]]
insert image description here
Step 3:
Finally, we make predictions on the test set and calculate the mean absolute error in the predictions.

# predicting output for X_test
y_pred = hypothesis(X_test, theta)
plt.scatter(X_test[:, 1], y_test[:, ], marker='.')
plt.plot(X_test[:, 1], y_pred, color='orange')
plt.show()

# calculating error in predictions
error = np.sum(np.abs(y_test - y_pred) / y_test.shape[0])
print(& quot
	Mean absolute error = & quot
	, error)

Output:
insert image description here
Mean Absolute Error = 0.4366644295854125

The orange line represents the final hypothesis function: theta[0] + theta[1]*X_test[:, 1] + theta[2]*X_test[:, 2] = 0

the end.

Guess you like

Origin blog.csdn.net/weixin_43367756/article/details/126004573