Gradient Descent Algorithm for Deriving Mean Squared Error
Gradient descent series blog:
- Gradient descent series blog: 1. Gradient descent algorithm basis
- Gradient descent series blog: 2. The mathematical intuition behind the gradient descent algorithm (you are here!)
- Gradient descent series blog: 3, batch gradient descent code combat
- Gradient descent series blog: 4. Small batch gradient descent algorithm code combat
introduce:
welcome! Today, we are trying to develop a strong mathematical intuition about how the gradient descent algorithm finds the best values for its parameters. Having this feeling can help you spot errors in machine learning output and gain a better understanding of how the gradient descent algorithm makes machine learning so powerful. In the next few pages, we will derive the gradient descent algorithm equation for the mean squared error function. We will use the results of this blog to code the gradient descent algorithm. Let's dig in!
The gradient descent algorithm derivation of the mean square error:
1. Step 1:
The input data is shown in the matrix below. Here, we can observe that there are ** m training examples and n ** features.
Dimensions: X = (m, n)
2. Step 2:
The expected output matrix is as follows. Our expected output matrix size is ** m*1 since we have m ** training samples.
Dimensions: Y = (m, 1)
3. Step 3:
We will add a bias element to the parameters to be trained.
Dimensions: α = (1, 1)
4. Step 4:
In our parameters we have the weight matrix. The weight matrix will have ** n elements. Here, n ** is the number of features in our training dataset.
Dimensions: β = (1, n)
5. Step 5:
The predicted value for each training example is given by,
Note that we are transposing the weight matrix (β) to make dimensions compatible with the matrix multiplication rules.
Dimensions: predicted_value = (1, 1) + (m, n) * (1, n)
— Transpose the weight matrix (β) —
Dimensions: predicted_value = (1, 1) + (m, n) * (n, 1) = (m, 1)
6. Step 6:
The mean square error is defined as follows.
** dimension: ** cost = scalar function
7. Step 7:
In this case, we will use the following gradient descent rule to determine the optimal parameters.
Dimensions: α = (1, 1) & β = (1, n)
8. Step 8:
Now, let's find the partial derivative of the cost function with respect to the bias element ( α ).
Dimensions: (1, 1)
9. Step 9:
Now, we are trying to simplify the above equation to find the partial derivatives.
Dimensions: u = (m, 1)
10. Step 10:
Based on Step — 9 , we can write the cost function as,
**dimension:** scalar function
11. Step 11:
Next, we will compute the partial derivatives of the cost function with respect to the intercept ( α ) using the chain rule.
**Dimensions: (**m, 1)
12. Step 12:
Next, we are computing the first part of the partial derivatives of Step_11.
**Dimensions: (**m, 1)
13. Step 13:
Next, we calculate the second part of the partial derivative of Step — 11.
**dimension:** scalar function
14. Step 14:
Next, we multiply the results of steps 12 and 13 to get the final result.
**Dimensions: (**m, 1)
15. Step 15:
Next, we will compute the partial derivatives of the cost function with respect to the weights ( β ) using the chain rule.
Dimensions: (1,n)
16. Step 16:
Next, we calculate the second part of the partial derivative of Step — 15.
**Dimensions: (**m,n)
17. Step 17:
Next, we multiply the results of Step_12 and Step_16 to get the final partial derivative result.
Now, since we want to have ** n ** weight values, we will remove the summation part from the above equation.
Note that here we have to transpose the first part of the calculation to be compatible with the matrix multiplication rules.
**dimensions: (**m,1)*(m,n)
— Transpose the wrong part —
Dimensions: (1,m)*(m,n)=(1,n)
18. Step 18:
Next, we put all the calculated values in Step_7 to calculate the gradient rule for updating α .
Dimensions: α = (1, 1)
19. Step 19:
Next, we put all the calculated values in Step_7 and calculate the gradient rule for updating ** β **.
Note that we have to transpose the error values to make the function compatible with the matrix multiplication rules.
Dimensions: β = (1, n) - (1, n) = (1, n)
A working example of the gradient descent algorithm:
Now, let's take an example to see how the gradient descent algorithm finds the best parameter values.
1. Step 1:
The input data is shown in the matrix below. Here, we can observe that there are ** 4 training examples and 2 ** features.
2. Step 2:
The expected output matrix is as follows. Our expected output matrix size is ** 4*1 , since we have 4 ** training examples.
3. Step 3:
We will add a bias element to the parameters to be trained. Here we choose an initial value of 0 for the bias.
4. Step 4:
In our parameters we have the weight matrix. The weight matrix will have 2 elements. Here, 2 is the number of features in our training dataset . Initially, we can choose arbitrary random numbers for the weight matrix .
5. Step 5:
Next, we will use the input matrix, weight matrix, and biases to predict values.
6. Step 6:
Next, we calculate the cost using the following equation.
7. Step 7:
Next, we are computing the partial derivatives of the cost function with respect to the biased elements. We will use this result in the gradient descent algorithm to update the value of the bias parameter.
8. Step 8:
Next, we compute the partial derivatives of the cost function with respect to the weight matrix. We will use this result in the gradient descent algorithm to update the values of the weight matrix.
9. Step 9:
Next, we define the value of the learning rate. The learning rate is a parameter that controls how quickly the model learns.
10. Step 10:
Next, we use the gradient descent rule to update the parameter values of the bias elements.
11. Step 11:
Next, we use the gradient descent rule to update the parameter values of the weight matrix.
12. Step 12:
We now repeat this process for many iterations to find the parameters that best fit our model. In each iteration, we use the updated values of the parameters.
Endnotes:
So, this is how we find the update rule using the gradient descent algorithm with mean squared error. We hope this has sparked your curiosity and left you eager to learn more about machine learning. We'll be implementing the gradient descent algorithm in a future blog using the rules we've derived here, so don't miss the third part of the gradient descent series, where it all comes together - the grand finale!