Gradient Descent Series Blog: 2. The Mathematical Intuition Behind the Gradient Descent Algorithm

image-20230205203839330

Gradient Descent Algorithm for Deriving Mean Squared Error

Gradient descent series blog:

  1. Gradient descent series blog: 1. Gradient descent algorithm basis
  2. Gradient descent series blog: 2. The mathematical intuition behind the gradient descent algorithm (you are here!)
  3. Gradient descent series blog: 3, batch gradient descent code combat
  4. Gradient descent series blog: 4. Small batch gradient descent algorithm code combat

introduce:

welcome! Today, we are trying to develop a strong mathematical intuition about how the gradient descent algorithm finds the best values ​​for its parameters. Having this feeling can help you spot errors in machine learning output and gain a better understanding of how the gradient descent algorithm makes machine learning so powerful. In the next few pages, we will derive the gradient descent algorithm equation for the mean squared error function. We will use the results of this blog to code the gradient descent algorithm. Let's dig in!

The gradient descent algorithm derivation of the mean square error:

1. Step 1:

The input data is shown in the matrix below. Here, we can observe that there are ** m training examples and n ** features.

image-20230205203941371

Dimensions: X = (m, n)

2. Step 2:

The expected output matrix is ​​as follows. Our expected output matrix size is ** m*1 since we have m ** training samples.

image-20230205203950375

Dimensions: Y = (m, 1)

3. Step 3:

We will add a bias element to the parameters to be trained.

image-20230205203956816

Dimensions: α = (1, 1)

4. Step 4:

In our parameters we have the weight matrix. The weight matrix will have ** n elements. Here, n ** is the number of features in our training dataset.

image-20230205213747278

Dimensions: β = (1, n)

5. Step 5:

image-20230205213750299

The predicted value for each training example is given by,

image-20230205213755761

Note that we are transposing the weight matrix (β) to make dimensions compatible with the matrix multiplication rules.

Dimensions: predicted_value = (1, 1) + (m, n) * (1, n)

— Transpose the weight matrix (β) —

Dimensions: predicted_value = (1, 1) + (m, n) * (n, 1) = (m, 1)

6. Step 6:

The mean square error is defined as follows.

image-20230205213807466

** dimension: ** cost = scalar function

7. Step 7:

In this case, we will use the following gradient descent rule to determine the optimal parameters.

image-20230205214726327

Dimensions: α = (1, 1) & β = (1, n)

8. Step 8:

Now, let's find the partial derivative of the cost function with respect to the bias element ( α ).

image-20230205214730812

Dimensions: (1, 1)

9. Step 9:

Now, we are trying to simplify the above equation to find the partial derivatives.

image-20230205214734376

Dimensions: u = (m, 1)

10. Step 10:

Based on Step — 9 , we can write the cost function as,

image-20230205214740351

**dimension:** scalar function

11. Step 11:

Next, we will compute the partial derivatives of the cost function with respect to the intercept ( α ) using the chain rule.

image-20230205214743632

**Dimensions: (**m, 1)

12. Step 12:

Next, we are computing the first part of the partial derivatives of Step_11.

image-20230205214749436

**Dimensions: (**m, 1)

13. Step 13:

Next, we calculate the second part of the partial derivative of Step — 11.

image-20230205214753140

**dimension:** scalar function

14. Step 14:

Next, we multiply the results of steps 12 and 13 to get the final result.

image-20230205214757690

**Dimensions: (**m, 1)

15. Step 15:

Next, we will compute the partial derivatives of the cost function with respect to the weights ( β ) using the chain rule.

image-20230205214800965

Dimensions: (1,n)

16. Step 16:

Next, we calculate the second part of the partial derivative of Step — 15.

image-20230205214804159

**Dimensions: (**m,n)

17. Step 17:

Next, we multiply the results of Step_12 and Step_16 to get the final partial derivative result.

image-20230205214807412

Now, since we want to have ** n ** weight values, we will remove the summation part from the above equation.

image-20230205214811208

Note that here we have to transpose the first part of the calculation to be compatible with the matrix multiplication rules.

**dimensions: (**m,1)*(m,n)

— Transpose the wrong part —

Dimensions: (1,m)*(m,n)=(1,n)

18. Step 18:

Next, we put all the calculated values ​​in Step_7 to calculate the gradient rule for updating α .

image-20230205214815331

Dimensions: α = (1, 1)

19. Step 19:

Next, we put all the calculated values ​​in Step_7 and calculate the gradient rule for updating ** β **.

image-20230205214818624

Note that we have to transpose the error values ​​to make the function compatible with the matrix multiplication rules.

Dimensions: β = (1, n) - (1, n) = (1, n)

A working example of the gradient descent algorithm:

Now, let's take an example to see how the gradient descent algorithm finds the best parameter values.

1. Step 1:

The input data is shown in the matrix below. Here, we can observe that there are ** 4 training examples and 2 ** features.

image-20230205214822262

2. Step 2:

The expected output matrix is ​​as follows. Our expected output matrix size is ** 4*1 , since we have 4 ** training examples.

image-20230205214825013

3. Step 3:

We will add a bias element to the parameters to be trained. Here we choose an initial value of 0 for the bias.

image-20230205214828045

4. Step 4:

In our parameters we have the weight matrix. The weight matrix will have 2 elements. Here, 2 is the number of features in our training dataset . Initially, we can choose arbitrary random numbers for the weight matrix .

image-20230205214830967

5. Step 5:

Next, we will use the input matrix, weight matrix, and biases to predict values.

image-20230205214834598

6. Step 6:

Next, we calculate the cost using the following equation.

image-20230205214838313

7. Step 7:

Next, we are computing the partial derivatives of the cost function with respect to the biased elements. We will use this result in the gradient descent algorithm to update the value of the bias parameter.

image-20230205214841571

8. Step 8:

Next, we compute the partial derivatives of the cost function with respect to the weight matrix. We will use this result in the gradient descent algorithm to update the values ​​of the weight matrix.

image-20230205214844121

9. Step 9:

Next, we define the value of the learning rate. The learning rate is a parameter that controls how quickly the model learns.

image-20230205215003964

10. Step 10:

Next, we use the gradient descent rule to update the parameter values ​​of the bias elements.

image-20230205215007408

11. Step 11:

Next, we use the gradient descent rule to update the parameter values ​​of the weight matrix.

image-20230205215034908

12. Step 12:

We now repeat this process for many iterations to find the parameters that best fit our model. In each iteration, we use the updated values ​​of the parameters.

Endnotes:

So, this is how we find the update rule using the gradient descent algorithm with mean squared error. We hope this has sparked your curiosity and left you eager to learn more about machine learning. We'll be implementing the gradient descent algorithm in a future blog using the rules we've derived here, so don't miss the third part of the gradient descent series, where it all comes together - the grand finale!

Guess you like

Origin blog.csdn.net/weixin_45755332/article/details/128893715