Deep Learning from Scratch in Modern C++: [4/8] Gradient Descent

1. Description

        In this series , we will learn how to write must-know deep learning algorithms such as convolutions, backpropagation, activation functions, optimizers, deep neural networks, and more, using only plain and modern C++.

        In this story, we will introduce the fitting of 2D convolution kernels to the data by introducing the gradient descent algorithm. We'll code everything into modern C++ and Eigen using convolutions and the cost function concept introduced in the previous story .

This story is: Gradient Descent in C++, see other stories:

0 — Basics of Modern C++ Deep Learning Programming

1 — Coding 2D convolution in C++

2 — Cost function using Lambda

4 — activation function

...and more coming soon.

2. Function approximation as an optimization problem

        If you've read our previous talks, you already know that in machine learning most of the time we focus on using data to find function approximations .

        Usually, we obtain a function approximation by finding the coefficients that minimize the cost value . Therefore, our approximation problem is transformed into an optimization problem, where we try to minimize the value of the cost function.

3. Cost function and gradient descent

The cost function computes the cost of approximating the objective function  F(X )         using the function  H(X )  . For example, if  H(X)  is   a convolution between input X  and kernel  k , the MSE cost function is given by:

        We usually do  Y n =  F (Xn), the result is:

MSE is mean squared error and is the cost function introduced in the previous story

Therefore, our goal is to find the kernel value k m that minimizes MSE(k) . The most basic (but most powerful) algorithm for finding  k m is gradient descent.

Gradient descent uses the cost function gradient to find the minimum cost. To understand what a gradient is, let's talk about cost surfaces.

4. Draw the cost surface

        For easier understanding, let us temporarily assume that the kernel consists of only two coefficients. If we plot the value of MSE(k) for each possible combination   , we end up with a surface like this:k[k00, k01][k00, k01]

At each point, the surface has one inclination to the 0k₀₀ axis and another inclination to the 0k₀₁ axis:(k00, k01, MSE(k00, k01))

Partial derivative

These two slopes are the partial derivatives of the MSE curve with respect to the axes O k₀₀ and  O k ₀₁ , respectively . In calculus we very much use the notation ∂ to denote partial derivatives:

These two partial derivatives together constitute the gradient of the MSE with respect to the axes O k₀₀ and O k₀₁ . This gradient is used to drive the execution of the gradient descent algorithm as follows:

Practical Applications of Gradient Descent

The algorithm that performs this "navigation" on a cost surface is called gradient descent.

5. Gradient descent

The gradient descent pseudocode is described as follows:

gradient_descent:
    initialize k, learning_rate, epoch = 1
    repeat
        k = k - learning_rate x ∇Cost(k)
    until epoch <= max_epoch
    return k

 The value of         learning_rate x ∇Cost(k) is often called a weight update . We can restore the behavior of gradient descent with:

for each iteration:
    calculate the weight update
    subtract it from the parameter k

As the name suggests, Cost(k)  is the cost function for configuration  k  . The purpose of gradient descent is to find the value of k that minimizes the cost (k) .

learning_rate is usually a scalar like 0.1, 0.01, 0.001 or so. This value controls the step size during optimization.

The algorithm loops  max_epoch  times. Sometimes, we stop the algorithm earlier, i.e., even if epoch < max_epoch , in   case Cost(k) is too small.

We usually refer to parameters like learning_rate and max_epoch by the names of hyperparameters .

To implement gradient descent, the last thing we need to know is how to compute  the gradient of C(k)  . Fortunately, in the case where the cost function is MSE, finding  ∇Cost(k)  is trivial, as mentioned earlier.

6. Find the MSE gradient

So far we have seen that the components of the gradient are  the slopes of the cost surface for each axis 0 k ij . We also see that the gradient of MSE ( k ) with respect to   the coefficient  j- of each i  , kernel  k is given by:

Let us remember that MSE(k)  is given by:

where n is the index of each pair ( Y n, T n) and r & c are the indices of the output matrix coefficients:

output layout

Using the chain rule and linear combination rule, we can find the MSE gradient in the following way:

Since  the values ​​of N , R , C , Y n and T n are known, all we need to calculate is  the partial derivative of each coefficient in T n  with respect to coefficient  kij . In the case of convolution with padding P, this derivative is given by:

If we expand  the sum of r  and  c  , we can find that the gradient is given by:

where δn is the matrix:

The following code does this:

auto gradient = [](const std::vector<Matrix> &xs, std::vector<Matrix> &ys, std::vector<Matrix> &ts, const int padding)
{
    const int N = xs.size();
    const int R = xs[0].rows();
    const int C = xs[0].cols();

    const int result_rows = xs[0].rows() - ys[0].rows() + 2 * padding + 1;
    const int result_cols = xs[0].cols() - ys[0].cols() + 2 * padding + 1;
    Matrix result = Matrix::Zero(result_rows, result_cols);
    
    for (int n = 0; n < N; ++n) {
        const auto &X = xs[n];
        const auto &Y = ys[n];
        const auto &T = ts[n];

        Matrix delta = T - Y;
        Matrix update = Convolution2D(X, delta, padding);
        result = result + update;
    }

    result *= 2.0/(R * C);

    return result;
};

Now that we know how to obtain gradients, let's implement the gradient descent algorithm.

7. Encoding Gradient Descent

Finally, our gradient descent code is here:

auto gradient_descent = [](Matrix &kernel, Dataset &dataset, const double learning_rate, const int MAX_EPOCHS)
{
    std::vector<double> losses; losses.reserve(MAX_EPOCHS);

    const int padding = kernel.rows() / 2;
    const int N = dataset.size();

    std::vector<Matrix> xs; xs.reserve(N);
    std::vector<Matrix> ys; ys.reserve(N);
    std::vector<Matrix> ts; ts.reserve(N);

    int epoch = 0;
    while (epoch < MAX_EPOCHS)
    {
        xs.clear(); ys.clear(); ts.clear();

        for (auto &instance : dataset) {
            const auto & X = instance.first;
            const auto & Y = instance.second;
            const auto T = Convolution2D(X, kernel, padding);
            xs.push_back(X);
            ys.push_back(Y);
            ts.push_back(T);
        }

        losses.push_back(MSE(ys, ts));

        auto grad = gradient(xs, ys, ts, padding);
        auto update = grad * learning_rate;
        kernel -= update;

        epoch++;
    }

    return losses;
};

This is the base code. We can improve it in several ways, for example:

  • using the loss of each instance to update the kernel. This is called Stochastic Gradient Descent (SGD), which is very useful in real-world scenarios;
  • grouping instances in batches and updating the kernel after each batch, which is called Minibatch;
  • Use a learning rate schedule to reduce the learning rate across epochs ;
  • In this line we can connect an optimizer such as Momentum , RMSProp or Adam. We will discuss optimizers in the next story;kernel -= update;
  • Bring in a validation set or use some cross-validation architecture ;
  • Replacing nested loops with vectorization for performance and CPU usage (as mentioned in the previous story);for(auto &instance: dataset)
  • Add callbacks and hooks to more easily customize our training loop.

We can forget about these improvements for a moment. Now, the focus is on understanding how gradients are used to update parameters (kernels in our case). This is a fundamental, core concept in machine learning today, and a key factor in advancing more advanced topics.

Let's put this into action with an illustrative experiment to see how this code works.

Eight, the actual experiment: repair Sobel edge detector

        In the last story, we learned that we can apply a Sobel filter  Gx  to detect vertical edges:

        Now, the question is: given the original image and the edge image, have we managed to recover the Sobel filter  Gx ?

In other words, can we fit a kernel given an input X and an expected output Y?

The answer is yes, we will use gradient descent to do this.

9. Load and prepare data

        First, we read some images from a folder using OpenCV. We apply the Gx filter to them and store them in pairs in our dataset object:

auto load_dataset = [](std::string data_folder, const int padding) {

    Dataset dataset;
    std::vector<std::string> files;
    for (const auto & entry : fs::directory_iterator(data_folder)) {

        Mat image = cv::imread(data_folder + entry.path().c_str(), cv::IMREAD_GRAYSCALE);
        Mat formatted_image = resize_image(image, 640, 640);

        Matrix X;
        cv::cv2eigen(formatted_image, X);
        X /= 255.;

        auto Y = Convolution2D(X, Sobel.Gx, padding);

        auto pair = std::make_pair(X, Y);
        dataset.push_back(pair);
    }

    return dataset;
};

auto dataset = load_dataset("../images/");

We use the helper utility .resize_image to format each input image to fit on a 640x640 grid

        Center each image into a black 640x640 grid as shown above without stretching the image by simply resizing it. resize_image

        We use the Gx filter to generate the ground truth output Y for each image. Now, we can forget about this filter. We'll recover it from the data using gradient descent and 2D convolution.

10. Run the experiment       

By connecting all the pieces, we can finally see the training perform:

int main() {
    const int padding = 1;
    auto dataset = load_dataset("../images/", padding);

    const int MAX_EPOCHS = 1000;
    const double learning_rate = 0.1;
    auto history = gradient_descent(kernel, dataset, learning_rate, MAX_EPOCHS);
    
    std::cout << "Original kernel is:\n\n" << std::fixed << std::setprecision(2) << Sobel.Gx << "\n\n";
    std::cout << "Trained kernel is:\n\n" << std::fixed << std::setprecision(2) << kernel << "\n\n";

    plot_performance(history);

    return 0;
}

The following sequence illustrates the fitting process:

In the beginning, the kernel is filled with random numbers. Therefore, in the first epoch, the output image is usually black output.

However, after a few epochs, gradient descent starts fitting the kernel to the global minimum.

Finally, in the last epoch, the output is almost equal to the ground truth. At this point, the loss value moves asymptotically to the lowest value. Let's check the loss performance over time:

training performance

This loss curve shape is very common in machine learning. It turns out that in the first epoch the parameters are basically random values. This results in a high initial loss:

Algorithmic Search Representation on Cost Surfaces

In the last epoch, gradient descent finally does its job, fitting the kernel to a suitable value, which makes the loss converge to a minimum.

We can now compare the learned kernel with the original  Gx  Sobel filter:

As we expected, the learned kernel is very close to the original kernel. Note that this difference can still be smaller if we train the kernel over more epochs (and use a smaller learning rate).

The code used to train this kernel can be found in this repository .

11. About differentiation andautodiff

        In this story, we use common calculus rules to find the MSE partial derivatives. However, finding the algebraic derivative for a given complex cost function can be challenging in some cases. Luckily, modern machine learning frameworks provide a magical feature called automatic differentiation or simply.autodiff

   autodiffKeep track of each basic arithmetic operation (such as addition or multiplication), applying the chain rule to them to find the partial derivatives. Thus, when using , we don't need algebraic formulas for computing partial derivatives, or even implement them directly.autodiff

        Since here we are using simple, well-known cost formulas, there is no need to manually use or even solve complex differentials.autodiff

Covering derivatives, partial derivatives, and automatic differentiation in more detail deserves a new story!

12. Conclusion 

        In this story, we learned how to use gradients to fit kernels to data. We introduced gradient descent, which is simple, powerful, and the basis for deriving more complex algorithms such as backpropagation. We also performed a practical experiment using gradient descent to recover the Sobel filter from the data.

reference book

Machine Learning, Mitchell

Cálculo 3, Geraldo Ávila (Brazilian Portuguese)

Neural Networks: A Comprehensive Fundamentals, Haykin

Pattern Classification, Duda

Computer Vision: Algorithms and Applications, Szeliski.

Python machine learning, Raschka

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/131999702