Deep Learning from Scratch in Modern C++: [6/8] Cost Functions

Deep Learning from Scratch in Modern C++: Cost Functions

1. Description

        In machine learning, we typically model problems as functions. Therefore, much of our work consists of finding ways to approximate functions using known models. In this case, the cost function plays a central role.

        This story is a continuation of our previous discussion on convolutions . Today, we'll introduce the concept of cost functions, show common examples and learn how to code and plot them. As always, pure C++ and Eigen from scratch.

2. About this series

        In this series , we will learn how to code must-know deep learning algorithms such as convolutions, backpropagation, activation functions, optimizers, deep neural networks, and more, using only plain and modern C++.

The story is: Cost Function in C++

Deep Learning from Scratch in Modern C++ [3/8]: Activation Functions 

Deep Learning from Scratch in Modern C++: [4/8] Gradient Descent 

Deep Learning from Scratch in Modern C++: [5/8] Convolution 

...and more coming soon.

3. Modeling in machine learning

        As AI engineers, we usually define each task or problem as a function.

        For example, if we are developing a face recognition system, our first step is to define the problem as a function that maps input images to identifiers:

        For a medical diagnosis system, we can define a function to map symptoms to diagnoses:

        We can write a model to provide an image given a sequence of words:

        It's an endless list. Using functions to represent tasks or problems is a simplified way to implement machine learning systems.

The question is often: how do you know  the F()  formula?

4. Approximate functions

        In fact, it is not feasible to define F(X) using a formula or sequence of rules (I will explain why someday).

        In general, instead of finding or defining the correct function F(X), we try to find  an approximation  of F(X) Let's call this approximation by the hypothesis function , or simply H(X).

        At first glance, this doesn't make sense: if we need to find an approximate function H(X ), why don't we try to find  F(X) directly ?

        The answer is: we know H(X). Although we know very little about F(X), we know almost everything about H(X) : its formula, parameters, etc. The only thing we don't know about  H(X) are its parameter values.

        In fact, the main concern of machine learning is finding ways to determine appropriate parameter values ​​for a given problem and data. Let's see how we can implement it.

In machine learning terminology, H (X) is known as an " approximation of F(X) ". The existence of H(X) is covered by the general approximation theorem .

5. Cost Function and Universal Approximation Theorem

        Consider a situation where we know the value of the input and the corresponding output, but we don't know  the formula. For example, we know that if the input is , then the result is .XY = F(X)F(X)X = 1.0F(1.0)Y = 2.0

        4 Mapping of X and F(X)

        Now, consider that we have a known function and we want to know whether is a good approximation of . Therefore, we calculate and find the .H(X)H(X)F(X)T = H(1.0)T = 1.9

        How bad is this value, since we know what the real value is?T = 1.9Y = 2.0X = 1.0

        The metric used to quantify the cost of the difference between and is called by the cost function .YT

Note that Y is the expected value and T is the actual value we guessedH(X)

The concept of a cost function is at the heart of machine learning. Let's take the most common cost function as an example.

6. Mean square error

        The most famous cost function is mean squared error :

        where  T i is given by the convolution of Xi  with  kernel  k :

We discussed convolution in a previous story

        Note that we have n pairs ( Yn , Tn), each of which is a combination of the expected value  Yi  and the actual value  Tn . For example:

        Therefore, the MSE is evaluated as follows:

We can write the first version of MSE as follows:

auto MSE = [](const std::vector<double> &Y_true, const std::vector<double> &Y_pred) {

    if (Y_true.empty()) throw std::invalid_argument("Y_true cannot be empty.");

    if (Y_true.size() != Y_pred.size()) throw std::invalid_argument("Y_true and Y_pred sizes do not match.");

    auto quadratic = [](const double a, const double b) {
        double result = a - b;
        return result * result;
    };
    const int N = Y_true.size();
    double acc = std::inner_product(Y_true.begin(), Y_true.end(), Y_pred.begin(), 0.0, std::plus<>(), quadratic);

    double result = acc / N;

    return result;
};

        Now that we know how to calculate MSE, let's see how to use it to approximate functions.

7. Use MSE to find the intuition of the best parameters

        Suppose we have a map  F(X)  synthetically generated:

F(X) = 2*X + N(0, 0.1)

        where N(0, 0.1) represents random values ​​drawn from a normal distribution with mean = 0 and standard deviation = 0.1. We can generate sample data with:

#include <random>

std::default_random_engine dre(time(0));

std::normal_distribution<double> gaussian_dist(0., 0.1);
std::uniform_real_distribution<double> uniform_dist(0., 1.);

std::vector<std::pair<double, double>> sample(90);

std::generate(sample.begin(), sample.end(), [&gaussian_dist, &uniform_dist]() {
    double x = uniform_dist(dre);
    double noise = gaussian_dist(dre);
    double y = 2. * x + noise;
    return std::make_pair(x, y);
});

        If we plot this example using any spreadsheet software, we get something like this:

        Note that we know the formulas for G(X) and F(X). In real life, however, these generator functions are an untold secret of the underlying phenomenon. Here, in our example, we only know about them because we are generating synthetic data to help us understand better.

        In real life, all we know is an assumption that a hypothetical function H(X) defined by H(X) = kX might be a good approximation of F(X) .  Of course, we don't know the value of k yet.

        Let's see how to use MSE to find out the appropriate  value of k  . In fact, it's as simple as plotting the MSE for a range of different k:

std::vector<std::pair<double, double>> measures;

double smallest_mse = 1'000'000'000.;
double best_k = -1;
double step = 0.1;

for (double k = 0.; k < 4.1; k += step) {
    std::vector<double> ts(sample.size());
    std::transform(sample.begin(), sample.end(), ts.begin(), [k](const auto &pair) {
        return pair.first * k;
    });

    double mse = MSE(ys, ts);
    if (mse < smallest_mse) {
        smallest_mse = mse;
        best_k = k;
    }

    measures.push_back(std::make_pair(k, mse));
}

std::cout << "best k was " << best_k << " for a MSE of " << smallest_mse << "\n";

        Many times, this program outputs something like this:

        best k was 2.1 for a MSE of 0.00828671

        If we plot MSE(k) with k, we can see a very interesting fact:

        k from 0 to 4 in steps of 0.1

        Note that  the value of  MSE(k) is smallest around k  = 2. In fact, 2 is the parameter of the generic function  G(X) = 2X  .

Given the data and using a step size of 0.1, a small value of MSE(k)         can be found when k = 2.1  . This shows that  H(X) =  2.1X is  a good approximation of  F(X) .  In fact, if we plot F(X), G(X) and H (X), we have:

        From the graph above, we can realize that H(X ) actually approximates F (X). However, we can try to use smaller step sizes, such as 0.01 or 0.001, to find a better approximation.

Code can be found in this repository

8. Cost surface

        The curve of MSE(k)  times  k  is a one-dimensional example of a cost surface .

MSE curves are 1D surfaces

        What the previous example shows is that we can use the minimum value of the cost surface to find   the best fit value for the parameter k .

This example describes the most important paradigm in machine learning: function approximation via cost function minimization .

        The figure above shows a one-dimensional cost surface, that is, a cost curve for a given one-dimensional k . In two dimensions, i.e. when we have two k, namely k1 and k2 , the cost surface looks more like an actual surface:

2D cost surface

        Whether k is 1D, 2D or higher dimensional, the process of finding the best kth  value is the same: find the minimum of the cost curve.

The minimum cost value is also known as the global minimum .

        In 1D space, the process of finding the global minimum is relatively easy. However, in high dimensions, scanning all the space to find the minimum can be computationally expensive. In the next story, we'll introduce algorithms to perform this search at scale.

        Not only k can be high-dimensional. In practical problems, the output is usually also high-dimensional. Let's learn how to calculate MSE in this case.

Nine, MSE on high-dimensional output

        In real world problems, Y  and  T  are vectors or matrices. Let's see how to handle such data.

        If the output is one-dimensional, the previous formulation of MSE will work. But if the output is multidimensional, we need to change the formula a bit. For example:

        In this case, Y n and  T n  are not scalar values, but matrices of sizes. Before applying MSE to this data, we need to change the formula as follows:(2,3)

        In this formula, N  is the logarithm, R  is the number of rows, and C  is the number of columns in each pair. As usual, we can implement this version of MSE using lambdas:

#include <numeric>
#include <iostream>

#include <Eigen/Core>

using Eigen::MatrixXd;

int main() 
{

    auto MSE = [](const std::vector<MatrixXd> &Y_true, const std::vector<MatrixXd> &Y_pred) 
    {

        if (Y_true.empty()) throw std::invalid_argument("Y_true cannot be empty.");

        if (Y_true.size() != Y_pred.size()) throw std::invalid_argument("Y_true and Y_pred sizes do not match.");

        const int N = Y_true.size();
        const int R = Y_true[0].rows();
        const int C = Y_true[0].cols();

        auto quadratic = [](const MatrixXd a, const MatrixXd b) 
        {
            MatrixXd result = a - b;
            return result.cwiseProduct(result).sum();
        };

        double acc = std::inner_product(Y_true.begin(), Y_true.end(), Y_pred.begin(), 0.0, std::plus<>(), quadratic);

        double result = acc / (N * R * C);

        return result;
    };

    std::vector<MatrixXd> A(4, MatrixXd::Zero(2, 3)); 
    A[0] << 1., 2., 1., -3., 0, 2.;
    A[1] << 5., -1., 3., 1., 0.5, -1.5; 
    A[2] << -2., -2., 1., 1., -1., 1.; 
    A[3] << -2., 0., 1., -1., -1., 3.;

    std::vector<MatrixXd> B(4, MatrixXd::Zero(2, 3)); 
    B[0] << 0.5, 2., 1., 1., 1., 2.; 
    B[1] << 4., -2., 2.5, 0.5, 1.5, -2.; 
    B[2] << -2.5, -2.8, 0., 1.5, -1.2, 1.8; 
    B[3] << -3., 1., -1., -1., -1., 3.5;

    std::cout << "MSE: " << MSE(A, B) << "\n";

    return 0;
}

        It is worth noting that MSE is always a scalar value regardless of whether  k  or  Y  is multidimensional or not.

10. Other cost functions

        In addition to MSE, other cost functions often appear in deep learning models. The most common are categorical cross-entropy, log cosh, and cosine similarity.

We'll cover these capabilities in upcoming stories, especially when we introduce classification and nonlinear inference .

11. Conclusion and next steps

        Cost functions are one of the most important topics in machine learning. In this story, we learned how to code the most commonly used cost function, MSE, and how to use it to fit a one-dimensional problem. We also saw why cost functions are so important for finding function approximations.

        In the next story , we will learn how to train a convolution kernel from data using a cost function. We will introduce the basic algorithm for fitting kernels and discuss the implementation of training mechanisms such as epochs, stopping conditions and hyperparameters

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132162995