Deep Learning from Scratch in Modern C++: Cost Functions
1. Description
In machine learning, we typically model problems as functions. Therefore, much of our work consists of finding ways to approximate functions using known models. In this case, the cost function plays a central role.
This story is a continuation of our previous discussion on convolutions . Today, we'll introduce the concept of cost functions, show common examples and learn how to code and plot them. As always, pure C++ and Eigen from scratch.
2. About this series
In this series , we will learn how to code must-know deep learning algorithms such as convolutions, backpropagation, activation functions, optimizers, deep neural networks, and more, using only plain and modern C++.
The story is: Cost Function in C++
Deep Learning from Scratch in Modern C++ [3/8]: Activation Functions
Deep Learning from Scratch in Modern C++: [4/8] Gradient Descent
Deep Learning from Scratch in Modern C++: [5/8] Convolution
...and more coming soon.
3. Modeling in machine learning
As AI engineers, we usually define each task or problem as a function.
For example, if we are developing a face recognition system, our first step is to define the problem as a function that maps input images to identifiers:
For a medical diagnosis system, we can define a function to map symptoms to diagnoses:
We can write a model to provide an image given a sequence of words:
It's an endless list. Using functions to represent tasks or problems is a simplified way to implement machine learning systems.
The question is often: how do you know the F() formula?
4. Approximate functions
In fact, it is not feasible to define F(X) using a formula or sequence of rules (I will explain why someday).
In general, instead of finding or defining the correct function F(X), we try to find an approximation of F(X) . Let's call this approximation by the hypothesis function , or simply H(X).
At first glance, this doesn't make sense: if we need to find an approximate function H(X ), why don't we try to find F(X) directly ?
The answer is: we know H(X). Although we know very little about F(X), we know almost everything about H(X) : its formula, parameters, etc. The only thing we don't know about H(X) are its parameter values.
In fact, the main concern of machine learning is finding ways to determine appropriate parameter values for a given problem and data. Let's see how we can implement it.
In machine learning terminology, H (X) is known as an " approximation of F(X) ". The existence of H(X) is covered by the general approximation theorem .
5. Cost Function and Universal Approximation Theorem
Consider a situation where we know the value of the input and the corresponding output, but we don't know the formula. For example, we know that if the input is , then the result is .X
Y = F(X)
F(X)
X = 1.0
F(1.0)
Y = 2.0
4 Mapping of X and F(X)
Now, consider that we have a known function and we want to know whether is a good approximation of . Therefore, we calculate and find the .H(X)
H(X)
F(X)
T = H(1.0)
T = 1.9
How bad is this value, since we know what the real value is?T = 1.9
Y = 2.0
X = 1.0
The metric used to quantify the cost of the difference between and is called by the cost function .Y
T
Note that Y is the expected value and T is the actual value we guessed
H(X)
The concept of a cost function is at the heart of machine learning. Let's take the most common cost function as an example.
6. Mean square error
The most famous cost function is mean squared error :
where T i is given by the convolution of Xi with kernel k :
We discussed convolution in a previous story
Note that we have n pairs ( Yn , Tn), each of which is a combination of the expected value Yi and the actual value Tn . For example:
Therefore, the MSE is evaluated as follows:
We can write the first version of MSE as follows:
auto MSE = [](const std::vector<double> &Y_true, const std::vector<double> &Y_pred) {
if (Y_true.empty()) throw std::invalid_argument("Y_true cannot be empty.");
if (Y_true.size() != Y_pred.size()) throw std::invalid_argument("Y_true and Y_pred sizes do not match.");
auto quadratic = [](const double a, const double b) {
double result = a - b;
return result * result;
};
const int N = Y_true.size();
double acc = std::inner_product(Y_true.begin(), Y_true.end(), Y_pred.begin(), 0.0, std::plus<>(), quadratic);
double result = acc / N;
return result;
};
Now that we know how to calculate MSE, let's see how to use it to approximate functions.
7. Use MSE to find the intuition of the best parameters
Suppose we have a map F(X) synthetically generated:
F(X) = 2*X + N(0, 0.1)
where N(0, 0.1) represents random values drawn from a normal distribution with mean = 0 and standard deviation = 0.1. We can generate sample data with:
#include <random>
std::default_random_engine dre(time(0));
std::normal_distribution<double> gaussian_dist(0., 0.1);
std::uniform_real_distribution<double> uniform_dist(0., 1.);
std::vector<std::pair<double, double>> sample(90);
std::generate(sample.begin(), sample.end(), [&gaussian_dist, &uniform_dist]() {
double x = uniform_dist(dre);
double noise = gaussian_dist(dre);
double y = 2. * x + noise;
return std::make_pair(x, y);
});
If we plot this example using any spreadsheet software, we get something like this:
Note that we know the formulas for G(X) and F(X). In real life, however, these generator functions are an untold secret of the underlying phenomenon. Here, in our example, we only know about them because we are generating synthetic data to help us understand better.
In real life, all we know is an assumption that a hypothetical function H(X) defined by H(X) = kX might be a good approximation of F(X) . Of course, we don't know the value of k yet.
Let's see how to use MSE to find out the appropriate value of k . In fact, it's as simple as plotting the MSE for a range of different k:
std::vector<std::pair<double, double>> measures;
double smallest_mse = 1'000'000'000.;
double best_k = -1;
double step = 0.1;
for (double k = 0.; k < 4.1; k += step) {
std::vector<double> ts(sample.size());
std::transform(sample.begin(), sample.end(), ts.begin(), [k](const auto &pair) {
return pair.first * k;
});
double mse = MSE(ys, ts);
if (mse < smallest_mse) {
smallest_mse = mse;
best_k = k;
}
measures.push_back(std::make_pair(k, mse));
}
std::cout << "best k was " << best_k << " for a MSE of " << smallest_mse << "\n";
Many times, this program outputs something like this:
best k was 2.1 for a MSE of 0.00828671
If we plot MSE(k) with k, we can see a very interesting fact:
k from 0 to 4 in steps of 0.1
Note that the value of MSE(k) is smallest around k = 2. In fact, 2 is the parameter of the generic function G(X) = 2X .
Given the data and using a step size of 0.1, a small value of MSE(k) can be found when k = 2.1 . This shows that H(X) = 2.1X is a good approximation of F(X) . In fact, if we plot F(X), G(X) and H (X), we have:
From the graph above, we can realize that H(X ) actually approximates F (X). However, we can try to use smaller step sizes, such as 0.01 or 0.001, to find a better approximation.
Code can be found in this repository
8. Cost surface
The curve of MSE(k) times k is a one-dimensional example of a cost surface .
What the previous example shows is that we can use the minimum value of the cost surface to find the best fit value for the parameter k .
This example describes the most important paradigm in machine learning: function approximation via cost function minimization .
The figure above shows a one-dimensional cost surface, that is, a cost curve for a given one-dimensional k . In two dimensions, i.e. when we have two k, namely k1 and k2 , the cost surface looks more like an actual surface:
Whether k is 1D, 2D or higher dimensional, the process of finding the best kth value is the same: find the minimum of the cost curve.
The minimum cost value is also known as the global minimum .
In 1D space, the process of finding the global minimum is relatively easy. However, in high dimensions, scanning all the space to find the minimum can be computationally expensive. In the next story, we'll introduce algorithms to perform this search at scale.
Not only k can be high-dimensional. In practical problems, the output is usually also high-dimensional. Let's learn how to calculate MSE in this case.
Nine, MSE on high-dimensional output
In real world problems, Y and T are vectors or matrices. Let's see how to handle such data.
If the output is one-dimensional, the previous formulation of MSE will work. But if the output is multidimensional, we need to change the formula a bit. For example:
In this case, Y n and T n are not scalar values, but matrices of sizes. Before applying MSE to this data, we need to change the formula as follows:(2,3)
In this formula, N is the logarithm, R is the number of rows, and C is the number of columns in each pair. As usual, we can implement this version of MSE using lambdas:
#include <numeric>
#include <iostream>
#include <Eigen/Core>
using Eigen::MatrixXd;
int main()
{
auto MSE = [](const std::vector<MatrixXd> &Y_true, const std::vector<MatrixXd> &Y_pred)
{
if (Y_true.empty()) throw std::invalid_argument("Y_true cannot be empty.");
if (Y_true.size() != Y_pred.size()) throw std::invalid_argument("Y_true and Y_pred sizes do not match.");
const int N = Y_true.size();
const int R = Y_true[0].rows();
const int C = Y_true[0].cols();
auto quadratic = [](const MatrixXd a, const MatrixXd b)
{
MatrixXd result = a - b;
return result.cwiseProduct(result).sum();
};
double acc = std::inner_product(Y_true.begin(), Y_true.end(), Y_pred.begin(), 0.0, std::plus<>(), quadratic);
double result = acc / (N * R * C);
return result;
};
std::vector<MatrixXd> A(4, MatrixXd::Zero(2, 3));
A[0] << 1., 2., 1., -3., 0, 2.;
A[1] << 5., -1., 3., 1., 0.5, -1.5;
A[2] << -2., -2., 1., 1., -1., 1.;
A[3] << -2., 0., 1., -1., -1., 3.;
std::vector<MatrixXd> B(4, MatrixXd::Zero(2, 3));
B[0] << 0.5, 2., 1., 1., 1., 2.;
B[1] << 4., -2., 2.5, 0.5, 1.5, -2.;
B[2] << -2.5, -2.8, 0., 1.5, -1.2, 1.8;
B[3] << -3., 1., -1., -1., -1., 3.5;
std::cout << "MSE: " << MSE(A, B) << "\n";
return 0;
}
It is worth noting that MSE is always a scalar value regardless of whether k or Y is multidimensional or not.
10. Other cost functions
In addition to MSE, other cost functions often appear in deep learning models. The most common are categorical cross-entropy, log cosh, and cosine similarity.
We'll cover these capabilities in upcoming stories, especially when we introduce classification and nonlinear inference .
11. Conclusion and next steps
Cost functions are one of the most important topics in machine learning. In this story, we learned how to code the most commonly used cost function, MSE, and how to use it to fit a one-dimensional problem. We also saw why cost functions are so important for finding function approximations.
In the next story , we will learn how to train a convolution kernel from data using a cost function. We will introduce the basic algorithm for fitting kernels and discuss the implementation of training mechanisms such as epochs, stopping conditions and hyperparameters