1. Description

Let's have fun by implementing an activation function in C++. An artificial neural network is an example of a biologically inspired model. In an artificial neural network, processing units called neurons are grouped in computational layers and are typically used to perform pattern recognition tasks.

In this model, we usually prefer to control the output of each layer to obey some constraints. For example, we can restrict the output of a neuron to the interval [0, 1], [0, ∞] or [-1, +1]. Another very common scenario is to ensure that neurons from the same layer always add 1. The way to apply these constraints is to use activation functions .

In this story, we will introduce 5 important activation functions: sigmoid, tanh, ReLU, identity and Softmax.

2. About this series

In this series , we will learn how to code must-know deep learning algorithms such as convolutions, backpropagation, activation functions, optimizers, deep neural networks, and more, using only plain and modern C++.

The story is: Activation functions in C++

Check out other stories:

0 — Basics of Modern C++ Deep Learning Programming

1 — Coding 2D convolution in C++

2 — Cost function using Lambda

3 — Implementing Gradient Descent

...and more coming soon.

3. sigmoid activation

Historically, the most famous activation is the sigmoid function:

Sigmoid function and first derivative

This diagram shows three important properties of sigmoids:

its output is limited between 0 and 1;
It's smooth, or in better mathematical terms, it's differentiable;
It is S-shaped.

You should be wondering why does shape matter? An S-shaped model means that the curve resembles a linear curve in the neighborhood of the origin:

This helps to converge faster for small inputs. There are two ways to define the sigmoid formula:

These two formulas are equivalent, but when implementing, we prefer to use the latter:

double sigmoid(double x)
{
    return 1. / (1. + exp(-x));
}

The reason we prefer the second formula is that the first formula is more numerically unstable. Many times, we use short circuit when implementing sigmoid:

double sigmoid(double x)
{
    double result;
    if (x >= 45.) result = 1.;
    else if (x <= -45.) result = 0.;
    else result = 1. / (1. + exp(-x));
    return result;
}

This saves a lot of processing and avoids the case where | x | is large.

Four, sigmoid derivative

Using the chain rule, we can find the sigmoid derivative as:

For convenience, we group the sigmoid and its first derivatives into a functor:

class Sigmoid : public ActivationFunction
{
    public:

        virtual Matrix operator()(const Matrix &z) const
        {
            return z.unaryExpr(std::ref(Sigmoid::helper));
        }

        virtual Matrix jacobian(const Vector &z) const
        {
            Vector output = (*this)(z);

            Vector diagonal = output.unaryExpr([](double y) {
                return (1. - y) * y;
            });

            DiagonalMatrix result = diagonal.asDiagonal();

            return result;
        }

    private:

        static double helper(double z)
        {
            double result;
            if (z >= 45.) result = 1.;
            else if (z <= -45.) result = 0.;
            else result = 1. / (1. + exp(-z));
            return result;
        }

};

We will see how activation function derivatives are used when introducing the backpropagation algorithm.

Sigmoid is mainly used in the output layer of binary classifiers or regression systems, where the result is always non-negative. If the output can be negative, consider using the Tanh activation described below.

5. Tanh activation

As the name suggests, the tanh activation is defined by the hyperbolic tangent trigonometric function:

Like sigmoid, tanh is S-shaped and differentiable. However, the bounds of tanh are -1 and 1:

Tanh function and first derivative

tanh activation and sigmoid activation are closely related:

Note that since tanh can output negative values, we cannot use it with loss functions like logcosh.

The first derivative of tanh is:

We can pack tanh and its derivatives into a functor:

class Tanh : public ActivationFunction
{
    public:

        virtual Matrix operator()(const Matrix &z) const
        {
            return z.unaryExpr(std::ref(tanh));
        }

        virtual Matrix jacobian(const Vector &z) const
        {
            Vector output = (*this)(z);

            Vector diagonal = output.unaryExpr([](double y) {
                return (1. - y * y);
            });

            DiagonalMatrix result = diagonal.asDiagonal();

            return result;
        }
};

6. RELU

One problem with sigmoid and tanh is that they are very computationally expensive, making training time longer. ReLU is a simple activation:

ReLU activation and first derivative

Since ReLU is a simple comparison, its computational cost is very low compared to other functions.

We can implement ReLU as follows:

class ReLU : public ActivationFunction
{

    public:

        virtual Matrix operator()(const Matrix &z) const
        {
            return z.unaryExpr([](double v) {
                return std::max(0., v);
            });
        }

        virtual Matrix jacobian(const Vector &z) const
        {

            Vector output = (*this)(z);
            Vector diagonal = output.unaryExpr([](double y) {
                double result = 0.;
                if (y > 0) result = 1.; 
                return result;
            });

            DiagonalMatrix result = diagonal.asDiagonal();

            return result;
        }

};

The relevant points are:

It is bounded for negative values, but unbound for positive x values: [0, ∞]
When x = 0 , it is not differentiable. In practice, we relax this condition by assuming that the derivative dRelu(x)/dx is 0 when x = 0.

Since ReLU basically consists of a single comparison, we are talking about a very fast computational operation. Its first derivative can also be computed quickly:

Despite its advantages, ReLu has three main disadvantages:

Since it is not positively bounded, we cannot use it to control the output to [0, 1]. Because of this, in practice ReLUs are usually only present in inner (hidden) layers.
Since ReLu is 0 for any x < 0, sometimes our model simply "dies" during training, as some or all neurons get stuck outputting only 0.
Since the derivative of ReLU is discontinuous at x = 0, the training of the model may be unstable for some inputs.

There are some alternatives to solve these problems (see Softplus, leakyReLU, ELU and GeLU). However, due to considerable benefits, ReLU is still widely used for real-world models.

7. Identity activation

The definition of identity activation is simple:

Its derivative is:

Using identity activation means that the output of the neuron is not modified in any way. In this case the implementation is very simple:

class Identity : public ActivationFunction
{
    public:
        virtual Matrix operator()(const Matrix &z) const { return z; }

        virtual Matrix jacobian(const Vector &z) const
        {

            Vector diagonal = Vector::Ones(z.rows());

            DiagonalMatrix result = diagonal.asDiagonal();

            return result;
        }
};

Identities and first derivatives

Eight, softmax

Considering we have a photo of a pet, we need to determine what kind of animal it is: a dog? a cat? hamster? a bird? Guinea pig? In machine learning, we usually model such problems as classification problems and refer to the model as a classifier .

Softmax is very suitable as the output of a classifier because it actually represents a discrete probability distribution . For example, consider the following example:

Classifiers for cats, dogs, and birds

In the previous example, the network was pretty sure the pet in the image was a cat. In the next example, the model scores an image as a dog:

In deep learning models, we use Softmax to represent this type of output.

This stunning pet photo was taken by Amber Janssens

8.1 Define SoftMax

The original formula of Softmax is:

This formula means that if we have k neurons, the output of the ith neuron is given by the sum of the indices of x i divided by the sum of the indices of each neuron x j .

The first implementation of Softmax can be:

const auto buggy_softmax(const Vector &z) {

    Vector expo = z.array().exp();
    Vector sums = expo.colwise().sum();
    Vector result = expo.array().rowwise() / sums.transpose().array();
    return result;

};

We'll see shortly that this implementation is seriously flawed. But what this code does is illustrate the most important aspect of softmax: the outcome of each neuron depends on each individual input.

We can run the following code:

Vector input1 = Vector::Zero(3);
input1 << 0.1, 1., -2.;

std::cout << "Input 1:\n" << input1.transpose() << "\n\n";
std::cout << "results in:\n" << buggy_softmax(input1).transpose() << "\n\n";

to output:

The two most important aspects of Softmax are:

The sum of all neurons is always 1
Each neuron value is in the interval [0, 1]

8.2 Implementation of SoftMax

The problem with our previous softmax implementation was that the exponential function grew very fast. For example, e¹⁰ is about 22,026, but e¹⁰⁰ is 2.688117142×10 ⁴³, a dauntingly large number. It turns out that our implementation fails even when we use modest numbers as input:

Vector input2 = Vector::Zero(4);
input2 << 100, 1000., -500., 200.;

std::cout << "Input 2:\n" << input2.transpose() << "\n\n";
std::cout << "results in:\n" << buggy_softmax(input2).transpose() << "\n\n";
std::cout << "using the buggy implementation.\n";

This happens because C++ floating point has a fixed representation. With a regular 64-bit processor, any call passing 750 or more characters results in a number.cmath exp(x)inf

Fortunately, we can fix it with the following tricks:

where m is the maximum input:

Now, by fixing the code, we get:

const auto good_softmax(const Vector &z) {

    Vector maxs = z.colwise().maxCoeff();
    Vector reduc = z.rowwise() - maxs.transpose();
    Vector expo = reduc.array().exp();
    Vector sums = expo.colwise().sum();
    Vector result = expo.array().rowwise() / sums.transpose().array();
    return result;

};

Overflow is a source of numerical instability.

Numerical stability is a very common problem when we develop real-world deep learning systems.

8.3 Softmax Derivatives

There is a very clear difference between Softmax and other activations. Typically, activations like sigmoid or ReLU are coefficient operations, i.e. the value of one coefficient does not affect other coefficients. Of course, in Softmax this is not true since all values need to sum to 1. This dependency makes the calculation of the softmax derivative a bit tricky. Still, after a little calculation and using our old friend the chain rule , we can figure out:

Let me know if you want to read about the development of this spinoff.

For example, if we have 5 neurons, the derivative of each neuron with respect to each neuron in the same layer is given by:

This derivative will be applied in the next story, when we train the first classifier.

9. Packaging SoftMax for further use

Finally, we can implement the Softmax functor as follows:

class Softmax : public ActivationFunction
{
    public:

        virtual Matrix operator()(const Matrix &z) const
        {

            if (z.rows() == 1)
            {
                throw std::invalid_argument("Softmax is not suitable for single value outputs. Use sigmoid/tanh instead.");
            }
            Vector maxs = z.colwise().maxCoeff();
            Matrix reduc = z.rowwise() - maxs.transpose();
            Matrix expo = reduc.array().exp();
            Vector sums = expo.colwise().sum();
            Matrix result = expo.array().rowwise() / sums.transpose().array();
            return result;
        }

        virtual Matrix jacobian(const Vector &z) const
        {
            Matrix output = (*this)(z);

            Matrix outputAsDiagonal = output.asDiagonal();

            Matrix result = outputAsDiagonal - (output * output.transpose());

            return result;
        }

};

Almost every classifier these days uses Softmax in the output layer. We will introduce some real examples of softmax in the next story.

10. Other activation functions

There are several other activation functions. Besides those described here, we can also list Softplus, Softsign, SeLU, Elu, GeLU, index, swish, etc. Generally, they are some variant of sigmoid or ReLU.

11. Conclusion and next steps

Activation functions are one of the most important building blocks of machine learning models. In this story, we learned some of the most important ones: Sigmoid, Tanh, ReLU, Identity and Softmax.

In the next story, we will dive into the implementation of the most important deep learning algorithm: backpropagation. From scratch, in C++ and Eigen.

Deep Learning from Scratch in Modern C++: Activation Functions