Neural Network Architecture Design Frequently Asked Questions and Answers

If you are a beginner to Artificial Neural Networks (ANNs), you might have some questions to ask. Like how many hidden layers to use? How many hidden neurons are there in each hidden layer? What is the purpose of using hidden layers/neurons? Does increasing the number of hidden layers/neurons always lead to better results? What loss function to use? How many epochs to use? What weight initialization method to use?

Answering these questions forms the basis for designing a neural network-based project architecture. So it is of course very important to choose these parameters wisely.

I'm happy to tell you that we can answer questions like these. To be clear, answering these questions may be too complex if the problem to be solved is complex. After reading this article, you should at least know how to answer these questions and be able to test it yourself based on simple examples.

insert image description here

Recommendation: Use NSDT Designer to quickly build programmable 3D scenes.

1. Basic neural network structure

Based on nature, a neural network is our usual representation of the brain: neurons interconnected with other neurons to form a network. A simple message like "Move your hand and pick up this pencil" travels through many people before it becomes an actual thing.

The operation of a full neural network is simple: take variables as input (eg, an image if the neural network should tell what is on the image), after some calculations, return an output (as per the first example, given an image of a cat should return the word "cat").
insert image description here

2. Input neurons

This is the number of features the neural network uses to make predictions.

Each feature of the input vector requires one input neuron. For tabular data, this is the number of related features in the dataset. You need to choose these features carefully, and remove any that might contain patterns that fail to generalize beyond the training set (and lead to overfitting). For images, these are the dimensions of the image (28*28=784 for MNIST).

3. Output neuron

This is the number of predictions you want to make.

Regression: For regression tasks, this can be a value (e.g. house price). For multivariate regression, there is one neuron per predicted value (for example, for bounding boxes, it could be 4 neurons - one each for bounding box height, width, x-coordinate, y-coordinate).

Classification: For binary classification (spam - not spam), we use one output neuron for each positive class, where the output represents the probability of the positive class. For multi-class classification (e.g. in object detection, an instance can be classified as a car, dog, house, etc.), we have one output neuron per class and use a softmax activation function at the output layer to ensure that the final probabilities sum to 1.

4. Hidden layer neurons

The number of hidden layers depends a lot on the problem and the architecture of the neural network. You're essentially trying to get into the perfect neural network architecture -- not too big, not too small, just right.

In general, 1-5 hidden layers will work well for most problems. When processing image or speech data, you expect networks to have hundreds of layers, not all of which are fully connected. For these use cases, there are pre-trained models (YOLO, ResNet, VGG) that allow you to use large parts of their network and train your model on top of these networks to learn only high-order features. In this case, your model still only has a few layers to train.

In general, it is sufficient to use the same number of neurons for all hidden layers. For some datasets, having a large first layer followed by smaller layers will lead to better performance, because the first layer can learn many lower-level features that can be fed into some of the subsequent layers high-level features.

In general, adding more layers yields more performance gains than adding more neurons in each layer.

I recommend starting with 1-5 layers and 1-100 neurons, and slowly adding more layers and neurons until you start to overfit. You can track loss and accuracy in the weights and biases dashboard to see which hidden layer + hidden neuron combinations give the best loss.

The thing to keep in mind when choosing a lower number of layers/neurons is that if this number is too small, your network won't be able to learn the underlying patterns in the data and will become useless. The solution to this problem is to start with a large number of hidden layers + hidden neurons, then use dropout and early stopping to let the neural network shrink itself for you. Again, I recommend trying a few combinations and tracking the performance in the weights and biases dashboard to determine the perfect network size for your problem.

The well-known researcher Andrej Karpathy also recommends the method of overfitting and then regularization - "first obtain a model large enough to allow it to overfit (i.e. focus on training loss), and then properly regularize it (drop some training loss to improve the validation loss)."

5. Loss function

The loss function is used to measure the error between the predicted output and the provided target value. The loss function tells us how far the algorithm model is from achieving the desired result. The word "loss" refers to the penalty a model receives for failing to produce the expected result.

insert image description here

  • return

Mean squared error is the most common loss function to optimize unless there are a lot of outliers. In this case, the mean absolute error (MAE) or Huber loss is used.

  • Classification

In most cases, cross-entropy will serve you well.

You can learn more about loss functions in neural networks from this post.

6. Batch size

Batch size refers to the number of training examples used in one iteration.

Large batches can be great because they can take advantage of the power of the GPU to process more training instances at a time. OpenAI has found that large batch sizes (tens of thousands for image classification and language modeling, millions for reinforcement learning agents) are well suited for scaling and parallelization.

However, there are also cases where smaller batches are suitable. According to this paper by Masters and Luschi, the advantage gained in parallelism by running large batches is offset by the increased performance generality and smaller memory footprint of smaller batches. They show that increasing the batch size reduces the acceptable range of learning rates that provide stable convergence. They concluded that, in fact, smaller is better; and that the best performance was obtained with small batch sizes between 2 and 32.

If you're not doing large scale operations, I suggest you start with a smaller batch size, then slowly increase the size and monitor performance in the weights and biases dashboard to determine what works best.

7. Number of rounds

I suggest that people should start with a large number of epochs and use Early Stopping to stop training when the performance stops improving each epoch.

8. Learning rate

Choosing the learning rate is very important, you want to make sure you get it right! Ideally, you want to retune the learning rate when you tune other hyperparameters of the network.
insert image description here

To find the optimal learning rate, start with a very low value (10^-6) and slowly multiply it by a constant until you reach a very high value (eg 10). Measure model performance (vs. the logarithm of the learning rate) in the weights and biases dashboard to determine which rate works well for your problem. You can then retrain your model using this optimal learning rate.

The optimal learning rate is usually half the learning rate that causes the model to diverge. Feel free to set different values ​​for learn_rate in the accompanying code and see how it affects model performance to develop your intuition about learning rates.

I also recommend using the learning rate lookup method proposed by Leslie Smith. This is an excellent way to find good learning rates for most gradient optimizers (most variants of SGD), and works for most network architectures.

See also the section below on learning rate tables.

9. Momentum

insert image description here

Comparing the learning paths of the SGD algorithm with and without momentum

Gradient descent takes small, consistent steps to approach a local minimum, which can take a long time to converge when the gradient is small. Momentum, on the other hand, takes into account previous gradients and speeds up convergence by crossing valleys faster and avoiding local minima.

In general, you want the momentum value to be very close to 1. 0.9 is a good starting point for smaller datasets, and you want to get closer to 1 (0.999) for larger datasets. (Setting nesterov=True makes momentum take into account the gradient of the cost function a few steps before the current point, which makes it slightly more accurate and faster.)

10. Vanishing and Exploding Gradients

Just like people, not all neural network layers learn at the same rate. So when the backpropagation algorithm propagates the error gradient from the output layer to the first layer, the gradient gets smaller and smaller until it reaches the first layer and is almost negligible. This means that the weights of the first layer are not significantly updated at each step.

insert image description here

This is the problem of vanishing gradients. (A similar gradient explosion problem arises when the gradients of some layers get progressively larger, resulting in massive updates to the weights of some layers but not others.)

There are several ways to counteract vanishing gradients. Let's take a look at them now!

11. Hidden layer activation function

In general, the performance of using different activation functions improves in the following order (from lowest → highest performance):

logistic → tanh → ReLU → Leaky ReLU → ELU → SELU

ReLU is the most popular activation function, and if you don't want to tune the activation function, ReLU is a good starting point. However, keep in mind that ReLU is less and less attractive than ELU or GELU.

If you're feeling hot, try the following:

  • Combating Neural Network Overfitting: RReLU
  • Reducing runtime latency: leaky ReLU
  • For large training sets: PReLU
  • Fast inference: leaky ReLU
  • If your network is not self-normalizing: ELU
  • For an overall robust activation function: SELU

As always, don’t be afraid to experiment with different activation functions and turn to your weights and biases dashboard to help you choose the one that works best for you!

You can refer to this research paper which goes in depth on the comparison of different activation functions used in neural networks.

12. Output layer activation function

Regression: The output neuron for a regression problem does not need an activation function because we want the output to have any value. If we want the output values ​​to be limited to a certain range, we can use tanh for -1→1 values ​​and the logistic function for 0→1 values. If we are only looking for positive outputs, we can use a softplus activation (a smooth approximation of the ReLU activation function).

Classification: Use the sigmoid activation function for binary classification to ensure that the output is squeezed between 0 and 1. Using softmax for multi-class classification ensures that the output probabilities add up to 1.

13. Weight initialization method

Correct weight initialization method can greatly speed up the convergence time. The choice of initialization method depends on your activation function. Some things worth trying:

  • When using ReLU or leaky RELU, use He initialization
  • When using SELU or ELU, use LeCun initialization
  • When using softmax, logistic or tanh, use Glorot initialization
  • Most initialization methods are uniform and normal distributed.

14. Batch normalization

Batch normalization learns the optimal mean and scale of the inputs to each layer. It does this by zero-centering and normalizing its input vectors, then scaling and shifting them. It also acts like a regularizer, which means we don't need dropout or L2 reg.
insert image description here

Using batch normalization allows us to use larger learning rates (which lead to faster convergence) and brings huge improvements in most neural networks by reducing the vanishing gradient problem. The only downside is that it adds little to training time because of the extra computation required for each layer.

15. Gradient Clipping

One of the good ways to reduce exploding gradients, especially when training RNNs, is to simply clip the gradients when they exceed a certain value. I suggest trying clip normalization instead of clipping values, which allows us to keep the direction of the gradient vector consistent. Clip normalization includes any gradients with an l2 norm greater than a certain threshold.
insert image description here

Try a few different thresholds to find what works best for you.

16. Early stop

insert image description here

Early stopping lets you do it by training a model with more hidden layers, hidden neurons, and more epochs than you need, and stops training when performance stops improving for n epochs in a row. It also saves the best performing model for you. Early stopping can be enabled by setting a callback when fitting the model and setting save_best_only=True.

17、Dropout

Dropout is an excellent regularization technique that gives us a huge performance boost (about 2% for a state-of-the-art model) despite how simple the technique actually is. What dropout does is randomly turn off a certain percentage of neurons in each layer at each training step. This makes the network more robust because it cannot rely on any particular set of input neurons for its predictions. Knowledge is distributed throughout the network. During training approximately 2^n (where n is the number of neurons in the architecture) slightly unique neural networks are generated and ensembled together to make predictions.
insert image description here

A good dropout rate is between 0.1 and 0.5; 0.3 for RNNs and 0.5 for CNNs. Use larger rates for larger layers. Increasing the dropout rate can reduce overfitting, while decreasing the dropout rate can help combat underfitting.

You want to experiment with different dropout values ​​in the early layers of the network, and check the weights and biases dashboard to choose the best performing one. You definitely don't want to use dropout in the output layer.

Please read this article before using Dropout with BatchNorm.

In this kernel, I used AlphaDropout, which is a generic dropout that works well with the SELU activation function by preserving the mean and standard deviation of the input.

18. Optimizer

Gradient descent is not the only optimizer used in neural networks. We can choose from several different ones. In this article, I just describe some of the optimizers you can choose. You can check out this article where I discuss all the optimizers in detail.
insert image description here

If you are very concerned about the quality of convergence and time is not paramount, I recommend using Stochastic Gradient Descent (SGD).

If you care about convergence time and getting close to the point of optimal convergence is enough, try Adam, Nadam, RMSProp and Adamax optimizers. Your weights and biases dashboard will guide you to the best optimizer for you!

Adam/Nadam is usually a good starting point and tends to be quite tolerant of slow learning and other non-optimal hyperparameters.

According to Andrej Karpathy, "well-tuned SGD will almost always slightly outperform Adam" when it comes to ConvNets.

In this kernel, I get the best performance from Nadam, which is just a regular Adam optimizer with Nesterov's trick, so it converges faster than Adam.

19. Learning rate scheduling

We've already discussed the importance of a good learning rate - we don't want it to be so high that the cost function dances around the optimum and diverges. We also don't want it to be too low, as that means convergence will take a long time.

Taking care of the learning rate can be difficult because both higher and lower learning rates have their advantages. The good news is that we don't have to commit to a learning rate! With learning rate scheduling, we can start at a higher rate to go through gradient slopes faster and slow down when we reach gradient valleys in hyperparameter space, which requires taking smaller steps.

There are many ways to schedule the learning rate, including reducing the learning rate exponentially, using a step function, adjusting the learning rate when performance starts to degrade, or using 1cycle scheduling. In this kernel, I show you how to use the ReduceLROnPlateau callback to reduce the learning rate by a constant factor when performance drops over n epochs.

I highly recommend trying 1cycle scheduling as well.

Use a constant learning rate until all other hyperparameters are trained. And implement learning rate decay scheduling at the end.

As with most things, I recommend running a few different experiments with different scheduling strategies, and using the weights and biases dashboard to choose the model that produces the best model.


Original Link: Neural Network Design FAQ — BimAnt

Guess you like

Origin blog.csdn.net/shebao3333/article/details/131631062