Estimation of computational load for deep learning model training

Estimation of computational load for deep learning model training

In today's machine learning world, the performance and sophistication of deep learning models are often related to their training on more computing resources. To ensure accurate comparisons between different deep learning models, it becomes critical to estimate and report computing resource usage during training.

This article will explore methods for estimating the computational load of deep learning model training and introduce some frontiers in this field.

Compute resource usage is typically measured in the number of floating-point operations (FLOPs) required to train the final version of the model.

We will focus on two estimation methods to better understand and compare the training calculations of different deep learning models.These two methods are used to estimate deep learning models. The amount of training calculation:

  • A network architecture and batch number based
  • A hardware-based maximum configuration and model training time

Method 1: Calculate the number of arithmetic operations based on the network architecture and batch number

This method estimates the computational effort by analyzing the architecture of the model and the amount of training data. We will explore how this information can be used to estimate the computational resource requirements of model training and its application in practical research.

The approximate formula is as follows:

计算量 = 2 × c o n n e c t i o n s × 3 × t r a i n i n g   e x a m p l e × e p o c h s 计算量 = 2 \times connections \times 3 \times training \space example \times epochs Calculation amount=2×connecti ons×3×training example×epochs

connections: Refers to the number of connections in the neural network, that is, the direct interconnections between neurons. In a neural network, the connections between neurons represent the information transfer and interaction between them.

For example, if you have a fully connected layer with N input neurons and M output neurons, then it will have NM connections. This means that each input neuron is connected to each output neuron, forming NM connections.

training example: refers to the number of samples in the data set used to train the machine learning model
epochs: refers to the number of iterations when training a deep learning model

The usage of computing resources is usually measured by the number of floating point operations (FLOPs) required for forward propagation (inference) or backpropagation (backpropagation) of the model. This is a calculation in a single iteration (a batch), not the sum of iterations.In the deep learning framework, after each batch calculation is completed, the framework usually automatically releases the corresponding Computational resources, including memory for intermediate results.

Why can't we calculate layer by layer, release resources, and move to the next layer?

In the training of neural networks, the calculation of each layer depends on the output of the previous layer, so resources cannot be released in the calculation of each layer and enter the next layer. The calculations of neural networks are usually pipelined, and the output of each layer is the input of the next layer. If each layer waits for the previous layer to complete the calculation and release resources, the entire calculation process will become very slow.

Now we can translate and convert the above formula, which can be explained by the following formula:

t r a i n i n g _ c o m p u t e = ( o p s _ p e r _ f o r w a r d _ p a s s + o p s _ p e r _ b a c k w a r d _ p a s s ) ∗ n _ p a s s e s training\_compute = (ops\_per\_forward\_pass + ops\_per\_backward\_pass) * n\_passes training_compute=(ops_per_forward_pass+ops_per_backward_pass)n_passes

in:

  • ops_per_forward_pass: represents the number of calculations in forward propagation
  • ops_per_backward_pass: It is the calculation number in backpropagation
  • n_passes: equal to the product of the number of model iterations and the number of training samples:

n _ p a s s e s = n _ e p o c h s ∗ n _ e x a m p l e s n\_passes = n\_epochs * n\_examples n_passes=n_epochsn_exam ples

If you don't know the number of training samples you have for an iteration, you can sometimes calculate it as the number of batches per iteration multiplied by the size of each batch.

n _ e x a m p l e s = n _ b a t c h e s ∗ b a t c h _ s i z e n\_examples = n\_batches * batch\_size n_exam ples=n_batc hesbatch_ size

The ratio of ops_per_backward_pass to ops_per_forward_pass is relatively stable, so the two can be integrated into

f p _ t o _ b p _ r a t i o = o p s _ p e r _ b a c k w a r d _ p a s s o p s _ p e r _ f o r w a r d _ p a s s fp\_to\_bp\_ratio = \frac{ops\_per\_backward\_pass}{ops\_per\_forward\_pass} fp_to_bp_ratio=ops_per_forward_passops_per_backward_pass

The following formula is obtained:

t r a i n i n g _ c o m p u t e = o p s _ p e r _ f o r w a r d _ p a s s ∗ ( 1 + f p _ t o _ b p _ r a t i o ) ∗ n _ p a s s e s training\_compute = ops\_per\_forward\_pass * (1 + fp\_to\_bp\_ratio) * n\_passes training_compute=ops_per_forward_pass(1+fp_to_bp_ratio)n_passes

It is usually estimated thatfp_to_bp_ratio has a value of 2:1. The final equation is: \_epochs * n\_examples
training_compute=ops_per_forward_pass3n_epochsn_exam ples

Why the ratio of backward pass operations to forward pass operations is 2:1

Computing the backpropagation requires computing for each layer the gradient associated with the weights and the error gradient for each neuron with respect to the layer input to be propagated back. Each of these operations requires an amount of computation roughly equal to the amount of operations in the forward pass of that layer. Therefore, fp_to_bp_ratio is approximately 2:1.

Why weight update parameter calculation is negligible

In deep learning training, the amount of parameter calculation required for weight update can usually be considered negligible compared to forward propagation and back propagation. This is mainly due to the following reasons:

  1. Batch update: In deep learning, optimization algorithms such as batch gradient descent or mini-batch gradient descent are usually used for weight update. This means that the weight updates are based on the gradients of the entire training data set or mini-batch. Compared to forward propagation and back propagation, where calculations need to be performed for each training sample, the calculation of weight updates is performed on a larger data set, so its computational complexity is relatively small.
  2. Accumulated gradient: In practical applications, gradients of multiple batches are usually accumulated to update parameters. Doing so helps reduce the variance of the gradient and improves the stability of the gradient estimate. Due to the accumulation of gradients, the weight update calculations in a single batch are a smaller part of the overall training process.
  3. Parameter sharing: In architectures such as convolutional neural networks (CNN), there is parameter sharing. In this case, multiple neurons share the same set of weights, reducing the number of parameters. Due to parameter sharing, the gradient calculation of the weights is relatively small.

Forward pass calculation and number of parameters for common layers

The following is a table that compiles a common neural network layer, estimating their number of parameters and the number of floating point operations required for each layer's forward pass.

We already know that in many layers,the number of FLOPs in the forward pass is roughly equal to twice the number of parameters, however, There are many exceptions, such as CNNs which have fewer parameters due to parameter sharing, while word embeddings do nothing.

Fully connected layer

Insert image description here

parameter

Insert image description here

Floating point calculation amount

Insert image description here

CNN

The shape is H × W × C H \times W \times C H×IN×The tensor used in C has shape K × K × C K \times K \times C K×K×D filters of C, applying a convolutional neural network (CNN) with stride S and padding P.
Insert image description here

parameter

Insert image description here

Floating point calculation amount

Insert image description here

Transposed CNN

The shape is H × W × C H \times W \times C H×IN×The tensor used in C has shape K × K × C K \times K \times C K×K×D filters of C, applying a transposed convolutional neural network (Transpose CNN) with stride S and padding P.

Insert image description here

parameter

Insert image description here

Floating point calculation amount

Insert image description here

RNN

Recurrent Neural Network (RNN) with bias vectors whose input size is N and output size is M.

Insert image description here

parameter

Insert image description here

Floating point calculation amount

Insert image description here

GRU

Fully Gated GRU with bias vector, whose input size is N and output size is M.

Insert image description here

parameter

Insert image description here

Floating point calculation amount

Insert image description here

LSTM

A long short-term memory (LSTM) network with bias vectors whose input size is N and output size is M.

Insert image description here

parameter

Insert image description here

Floating point calculation amount

Insert image description here

Self-Attention

Self-Attention Layer with sequence length L, input size W, key size D, and output size N.
Insert image description here

parameter

Insert image description here

Floating point calculation amount

Insert image description here

Multi-Headed Attention

Multi-Headed Attention Layer with sequence length L, input size W, key size D, each attention head output size N, final output size M, and H attention heads.
Insert image description here

parameter

Insert image description here

Floating point calculation amount

Insert image description here

Example: CNN-LSTM-FCN model

For example, suppose we have a CNN-LSTM-FCN architecture as follows:

  • The input is a sequence of images of shape [400x400x5].
  • The average length of each input sequence is 20 images.
  • The CNN has 16 filters of shape 5x5x5, applied with stride 2 and padding 2.
  • LSTM is a many-pair layer with 256 output units and bias vectors.
  • The fully connected layer has 10 output units.
  • The training process goes through 10 epochs, each epoch contains 100 sequence batches of size 128.
    Insert image description here
    When we consider a CNN-LSTM-FCN model, the recurrent part includes CNN and LSTM, while FC is the acyclic part of the network.

First, CNN accepts a shape of 400 ∗ 400 ∗ 5 400 *400 * 5 400400The input image sequence of 5. The CNN has 16 5x5x5 filters applied with stride 2 and padding 2. It produces an output with width and height H ' = W ' = [ ( W − K + 2 P ) / S ] + 1 = [ ( 400 − 5 + 2 ∗ 2 + 1 ) / 2 ] = 200 H'=W'= [(W -K +2P)/S]+1 = [(400 - 5+2 * 2+1)/2 ]=200 H=W=[(WK+2P)/S]+1=[(4005+22+1)/2]=200, the number of channels is 16. The forward pass of the entire CNN requires approximately 1.024 × 1 0 12 1.024 \times 10^{12} 1.024×1012 Next floating point operation (FLOP).

Before feeding the input to the LSTM, the output of the CNN is rearranged into200200Input of 16. Then, the number of operations per sequence tag in LSTM is approximately 1.31 × 1 0 9 1.31 \times 10^9 1.31×109FLOP. Finally, the fully connected layer (FC) has 10 output units and it requires 5120 FLOPs.

The acyclic part of the entire network is relatively small, and we can approximate the total number of operations as:

t r a i n i n g _ c o m p u t e ≈ o p s _ p e r _ f o r w a r d _ p a s s _ r e c u r r e n t × 3 × n _ e p o c h s × n _ b a t c h e s × b a t c h _ s i z e × a v g _ t o k e n s _ p e r _ s e q u e n c e ≈ 1.024 × 1 0 12 F L O P × 3 × 10 × 100 × 128 × 20 = 7.86432 × 1 0 18 F L O P training\_compute≈ops\_per\_forward\_pass\_recurrent \times 3 \times n\_epochs \times n\_batches \times batch\_size \times avg\_tokens\_per\_sequence ≈1.024 \times 10^{12}FLOP \times 3 \times 10 \times 100 \times 128 \times 20=7.86432×10 ^{18}FLOP training_computeops_per_forward_pass_recurrent×3×n_epochs×n_batc hes×batch_ size×avg_to kens_ per_seq uence1.024×1012FLOP×3×10×100×128×20=7.86432×1018FLOP

Method 2: Calculate the number of operations based on hardware settings and training time

Another way to estimate involves considering the hardware used and training time. We'll examine how this hardware information can be used to estimate computing resource usage, and explore how hardware choices affect the performance and efficiency of deep learning models.

The traditional method of calculating the number of operations based on hardware settings and training time is to use the number of days of GPU usage as the standard..

GPU usage days: Describes the cumulative number of days a single GPU has been used for training. If the training lasted for 5 days and a total of 4 GPUs were used, that equals 20 GPU days.

The traditional method of estimating computing resources using GPU days has some problems. First, it only focuses on the time used for training and ignores the performance of the computing hardware used during the training process. Within a decade, GPU performance has improved significantly, so the actual amount of computing work done by the same GPU days in different periods can vary greatly.

Furthermore, this method does not take into account the differences between different hardware setups. The same number of GPU days may result in a different number of floating point operations under different hardware configurations. Therefore, to more accurately estimate computing resource usage, we need to consider the impact of hardware performance and configuration.

Therefore, we need to use GPU time combined with hardware configuration to estimate FLOP. The specific steps are as follows

1. Extract information from papers/references:

When delving into model-related papers, we need to extract the following key information from them:

  1. Number of GPU days: The paper should include information on the number of GPU days used to train the model, which reflects the model's computing resource usage during training.
  2. Computing system/GPU used: The paper should clearly state the computing system or GPU model used during training. This is crucial for understanding the hardware specifications and performance.
  3. Numeric representation of floating point numbers used during the training run: Papers should provide information on the numeric representation used during the training run, such as FP32, FP16, BF16, INT8, etc. This is directly related to the accuracy of the model used in the calculation process.

2. Read the hardware specifications

By reading the hardware spec sheet, we can obtain the following information:

  1. GPU/System Model: By consulting the manufacturer's spec sheet, we can determine the specific model of GPU or computing system being used. This information is critical for accurate evaluation of computing performance.
  2. Peak performance of the GPU: Spec sheets usually include the peak performance of the GPU in FLOP/s (floating point operations per second). This is a key metric for evaluating hardware computing capabilities.

Here's an example for NVIDIA A100:
Insert image description here

If you can't find used hardware or specs for said hardware, I recommend consulting the table linked below to estimate the average computing power for a given year. You can also find a graph of peak performance for each year in the box below

ML Hardware Data sheet

3. Make an estimate

Based on the above information, we can make the following estimates:

Estimate total FLOPs for GPU:

Step 1: Calculate peak performance of a single GPU

Obtains the peak performance of the GPU from the hardware specification table, expressed as FLOP/s. For example, if peak performance is X X X FLOP/s。

Step 2: Calculate total GPU FLOPs

By multiplying the peak performance of a single GPU by the number of days the GPU has been used, we get the total GPU FLOPs. Assume that the number of GPU usage days is Y days, then the total GPU FLOP is X × Y X \times Y X×Y

Precision considerations:

Step 1: Determine the numeric representation to use for training

Get the numerical representation used by the model during training, such as FP32, FP16, etc. from the paper.

Step 2: Determine the number of FLOPs represented by each number

Determine the number of FLOPs required for each digital representation based on the standards of different digital representations. For example, FP32 may require A A A FLOP, FP16 possible demand B B BFLOP。

Step 3: Calculate the total number of FLOPs

Multiply the number of FLOPs required for each numerical representation by the usage of the corresponding numerical representation in the model to get the total number of FLOPs. Assuming that FP32 and FP16 are used, the total number of FLOPs is
A × quantity 1 + B × quantity 2 A × quantity 1 + B × quantity 2 A×Quantity1+B×Quantity2.

Consider hardware features:

Step 1: Check if tensor core is used

Check the hardware spec sheet or relevant literature to determine whether NVIDIA's tensor cores are enabled. We can consider the performance impact of this feature if enabled.

Step 2: Understand Tensor Core usage

If Tensor Core is enabled, learn how it is used in model training. This may involve special parameter settings or architectural requirements.

Step 3: Adjust total GPU FLOPs

If tensor cores are used, the total GPU FLOPs can be adjusted based on their usage. This may require some additional calculations and estimates depending on the circumstances.

Through these detailed steps, we can more accurately estimate the computational resource usage of the model during training, taking into account the impact of different accuracy and hardware characteristics.

Guess you like

Origin blog.csdn.net/weixin_42010722/article/details/133907640