Large model training time estimation

Turn on activation recalculation

insert image description here

insert image description here
GPU utilization is generally between 0.3 - 0.55, assuming 0.45
4090 Theoretical performance: FP16: 82.58 TFLOPS

Do not activate recalculation

Let’s talk about how the coefficient 8 or 6 comes from:

  • For each model parameter, two floating-point calculations are performed, that is, when calculating Y = AB , the elements are first multiplied bitwise and then added bitwise , so each parameter requires two floating-point number operations.
  • The calculation amount of back propagation is twice that of forward propagation
    insert image description here
    . Personal understanding is that for each parameter, back propagation needs to calculate: first-order derivative, second-order derivative, gradient accumulation, and parameter update. There are four operations in total, and forward propagation only needs to be calculated twice, so the amount of calculation is twice that of forward calculation.
  • When activation recalculation is turned on, an additional forward propagation is required during back propagation.

therefore

Forward pass + backward pass + coefficient to activate recalculation = 1 + 2 + 1 = 4

In one training iteration using activation recalculation, for each token and each model parameter,
2 x 4 = 8 floating point operations are required.

Students who want to know more specifically that the calculation amount of backpropagation is twice that of forward calculation can refer to the following articles:
1. What's the backward-forward FLOP ratio for Neural Networks?
2. How the backpropagation algorithm works

Note: Enable activation recalculation when the memory is relatively small. If there is sufficient memory, there is no need to enable activation recalculation.

Guess you like

Origin blog.csdn.net/qq_44193969/article/details/132246050