[DeepSpeed tutorial translation] Third, use PyTorch Profiler in DeepSpeed for performance debugging and Flops Profiler tutorial translation

0x0. Preamble

This translation is for https://www.deepspeed.ai/tutorials/pytorch-profiler/ and https://www.deepspeed.ai/tutorials/flops-profiler/ two tutorials, using DeepSpeed ​​training model can be based on These two tutorials do the calculation of the Profile work judgment model and where the memory bottleneck is.

0x1. Use PyTorch Profiler in DeepSpeed ​​for performance debugging

Corresponding to the original tutorial: https://www.deepspeed.ai/tutorials/pytorch-profiler/

This tutorial describes how to use the PyTorch Profiler tool (https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/) in DeepSpeed.

PyTorch Profiler is an open source tool that provides accurate and efficient performance profiling and troubleshooting for large-scale deep learning models. Analysis results can be output as a .json trace file and viewed in Google Chrome's trace viewer (chrome://tracing). The Python extension for Microsoft Visual Studio Code integrates TensorBoard into the code editor, including support for the PyTorch Profiler. More details can refer to (https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html#pytorch-profiler)

The cycle of Profile model training

The following shows how to profile a training loop by wrapping code in a Profiler context manager. The Profiler assumes that the training process consists of steps (numbered from zero). PyTorch Profiler accepts many parameters such as schedule, on_trace_ready, with_stacketc.

In the example below, the analyzer will skip the previous 5step, use the next 2step as a warmup, and record the next 6step. Since repeatit is set 2, the analyzer will stop recording after two cycles (the cycle here means to repeat the number of active steps repeat). For scheduledetailed usage of , please refer to Using Profiler to analyze long-running jobs (https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html#using-profiler-to-analyze-long-running-jobs).


from torch.profiler import profile, record_function, ProfilerActivity

with torch.profiler.profile(
    schedule=torch.profiler.schedule(
        wait=5, # During this phase profiler is not active.
        warmup=2, # During this phase profiler starts tracing, but the results are discarded.
        active=6, # During this phase profiler traces and records data.
        repeat=2), # Specifies an upper bound on the number of cycles.
    on_trace_ready=tensorboard_trace_handler,
    with_stack=True # Enable stack tracing, adds extra profiling overhead.
) as profiler:
    for step, batch in enumerate(data_loader):
        print("step:{}".format(step))

        #forward() method
        loss = model_engine(batch)

        #runs backpropagation
        model_engine.backward(loss)

        #weight update
        model_engine.step()
        profiler.step() # Send the signal to the profiler that the next step has started.

Mark any code range

Arbitrary user-specified code ranges can be marked using record_functioncontext managers. For example, the following code marks "model_forward" as a label:

with profile(record_shapes=True) as prof: # record_shapes indicates whether to record shapes of the operator inputs.
    with record_function("model_forward"):"
        model_engine(inputs)

Later, in the profile result, you can see the time-consuming situation of the marked "model_forward".

Profile CPU/GPU activity

The activities parameter passed to the Profiler specifies the list of activities to be profiled during the execution of the code scope wrapped with the profiler context manager:

  • ProfilerActivity.CPU - PyTorch operators, TorchScript functions, and user-defined code tags ( record_function).
  • ProfilerActivity.CUDA - CUDA kernels on the device. Note that CUDA profiling imposes non-negligible overhead.
    The example below profiles CPU and GPU activity during the forward pass of the model and prints a summary table sorted by total CUDA time.
with profile(activities=[
        ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    with record_function("model_forward"):
        model_engine(inputs)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Profile memory consumption

By passing to the PyTorch Profiler profile_memory=True, we enable a memory profiling feature that records the amount of memory used by model tensors allocated (or freed) during the execution of the model OP. For example:

with profile(activities=[ProfilerActivity.CUDA],
        profile_memory=True, record_shapes=True) as prof:
    model(inputs)

print(prof.key_averages().table(sort_by="self_cuda_memory_usage", row_limit=10))

0x2. Flops Profiler

Corresponding to the original tutorial: https://www.deepspeed.ai/tutorials/flops-profiler/

In this tutorial, we will introduce the DeepSpeed ​​Flops Profiler and provide examples of its use.

overview

Efficient utilization of hardware resources is crucial for good performance, but in existing large-scale model training and inference implementations, performance inefficiencies are often difficult to detect and attributed to specific module components. DeepSpeed ​​Flops Profiler helps users easily measure the training/inference speed (latency, throughput) and efficiency (floating point operations per second, or FLOPS) of a model and its submodules, aiming to eliminate inefficiencies in existing implementations.

Here is an example output from BERT-Large (NVIDIA) with a batch size of 80 on an A100 GPU:

-------------------------- DeepSpeed Flops Profiler --------------------------
Profile Summary at step 10:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating-point operations (flops), floating-point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)

world size:                                                   1
data parallel size:                                           1
model parallel size:                                          1
batch size per GPU:                                           80
params per gpu:                                               336.23 M
params of model = params per GPU * mp_size:                   336.23 M
fwd MACs per GPU:                                             3139.93 G
fwd flops per GPU:                                            6279.86 G
fwd flops of model = fwd flops per GPU * mp_size:             6279.86 G
fwd latency:                                                  76.67 ms
bwd latency:                                                  108.02 ms
fwd FLOPS per GPU = fwd flops per GPU / fwd latency:          81.9 TFLOPS
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency:      116.27 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency):   102.0 TFLOPS
step latency:                                                 34.09 us
iter latency:                                                 184.73 ms
samples/second:                                               433.07

----------------------------- Aggregated Profile per GPU -----------------------------
Top modules in terms of params, MACs or fwd latency at different model depths:
depth 0:
    params      - {
    
    'BertForPreTrainingPreLN': '336.23 M'}
    MACs        - {
    
    'BertForPreTrainingPreLN': '3139.93 GMACs'}
    fwd latency - {
    
    'BertForPreTrainingPreLN': '76.39 ms'}
depth 1:
    params      - {
    
    'BertModel': '335.15 M', 'BertPreTrainingHeads': '32.34 M'}
    MACs        - {
    
    'BertModel': '3092.96 GMACs', 'BertPreTrainingHeads': '46.97 GMACs'}
    fwd latency - {
    
    'BertModel': '34.29 ms', 'BertPreTrainingHeads': '3.23 ms'}
depth 2:
    params      - {
    
    'BertEncoder': '302.31 M', 'BertLMPredictionHead': '32.34 M'}
    MACs        - {
    
    'BertEncoder': '3092.88 GMACs', 'BertLMPredictionHead': '46.97 GMACs'}
    fwd latency - {
    
    'BertEncoder': '33.45 ms', 'BertLMPredictionHead': '2.61 ms'}
depth 3:
    params      - {
    
    'ModuleList': '302.31 M', 'Embedding': '31.79 M', 'Linear': '31.26 M'}
    MACs        - {
    
    'ModuleList': '3092.88 GMACs', 'Linear': '36.23 GMACs'}
    fwd latency - {
    
    'ModuleList': '33.11 ms', 'BertPredictionHeadTransform': '1.83 ms''}
depth 4:
    params      - {'BertLayer': '302.31 M', 'LinearActivation': '1.05 M''}
    MACs        - {
    
    'BertLayer': '3092.88 GMACs', 'LinearActivation': '10.74 GMACs'}
    fwd latency - {
    
    'BertLayer': '33.11 ms', 'LinearActivation': '1.43 ms'}
depth 5:
    params      - {
    
    'BertAttention': '100.76 M', 'BertIntermediate': '100.76 M'}
    MACs        - {
    
    'BertAttention': '1031.3 GMACs', 'BertIntermediate': '1030.79 GMACs'}
    fwd latency - {
    
    'BertAttention': '19.83 ms', 'BertOutput': '4.38 ms'}
depth 6:
    params      - {
    
    'LinearActivation': '100.76 M', 'Linear': '100.69 M'}
    MACs        - {
    
    'LinearActivation': '1030.79 GMACs', 'Linear': '1030.79 GMACs'}
    fwd latency - {
    
    'BertSelfAttention': '16.29 ms', 'LinearActivation': '3.48 ms'}

------------------------------ Detailed Profile per GPU ------------------------------
Each module profile is listed after its name in the following order:
params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS

BertForPreTrainingPreLN(
  336.23 M, 100.00% Params, 3139.93 GMACs, 100.00% MACs, 76.39 ms, 100.00% latency, 82.21 TFLOPS,
  (bert): BertModel(
    335.15 M, 99.68% Params, 3092.96 GMACs, 98.50% MACs, 34.29 ms, 44.89% latency, 180.4 TFLOPS,
    (embeddings): BertEmbeddings(...)
    (encoder): BertEncoder(
      302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.45 ms, 43.79% latency, 184.93 TFLOPS,
      (FinalLayerNorm): FusedLayerNorm(...)
      (layer): ModuleList(
        302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.11 ms, 43.35% latency, 186.8 TFLOPS,
        (0): BertLayer(
          12.6 M, 3.75% Params, 128.87 GMACs, 4.10% MACs, 1.29 ms, 1.69% latency, 199.49 TFLOPS,
          (attention): BertAttention(
            4.2 M, 1.25% Params, 42.97 GMACs, 1.37% MACs, 833.75 us, 1.09% latency, 103.08 TFLOPS,
            (self): BertSelfAttention(
              3.15 M, 0.94% Params, 32.23 GMACs, 1.03% MACs, 699.04 us, 0.92% latency, 92.22 TFLOPS,
              (query): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 182.39 us, 0.24% latency, 117.74 TFLOPS,...)
              (key): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 57.22 us, 0.07% latency, 375.3 TFLOPS,...)
              (value): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 53.17 us, 0.07% latency, 403.91 TFLOPS,...)
              (dropout): Dropout(...)
              (softmax): Softmax(...)
            )
            (output): BertSelfOutput(
              1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 114.68 us, 0.15% latency, 187.26 TFLOPS,
              (dense): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 64.13 us, 0.08% latency, 334.84 TFLOPS, ...)
              (dropout): Dropout(...)
            )
          )
          (PreAttentionLayerNorm): FusedLayerNorm(...)
          (PostAttentionLayerNorm): FusedLayerNorm(...)
          (intermediate): BertIntermediate(
            4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 186.68 us, 0.24% latency, 460.14 TFLOPS,
            (dense_act): LinearActivation(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 175.0 us, 0.23% latency, 490.86 TFLOPS,...)
          )
          (output): BertOutput(
            4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 116.83 us, 0.15% latency, 735.28 TFLOPS,
            (dense): Linear(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 65.57 us, 0.09% latency, 1310.14 TFLOPS,...)
            (dropout): Dropout(...)
          )
        )
        ...
        (23): BertLayer(...)
      )
    )
    (pooler): BertPooler(...)
  )
  (cls): BertPreTrainingHeads(...)
)
------------------------------------------------------------------------------

In the profile summary, the DeepSpeed ​​Flops Profiler outputs the number of parameters of the model, the number of floating-point operations (flops), FLOPS, latency, and throughput of samples/second. This overview shows the performance gap between the current model execution and the peak performance of the hardware, and helps users tune training or inference settings (e.g., hyperparameters, data parallelism, model parallelism, system configuration, etc.) to achieve better performance.

DeepSpeed ​​Flops Profiler can also profile important modules at different model depths (aggregated profile) and specific modules in the model architecture (detailed profile). Through these profiles, DeepSpeed ​​users can understand the contribution of each layer or sub-module to the overall model complexity/performance. Users can then tweak or refactor the model design to improve performance. For example, using the Profiler, DeepSpeed ​​users can quantify whether stacking smaller layers is lighter or performs better than having larger layers. Aggregated and Detailed Profiles also allow users to quickly identify bottleneck modules. In the BERT-Large example above, using DeepSpeed ​​Flops Profiler, we found that BertLayer is the most important layer and contains a lot of dropout, softmax and layer norm and linear layer modules. These modules are not heavy in flops, but trigger many GPU Kernel calls and create excessive memory read/write requests. The Pattern shown in the detailed Profile shows that this is a perfect match for Kernel fusion, and we developed fused transformer-kernels to reduce data movement (see https://www.deepspeed.ai/tutorials/bert-pretraining/). After applying our optimizations, we see a 25% improvement in FLOPS per GPU and overall training samples/sec in the DeepSpeed ​​Flops Profiler output.

The DeepSpeed ​​Flops Profiler can be used with the DeepSpeed ​​runtime without any user code changes, or as a standalone package independent of DeepSpeed. When using DeepSpeed ​​for model training, the profiler can be enabled in the DeepSpeed ​​configuration file (https://www.deepspeed.ai/docs/config-json/#flops-profiler). As a stand-alone package, the Analyzer API can be used in both training and inference code. The DeepSpeed ​​Profiler is still under active development and currently contains only initial functionality. Stay tuned, more exciting features will be added soon.

Flop measurement

Similar to existing flops calculation tools or methods, the DeepSpeed ​​Flops Analyzer measures the Module's forward-propagation flops, while the back-propagation flops are estimated to be twice the forward-propagation flops. Unlike the PyTorch Analyzer which counts flops of a PyTorch Op, the DeepSpeed ​​Flops Analyzer measures flops inside modules in a model and provides users with more insight into model execution. The flops estimation part is inspired by ptflops (https://github.com/sovrasov/flops-counter.pytorch), the main difference is that the DeepSpeed ​​Flops analyzer not only supports FLOPS calculation directly at the module level, but also captures calls in modules to torch.nn.functionalestimate flops. Therefore, the DeepSpeed ​​Flops analyzer allows the use of custom modules in the model, such as ParallelTransformerLayerworks, ParallelSelfAttention, , RowParallelLinearetc. in Megatron-LM. This is in contrast to ptflops, which requires the user to write a custom flops calculation function for each custom module.

Multi-GPU, multi-node, data-parallel and model-parallel

DeepSpeed ​​Flops Analyzer outputs analysis results for each GPU along with world size, data parallel size and model parallel size.

For models running on multiple GPUs or multiple nodes, only changes in model parallelism (eg, in Megatron-LM --model-parallel-size) affect the analysis results of floating-point operands and Paramater, ie, model_parallel_size * flops = total_flopsand model_parallel_size * parameters = total_parameters. The data parallel size or world size (related to the number of GPUs or nodes) does not affect the analysis results per GPU.

example

The DeepSpeed ​​Flops Analyzer can be used with the DeepSpeed ​​runtime or as a standalone package. When using DeepSpeed ​​for model training, users can configure the profiler in the deepspeed configuration file (https://www.deepspeed.ai/docs/config-json/#flops-profiler) without changing the code. To use the flops analyzer outside of the DeepSpeed ​​runtime, install DeepSpeed ​​and import flops_profilerthe package to use the API directly. Examples of how to use each are given below.

For use with the DeepSpeed ​​runtime

When using DeepSpeed ​​for model training, the analyzer can be configured in the deepspeed configuration file. Using the profiler does not require explicit API calls. The analyzer can be enabled by adding the following field in deepspeed's configuration json file. For details, please refer to flops profiler (https://www.deepspeed.ai/docs/config-json/#flops-profiler).

{
    
    
  "flops_profiler": {
    
    
    "enabled": true,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
    }
}

Used in Megatron-LM

For information on running Megatron-LM with DeepSpeed, please refer to our tutorial Megatron-LM.

hidden_size = 8192Example outputs ( , num_attention_heads = 32, batch_size = 1024, ) of a 12-layer Megatron-LM model are shown below seq_length = 1024.

-------------------------- DeepSpeed Flops Profiler --------------------------
Profile Summary at step 10:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating-point operations (flops), floating-point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)

world size:                                                   1
data parallel size:                                           1
model parallel size:                                          1
batch size per GPU:                                           1024
params per gpu:                                               1.29 M
params of model = params per GPU * mp_size:                   1.29 M
fwd MACs per GPU:                                             41271.95 G
fwd flops per GPU:                                            82543.9 G
fwd flops of model = fwd flops per GPU * mp_size:             82543.9 G
fwd latency:                                                  1.89 s
bwd latency:                                                  5.38 s
fwd FLOPS per GPU = fwd flops per GPU / fwd latency:          43.68 TFLOPS
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency:      30.7 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency):   34.07 TFLOPS
step latency:                                                 34.12 s
iter latency:                                                 41.39 s
samples/second:                                               24.74

----------------------------- Aggregated Profile per GPU -----------------------------
Top 1 modules in terms of params, MACs or fwd latency at different model depths:
depth 0:
    params      - {
    
    'GPT2Model': '1.29 M'}
    MACs        - {
    
    'GPT2Model': '41271.95 GMACs'}
    fwd latency - {
    
    'GPT2Model': '1.84 s'}
depth 1:
    params      - {
    
    'TransformerLanguageModel': '1.29 M'}
    MACs        - {
    
    'TransformerLanguageModel': '39584.03 GMACs'}
    fwd latency - {
    
    'TransformerLanguageModel': '1.83 s'}
depth 2:
    params      - {
    
    'ParallelTransformer': '1.29 M'}
    MACs        - {
    
    'ParallelTransformer': '39584.03 GMACs'}
    fwd latency - {
    
    'ParallelTransformer': '1.81 s'}
depth 3:
    params      - {
    
    'ModuleList': '1.28 M'}
    MACs        - {
    
    'ModuleList': '39584.03 GMACs'}
    fwd latency - {
    
    'ModuleList': '1.3 s'}
depth 4:
    params      - {
    
    'ParallelTransformerLayerPart2': '688.15 k'}
    MACs        - {
    
    'ParallelTransformerLayerPart2': '26388.28 GMACs'}
    fwd latency - {
    
    'ParallelTransformerLayerPart2': '865.73 ms'}
depth 5:
    params      - {
    
    'ParallelMLP': '491.54 k'}
    MACs        - {
    
    'ParallelMLP': '26388.28 GMACs'}
    fwd latency - {
    
    'ParallelMLP': '849.4 ms'}

------------------------------ Detailed Profile per GPU ------------------------------
Each module profile is listed after its name in the following order:
params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make up the difference between a parent's MACs(or latency) and the sum of its submodules'.
1. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.
2. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch, thus it's less than the fwd latency shown above which is captured in DeepSpeed.

GPT2Model(
  1.29 M, 100.00% Params, 41271.95 GMACs, 100.00% MACs, 1.84 s, 100.00% latency, 44.78 TFLOPS,
  (language_model): TransformerLanguageModel(
    1.29 M, 100.00% Params, 39584.03 GMACs, 95.91% MACs, 1.83 s, 99.11% latency, 43.34 TFLOPS,
    (embedding): Embedding(
      2, 0.00% Params, 0 MACs, 0.00% MACs, 18.1 ms, 0.98% latency, 0.0 FLOPS,
      (word_embeddings): VocabParallelEmbedding(1, 0.00% Params, 0 MACs, 0.00% MACs, 164.75 us, 0.01% latency, 0.0 FLOPS, )
      (position_embeddings): Embedding(1, 0.00% Params, 0 MACs, 0.00% MACs, 489.23 us, 0.03% latency, 0.0 FLOPS, 1024, 8192)
      (embedding_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 93.94 us, 0.01% latency, 0.0 FLOPS, p=0.1, inplace=False)
    )
    (transformer): ParallelTransformer(
      1.29 M, 100.00% Params, 39584.03 GMACs, 95.91% MACs, 1.81 s, 98.11% latency, 43.78 TFLOPS,
      (layers): ModuleList(
        1.28 M, 98.73% Params, 39584.03 GMACs, 95.91% MACs, 1.3 s, 70.66% latency, 60.79 TFLOPS,
        (0): ParallelTransformerLayerPart1(
          49.15 k, 3.80% Params, 1099.65 GMACs, 2.66% MACs, 23.5 ms, 1.27% latency, 93.6 TFLOPS,
          (input_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 128.75 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
          (attention): ParallelSelfAttention(
            32.77 k, 2.53% Params, 1099.65 GMACs, 2.66% MACs, 22.8 ms, 1.24% latency, 96.46 TFLOPS,
            (query_key_value): ColumnParallelLinear(24.58 k, 1.90% Params, 824.63 GMACs, 2.00% MACs, 8.93 ms, 0.48% latency, 184.7 TFLOPS, )
            (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 134.22 MMACs, 0.00% MACs, 151.16 us, 0.01% latency, 1.78 TFLOPS, )
            (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 79.63 us, 0.00% latency, 0.0 FLOPS, p=0.1, inplace=False)
            (dense): RowParallelLinear(8.19 k, 0.63% Params, 274.88 GMACs, 0.67% MACs, 2.67 ms, 0.14% latency, 205.81 TFLOPS, )
          )
        )
        (1): ParallelTransformerLayerPart2(
          57.35 k, 4.43% Params, 2199.02 GMACs, 5.33% MACs, 77.53 ms, 4.21% latency, 56.73 TFLOPS,
          (post_attention_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 116.11 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
          (mlp): ParallelMLP(
            40.96 k, 3.16% Params, 2199.02 GMACs, 5.33% MACs, 76.19 ms, 4.13% latency, 57.72 TFLOPS,
            (dense_h_to_4h): ColumnParallelLinear(32.77 k, 2.53% Params, 1099.51 GMACs, 2.66% MACs, 10.79 ms, 0.59% latency, 203.81 TFLOPS, )
            (dense_4h_to_h): RowParallelLinear(8.19 k, 0.63% Params, 1099.51 GMACs, 2.66% MACs, 14.38 ms, 0.78% latency, 152.95 TFLOPS, )
          )
        )
        ...
        (23): ParallelTransformerLayerPart2(...)
      )
      (final_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 110.86 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
    )
  )
)
------------------------------------------------------------------------------

You can refer to the latest DeepSpeed-Megatron warehouse, and then configure the DeepSpeed ​​config file to DeepSpeed ​​Profiler when training the model.

Usage outside the DeepSpeed ​​runtime environment

The profiler can be used outside of the DeepSpeed ​​runtime environment as a standalone package. You just need to simply install DeepSpeed ​​and import flops_profilerthe package to use the API directly. For how to install DeepSpeed, please refer to the DeepSpeed ​​installation guide.

in model reasoning

To profile a trained model for inference state, use get_model_profilethe function. Some examples are given below.

AlexNet example

The following example shows how to profile AlexNet using the DeepSpeed ​​flops profiler.

import torchvision.models as models
import torch
from deepspeed.profiling.flops_profiler import get_model_profile
from deepspeed.accelerator import get_accelerator

with get_accelerator().device(0):
    model = models.alexnet()
    batch_size = 256
    flops, macs, params = get_model_profile(model=model, # model
                                    input_shape=(batch_size, 3, 224, 224), # input shape to the model. If specified, the model takes a tensor with this shape as the only positional argument.
                                    args=None, # list of positional arguments to the model.
                                    kwargs=None, # dictionary of keyword arguments to the model.
                                    print_profile=True, # prints the model graph with the measured profile attached to each module
                                    detailed=True, # print the detailed profile
                                    module_depth=-1, # depth into the nested modules, with -1 being the inner most modules
                                    top_modules=1, # the number of top modules to print aggregated profile
                                    warm_up=10, # the number of warm-ups before measuring the time of each module
                                    as_string=True, # print raw numbers (e.g. 1000) or as human-readable strings (e.g. 1k)
                                    output_file=None, # path to the output file. If None, the profiler prints to stdout.
                                    ignore_modules=None) # the list of modules to ignore in the profiling
BERT example
from functools import partial
import torch
from transformers import BertForSequenceClassification, BertTokenizer
from deepspeed.profiling.flops_profiler import get_model_profile
from deepspeed.accelerator import get_accelerator


def bert_input_constructor(batch_size, seq_len, tokenizer):
    fake_seq = ""
    for _ in range(seq_len - 2):  # ignore the two special tokens [CLS] and [SEP]
      fake_seq += tokenizer.pad_token
    inputs = tokenizer([fake_seq] * batch_size,
                       padding=True,
                       truncation=True,
                       return_tensors="pt")
    labels = torch.tensor([1] * batch_size)
    inputs = dict(inputs)
    inputs.update({
    
    "labels": labels})
    return inputs


with get_accelerator().device(0):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
    batch_size = 4
    seq_len = 128
    enable_profile = True
    if enable_profile:
      flops, macs, params = get_model_profile(
          model,
          kwargs=bert_input_constructor(batch_size, seq_len, tokenizer),
          print_profile=True,
          detailed=True,
      )
    else:
      inputs = bert_input_constructor((batch_size, seq_len), tokenizer)
      outputs = model(inputs)

In the model training workflow

To profile a model's forward pass in a training workflow, use FlopsProfilerthe class. FlopsProfilerThe class provides the following methods:

  • start_profile()- Start profiling.
  • get_total_flops(as_string=False)- Returns the total number of floating point operations in the model.
  • get_total_macs(as_string=False- Returns the total number of macs in the model.
  • get_total_params(as_string=False)- Returns the total number of parameters in the model.
  • print_model_profile(profile_step=1, module_depth=-1, top_modules=3, detailed=True, output_file=None)- Print the model profile.
  • stop_profile()- Stop profiling. This will stop counting floating point operations in the model.
  • end_profile()- to clean up. This will clean up profiling attributes added to the model during profiling. This should be done at the end of profiling and after calling get_total_flops, get_total_paramsor print_model_profileafter.

Training Workflow Example

Below is an example of using this method in a typical training workflow.

from deepspeed.profiling.flops_profiler import FlopsProfiler

model = Model()
prof = FlopsProfiler(model)

profile_step = 5
print_profile= True

for step, batch in enumerate(data_loader):
  # start profiling at training step "profile_step"
  if step == profile_step:
    prof.start_profile()

  # forward() method
  loss = model(batch)

  # end profiling and print output
  if step == profile_step: # if using multi nodes, check global_rank == 0 as well
    prof.stop_profile()
    flops = prof.get_total_flops()
    macs = prof.get_total_macs()
    params = prof.get_total_params()
    if print_profile:
        prof.print_model_profile(profile_step=profile_step)
    prof.end_profile()

  # runs backpropagation
  loss.backward()

  # weight update
  optimizer.step()

Guess you like

Origin blog.csdn.net/just_sort/article/details/131402383