Summary of Tips for Upgrading to PyTorch 2.0

PyTorch 2.0 has been released for some time, have you already started using it? PyTorch 2.0 can significantly increase the speed of training and inference by introducing torch.compile. Contrary to the eagerly mode, the compilation API converts the model into an intermediate calculation graph (FX graph), and then somehow compiles it into a low-level calculation kernel, which can improve the running speed.

For PyTorch 2.0, you might see:

"Just wrapping them with torch.compile calls can speed things up"

But there are many factors that can interfere with graph compilation and/or achieve the desired performance improvement. So need to tune the model and reach the best performance may need to redesign the project or modify some coding habits.

In this article, we'll demonstrate the use of this new feature and introduce some of the issues you might encounter while using it. We will share a few examples of problems encountered while tuning the torch.compile API. These examples are not comprehensive, and it is likely to encounter problems not mentioned here, and torch.compile is still under active development, and there is still room for improvement.

There are many innovative technologies behind Torch compilation, including TorchDynamo, FX Graph, TorchInductor, Triton, and more. We won't dive into the different components in this post, but if you're interested in those, you can check out the PyTorch documentation, which goes into great detail.

Two unimportant comparisons between TensorFlow and PyTorch

1. In the past, there was a clear distinction between PyTorch and TensorFlow. PyTorch uses the eager execution mode, TensorFlow uses the graph mode, and everyone is developing independently. But then TensorFlow 2 introduced eager execution as the default execution mode, and TensorFlow became a bit like PyTorch. Now PyTorch has also introduced its own graph mode solution, becoming a bit like TensorFlow. The TensorFlow vs. PyTorch rivalry continues, but the differences between the two are slowly disappearing.

2. AI development is a trendy industry. But popular AI models, model architectures, learning algorithms, training frameworks, etc. have evolved over time. In terms of papers, a few years ago most of the models we dealt with were written in TensorFlow. But people often complain that the high-level model.fit API limits their development flexibility, and that graph mode prevents them from debugging. Then a lot of people turned to PyTorch and said, "PyTorch can build models in any way you want and debug them easily". But more flexible custom operations will lead to development complexity. The emergence of advanced APIs such as PyTorch Lightening is to copy the characteristics of the model.fit API, and then the same people say that others say "We must adapt to PyTorch Lightening, we Must use torch.compile to speed up our training". It is impossible to be flexible and simple at the same time.

start of text

Let's start with a collection of tips on how to use the PyTorch 2 compilation API, as well as some potential problems you may face. Adapting the model to PyTorch's graph mode can be quite a bit of effort. Hopefully this post will help you better evaluate this endeavor and decide the best way to take this step.

Install PyTorch2

From the PyTorch installation documentation, it seems that installing PyTorch 2 is no different than installing any other PyTorch version, but in practice, you may encounter some problems. First, PyTorch 2.0 (as of this writing) requires Python 3.8 or higher. Then there's the fact that PyTorch 2 includes package dependencies that didn't exist in previous versions (most notably PyTorch-triton, which I don't know what, ha), and that needs to be taken care of that might introduce new conflicts.

So if you are familiar with Docker, it is recommended to use the container directly, which will be much simpler.

PyTorch2 Compatibility

One of the advantages of PyTorch2 is that it is fully backwards compatible, so we can still use PyTorch 2.0 and benefit from other new features and enhancements even if we don't use torch.compile. At most, you will not be able to enjoy the speed improvement, but there will be no compatibility issues. But if you want to further increase the speed, then please read below.

simple example

Let's start with an example of a simple image classification model. In the code block below, we build a basic Vision Transformer (ViT) model using the timm Python package (version 0.6.12) and train it for 500 steps (not epochs) on a fake dataset. The use_compile flag is defined here to control whether to perform model compilation (torch.compile), and use_amp controls whether to use automatic mixed precision (AMP) or full precision (FP) operation.

 importtime, os
 importtorch
 fromtorch.utils.dataimportDataset
 fromtimm.models.vision_transformerimportVisionTransformer
 
 use_amp=True# toggle to enable/disable amp
 use_compile=True# toggle to use eager/graph execution mode
 
 # use a fake dataset (random data)
 classFakeDataset(Dataset):
   def__len__(self):
     return1000000
 
   def__getitem__(self, index):
     rand_image=torch.randn([3, 224, 224], dtype=torch.float32)
     label=torch.tensor(data=[index%1000], dtype=torch.int64)
     returnrand_image, label
 
 deftrain():
   device=torch.cuda.current_device()
   dataset=FakeDataset()
   batch_size=64
 
   # define an image classification model with a ViT backbone
   model=VisionTransformer()
   
   ifuse_compile:
     model=torch.compile(model)
 
   model.to(device)
 
   optimizer=torch.optim.Adam(model.parameters())
   data_loader=torch.utils.data.DataLoader(dataset,
                           batch_size=batch_size, num_workers=4)
   loss_function=torch.nn.CrossEntropyLoss()
 
   t0=time.perf_counter()
   summ=0
   count=0
 
   foridx, (inputs, target) inenumerate(data_loader, start=1):
     inputs=inputs.to(device)
     targets=torch.squeeze(target.to(device), -1)
 
     optimizer.zero_grad()
 
     withtorch.cuda.amp.autocast(
       enabled=use_amp,
       dtype=torch.bfloat16
     ):
       outputs=model(inputs)
       loss=loss_function(outputs, targets)
 
     loss.backward()
     optimizer.step()
 
     batch_time=time.perf_counter() -t0
 
     ifidx>10:  # skip first few steps
       summ+=batch_time
       count+=1
     t0=time.perf_counter()
     ifidx>500:
       break
 
   print(f'average step time: {summ/count}')
 
 if__name__=='__main__':
   train()

The comparative performance results are recorded in the table below. These results vary greatly depending on the environment, so as a reference

It can be seen that the performance improvement brought by model compilation is much more obvious when using AMP (28.6%) than when using FP (4.5%). This is a well-known difference. If you have not used AMP for training, then the improvement of training speed is actually the transition from FP to AMP, so it is recommended that you use AMP first. The other is that the performance boost is accompanied by a very slight increase in GPU memory utilization.

When scaling to multiple GPUs, comparative performance may change due to the way distributed training is implemented on compiled graphs. See the official documentation for details.

https://pytorch.org/get-started/pytorch-2.0/#distributed

advanced options

The compile API includes a number of options for controlling graph creation, enabling fine-tuning of compilation for a particular model and possibly further performance improvements. The following code block is the official function introduction:

 defcompile(model: Optional[Callable] =None, *,
             fullgraph: builtins.bool=False,
             dynamic: builtins.bool=False,
             backend: Union[str, Callable] ="inductor",
             mode: Union[str, None] =None,
             options: Optional[Dict[str, Union[str, builtins.int, builtins.bool]]] =None,
             disable: builtins.bool=False) ->Callable:
     """
     Optimizes given model/function using TorchDynamo and specified backend.
 
     Args:
        model (Callable): Module/function to optimize
        fullgraph (bool): Whether it is ok to break model into several subgraphs
        dynamic (bool): Use dynamic shape tracing
        backend (str or Callable): backend to be used
        mode (str): Can be either "default", "reduce-overhead" or "max-autotune"
        options (dict): A dictionary of options to pass to the backend.
        disable (bool): Turn torch.compile() into a no-op for testing
     """

mode Compilation mode: Allows you to choose between minimizing the overhead required for compilation (“reduce-overhead”) and maximizing potential performance gains (“max-autotune”).

The table below compares the results of compiling the above ViT models in different compilation modes.

It can be seen that the compile mode behaves very similar to the named one, "reduce-overhead" reduces compile time at the cost of additional memory utilization, and "max-autotune" achieves the best performance at the cost of high compile time overhead.

backend Compiler backend: Which backend the API uses to convert the Intermediate Representation (IR) computation graph (FX graph) into low-level kernel operations. This option is very useful for debugging graph compilation issues and better understanding the internals of torch.compile. In most cases, the default Inductor backend seems to give the best training performance results. There are many backend lists, we can use the following command to view:

 fromtorchimport_dynamo
 print(_dynamo.list_backends())

We tested using the nvprims-nvfuser backend, which can achieve a 13% performance improvement over eager mode (compared to the default backend's 28.6% performance improvement). The specific difference still depends on the Pytorch documentation, we will not go into details here, because there are all documents.

fullgraph force a single graph: this parameter is very useful to ensure that there is no undesired graph truncation.

dynamic dynamic shape: Currently 2.0's compiled support for tensors with dynamic shape is somewhat limited. A common solution to compiling models with dynamic shapes is to recompile, but this adds significant overhead and slows down training considerably. If your model does contain dynamic shapes, setting the dynamic flag to True will result in better performance, especially reducing the number of recompilations.

What is a dynamic shape? The simplest thing is that the length of time series or text is different. If the alignment operation is not performed, the different length of the sequence is a dynamic shape.

performance analysis

PyTorch Profiler is one of the key tools used to analyze the performance of PyTorch models. It can evaluate and analyze the way graph compilation optimizes training steps. In the following code block, we use the profiler to generate TensorBoard results to see the performance of the training:

   out_path=os.path.join(os.environ.get('SM_MODEL_DIR','/tmp'),'profile')
   fromtorch.profilerimportprofile, ProfilerActivity
   withprofile(
           activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
           schedule=torch.profiler.schedule(
             wait=20,
             warmup=5,
             active=10,
             repeat=1),
           on_trace_ready=torch.profiler.tensorboard_trace_handler(
                                                 dir_name=out_path)
 
   ) asp:
     foridx, (inputs, target) inenumerate(data_loader, start=1):
       inputs=inputs.to(device)
       targets=torch.squeeze(target.to(device), -1)
       optimizer.zero_grad()
 
       withtorch.cuda.amp.autocast(
         enabled=use_amp,
         dtype=torch.bfloat16
       ):
         outputs=model(inputs)
         loss=loss_function(outputs, targets)
       loss.backward()
       optimizer.step()
       p.step()

The image below is taken from TensorBoard generated by PyTorch Profiler. It provides details of the kernels running on the GPU during the training step of the compiled model experiment above.

We were able to see that torch.compile increased the utilization of the GPU tensor cores (from 51% to 60%), and it pulled in the GPU cores developed using Triton.

Debug model compilation issues

torch.compile is currently in beta, if you encounter problems, and if you are lucky, you will get an error message, we can directly search for solutions, or ask chatgpt. But if you're not so lucky, you'll need to find the source of the problem yourself.

The main resource here for troubleshooting compilation issues is the TorchDynamo troubleshooting documentation, which includes a list of debugging tools and provides step-by-step guidance for diagnosing errors. But currently these tools and techniques seem to be aimed more at PyTorch developers than PyTorch users. They might help with the underlying problem causing the compilation problem, but there's a very good chance they won't actually help at all, so what?

Here we demonstrate a process of solving problems by ourselves. According to this way of thinking, some problems can be solved.

Below is a simple distributed model that includes a call to torch.distributed.all_reduce. The model runs as expected in eager mode, but fails with an "attribute error" ( torch.classes.c10d.ProcessGroup does not have a field with name 'shape' ) during graph compilation . We need to increase the log level to INFO, and then find that the error is found in the "step 3" of the calculation, which is TorchInductor. Then by verifying that the compilation of the "eager" and "aot_eager" backends was successful, and finally creating a minimal code example to reproduce the failure using PyTorch Minifier.

 importos, logging
 importtorch
 fromtorchimport_dynamo
 
 # enable debug prints
 torch._dynamo.config.log_level=logging.INFO
 torch._dynamo.config.verbose=True
 
 # uncomment to run minifier
 # torch._dynamo.config.repro_after="aot"
 
 defbuild_model():
   importtorch.nnasnn
   importtorch.nn.functionalasF
 
   classDumbNet(nn.Module):
     def__init__(self):
       super().__init__()
       self.conv1=nn.Conv2d(3, 6, 5)
       self.pool=nn.MaxPool2d(2, 2)
       self.fc1=nn.Linear(1176, 10)
 
     defforward(self, x):
       x=self.pool(F.relu(self.conv1(x)))
       x=torch.flatten(x, 1)
       x=self.fc1(x)
       withtorch.no_grad():
         sum_vals=torch.sum(x,0)
         # this is the problematic line of code
         torch.distributed.all_reduce(sum_vals)
       # add noise
       x=x+0.1*sum_vals
       returnx
 
   net=DumbNet()
   returnnet
 
 deftrain():
   os.environ['MASTER_ADDR'] =os.environ.get('MASTER_ADDR',
                                              'localhost')
   os.environ['MASTER_PORT'] =os.environ.get('MASTER_PORT',
                                              str(2222))
   torch.distributed.init_process_group('nccl', rank=0,
                                          world_size=1)
   torch.cuda.set_device(0)
   device=torch.cuda.current_device()
 
   model=build_model()
 
   model=torch.compile(model)
 
   # replace with this to verfiy that error is not in TorchDynamo
   # model = torch.compile(model, 'eager')
   # replace with this to verfiy that error is not in AOTAutograd
   # model = torch.compile(model, 'aot_eager')
 
   model.to(device)
 
   rand_image=torch.randn([4, 3, 32, 32], dtype=torch.float32).to(device)
 
   model(rand_image)
 
 if__name__=='__main__':
   train()

In this example, running the generated minifier_launcher.py script will result in different attribute errors (such as Repro' object has no attribute '_tensor_constant0'), this is not very helpful for our demonstration, we ignore him for now, which also shows Well, torch.compile is not perfect yet, and needs more room for improvement, or if you don’t need to solve the problem, then don’t use it, at least “slow” is better than not being used, right (and the speed improvement is also limited)

Common graph truncation issues

One of the advantages of Pytorch eager mode is the ability to interleave pure Pythonic code with PyTorch operations. But this freedom is very limited when using torch.compile. Because Pythonic operations cause TorchDynamo to split the computation graph into multiple components, hindering the potential for performance gains. The goal of our code optimization is to reduce such graph truncation as much as possible. The easiest way is to compile the model with the fullgraph flag. This Yang can prompt to remove any code that causes graph truncation, but also tell us how to best adapt to the development habits of PyTorch2. But to run distributed code, it must be set to False because the current way of implementing communication between GPUs requires graph splitting. We can also use the torch._dynamo.explain order to analyze graph truncation.

The following code block demonstrates a simple model with four potential graph truncations in its forward pass, but this is used in a way that is not uncommon in typical PyTorch models.

 importtorch
 fromtorchimport_dynamo
 importnumpyasnp
 
 defbuild_model():
   importtorch.nnasnn
   importtorch.nn.functionalasF
 
   classDumbNet(nn.Module):
     def__init__(self):
       super().__init__()
       self.conv1=nn.Conv2d(3, 6, 5)
       self.pool=nn.MaxPool2d(2, 2)
       self.fc1=nn.Linear(1176, 10)
       self.fc2=nn.Linear(10, 10)
       self.fc3=nn.Linear(10, 10)
       self.fc4=nn.Linear(10, 10)
       self.d= {}
 
 
     defforward(self, x):
       x=self.pool(F.relu(self.conv1(x)))
       x=torch.flatten(x, 1)
       asserttorch.all(x>=0) # graph break
       x=self.fc1(x)
       self.d['fc1-out'] =x.sum().item() # graph break
       x=self.fc2(x)
       forkinnp.arange(1): # graph break
         x=self.fc3(x)
       print(x)  # graph break
       x=self.fc4(x)
       returnx
 
   net=DumbNet()
   returnnet
 
 deftrain():
   model=build_model()
   rand_image=torch.randn([4, 3, 32, 32], dtype=torch.float32)
   explanation=torch._dynamo.explain(model, rand_image)
   print(explanation)
 
 if__name__=='__main__':
   train()

Graph truncation will not cause compilation to fail (unless the fullgraph flag is set). So it's quite possible that the model is compiling and running, but actually contains multiple graph truncations, which slows it down.

Troubleshooting Training Issues

For now, a model successfully compiled with Pytorch2 can be considered an achievement worth celebrating, but it does not guarantee that the training will be successful.

The low-level kernels running on the GPU differ between eager and graph modes. Certain advanced operations may exhibit different behavior. You may find that operations that run in eager mode fail in graph mode (eg torch.argmin). Or it may be found that numerical differences in calculations affect training.

Debugging in graph mode is much more difficult than debugging in eager mode. In eager mode, each line of code is executed independently, and we can place a breakpoint at any point in the code to obtain the previous tensor value. In graph mode, however, the model defined by the code undergoes multiple transformations before being processed, and the set breakpoint may not be triggered.

So you can use eager mode first, after the model runs through, apply torch.compile to each part separately, or generate graph truncation by inserting printing and/or Tensor.numpy calls, so that we may successfully trigger the code in the breakpoint. In other words, if you use torch.compile, it will take longer for development, so the choice between training and development speed depends on your own choice.

But don't forget what we said above that your model may not run correctly after adding torch.compile, which is another intangible cost.

Include the loss function in the graph

Graph mode is enabled by wrapping a PyTorch model (or function) with the torch.compile call. But the loss function is not part of the compile call, nor is it part of generating the graph. So the loss function is a relatively small part of the training step and doesn't incur much overhead if run in eager mode. But if there is a loss function that is not computationally expensive, it is also possible to further improve performance by including it in the compiled calculation graph.

In the code below, we define a loss function to perform model distillation from a large ViT model (with 24 ViT blocks) to a smaller ViT model (with 12 ViT blocks).

 importtorch
 fromtimm.models.vision_transformerimportVisionTransformer
 
 classExpensiveLoss(torch.nn.Module):
   def__init__(self):
     super(ExpensiveLoss, self).__init__()
     self.expert_model=VisionTransformer(depth=24)
     iftorch.cuda.is_available():
       self.expert_model.to(torch.cuda.current_device())
     self.mse_loss=torch.nn.MSELoss()
 
   defforward(self, input, outputs):
     expert_output=self.expert_model(input)
     returnself.mse_loss(outputs, expert_output)

This is a loss function that is much more computationally intensive than CrossEntropyLoss. Here are two methods to make it perform faster.

1. The loss function is encapsulated in the torch.compile call, as follows:

 loss_function=ExpensiveLoss()
 compiled_loss=torch.compile(loss_function)

The disadvantage of this method is that the compilation graph of the loss function does not intersect the compilation graph of the model, but its advantage is very obvious, that is, simplicity.

2. Create a wrapper model containing the model and loss to compile the model and loss together and return the resulting loss as output.

 importtime, os
 importtorch
 fromtorch.utils.dataimportDataset
 fromtorchimportnn
 fromtimm.models.vision_transformerimportVisionTransformer
 
 # use a fake dataset (random data)
 classFakeDataset(Dataset):
   def__len__(self):
     return1000000
 
   def__getitem__(self, index):
     rand_image=torch.randn([3, 224, 224], dtype=torch.float32)
     label=torch.tensor(data=[index%1000], dtype=torch.int64)
     returnrand_image, label
 
 # create a wrapper model for the ViT model and loss
 classSuperModel(torch.nn.Module):
   def__init__(self):
     super(SuperModel, self).__init__()
     self.model=VisionTransformer()
     self.expert_model=VisionTransformer(depth=24iftorch.cuda.is_available() else2)
     self.mse_loss=torch.nn.MSELoss()
 
   defforward(self, inputs):
     outputs=self.model(inputs)
     withtorch.no_grad():
       expert_output=self.expert_model(inputs)
     returnself.mse_loss(outputs, expert_output)
 
 # a loss that simply passes through the model output
 classPassthroughLoss(nn.Module):
   def__call__(self, model_output):
     returnmodel_output
 
 deftrain():
   device=torch.cuda.current_device()
   dataset=FakeDataset()
   batch_size=64
 
   # create and compile the model
   model=SuperModel()
   model=torch.compile(model)
 
   model.to(device)
 
   optimizer=torch.optim.Adam(model.parameters())
   data_loader=torch.utils.data.DataLoader(dataset,
                           batch_size=batch_size, num_workers=4)
   
   loss_function=PassthroughLoss()
 
   t0=time.perf_counter()
   summ=0
   count=0
 
   foridx, (inputs, target) inenumerate(data_loader, start=1):
     inputs=inputs.to(device)
     targets=torch.squeeze(target.to(device), -1)
 
     optimizer.zero_grad()
 
     withtorch.cuda.amp.autocast(
       enabled=True,
       dtype=torch.bfloat16
     ):
       outputs=model(inputs)
       loss=loss_function(outputs)
 
     loss.backward()
     optimizer.step()
 
     batch_time=time.perf_counter() -t0
 
     ifidx>10:  # skip first few steps
       summ+=batch_time
       count+=1
     t0=time.perf_counter()
     ifidx>500:
       break
 
   print(f'average step time: {summ/count}')
 
 if__name__=='__main__':
   train()

The downside of this approach is that the actual model inside needs to be extracted from the wrapper model when running the model in inference mode.

The performance improvement of these two options is roughly the same at 8%, which means that compiling loss is also an important part of optimization.

dynamic shape

The official also said that torch.compile has limited compilation support for dynamic shape models. The compile API includes a dynamic parameter for signaling to the compiler, but the extent to which this helps performance is questionable. If you are trying to compile and optimize dynamic graphs and facing problems, then don't use torch.compile because it is too much trouble.

Summarize

The PyTorch 2.0 compilation mode has the potential to significantly increase training and inference speed, leading to significant cost savings, but the amount of effort required for models to realize this potential can vary widely. Many public models require only one line of code modification. While other models, especially those containing non-standard operations, dynamic shapes, and/or a lot of interleaved Python code, may outweigh the benefits or even make it impossible. But it is a good choice to start modifying the model now, because torch.compile is currently an important and ongoing feature for PyTorch2.

https://avoid.overfit.cn/post/dfea563957fc43a19f1aaf7733888031

By Chaim Rand

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/130822098