Home Artificial Intelligence PyTorch Model Performance Evaluation and Optimization Toy Example Interim Results Summary

PyTorch Model Performance Evaluation and Optimization Toy Example Interim Results Summary

0
PyTorch Model Performance Evaluation and Optimization
Toy Example
Interim Results
Summary

Tips on how to Use PyTorch Profiler and TensorBoard to Speed up Training and Reduce Cost

Towards Data Science
Photo by Torsten Dederichs on Unsplash

Training deep learning models, especially large ones, generally is a costly expenditure. Considered one of the primary methods now we have at our disposal for managing these costs is performance optimization. Performance optimization is an iterative process through which we consistently seek for opportunities to extend the performance of our application after which reap the benefits of those opportunities. In previous posts (e.g., here) now we have stressed the importance of getting appropriate tools for conducting this evaluation. The tools of selection will likely depend upon a variety of aspects including the form of training accelerator (e.g., GPU, HPU, or other) and the training framework.

Performance Optimization Flow (By Creator)

The main focus on this post might be on training in PyTorch on GPU. More specifically, we are going to concentrate on the PyTorch’s built-in performance analyzer, PyTorch Profiler, and on one among the ways to view its results, the PyTorch Profiler TensorBoard plugin.

This post is just not meant to be a alternative for the official PyTorch documentation on either PyTorch Profiler or using the TensorBoard plugin for analyzing the profiler results. Our intention is quite to display how these tools is likely to be used through the course of 1’s every day development. In actual fact, should you haven’t already, we recommend that you just have a look over the official documentation before reading this post.

For some time, I actually have been intrigued by one portion particularly of the TensorBoard-plugin tutorial. The tutorial introduces a classification model (based on the Resnet architecture) that’s trained on the favored Cifar10 dataset. It proceeds to display how PyTorch Profiler and the TensorBoard plugin could be used to discover and fix a bottleneck in the information loader. Performance bottlenecks within the input data pipeline aren’t unusual and now we have discussed them at length in a few of our previous posts (e.g., here). What’s surprising in regards to the tutorial is the ultimate (post-optimization) results which might be presented (as of the time of this writing) which now we have pasted in below:

Performance Following Optimization (From PyTorch Website)

In the event you look closely, you will notice that the post-optimization GPU utilization is 40.46%. Now there is no such thing as a approach to sugarcoat this: These results are absolutely abysmal and will keep you up at night. As now we have expanded on up to now (e.g., here), the GPU is the costliest resource in our training machine and our goal needs to be to maximise its utilization. A 40.46% utilization result normally represents a major opportunity for training acceleration and value savings. Surely, we will do higher! On this blog post we are going to try to do higher. We are going to start by attempting to breed the outcomes presented within the official tutorial and see whether we will use the identical tools to further improve the training performance.

The code block below incorporates the training loop defined by the TensorBoard-plugin tutorial, with two minor modifications:

  1. We use a fake dataset with the identical properties and behaviors because the CIFAR10 dataset that was utilized in the tutorial. The motivation for this modification could be found here.
  2. We initialize the torch.profiler.schedule with the warmup flag set to 3 and the repeat flag set to 1. We found that this slight increase within the variety of warmup steps improves the soundness of the profiling results.
import numpy as np
import torch
import torch.nn
import torch.optim
import torch.profiler
import torch.utils.data
import torchvision.datasets
import torchvision.models
import torchvision.transforms as T
from torchvision.datasets.vision import VisionDataset
from PIL import Image

class FakeCIFAR(VisionDataset):
def __init__(self, transform):
super().__init__(root=None, transform=transform)
self.data = np.random.randint(low=0,high=256,size=(10000,32,32,3),dtype=np.uint8)
self.targets = np.random.randint(low=0,high=10,size=(10000),dtype=np.uint8).tolist()

def __getitem__(self, index):
img, goal = self.data[index], self.targets[index]
img = Image.fromarray(img)
if self.transform is just not None:
img = self.transform(img)
return img, goal

def __len__(self) -> int:
return len(self.data)

transform = T.Compose(
[T.Resize(224),
T.ToTensor(),
T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

train_set = FakeCIFAR(transform=transform)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=32,
shuffle=True)

device = torch.device("cuda:0")
model = torchvision.models.resnet18(weights='IMAGENET1K_V1').cuda(device)
criterion = torch.nn.CrossEntropyLoss().cuda(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
model.train()

# train step
def train(data):
inputs, labels = data[0].to(device=device), data[1].to(device=device)
outputs = model(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()

# training loop wrapped with profiler object
with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=1, warmup=4, lively=3, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for step, batch_data in enumerate(train_loader):
if step >= (1 + 4 + 3) * 1:
break
train(batch_data)
prof.step() # Have to call this at the top of every step

The GPU that was utilized in the tutorial was a Tesla V100-DGXS-32GB. On this post we try to reproduce — and improve on — the performance results from the tutorial using an Amazon EC2 p3.2xlarge instance that incorporates a Tesla V100-SXM2–16GB GPU. Although they share the identical architecture, there are some differences between the 2 GPUs which you’ll find out about here. We ran the training script using an AWS PyTorch 2.0 Docker image. The performance results of the training script as displayed within the overview page of the TensorBoard viewer is captured within the image below:

Baseline Performance Results as Shown within the TensorBoard Profiler Overview Tab (Captured by Creator)

We first note that, contrary to the tutorial, the Overview page (of torch-tb-profiler version 0.4.1) in our experiment combined the three profiling steps into one . Thus, the common overall step time is 80 milliseconds and never 240 milliseconds as reported. This could be seen clearly within the Trace tab (which, in our experience, almost all the time provides a more accurate report) where each step takes ~80 milliseconds.

Baseline Performance Results as Shown within the TensorBoard Profiler Trace View Tab (Captured by Creator)

Note that our place to begin of 31.65% GPU utilization and a step time of 80 milliseconds is different than the start line presented within the tutorial of 23.54% and 132 milliseconds, respectively. This is probably going a results of differences within the training environment including the GPU type and the PyTorch version. We also note that while the tutorial baseline results clearly diagnose the performance issue as a bottleneck within the DataLoader, our results don’t. We’ve often found that data loading bottlenecks will disguise themselves as a high percentage of “CPU Exec” or “Other” in the Overview tab.

Optimization #1: Multi-process Data Loading

Let’s start by applying multi process data loading as described within the tutorial. Being that the Amazon EC2 p3.2xlarge instance has 8 vCPUs, we set the variety of DataLoader staff to eight for optimum performance:

train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, 
shuffle=True, num_workers=8)

The outcomes of this optimization are displayed below:

Results of Multi-proc Data Loading within the TensorBoard Profiler Overview Tab (Captured by Creator)

The change to a single line of code increased the GPU utilization by greater than 200% (31.65% from to 72.81%), and greater than halved our training step time, (from 80 milliseconds all the way down to 37).

That is where the optimization process within the tutorial involves end. Although our GPU utilization (72.81%) is sort of a bit higher than the leads to the tutorial (40.46%), I actually have little doubt that, like us, you discover these results to still be quite unsatisfactory.

Personal commentary that it is best to be at liberty to skip: Imagine how much global money might be saved if PyTorch applied multi-process data loading by default when training on GPU! True, there could also be some unwanted side-effects to using multiprocessing. Nevertheless, there have to be some type of auto-detection algorithm that might be run that might rule out the presence of discover potentially problematic scenarios and apply this optimization accordingly.

Optimization #2: Memory Pinning

If we analyze the Trace view of our last experiment, we will see that a major period of time (10 out of 37 milliseconds) remains to be spent on loading the training data into the GPU.

Results of Multi-proc Data Loading within the Trace View Tab (Captured by Creator)

To deal with this, we are going to apply one other PyTorch-recommended optimization for streamlining the information input flow, memory pinning. Using pinned memory can increase the speed of host to GPU data copy and, more importantly, allows us to make them asynchronous. Which means we will prepare the subsequent training batch within the GPU in parallel to running the training step on the present batch. For more details in addition to the potential uncomfortable side effects of memory pinning, please see the PyTorch documentation.

This optimization requires changes to 2 lines of code. First, we set the pin_memory flag of the DataLoader to True.

train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, 
shuffle=True, num_workers=8, pin_memory=True)

Then we modify the host-to-device memory transfer (within the train function) to be non-blocking:

inputs, labels = data[0].to(device=device, non_blocking=True), 
data[1].to(device=device, non_blocking=True)

The outcomes of the memory pinning optimization are displayed below:

Results of Memory Pinning within the TensorBoard Profiler Overview Tab (Captured by Creator)

Our GPU utilization now stands at a decent 92.37% and our step time has further decreased. But we will still do higher. Note that despite this optimization, the performance report continues to point that we’re spending a variety of time copying the information into the GPU. We are going to come back to this in step 4 below.

Optimization #3: Increase Batch Size

For our next optimization, we draw our attention to the Memory View of the last experiment:

Memory View in TensorBoard Profiler (Captured by Creator)

The chart shows that out of 16 GB of GPU memory, we’re peaking at lower than 1 GB of utilization. That is an extreme example of resource under-utilization that always (though not all the time) indicates a possibility to spice up performance. One approach to control the memory utilization is to extend the batch size. Within the image below we display the performance results once we increase the batch size to 512 (and the memory utilization to 11.3 GB).

Results of Increasing Batch Size within the TensorBoard Profiler Overview Tab (Captured by Creator)

Although the GPU utilization measure didn’t change much, our training speed has increased considerably, from 1200 samples per second (46 milliseconds for batch size 32) to 1584 samples per second (324 milliseconds for batch size 512).

Caution: Contrary to our previous optimizations, increasing the batch size could have an effect on the behavior of your training application. Different models exhibit different levels of sensitivity to a change in batch size. Some may require nothing greater than some tuning to the optimizer settings. For others, adjusting to a big batch size could also be harder and even inconceivable. See this previous post for a number of the challenges involved in training on large batches.

Optimization #4: Reduce Host to Device Copy

You almost certainly noticed the large red eyesore representing the host-to-device data copy within the pie chart from our previous results. Probably the most direct way of trying to deal with this type of bottleneck is to see if we will reduce the quantity of information in each batch. Notice that within the case of our image input, we convert the information type from an 8-bit unsigned integer to a 32-bit float and apply normalization before performing the information copy. Within the code block below, we propose a change to the input data flow through which we delay the information type conversion and normalization until the information is on the GPU:

# maintain the image input as an 8-bit uint8 tensor
transform = T.Compose(
[T.Resize(224),
T.PILToTensor()
])
train_set = FakeCIFAR(transform=transform)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=1024, shuffle=True, num_workers=8, pin_memory=True)

device = torch.device("cuda:0")
model = torch.compile(torchvision.models.resnet18(weights='IMAGENET1K_V1').cuda(device), fullgraph=True)
criterion = torch.nn.CrossEntropyLoss().cuda(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
model.train()

# train step
def train(data):
inputs, labels = data[0].to(device=device, non_blocking=True),
data[1].to(device=device, non_blocking=True)
# convert to float32 and normalize
inputs = (inputs.to(torch.float32) / 255. - 0.5) / 0.5
outputs = model(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()

Consequently of this modification the quantity of information being copied from the CPU to the GPU is reduced by 4x and the red eyesore virtually disappears:

Results of Reducing CPU to GPU Copy within the TensorBoard Profiler Overview Tab (Captured by Creator)

We now stand at a brand new high of 97.51%(!!) GPU utilization and a training speed of 1670 samples per second! Let’s see what else we will do.

Optimization #5: Set Gradients to None

At this stage we seem like fully utilizing the GPU, but that doesn’t mean that we will’t put it to use more effectively. One popular optimization that is alleged to cut back memory operations within the GPU is to set the model parameters gradients to None quite than zero in each training step. Please see the PyTorch documentation for more details on this optimization. All that’s required to implement this optimization is to set the set_to_none of the optimizer.zero_grad call to True:

optimizer.zero_grad(set_to_none=True)

In our case this optimization didn’t boost our performance in any meaningful way.

Optimization #6: Automatic Mixed Precision

The GPU Kernel View displays the period of time that the GPU kernels were lively and generally is a helpful resource for improving GPU utilization:

Kernel View in TensorBoard Profiler (Captured by Creator)

One of the vital glaring details on this report is the dearth of use of the GPU Tensor Cores. Available on relatively newer GPU architectures, Tensor Cores are dedicated processing units for matrix multiplication that may boost AI application performance significantly. Their lack of use may represent a significant opportunity for optimization.

Being that Tensor Cores are specifically designed for mixed-precision computing, one straight-forward approach to increase their utilization is to change our model to make use of Automatic Mixed Precision (AMP). In AMP mode portions of the model are mechanically forged to lower-precision 16-bit floats and run on the GPU TensorCores.

Importantly, note that a full implementation of AMP may require gradient scaling which we don’t include in our demonstration. Be sure you see the documentation on mixed precision training before adapting it.

The modification to the training step required to enable AMP is demonstrated within the code block below.

def train(data):
inputs, labels = data[0].to(device=device, non_blocking=True),
data[1].to(device=device, non_blocking=True)
inputs = (inputs.to(torch.float32) / 255. - 0.5) / 0.5
with torch.autocast(device_type='cuda', dtype=torch.float16):
outputs = model(inputs)
loss = criterion(outputs, labels)
# Note - torch.cuda.amp.GradScaler() could also be required
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()

The impact to the Tensor Core utilization is displayed within the image below. Even though it continues to point opportunity for further improvement, with only one line of code the utilization jumped from 0% to 26.3%.

Tensor Core Utilization with AMP optimization from Kernel View in TensorBoard Profiler (Captured by Creator)

Along with increasing Tensor Core utilization, using AMP lowers the GPU memory utilization freeing up extra space to extend the batch size. The image below captures the training performance results following the AMP optimization and the batch size set to 1024:

Results of AMP Optimization within the TensorBoard Profiler Overview Tab (Captured by Creator)

Although the GPU utilization has barely decreased, our primary throughput metric has further increased by nearly 50%, from 1670 samples per second to 2477. We’re on a task!

Caution: Lowering the precision of portions of your model could have a meaningful effect on its convergence. As within the case of accelerating the batch size (see above) the impact of using mixed precision will vary per model. In some cases, AMP will work with little to no effort. Other times you would possibly must work a bit harder to tune the autoscaler. Still other times you would possibly must set the precision varieties of different portions of the model explicitly (i.e., manual mixed precision).

For more details on using mixed precision as a way for memory optimization please see our previous blog post on the subject.

Optimization #7: Train in Graph Mode

The ultimate optimization we are going to apply is model compilation. Contrary to the default PyTorch eager-execution mode through which each PyTorch operation is run “eagerly”, the compile API converts your model into an intermediate computation graph which it then compiles into low-level compute kernels in a fashion that is perfect for the underlying training accelerator. For more on model compilation in PyTorch 2, take a look at our previous post on the subject.

The next code block demonstrates the change required to use model compilation:

model = torchvision.models.resnet18(weights='IMAGENET1K_V1').cuda(device)
model = torch.compile(model)

The outcomes of the model compilation optimization are displayed below:

Results of Graph Compilation within the TensorBoard Profiler Overview Tab (Captured by Creator)

Model compilation further increases our throughput to 3268 samples per second in comparison with 2477 within the previous experiment, an extra 32% (!!) boost in performance.

The style through which graph compilation changes the training step could be very evident in the several views of the TensorBoard plugin. The Kernel View, for instance, indicates using recent (fused) GPU kernels, and the Trace View (shown below) displays an entirely different pattern than what we saw previously.

Results of Graph Compilation within the TensorBoard Profiler Trace View Tab (Captured by Creator)

Within the table below we summarize the outcomes of the successive optimizations now we have applied.

Performance Results Summary (By Creator)

By applying our iterative approach of evaluation and optimization using PyTorch Profiler and the TensorBoard plugin, we were capable of increase performance by 817%!!

Is our work complete? Absolutely not! Each optimization that we implement uncovers recent potential opportunities for performance improvement. These opportunities are presented in the shape of resources being freed up (e.g., the way in which through which moving to mixed precision enabled our increasing the batch size) or in the shape of newly uncovered performance bottlenecks (e.g., the way in which through which our final optimization uncovered a bottleneck in host-to-device data transfer). Moreover, there are lots of other well-known types of optimization that we didn’t attempt on this post (e.g., see here and here). And lastly, recent library optimizations (e.g., the model compilation feature that we demonstrated in step 7), are released on a regular basis, further enabling our performance improvement objectives. As we emphasized within the introduction, to completely leverage such opportunities, performance optimization have to be an iterative and consistent a part of your development workflow.

On this post now we have demonstrated the numerous potential of performance optimization on a toy classification model. Although there are other performance analyzers that you would be able to use, each with their pros and cons, we selected PyTorch Profiler and the TensorBoard plugin resulting from their ease of integration.

We must always emphasize that the trail to successful optimization will vary greatly based on the main points of the training project, including the model architecture and training environment. In practice, reaching your goals could also be harder than in the instance we presented here. Among the techniques we described can have little impact in your performance or might even make it worse. We also note that the precise optimizations that we selected, and the order through which we selected to use them, was somewhat arbitrary. You’re highly encouraged to develop your individual tools and techniques for reaching your optimization goals based on the particular details of your project.

Performance optimization of machine learning workloads is usually viewed as secondary, non-critical, and odious. I hope that now we have succeeded in convincing you that the potential for savings in development time and value warrant a meaningful investment in performance evaluation and optimization. And, hey, you would possibly even find it to be fun :).

What Next?

This was just the tip of the iceberg. There may be rather a lot more to performance optimization than now we have covered here. In a sequel to this post, we are going to dive right into a performance issue that is sort of common in PyTorch models through which an excessive amount of computation is run on the CPU quite than the GPU, often in a fashion that’s unbeknownst to the developer. We also encourage you to ascertain out our other posts on medium, a lot of which cover different elements of performance optimization of machine learning workloads.

LEAVE A REPLY

Please enter your comment!
Please enter your name here