The way to reduce “Cuda Memcpy Async” events and why it’s best to watch out for boolean mask operations

That is the third a part of a series of posts on the subject of analyzing and optimizing PyTorch models using PyTorch Profiler and TensorBoard. Our intention has been to spotlight the advantages of performance profiling and optimization of GPU-based training workloads and their potential impact on the speed and value of coaching. Specifically, we want to display the accessibility of profiling tools akin to PyTorch Profiler and TensorBoard to all ML developers. You don’t want to be a CUDA expert in an effort to derive meaningful performance gains from applying the techniques we discuss in our posts.
In our first post we demonstrated how different views of the PyTorch Profiler TensorBoard plugin might be used to discover performance issues and reviewed just a few popular techniques for accelerating training. Within the second post we showed how the TensorBoard plugin Trace View might be used to discover when tensors are being copied from the CPU to the GPU, and back. Such movement of knowledge — which may cause points of synchronization and slow the speed of coaching considerably — is commonly unintentional and might sometimes be easily avoided. The subject of this post shall be situations through which we encounter points of synchronization between the GPU and CPU which can be not related to tensor copies. As within the case of tensor copies, these could cause stagnation in your training step and slow the general time of your training considerably. We’ll display the existence of such occurrences, how they might be identified using PyTorch Profiler and the PyTorch Profiler TensorBoard plugin Trace View, and the potential performance advantages of constructing your model in a way that minimizes such synchronization events.
As in our previous posts, we’ll define a toy PyTorch model after which iteratively profile its performance, discover bottlenecks, and try to fix them. We’ll run our experiments on an Amazon EC2 g5.2xlarge instance (containing an NVIDIA A10G GPU and eight vCPUs) and using the official AWS PyTorch 2.0 Docker image. Bear in mind that among the behaviors we describe may vary between versions of PyTorch.
In the next blocks we introduce a toy PyTorch model that performs semantic segmentation on a 256×256 input image, i.e., it takes a 256×256 RGB image and outputs a 256×256 map of “per-pixel” labels from a category of ten semantic categories.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim
import torch.profiler
import torch.utils.data
from torch import Tensorclass Net(nn.Module):
def __init__(self, num_hidden=10, num_classes=10):
super().__init__()
self.conv_in = nn.Conv2d(3, 10, 3, padding='same')
hidden = []
for i in range(num_hidden):
hidden.append(nn.Conv2d(10, 10, 3, padding='same'))
hidden.append(nn.ReLU())
self.hidden = nn.Sequential(*hidden)
self.conv_out = nn.Conv2d(10, num_classes, 3, padding='same')
def forward(self, x):
x = F.relu(self.conv_in(x))
x = self.hidden(x)
x = self.conv_out(x)
return x
To coach our model we’ll use the usual cross-entropy loss with just a few modifications:
- We’ll assume that the goal labels include an ignore value indicating pixels that we wish to exclude from the loss calculation.
- We’ll assume that considered one of semantic labels identifies certain pixels as belonging to the “background” of the image. We define our loss function to treat these as ignore labels.
- We’ll update our model weights only after we encounter batches with targets tensors that include at the least two unique values.
While we now have chosen these modifications for the needs of our demonstration, a majority of these operations are usually not unusual and might be present in many “standard” PyTorch models. Since we’re already “experts” at performance profiling, we now have already gone ahead and wrapped each of the operations in our loss function with a torch.profiler.record_function context manager, (as described in our second post).
class MaskedLoss(nn.Module):
def __init__(self, ignore_val=-1, num_classes=10):
super().__init__()
self.ignore_val = ignore_val
self.num_classes = num_classes
self.loss = torch.nn.CrossEntropyLoss()def cross_entropy(self, pred: Tensor, goal: Tensor) -> Tensor:
# create a boolean mask of valid labels
with torch.profiler.record_function('create mask'):
mask = goal != self.ignore_val
# permute the logits in preparation for masking
with torch.profiler.record_function('permute'):
permuted_pred = torch.permute(pred, [0, 2, 3, 1])
# apply the boolean mask to the targets and logits
with torch.profiler.record_function('mask'):
masked_target = goal[mask]
masked_pred = permuted_pred[mask.unsqueeze(-1).expand(-1, -1, -1,
self.num_classes)]
masked_pred = masked_pred.reshape(-1, self.num_classes)
# calculate the cross-entropy loss
with torch.profiler.record_function('calc loss'):
loss = self.loss(masked_pred, masked_target)
return loss
def ignore_background(self, goal: Tensor) -> Tensor:
# discover all indices where goal label is "background"
with torch.profiler.record_function('non_zero'):
inds = torch.nonzero(goal == self.num_classes - 1, as_tuple=True)
# reset all "background" labels to the ignore index
with torch.profiler.record_function('index task'):
goal[inds] = self.ignore_val
return goal
def forward(self, pred: Tensor, goal: Tensor) -> Tensor:
# ignore background labels
goal = self.ignore_background(goal)
# retrieve an inventory of unique elements in goal
with torch.profiler.record_function('unique'):
unique = torch.unique(goal)
# check if the variety of unique items pass the brink
with torch.profiler.record_function('numel'):
ignore_loss = torch.numel(unique) < 2
# calculate the cross-entropy loss
loss = self.cross_entropy(pred, goal)
# zero the loss within the case that the variety of unique elements
# is below the brink
if ignore_loss:
loss = 0. * loss
return loss
Our loss function seems innocent enough, right? Incorrect! As we’ll see below, the loss function includes various operations that trigger host-device synchronization events that slow the speed of coaching considerably — none of which involve copying tensors into or out of the GPU. As in our previous post, we challenge you to attempt to discover three opportunities for performance optimization before reading on.
For the needs of our demo, we use randomly generated images and per-pixel label maps, as defined below.
from torch.utils.data import Dataset# A dataset with random images and label maps
class FakeDataset(Dataset):
def __init__(self, num_classes=10):
super().__init__()
self.num_classes = num_classes
self.img_size = [256, 256]
def __len__(self):
return 1000000
def __getitem__(self, index):
rand_image = torch.randn([3]+self.img_size, dtype=torch.float32)
rand_label = torch.randint(low=-1, high=self.num_classes,
size=self.img_size)
return rand_image, rand_label
train_set = FakeDataset()
train_loader = torch.utils.data.DataLoader(train_set, batch_size=256,
shuffle=True, num_workers=8, pin_memory=True)
Last, we define our training step with the PyTorch Profiler configured to our desire:
device = torch.device("cuda:0")
model = Net().cuda(device)
criterion = MaskedLoss().cuda(device)optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
model.train()
# training loop wrapped with profiler object
with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=1, warmup=4, lively=3, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler('/tmp/prof'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for step, data in enumerate(train_loader):
inputs = data[0].to(device=device, non_blocking=True)
labels = data[1].to(device=device, non_blocking=True)
if step >= (1 + 4 + 3) * 1:
break
outputs = model(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
prof.step()
Should you were to naively run this training script, you’d probably see high GPU (~90%) utilization and never know that there was anything flawed with it. It is just through profiling that we’re in a position to discover the underlying performance bottlenecks and potential opportunities for training acceleration. So, without further ado, let’s see how our model performs.
On this post we’ll concentrate on the Trace View of the PyTorch Profiler TensorBoard plugin. Please see our previous posts for tips about the right way to use among the other views supported by the plugin.
Within the image below we show the Trace View of a single training step of our toy model.
We will clearly see that our 1.3 second long training step is completely dominated by the torch.nonzero operator in the primary line of our loss function. All the opposite operations appear bunched together on either side of the massive cudaMemcpyAsyn event. What is occurring??!! Why would such a seemingly innocent operation cause such an enormous eyesore?
Perhaps we must always not be so surprised, because the torch.nonzero documentation does include the next note: “When input
is on CUDA, torch.nonzero()
causes host-device synchronization.” The necessity for synchronization arises from the incontrovertible fact that, contrary to other common PyTorch ops, the dimensions of the tensor that’s returned by torch.nonzero is not pre-determined. The CPU doesn’t know the way many non-zero elements there are within the input tensor ahead of time. It needs to attend for the sync event from the GPU in an effort to perform the suitable GPU memory allocation and appropriately prepare the next PyTorch ops.
Note that the length of cudaMempyAsync will not be indicative of the complexity of the torch.nonzero op, but reasonably reflects the period of time that the CPU needs to attend for the GPU to complete the entire previous kernels that the CPU launched. For instance, were we to make a further torch.nonzero call immediately after our first one, our second cudaMempyAsync event would seem significantly shorter than the primary because the CPU and GPU are already roughly “in sync”. (Bear in mind that this explanation is coming from a non-CUDA expert, so make of it what you’ll…)
Now that we understand the source of the bottleneck, the challenge becomes finding an alternate sequence of operations that performs the identical logic but that does not trigger a host-device synchronization event. Within the case of our loss function, we will easily accomplish this using the torch.where operator as shown within the code block below:
def ignore_background(self, goal: Tensor) -> Tensor:
with torch.profiler.record_function('update background'):
goal = torch.where(goal==self.num_classes-1,
-1*torch.ones_like(goal),goal)
return goal
Within the image below we show the Trace View following this transformation.
While we now have succeeded in removing the cudaMempyAsync coming from the torch.nonzero op, it has been immediately replaced with one coming from the torch.unique op, and our step time has not budged. Here the PyTorch documentation is less kind, but based on our previous experience we will assume that, once more, we’re affected by a host-device synchronization event as a result of our use of tensors with undetermined size.
Replacing the torch.unique operator with an equivalent alternative will not be all the time possible. Nonetheless, in our case we don’t really need to know the values of the unique labels, we’d like to know only the number of unique labels. This might be calculated by applying the torch.sort op on the flattened goal tensor and counting the variety of steps within the resultant step function.
def forward(self, pred: Tensor, goal: Tensor) -> Tensor:# ignore background labels
goal = self.ignore_background(goal)
# sort the list of labels
with torch.profiler.record_function('sort'):
sorted,_ = torch.sort(goal.flatten())
# indentify the steps of the resultant step function
with torch.profiler.record_function('deriv'):
deriv = sorted[1:]-sorted[:-1]
# count the variety of steps
with torch.profiler.record_function('count_nonzero'):
num_unique = torch.count_nonzero(deriv)+1
# calculate the cross-entropy loss
loss = self.cross_entropy(pred, goal)
# zero the loss within the case that the variety of unique elements
# is below the brink
with torch.profiler.record_function('where'):
loss = torch.where(num_unique<2, 0.*loss, loss)
return loss
Within the image below we capture the Trace View following our second optimization:
Once more, we now have solved one bottleneck only to be faced with a brand new one, this time coming from the boolean mask routine.
Boolean masking is a routine we commonly use in an effort to reduce the general variety of machine operations which can be required. In our case, our intention was to scale back the quantity of computation by removing the “ignore” pixels and limiting the cross-entropy calculation to the pixels of interest. Clearly, this has backfired. As before, applying a boolean mask ends in a tensor of undetermined size, and the cudaMempyAsync that it triggers greatly overshadows any of the savings from excluding the “ignore” pixels.
In our case, fixing this issue is reasonably easy because the PyTorch CrossEntropyLoss has a built-in option for setting an ignore_index.
class MaskedLoss(nn.Module):
def __init__(self, ignore_val=-1, num_classes=10):
super().__init__()
self.ignore_val = ignore_val
self.num_classes = num_classes
self.loss = torch.nn.CrossEntropyLoss(ignore_index=-1)def cross_entropy(self, pred: Tensor, goal: Tensor) -> Tensor:
with torch.profiler.record_function('calc loss'):
loss = self.loss(pred, goal)
return loss
Within the image below we show the resultant Trace View:
Holy cow!! Our step time has dropped all the way in which right down to 5.4 milliseconds. That’s 240 (!!) times faster than what we began with. By simply changing around just a few function calls and with none modification to the loss function logic, we were in a position to optimize the performance of the training step dramatically.
Necessary Note: Within the toy example we now have chosen, the steps that we took to scale back the number cudaMempyAsync events had a transparent impact on the training step time. Nonetheless, there could also be situations where the identical kinds of changes will harm performance reasonably than improve it. For instance, within the case of boolean masking, if our mask is amazingly sparse and the unique tensors extremely large, the savings in computation from applying the mask might outweigh the value of the host-device synchronization. Importantly, the impact of every optimization ought to be evaluated on a case-by-case basis.
On this post we now have focused on performance issues in training applications which can be attributable to host-device synchronization events. We saw several examples of PyTorch operators that trigger such events — the common property of all of them being that the size of the tensors that they output are depending on the input. You may also encounter synchronization events from other operators, not covered on this post. We demonstrated how performance analyzers akin to PyTorch Profiler and its associated TensorBoard plugin might be used to discover these sorts of events.
Within the case of our toy example, we were in a position to find equivalent alternatives to the problematic operators that use fixed sized tensors and avoid the necessity for synchronization events. These led to a big improvement in training time. Nonetheless, in practice you may find it much harder — even unimaginable — to unravel these sorts of bottlenecks. Sometimes, overcoming them might require redesigning parts of your model.