Instance Selection for Deep Learning — Part 2
This post was written in collaboration with Tomer Berkovich, Yitzhak Levi, and Max Rabin.
Appropriate instance selection for machine learning (ML) workloads is a crucial decision with potentially significant implications on the speed and price of development. In a previous post we expanded on this process, proposed a metric for making this essential decision, and highlighted a number of the many aspects you must consider. On this post we’ll show the chance for reducing AI model training costs by taking Spot Instance availability into consideration when making your cloud-based instance selection decision.
One of the significant opportunities for cost savings within the cloud is to make the most of low price Amazon EC2 Spot Instances. Spot instances are discounted compute engines from surplus cloud service capability. In exchange for the discounted price, AWS maintains the precise to preempt the instance with little to no warning. Consequently, the relevance of Spot instance utilization is proscribed to workloads which can be fault tolerant. Fortunately, through effective use of model checkpointing ML training workloads might be designed to be fault tolerant and to make the most of the Spot instance offering. The truth is, Amazon SageMaker, AWS’s managed service for developing ML, makes it easy to coach on Spot instances by managing the end-to-end Spot life-cycle for you.
Unfortunately, Spot instance capability, which measures the provision of Spot instances to be used, is subject to constant fluctuations and might be very difficult to predict. Amazon offers partial assistance in assessing the Spot instance capability of an instance style of alternative via its Spot placement rating (SPS) feature which indicates the likelihood that a Spot request will reach a given region or availability zone (AZ). This is very helpful when you’ve the liberty to decide on to coach your model in one in all several different locations. Nevertheless, the SPS feature offers no guarantees.
Once you decide to train a model on a number of Spot instances, you take the danger that your instance style of alternative doesn’t have any Spot capability (i.e., your training job won’t start), or worse, that you’re going to enter an iterative cycle through which your training repeatedly runs for only a small number of coaching steps and is stopped before you’ve made any meaningful progress — which might tally up your training costs with none return.
Over the past couple of years, the challenges of spot instance utilization have been particularly acute in relation to multi-GPU EC2 instance types resembling g5.12xlarge and p4d.24xlarge. An enormous increase in demand for powerful training accelerators (driven partly by advances in the sphere of Generative AI) combined with disruptions in the worldwide supply chain, have made it virtually unattainable to reliably rely on multi-GPU Spot instances for ML training. The natural fallback is to make use of the more costly On-Demand (OD) or reserved instances. Nevertheless, in our previous post we emphasized the worth of considering many various alternatives in your alternative of instance type. On this post we’ll show the potential gains of replacing multi-GPU On Demand instances with multiple single-GPU Spot instances.
Although our demonstration will use Amazon Web Services, similar conclusions might be reached on alternative cloud service platforms (CSPs). Please don’t interpret our alternative of CSP or services as an endorsement. The most effective option for you may rely on the unique details of your project. Moreover, please consider the likelihood that the style of cost savings we’ll show won’t reproduce within the case of your project and/or that the answer we propose won’t be applicable (e.g., for some reason beyond the scope of this post). Remember to conduct an in depth evaluation of the relevance and efficacy of the proposal before adapting it to your use case.
Nowadays, training AI models on multiple GPU devices in parallel — a process called distributed training — is commonplace. Setting aside instance pricing, when you’ve the alternative between an instance type with multiple GPUs and multiple instance types with the identical style of single GPUs, you’ll typically select the multi-GPU instance. Distributed training typically requires a substantial amount of information communication (e.g., gradient sharing) between the GPUs. The proximity of the GPUs on a single instance is certain to facilitate higher network bandwidth and lower latency. Furthermore, some multi-GPU instances include dedicated GPU-to-GPU inter-connections that may further speed up the communication (e.g., NVLink on p4d.24xlarge). Nevertheless, when Spot capability is proscribed to single GPU instances, the choice of coaching on multiple single GPU instances at a much lower cost becomes more compelling. On the very least, it warrants evaluation of its opportunity for cost-savings.
When distributed training runs on multiple instances, the GPUs communicate with each other via the network between the host machines. To optimize the speed of coaching and reduce the likelihood and/or impact of a network bottleneck, we’d like to make sure minimal network latency and maximal data throughput. These might be affected by quite a few aspects.
Instance Collocation
Network latency might be greatly impacted by the relative locations of the EC2 instances. Ideally, after we request multiple cloud-based instances we would love them to all be collocated on the identical physical rack. In practice, without appropriate configuration, they might not even be in the identical city. In our demonstration below we’ll use a VPC Config object to program an Amazon SageMaker training job to make use of a single subnet of an Amazon Virtual Private Cloud (VPC). This method will be sure that all of the requested training instances can be in the identical availability zone (AZ). Nevertheless, collocation in the identical AZ, may not suffice. Moreover, the tactic we described involves selecting a subnet related to one specific AZ (e.g., the one with the best Spot placement rating). A preferred API would fulfill the request in any AZ that has sufficient capability.
A greater technique to control the location of our instances is to launch them inside a placement group, specifically a cluster placement group. Not only will this guarantee that every one of the instances can be in the identical AZ, but it can also place them on “the identical high-bisection bandwidth segment of the network” in order to maximise the performance of the network traffic between them. Nevertheless, as of the time of this writing SageMaker does not provide the choice to specify a placement group. To make the most of placement groups we would wish to make use of another training service solution (as we’ll show below).
EC2 Network Bandwidth Constraints
Remember to consider the maximal network bandwidth supported by the EC2 instances that you just select. Note, specifically, that the network bandwidths related to single-GPU machines are sometimes documented as being “as much as” a certain variety of Gbps. Be sure that to know what meaning and the way it could impact the speed of coaching over time.
Remember that the GPU-to-GPU data communication (e.g., gradient sharing) might must share the limited network bandwidth with other data flowing through the network resembling training samples being streamed into the training instances or training artifacts being uploaded to persistent storage. Consider ways of reducing the payload of every of the categories of information to reduce the likelihood of a network bottleneck.
Elastic Fabric Adapter (EFA)
A growing variety of EC2 instance types support Elastic Fabric Adapter (EFA), a dedicated network interface for optimizing inter-node communication. Using EFA can have a decisive impact on the runtime performance of your training workload. Note that the bandwidth on the EFA network channel is different than the documented bandwidth of the usual network. As of the time of this writing, detailed documentation of the EFA capabilities is tough to return by and it will likely be best to judge its impact through trial and error. Think about using an EC2 instance that supports EFA type when relevant.
We are going to now show the comparative price performance of coaching on 4 single-GPU EC2 g5 Spot instances (ml.g5.2xlarge and ml.g5.4xlarge) vs. a single four-GPU On-Demand instance (ml.g5.12xlarge). We are going to use the training script below containing a Vision Transformer (ViT) backed classification model (trained on synthetic data).
import os, torch, time
import torch.distributed as dist
from torch.utils.data import Dataset, DataLoader
from torch.cuda.amp import autocast
from torch.nn.parallel import DistributedDataParallel as DDP
from timm.models.vision_transformer import VisionTransformerbatch_size = 128
log_interval = 10
# use random data
class FakeDataset(Dataset):
def __len__(self):
return 1000000
def __getitem__(self, index):
rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
label = torch.tensor(data=[index % 1000], dtype=torch.int64)
return rand_image, label
def mp_fn():
local_rank = int(os.environ['LOCAL_RANK'])
dist.init_process_group("nccl")
torch.cuda.set_device(local_rank)
# model definition
model = VisionTransformer()
loss_fn = torch.nn.CrossEntropyLoss()
model.to(torch.cuda.current_device())
model = DDP(model)
optimizer = torch.optim.Adam(params=model.parameters())
# dataset definition
num_workers = os.cpu_count()//int(os.environ['LOCAL_WORLD_SIZE'])
dl = DataLoader(FakeDataset(), batch_size=batch_size, num_workers=num_workers)
model.train()
t0 = time.perf_counter()
for batch_idx, (x, y) in enumerate(dl, start=1):
optimizer.zero_grad(set_to_none=True)
x = x.to(torch.cuda.current_device())
y = torch.squeeze(y.to(torch.cuda.current_device()), -1)
with autocast(enabled=True, dtype=torch.bfloat16):
outputs = model(x)
loss = loss_fn(outputs, y)
loss.backward()
optimizer.step()
if batch_idx % log_interval == 0 and local_rank == 0:
time_passed = time.perf_counter() - t0
samples_processed = dist.get_world_size() * batch_size * log_interval
print(f'{samples_processed / time_passed} samples/second')
t0 = time.perf_counter()
if __name__ == '__main__':
mp_fn()
The code block below demonstrates how we used the SageMaker Python package (version 2.203.1) to run our experiments. Note that for the four-instance experiments, we configure using a VPC with a single subnet, as explained above.
from sagemaker.pytorch import PyTorch
from sagemaker.vpc_utils import VPC_CONFIG_DEFAULT# Toggle flag to change between multiple single-GPU nodes and
# single multi-GPU node
multi_inst = False
inst_count=1
inst_type='ml.g5.12xlarge'
use_spot_instances=False
max_wait=None #max seconds to attend for Spot job to finish
subnets=None
security_group_ids=None
if multi_inst:
inst_count=4
inst_type='ml.g5.4xlarge' # optinally change to ml.g5.2xlarge
use_spot_instances=True
max_wait=24*60*60 #24 hours
# configure vpc settings
subnets=['']
security_group_ids=['']
estimator = PyTorch(
role='',
entry_point='train.py',
source_dir='',
instance_type=inst_type,
instance_count=inst_count,
framework_version='2.1.0',
py_version='py310',
distribution={'torch_distributed': {'enabled': True}},
subnets=subnets,
security_group_ids=security_group_ids,
use_spot_instances=use_spot_instances,
max_wait=max_wait
)
# start job
estimator.fit()
Note that our code depends upon the third-party timm Python package that we point to in a requirements.txt file in the foundation of the source directory. This assumes that the VPC has been configured to enable web access. Alternatively, you might define a non-public PyPI server (as described here), or create a custom image together with your third party dependencies preinstalled (as described here).
We summarize the outcomes of our experiment within the table below. The On-Demand prices were taken from the SageMaker pricing page (as of the time of this writing, January 2024). The Spot saving values were collected from the reported managed spot training savings of the finished job. Please see the EC2 Spot pricing documentation to get a way for the way the reported Spot savings are calculated.
Our results clearly show the potential for considerable savings when using 4 single-GPU Spot instances quite than a single four-GPU On Demand instance. They further show that although the price of an On Demand g5.4xlarge instance type is higher, the increased CPU power and/or network bandwidth combined with higher Spot savings, resulted in much greater savings.
Importantly, have in mind that the relative performance results can vary considerably based on the small print of your job as well the Spot prices on the time that you just run your experiments.
In a previous post we described how one can create a customized managed environment on top of an unmanaged service, resembling Amazon EC2. One in every of the motivating aspects listed there was the will to have greater control over device placement in a multi-instance setup, e.g., by utilizing a cluster placement group, as discussed above. On this section, we show the creation of a multi-node setup using a cluster placement group.
Our code assumes the presence of a default VPC in addition to the (one-time) creation of a cluster placement group, demonstrated here using the AWS Python SDK (version 1.34.23):
import boto3ec2 = boto3.client('ec2')
ec2.create_placement_group(
GroupName='cluster-placement-group',
Strategy='cluster'
)
Within the code block below we use the AWS Python SDK to launch our Spot instances:
import boto3ec2 = boto3.resource('ec2')
instances = ec2.create_instances(
MaxCount=4,
MinCount=4,
ImageId='ami-0240b7264c1c9e6a9', # replace with image of alternative
InstanceType='g5.4xlarge',
Placement={'GroupName':'cluster-placement-group'},
InstanceMarketOptions={
'MarketType': 'spot',
'SpotOptions': {
"SpotInstanceType": "one-time",
"InstanceInterruptionBehavior": "terminate"
}
},
)
Please see our previous post for step-by-step tips about how one can extend this to an automatic training solution.
On this post, now we have illustrated how demonstrating flexibility in your alternative of coaching instance type can increase your ability to leverage Spot instance capability and reduce the general cost of coaching.
Because the sizes of AI models proceed to grow and the prices of AI training accelerators proceed to rise, it becomes increasingly essential that we explore ways to mitigate training expenses. The technique outlined here is only one amongst several methods for optimizing cost performance. We encourage you to explore our previous posts for insights into additional opportunities on this realm.