Home Artificial Intelligence Instance Selection for Deep Learning Part 1: Proposed Principles for Instance Type Selection Part 2: Differences Between Instance Types Summary

Instance Selection for Deep Learning Part 1: Proposed Principles for Instance Type Selection Part 2: Differences Between Instance Types Summary

0
Instance Selection for Deep Learning
Part 1: Proposed Principles for Instance Type Selection
Part 2: Differences Between Instance Types
Summary

Tips on how to select the very best machine in your ML workload

Towards Data Science
Photo by Cezary Morga on Unsplash

In the midst of our every day AI development, we’re continually making decisions about probably the most appropriate machines on which to run each of our machine learning (ML) workloads. These decisions aren’t taken calmly as they will have a meaningful impact on each the speed in addition to the associated fee of development. Allocating a machine with a number of GPUs to run a sequential algorithm (e.g., the usual implementation of the connected components algorithm) is perhaps considered wasteful, while training a big language model on a CPU would likely take a prohibitively very long time.

Most often we can have a spread of machine options to pick from. When using a cloud service infrastructure for ML development, we typically have the alternative of a large choice of machine types that adjust greatly of their hardware specifications. These are often grouped into families of machine types (called instance types on AWS, machine families on GCP, and virtual machine series on Microsoft Azure) with each family targeting various kinds of use cases. With all the numerous options it’s easy to feel overwhelmed or suffer from alternative overload, and plenty of online resources exist for helping one navigate the technique of instance selection.

On this post we would love to focus our attention on selecting an appropriate instance type for deep learning (DL) workloads. DL workloads are typically extremely compute-intensive and sometimes require dedicated hardware accelerators similar to GPUs. Our intentions on this post are to propose just a few guiding principles for selecting a machine type for DL and to spotlight a few of the primary differences between machine types that ought to be considered when making this decision.

What’s Different About this Instance Selection Guide

In our view, a lot of the present instance guides lead to an ideal deal of missed opportunity. They typically involve classifying your application based on just a few predefined properties (e.g., compute requirements, memory requirements, network requirements, etc.) and propose a flow chart for selecting an instance type based on those properties. They have an inclination to underestimate the high degree of complexity of many ML applications and the straightforward indisputable fact that classifying them in this fashion doesn’t at all times sufficiently foretell their performance challenges. We have now found that naively following such guidelines can, sometimes, lead to selecting a sub-optimal instance type. As we are going to see, the approach we propose is way more hands-on and data driven. It involves defining clear metrics for measuring the performance of your application and tools for comparing its performance on different instance type options. It’s our belief that it’s this sort of approach that’s required to be certain that you might be truly maximizing your opportunity.

Disclaimers

Please don’t view our mention of any specific instance type, DL library, cloud service provider, etc. as an endorsement for his or her use. One of the best option for you may rely on the unique details of your personal project. Moreover, any suggestion we make mustn’t be regarded as anything greater than a humble proposal that ought to be fastidiously evaluated and adapted to your use case before being applied.

As with another essential development design decision, it’s highly advisable that you’ve a transparent set of guidelines for reaching an optimal solution. There may be nothing easier than simply using the machine type you used in your previous project and/or are most conversant in. Nevertheless, doing so may lead to your missing out on opportunities for significant cost savings and/or significant speedups in your overall development time. On this section we propose just a few guiding principles in your instance type search.

Define Clear Metrics and Tools for Comparison

Perhaps a very powerful guideline we are going to discuss is the necessity to clearly define each the metrics for comparing the performance of your application on different instance types and the tools for measuring them. With out a clear definition of the utility function you are attempting to optimize, you should have no approach to know whether the machine you’ve chosen is perfect. Your utility function is perhaps different across projects and might even change in the course of the course of a single project. When your budget is tight you may prioritize reducing cost over increasing speed. When a very important customer deadline is approaching, you may prefer speed at any cost.

Example: Samples per Dollar Metric
In previous posts (e.g., here) we’ve proposed Samples per Dollar — i.e. the variety of samples which are fed into our ML model for each dollar spent — as a measure of performance for a running DL model (for training or inference. The formula for Samples per Dollar is:

Samples per Dollar formula (by Creator)

…where samples per second = batch size * batches per second. The training instance cost can normally be found online. In fact, optimizing this metric alone is perhaps insufficient: It might minimize the general cost of coaching but without including a metric that considers the general development time, you may find yourself missing your entire customer deadlines. Then again, the speed of development can sometimes be controlled by training on multiple instances in parallel allowing us to achieve our speed goals whatever the instance form of alternative. In any case, our easy example demonstrates the necessity to think about multiple performance metrics and weigh them based on details of the ML project similar to budget and scheduling constraints.

Formulating the metrics is useless for those who don’t have a approach to measure them. It’s critical that you just define and construct tools for measuring your metrics of alternative into your applications. Within the code block below, we show a straightforward PyTorch based training loop during which we include a straightforward line of code for periodically printing out the typical variety of samples processed per second. Dividing this by the published cost (per second) of the instance type gives you the associated fee per dollar metric we mentioned above.

    import time

batch_size = 128
data_loader = get_data_loader(batch_size)
global_batch_size = batch_size * world_size
interval = 100
t0 = time.perf_counter()

for idx, (inputs, goal) in enumerate(data_loader, 1):
train_step(inputs, goal)
if idx % interval == 0:
time_passed = time.perf_counter() - t0
samples_processed = global_batch_size * interval
print(f'{samples_processed / time_passed} samples/second')
t0 = time.perf_counter()

Have a Wide Number of Options

Once we’ve clearly defined our utility function, selecting the very best instance type is reduced to finding the instance type that maximizes the utility function. Clearly, the larger the search space of instance types we are able to pick from, the greater the result we are able to reach for overall utility. Hence the need to have a large number of options. But we should always also aim for diversity in instance types. Deep learning projects typically involve running multiple application workloads that adjust greatly of their system needs and system utilization patterns. It is probably going that the optimal machine type for one workload will differ substantially in its specifications from the optimal workload of one other. Having a large and diverse set of instance types will increase your ability to maximise the performance of your entire project’s workloads.

Consider Multiple Options

Some instance selection guides will recommend categorizing your DL application (e.g., by the scale of the model and/or whether it performs training or inference) and selecting a (single) compute instance accordingly. For instance AWS promotes using certain kinds of instances (e.g., the Amazon EC2 g5 family) for ML inference, and other (more powerful) instance types (e.g., the Amazon EC2 p4 family) for ML training. Nevertheless, as we mentioned within the introduction, it’s our view that blindly following such guidance can result in missed opportunities for performance optimization. And, in truth, we’ve found that for a lot of training workloads, including ones with large ML models, our utility function is maximized by instances that were considered to be targeted for inference.

Of course, we don’t expect you to check every available instance type. There are various instance types that may (and may) be ruled out based on their hardware specifications alone. We’d not recommend taking the time to judge the performance of a giant language model on a CPU. And if we all know that our model requires high precision arithmetic for successful convergence we won’t take the time to run it on a Google Cloud TPU (see here). But barring clearly prohibitive HW limitations, it’s our view that instance types should only be ruled out based on the performance data results.

One in all the explanations that multi-GPU Amazon EC2 g5 instances are sometimes not considered for training models is the indisputable fact that, contrary to Amazon EC2 p4, the medium of communication between the GPUs is PCIe, and never NVLink, thus supporting much lower data throughput. Nevertheless, although a high rate of GPU-to-GPU communication is indeed essential for multi-GPU training, the bandwidth supported by PCIe could also be sufficient in your network, or you may find that other performance bottlenecks prevent you from fully utilizing the speed of the NVLink connection. The one approach to know obviously is thru experimentation and performance evaluation.

Any instance type is fair game in reaching our utility function goals and in the middle of our instance type search we regularly find ourselves rooting for the lower-power, more environment-friendly, under-valued, and lower-priced underdogs.

Develop your Workloads in a Manner that Maximizes your Options

Different instance types may impose different constraints on our implementation. They could require different initialization sequences, support different floating point data types, or rely on different SW installations. Developing your code with these differences in mind will decrease your dependency on specific instance types and increase your ability to make the most of performance optimization opportunities.

Some high-level APIs include support for multiple instance types. PyTorch Lightening, for instance, has built-in support for running a DL model on many differing kinds of processors, hiding the small print of the implementation required for each from the user. The supported processors include CPU, GPU, Google Cloud TPU, HPU (Habana Gaudi), and more. Nevertheless, remember that a few of the adaptations required for running on specific processor types may require code changes to the model definition (without changing the model architecture). You may additionally need to incorporate blocks of code which are conditional on the accelerator type. Some API optimizations could also be implemented for specific accelerators but not for others (e.g., the scaled dot product attention (SDPA) API for GPU). Some hyper-parameters, similar to the batch size, may have to be tuned so as to reach maximum performance. Additional examples of changes which may be required were demonstrated in our series of blog posts on the subject of dedicated AI training accelerators.

(Re)Evaluate Constantly

Importantly, in our current environment of consistent innovation in the sector of DL runtime optimization, performance comparison results develop into outdated in a short time. Latest instance types are periodically released that expand our search space and offer the potential for increasing our utility. Then again, popular instance types can reach end-of-life or develop into difficult to accumulate as a result of high global demand. Optimizations at different levels of the software stack (e.g., see here) may also move the performance needle considerably. For instance, PyTorch recently released a brand new graph compilation mode which may, reportedly, speed up training by as much as 51% on modern GPUs. These speed-ups haven’t (as of the time of this writing) been demonstrated on other accelerators. This can be a considerable speed-up that will force us to reevaluate a few of our previous instance alternative decisions. (For more on PyTorch compile mode, see our recent post on the subject.) Thus, performance comparison should not be a one-time activity; To take full advantage of all of this incredible innovation, it ought to be conducted and updated frequently.

Knowing the small print of the instance types at your disposal and, particularly, the differences between them, is very important for deciding which of them to think about for performance evaluation. On this section we’ve grouped these into three categories: HW specifications, SW stack support, and instance availability.

Hardware Specifications

An important differentiation between potential instance types is in the small print of their hardware specifications. There are a complete bunch of hardware details that may have a meaningful impact on the performance of a deep learning workload. These include:

  • The specifics of the hardware accelerator: Which AI accelerators are we using (e.g., GPU/HPU/TPU), how much memory does each support, what number of FLOPs can it run, what base types does it support (e.g., bfloat16/float32), etc.?
  • The medium of communication between hardware accelerators and its supported bandwidths
  • The medium of communication between multiple instances and its supported bandwidth (e.g., does the instance type include a high bandwidth network similar to Amazon EFA or Google FastSocket).
  • The network bandwidth of sample data ingestion
  • The ratio between the general CPU compute power (typically liable for the sample data input pipeline) and the accelerator compute power.

For a comprehensive and detailed review of the differences within the hardware specifications of ML instance types on AWS, try the next TDS post:

Having a deep understanding of the small print of instance types you might be using is very important not only for knowing which instance types are relevant for you, but in addition for understanding and overcoming runtime performance issues discovered during development. This has been demonstrated in a lot of our previous blog posts (e.g., here).

Software Stack Support

One other input into your instance type search ought to be the SW support matrix of the instance types you might be considering. Some software components, libraries, and/or APIs support only specific instance types. In case your workload requires these, then your search space will likely be more limited. For instance, some models rely on compute kernels built for GPU but not for other kinds of accelerators. One other example is the dedicated library for model distribution offered by Amazon SageMaker which may boost the performance of multi-instance training but, as of the time of this writing, supports a limited variety of instance types (For more details on this, see here.) Also note that some newer instance types, similar to AWS Trainium based Amazon EC2 trn1 instance, have limitations on the frameworks that they support.

Instance Availability

The past few years have seen prolonged periods of chip shortages which have led to a drop in the availability of HW components and, particularly, accelerators similar to GPUs. Unfortunately, this has coincided with a major increase in demand for such components driven by the recent milestones in the event of enormous generative AI models. The imbalance between supply and demand has created a situation of uncertainty with reference to our ability to accumulate the machine kinds of our alternative. If once we might have taken with no consideration our ability to spin up as many machines as we wanted of any given type, we now have to adapt to situations during which our top decisions is probably not available in any respect.

The provision of instance types is a very important input into their evaluation and selection. Unfortunately, it could be very difficult to measure availability and even harder to predict and plan for it. Instance availability can change very suddenly. It will probably be here today and gone tomorrow.

Note that for cases during which we use multiple instances, we may require not only the supply of instance types but in addition their co-location in the identical data-centers (e.g., see here). ML workloads often depend on low network latency between instances and their distance from one another could hurt performance.

One other essential consideration is the supply of low price spot instances. Many cloud service providers offer discounted compute engines from surplus cloud service capability (e.g., Amazon EC2 Spot Instances in AWS, Preemptible VM Instances in Google Cloud Platform, and Low-Priority VMs in Microsoft Azure).The drawback of spot instances is the indisputable fact that they may be interrupted and brought from you with little to no warning. If available, and for those who program fault tolerance into your applications, spot instances can enable considerable cost savings.

On this post we’ve reviewed some considerations and suggestions for example type selection for deep learning workloads. The alternative of instance type can have a critical impact on the success of your project and the technique of discovering probably the most optimal one ought to be approached accordingly. This post is certainly not comprehensive. There could also be additional, even critical, considerations that we’ve not discussed that will apply to your deep learning project and ought to be accounted for.

The explosion in AI development over the past few years has been accompanied with the introduction of quite a few latest dedicated AI accelerators. This has led to a rise within the variety of instance type options available and with it the chance for optimization. It has also made the seek for probably the most optimal instance type each tougher and more exciting. Glad hunting :)!!

LEAVE A REPLY

Please enter your comment!
Please enter your name here