Concept and a Pytorch Lightning example
Parallel training on a lot of GPUs is cutting-edge in deep learning. The open source image generation algorithm Stable Diffusion was trained on a cluster of 256 GPUs. Meta’s AI Research SuperCluster accommodates greater than 24,000 NVIDIA H100 GPUs which might be used to coach models reminiscent of Llama 3.
Through the use of multiple GPUs, machine learning experts reduce the wall time of their training runs. Training Stable Diffusion took 150,000 GPU hours, or greater than 17 years. Parallel training reduced that to 25 days.
There are two kinds of parallel deep learning:
- Data parallelism, where a big dataset is distributed across multiple GPUs.
- Model parallelism, where a deep learning model that is just too large to suit on a single GPU is distributed across multiple devices.
We are going to focus here on data parallelism, as model parallelism only becomes relevant for very large models beyond 500M parameters.
Beyond reducing wall time, there may be an economic argument for parallel training: Cloud compute providers reminiscent of AWS offer single machines with as much as 16 GPUs. Parallel training can make the most of all available GPUs, and also you get more value to your money.
Parallel computing
Parallelism is the dominant paradigm in high performance computing. Programmers discover tasks that may be executed independently of one another and distribute them across a lot of devices. The serial parts of this system distribute the tasks and gather the outcomes.
Parallel computing reduces computation time. A program parallelized across 4 devices could ideally run 4 times faster than the identical program running on a single device. In practice, communication overhead limits this scaling.
As an analogy, consider a bunch of painters painting a wall. The communication overhead occurs when the foreman tells everyone what color to make use of and what area to color. Painting may be done in parallel, and only the ending is again a serial task.