
In the sector of Artificial Intelligence (AI), Multi-Layer Perceptrons (MLPs) are the muse for a lot of Machine Learning (ML) tasks, including partial differential equation solving, density function representation in Neural Radiance Fields (NeRFs), and ray tracing simulation using Neural Ray Tracing.
Fully connected layers, wherein every neuron in a layer is connected to each other neuron within the layer above and below, are a defining characteristic of MLPs. In MLPs, every neuron’s output is independent of the output of its nearby neurons in the identical layer, in contrast to certain other topologies. For this reason property, MLPs could be used for fully fusing processes, which is important for some computational workloads.
In recent research, a team of researchers from Intel Corporation and Ecole Polytechnique has focussed on effectively constructing narrow MLPs on Intel GPUs. Narrow MLPs feature a tiny, fixed variety of neurons per layer and a shallow depth, i.e., the variety of layers. Narrow MLPs are universal approximators which have significance in a big selection of applications despite their narrow width. Their narrow breadth, nonetheless, limits their performance, resulting in low memory bandwidth utilization and arithmetic intensity during training and inference.
Combining the layers right into a single kernel is a preferred solution to those problems, because it allows for the usage of quicker memories comparable to caches, shared memory, and register files. This method, called ‘fully-fused MLPs,’ was previously utilized with CUDA to construct Nvidia GPUs.
The team has shared that the goal of this study is to create fully-fused MLPs with a set layer width of two^i neurons and arbitrary depth using SYCL for Intel GPUs (where i varies from 4 to 7). These MLPs are effective universal approximators despite the fixed layer width. Utilizing the XMX technology in Intel’s Data Centre GPU Max 1550, the implementation relies on Intel’s joint matrix SYCL extensions.
Models requiring high data throughput with batch sizes of two^i, where i is greater than 15, are especially well suited to this method. In comparison with comparable CUDA implementations, the Intel hardware SYCL version performs higher, particularly for 64-width MLPs. A study has also indicated that this method requires less access to global memory than prior ones, which improves inference acceleration and theoretical peak performance.
Benchmarks and applications, including Image Compression, Neural Radiance Fields (NeRFs), and Physics-Informed Machine Learning, have been tested so as to show performance improvements and possible applications. The provided approach performs significantly higher than off-the-shelf implementations comparable to the CUDA PyTorch version on Nvidia’s H100 GPU and Intel Extension for PyTorch (IPEX) on the identical Intel GPU in all circumstances.
The team has summarized their primary contributions as follows.
- The primary SYCL implementation for fully-fused Multi-Layer Perceptrons designed for Intel GPUs using XMX instructions has been introduced.
- The performance of the implementation has been assessed using a roofline model, which shows an increase in arithmetic intensity of as much as 2.15 times when put next to a fully-fused implementation.
- 4 sample applications have been used to validate the upper performance: the regression benchmark, image compression, neural radiation fields, and physics-informed neural networks.
- The implementation is noteworthy because it could actually perform training 1.75 times quicker and inference 2.84 times faster than one other fully-fused implementation. Its effectiveness across quite a lot of activities and datasets has been further demonstrated by the as much as 30 times performance improvement it delivers over commercially available PyTorch versions.
Try the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you happen to like our work, you’ll love our newsletter..
Don’t Forget to affix our 39k+ ML SubReddit
Tanya Malhotra is a final yr undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and significant considering, together with an ardent interest in acquiring recent skills, leading groups, and managing work in an organized manner.