At Meta, AI workloads are in all places, serving as the inspiration for varied applications like content comprehension, Feeds, generative AI, and ad rating. Due to its seamless Python integration, eager-mode programming, and simple APIs, PyTorch can run these workloads. Specifically, DLRMs are vital to enhancing user experiences across all of Meta’s products and offerings. The hardware systems must supply increasingly more memory and computing as the dimensions and complexity of those models grow, all without sacrificing efficiency.
On the subject of the highly efficient processing of Meta’s unique advice workloads at scale, GPUs aren’t at all times the most effective option. To deal with this issue, the Meta team developed a set of application-specific integrated circuits (ASICs) called the “Meta Training and Inference Accelerator” (MTIA). With the needs of the next-generation advice model in mind, the first-generation ASIC is included in PyTorch to develop a totally optimized rating system. Keeping developers productive is an ongoing process as they maintain support for PyTorch 2.0, which dramatically improves the compiler-level performance of PyTorch.
In 2020, the team created the unique MTIA ASIC to handle Meta’s internal processing needs. Co-designed with silicon, PyTorch, and the advice models, this inference accelerator is a component of a full-stack solution. Using a TSMC 7nm technology, this 800 MHz accelerator can achieve 102.4 TOPS with INT8 precision and 51.2 TFLOPS with FP16 precision. The device’s TDP, or thermal design power, is 25 W.
The accelerator could be divided into constituent parts, including processing elements (PEs), on-chip and off-chip memory resources, and interconnects in a grid structure. An independent control subsystem inside the accelerator manages the software. The firmware coordinates the execution of jobs on the accelerator, controls the available computing and memory resources, and communicates with the host through a selected host interface. LPDDR5 is used for off-chip DRAM within the memory subsystem, which allows for expansion to 128 GB. More bandwidth and much less latency can be found for continuously accessed data and directions since the chip’s 128 MB of on-chip SRAM is shared amongst all of the PEs.
The 64 PEs within the grid are specified by an 8 by 8 matrix. Each PE’s 128 KB of local SRAM memory allows for fast data storage and processing. A mesh network links the PEs together and to the memory banks. The grid could be utilized in its whole to perform a job, or it could be split up into quite a few subgrids, each of which might handle its work. Matrix multiplication, accumulation, data transportation, and nonlinear function calculation are only among the vital tasks optimized for by the multiple fixed-function units and two processor cores in each PE. The RISC-V ISA-based processor cores have been extensively modified to perform the required computation and control operations. The architecture was designed to profit from two essentials for effective workload management: parallelism and data reuse.
The researchers compared MTIA to an NNPI accelerator and a graphics processing unit. The outcomes show that MTIA relies on efficiently managing small forms and batch sizes for low-complexity models. MTIA actively optimizes its SW stack to realize similar levels of performance. Within the meantime, it uses larger forms which are significantly more optimized on the GPU’s SW stack to run medium- and high-complexity models.
To optimize performance for Meta’s workloads, the team is now concentrating on finding a completely happy medium between computing power, memory capability, and interconnect bandwidth to develop a greater and more efficient solution.
Take a look at the Project. Don’t forget to hitch our 21k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you’ve gotten any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Tanushree
” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2020/10/Tanushree-Picture-225×300.jpeg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2020/10/Tanushree-Picture-768×1024.jpeg”>
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest within the scope of application of artificial intelligence in various fields. She is captivated with exploring the brand new advancements in technologies and their real-life application.