We now have to design efficient ‘SW/algorithm aware’ HW and ‘HW aware’ SW/algorithm, so that they can tightly intertwine to squeeze every little bit of our limited compute resources. How should we do it? Below is an evolving methodology that will be used as a reference should you are architecting recent HW/SW features for AI.
1. Discover proxy AI workloads.
To initiate the evaluation, we’d like proxy AI models and define a priority list to research further. There are multiple resources you’ll be able to discuss with, including the most recent research paper (CVPR, Siggraph, research labs from big tech) with open-source code, customer feedback or requests, industry trends, etc. Filter out several representative models based in your expert judgment. This step is crucial as you’re using them to design your ‘future architecture’
2. Thorough model architecture evaluation.
You should definitely investigate the model architecture comprehensively to know its functionality, innovation, and break it down into detailed pieces as much as you’ll be able to. Are there recent operators unsupported in the present tech stack? Where are the compute-intensive layers? Is it an information transfer (memory) heavy model? What’s the datatype required and what kind of quantization techniques will be applied without sacrificing accuracy? Which a part of the model will be HW accelerated and where are the potential performance optimizations?
For instance, in neural rendering, the model requires each rendering and computing (matrix multiplication) working in parallel, you want to check if the present SW stack supports render/compute concurrently. In LLMs, key-value(KV) cache size is increasing w.r.t. input sequence length, it’s critical to know the memory requirement and potential data transfer/memory hierarchy optimization to handle large KV cache.
3. SW enabling and prototyping
Download the open-source code for the model identified in Step 2, and run it on the ‘goal’ SW framework/HW. This step just isn’t straightforward, especially for brand spanking new/disruptive models. Because the goal is to enable a workable solution for performance evaluation, it’s not crucial to deliver product-quality code at this stage. A unclean fix on SW without performance tunning is appropriate to proceed to Step 4. A significant step is to convert the pre-trained model in the event framework (Pytorch) to a brand new format required by the targeted recent framework.
torch.onnx.export(model,
dummy_input,
"resnet50.onnx",
verbose=False,
input_names=input_names,
outputnames=output_names,
export_params=True)
Nevertheless, there are sometimes cases where significant support effort is required. For instance, to run differentiable rendering models, autograd must be supported. It’s very likely that this feature just isn’t ready in the brand new framework, and requires months of effort from the event team. One other example is GPTQ quantization for LLMs, which could not be supported within the inference framework initially. As a substitute of waiting for the engineering team, architects can run the workload on the Nvidia system for performance evaluation, as Nvidia is the alternative of HW for tutorial development. This permits the event of an inventory of SW requirements based on the gaps observed during SW enabling.
4. Performance evaluation and architectural innovation.
There are many metrics to guage an AI model’s performance. Below are the most important ones we must always consider.
4.1 FLOPs (Floating Point Operations) and MACs (Multiply-Accumulate Operations).
These metrics are commonly used to calculate the computational complexity of deep learning models. They supply a fast and straightforward method to understand the variety of arithmetic operations required. FLOPs will be calculated through methods equivalent to paper evaluation, Vtune reports, or tools like flops-counter.pytorch and pytorch-OpCounter.
4.2 Memory Footprint and Bandwidth (BW)
Memory footprint mainly consists of weights (network parameters) and input data. For instance, a Llama model with 13B parameters in FP16 consumes about 13*2 (FP16=2 bytes) = 26GB memory (input is negligible as weight takes rather more space). One other key factor for LLMs is the KV cache size. KV cache takes as much as 30% of total memory and it’s dynamic (discuss with picture in Step 2). Large models are often memory sure because the speed relies on how quickly to maneuver data from system memory to local memory, or from local memory to local caches/registers. Available memory BW is much better in predicting inference latency (token generation time for LLMs) than peak compute TOPS. One performance indicator is memory bandwidth utilization (MBU) which is defined as actual BW/peak BW. Ideally, MBU near 100% indicates memory BW is fully utilized.
Enough Memory is Not Enough!
As memory is a bottleneck, exploration in advanced model compression and memory/caching technologies are required. Just a few pioneer works are listed below:
- MemGPT: it utilizes different level memory hierarchy resources, equivalent to a mix of fast and small RAM, and enormous and slow storage storage memory. Information should be explicitly transferred between them. [2]
- Low precision quantization (GPTQ, AWQ, GGML) to scale back the memory footprint of models
- In-memory computing (PIM) : Reduce power and improve performance by eliminating the necessity of information movement.
4.3 Latency/throughput.
In computer vision, latency is the time to generate one frame. Within the context of LLMs, it’s the time between the primary token and the subsequent token generation. Throughput is the variety of tokens/frames per second. Latency is a critical metric for measuring AI system performance and is a compound factor of each SW/HW performance. There are numerous optimization strategies to contemplate, to call just a few below:
- Optimization of bandwidth-constrained operations like normalizations, pointwise operations, SoftMax, and ReLU. It’s estimated that normalization and pointwise operations devour nearly 40% more runtime than matrix multiplications, while only achieving 250x and 700x less FLOPS than matrix multiplications respectively. To resolve the issue, kernel fusion will be utilized to fuse multiple operators to save lots of data transfer costs or replace expensive operators (softmax) with light ones (ReLU).
- Specialized HW architecture. The mixing of specialised hardware(AVX, GPUs, TPU, NPU) can result in significant speedups and energy savings, which is especially vital for applications that require real-time processing on resource-constrained devices. For instance, Intel AVX instructions can lead as much as 60,000 times more speed-up than native Python code.
Tensor cores on Nvidia graphics (V100, A100, H100, etc.) can multiply and add two FP16 and/or FP32 matrices in a single clock cycle, in comparison with Cuda cores which may only perform 1 operation per cycle. Nevertheless, the utilization of tensor core may be very low (3% — 9% for end-to-end training), leading to high energy cost and low performance. There’s energetic research on improving systolic array utilization (FlexSA, multi-directional SA, etc. ) that I’ll write in the subsequent series of posts.
As well as, as memory and data traffic is at all times a bottleneck for giant AI models, it’s crucial to explore advanced architecture considering greater and more efficient memory on chip. One example is the Cerebras core memory design, where memory is independently addressed per core.
- There are quite a lot of other optimizations: Parallelism, KV cache quantization for LLMs, sparse activation, and End-to-End Optimization — I’ll explain more in upcoming posts
4.4 Power and energy efficiency
Power is one other beast we’d like to look into, especially for low-power user scenarios. It’s at all times a tradeoff between performance and power. As illustrated below, the memory access operation takes a few orders of magnitude more energy than computation operations. Reduction in memory transfer is strongly required to save lots of power.