Home Community Revolutionizing AI Efficiency: UC Berkeley’s SqueezeLLM Debuts Dense-and-Sparse Quantization, Marrying Quality and Speed in Large Language Model Serving

Revolutionizing AI Efficiency: UC Berkeley’s SqueezeLLM Debuts Dense-and-Sparse Quantization, Marrying Quality and Speed in Large Language Model Serving

0
Revolutionizing AI Efficiency: UC Berkeley’s SqueezeLLM Debuts Dense-and-Sparse Quantization, Marrying Quality and Speed in Large Language Model Serving

Recent developments in Large Language Models (LLMs) have demonstrated their impressive problem-solving ability across several fields. LLMs can include lots of of billions of parameters and are trained on enormous text corpora. 

Studies show that in LLM inference, memory bandwidth, not CPU, is the important thing performance limitation for generative tasks. This means that the speed at which parameters will be loaded and stored for memory-bound situations, moderately than arithmetic operations, becomes the important thing latency barrier. Nonetheless, progress in memory bandwidth technology has lagged far behind computation, giving rise to a phenomenon referred to as the Memory Wall.

Quantization is a promising method that involves storing model parameters with less accuracy than the same old 16 or 32 bits used during training. Despite recent advancements like LLaMA and its instruction-following variations, it remains to be difficult to realize good quantization performance, especially with lower bit precision and comparatively modest models (e.g., 50B parameters).

🚀 JOIN the fastest ML Subreddit Community

A brand new study from UC Berkeley investigates low-bit precision quantization in depth to disclose the shortcomings of current methods. Based on these findings, the researchers introduce SqueezeLLM, a post-training quantization framework that mixes a Dense-and-Sparse decomposition technique with a novel sensitivity-based non-uniform quantization strategy. These methods permit quantization with ultra-low-bit precision while preserving competitive model performance, drastically cutting down on model sizes and inference time costs. Their method reduces the LLaMA-7B model’s perplexity at 3-bit precision from 28.26 with uniform quantization to 7.75 on the C4 dataset, which is a substantial improvement.

Through comprehensive testing on the C4 and WikiText2 benchmarks, the researchers discovered that SqueezeLLM consistently outperforms existing quantization approaches by a large margin across different bit precisions when applied to LLaMA-7B, 13B, and 30B for language modeling tasks.

Based on the team, the low-bit precision quantization of many LLMs is especially difficult as a consequence of substantial outliers in the load matrices. These outliers likewise impact their non-uniform quantization approach since they bias the allocation of bits toward extremely high or low values. To eliminate the outlier values, they supply an easy method that splits the model weights into dense and sparse components. By isolating the intense values, the central region displays a narrower range of as much as 10, leading to higher quantization precision. With efficient sparse storage methods like Compressed Sparse Rows (CSR), the sparse data will be kept in full precision. This method incurs low overhead by utilizing efficient sparse kernels for the sparse half and parallelizing the computation alongside the dense part. 

The team demonstrates their framework’s potential quantizing IF models by applying SqueezeLLM to the Vicuna-7B and 13B models. They compare two systems of their tests. To start, they use the MMLU dataset, a multi-task benchmark that measures a model’s knowledge and problem-solving abilities, to gauge the standard of the generated output. Additionally they use GPT-4 to rank the generation quality of the quantized models relative to the FP16 baseline, using the evaluation methodology presented in Vicuna. In each benchmarks, SqueezeLLM commonly outperforms GPTQ and AWQ, two current state-of-the-art approaches. Notably, in each assessments, the 4-bit quantized model performs just in addition to the baseline.

The work shows considerable latency reductions and advances in quantization performance with their models running on A6000 GPUs. The researchers exhibit speedups of as much as 2.3 in comparison with the baseline FP16 inference for LLaMA-7B and 13B. Moreover, the proposed method achieves as much as 4x quicker latency than GPTQ, demonstrating its efficacy in quantization performance and inference efficiency. 


Check Out The Paper and Github. Don’t forget to affix our 24k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you might have any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com


🚀 Check Out 100’s AI Tools in AI Tools Club


Tanushree

” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2020/10/Tanushree-Picture-225×300.jpeg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2020/10/Tanushree-Picture-768×1024.jpeg”>

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest within the scope of application of artificial intelligence in various fields. She is captivated with exploring the brand new advancements in technologies and their real-life application.


➡️ Try: Ake: A Superb Residential Proxy Network (Sponsored)

LEAVE A REPLY

Please enter your comment!
Please enter your name here