
Large Language Models (LLMs) have taken the world by storm due to their remarkable performances and potential across a various range of tasks. They’re best known for his or her capabilities in text generation, language understanding, text summarization and lots of more. The downside to their widespread adoption is the astronomical size of their model parameters, which requires significant memory capability and specialized hardware for inference. Consequently, deploying these models has been quite difficult.
A technique the computational power required for inference might be reduced is through the use of quantization methods, i.e. reducing the precision of weights and activation functions of a synthetic neural network. INT8 and weight-only quantization are a pair of how the inference cost might be improved. These methods, nevertheless, are generally optimized for CUDA and will not necessarily work on CPUs.
The authors of this research paper from Intel have proposed an efficient way of efficiently deploying LLMs on CPUs. Their approach supports automatic INT-4 weight-only quantization (low precision is applied to model weights only while that of activation functions is kept high) flow. They’ve also designed a selected LLM runtime that has highly optimized kernels that speed up the inference process on CPUs.
The quantization flow is developed on the idea of an Intel Neural Compressor and allows for tuning on different quantization recipes, granularities, and group sizes to generate an INT4 model that meets the accuracy goal. The model is then passed to the LLM runtime, a specialized environment designed to judge the performance of the quantized model. The runtime has been designed to supply an efficient inference of LLMs on CPUs.
For his or her experiments, the researchers chosen a few of the popular LLMs having a various range of parameter sizes (from 7B to 20B). They evaluated the performance of FP32 and INT4 models using open-source datasets. They observed that the accuracy of the quantized model on the chosen datasets was nearly at par with that of the FP32 model. Moreover, they did a comparative evaluation of the latency of the following token generation and located that the LLM runtime outperforms the ggml-based solution by as much as 1.6 times.
In conclusion, this research paper presents an answer to one in every of the most important challenges related to LLMs, i.e., inference on CPUs. Traditionally, these models require specialized hardware like GPUs, which render them inaccessible for a lot of organizations. This paper presents an INT4 model quantization together with a specialized LLM runtime to supply an efficient inference of LLMs on CPUs. When evaluated on a set of popular LLMs, the tactic demonstrated a bonus over ggml-based solutions and gave an accuracy on par with that of FP32 models. There’s, nevertheless, scope for further improvement, and the researchers plan on empowering generative AI on PCs to fulfill the growing demands of AI-generated content.
Take a look at the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to hitch our 32k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more.
If you happen to like our work, you’ll love our newsletter..
We’re also on Telegram and WhatsApp.
Arham Islam
” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2022/10/Screen-Shot-2022-10-03-at-10.48.33-PM-293×300.png” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2022/10/Screen-Shot-2022-10-03-at-10.48.33-PM.png”>
I’m a Civil Engineering Graduate (2022) from Jamia Millia Islamia, Latest Delhi, and I even have a keen interest in Data Science, especially Neural Networks and their application in various areas.