Home Artificial Intelligence Improving LLM Inference Speeds on CPUs with Model Quantization Inference Latency in Application Development Model Compression Conclusion and Discussion

Improving LLM Inference Speeds on CPUs with Model Quantization Inference Latency in Application Development Model Compression Conclusion and Discussion

Improving LLM Inference Speeds on CPUs with Model Quantization
Inference Latency in Application Development
Model Compression
Conclusion and Discussion

Image Property of Creator — Create with Nightcafe

Discover tips on how to significantly improve inference latency on CPUs using quantization techniques for mixed, int8, and int4 precisions

Towards Data Science

Probably the most significant challenges the AI space faces is the necessity for computing resources to host large-scale production-grade LLM-based applications. At scale, LLM applications require redundancy, scalability, and reliability, which have historically been only possible on general computing platforms like CPUs. Still, the prevailing narrative today is that CPUs cannot handle LLM inference at latencies comparable with high-end GPUs.

One open-source tool within the ecosystem that can assist address inference latency challenges on CPUs is the Intel Extension for PyTorch (IPEX), which provides up-to-date feature optimizations for an additional performance boost on Intel hardware. IPEX delivers quite a lot of easy-to-implement optimizations that make use of hardware-level instructions. This tutorial will dive into the idea of model compression and the out-of-the-box model compression techniques IPEX provides. These compression techniques directly impact LLM inference performance on general computing platforms, like Intel 4th and Fifth-generation CPUs.

Second only to application safety and security, inference latency is one of the crucial critical parameters of an AI application in production. Regarding LLM-based applications, latency or throughput is usually measured in tokens/second. As illustrated within the simplified inference processing sequence below, tokens are processed by the language model after which de-tokenized into natural language.

GIF 1. of inference processing sequence — Image by Creator

Interpreting inference this fashion can sometimes lead us astray because we analyze this component of AI applications in abstraction of the normal production software paradigm. Yes, AI apps have their nuances, but at the top of the day, we’re still talking about transactions per unit of time. If we begin to take into consideration inference as a transaction, like several other, from an application design standpoint, the issue becomes less complex. For instance, let’s say we’ve a chat application that has the next requirements:

  • Average of 300 user sessions per hour
  • Average of 5 transactions (LLM inference requests) per user per session
  • Average 100 tokens generated per transaction
  • Each session has a mean of 10,000ms (10s) overhead for user authentication, guardrailing, network latency, and pre/post-processing.
  • Users take a mean of 30,000ms (30s) to reply when actively engaged with the chatbot.
  • The common total energetic session time goal is 3 minutes or less.

Below, you possibly can see that with some easy napkin math, we will get some approximate calculations for the required latency of our LLM inference engine.

Figure 1. An easy equation to calculate the required transaction and token latency based on various application requirements. — Image by Creator

Achieving required latency thresholds in production is a challenge, especially if it is advisable do it without incurring additional compute infrastructure costs. In the rest of this text, we’ll explore a method that we will significantly improve inference latency through model compression.

Model compression is a loaded term since it addresses quite a lot of techniques, corresponding to model quantization, distillation, pruning, and more. At their core, the chief aim of those techniques is to cut back the computational complexity of neural networks.

GIF 2. Illustration of inference processing sequence — Image by Creator

The strategy we’ll give attention to today is model quantization, which involves reducing the byte precision of the weights and, at times, the activations, reducing the computational load of matrix operations and the memory burden of moving around larger, higher precision values. The figure below illustrates the means of quantifying fp32 weights to int8.

Fig 2. Visual representation of model quantization going from full precision at FP32 all the way down to quarter precision at INT8, theoretically reducing the model complexity by an element of 4. — Image by Creator

It’s value mentioning that the reduction of complexity by an element of 4 that results from quantizing from fp32 (full precision) to int8 (quarter precision) doesn’t lead to a 4x latency reduction during inference because inference latency involves more aspects beyond just model-centric properties.

Like with many things, there isn’t a one-size-fits-all approach, and in this text, we’ll explore three of my favorite techniques for quantizing models using IPEX:

Mixed-Precision (bf16/fp32)

This system quantizes some but not the entire weights within the neural network, leading to a partial compression of the model. This system is good for smaller models, just like the <1B LLMs of the world.

Fig 3. Easy illustration of mixed previsions, showing FP32 weights in orange and half-precision quantized bf16 weights in green. — Image by Creator

The implementation is kind of straightforward: using hugging face transformers, a model may be loaded into memory and optimized using the IPEX llm-specific optimization function ipex.llm.optimize(model, dtype=dtype) by setting dtype = torch.bfloat16, we will activate the mixed precision inference capability, which improves the inference latency over full-precision (fp32) and stock.

import sys
import os
import torch
import intel_extension_for_pytorch as ipex
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# PART 1: Model and tokenizer loading using transformers
tokenizer = AutoTokenizer.from_pretrained("Intel/neural-chat-7b-v3-3")
model = AutoModelForCausalLM.from_pretrained("Intel/neural-chat-7b-v3-3")

# PART 2: Use IPEX to optimize the model
#dtype = torch.float # use for full precision FP32
dtype = torch.bfloat16 # use for mixed precision inference
model = ipex.llm.optimize(model, dtype=dtype)

# PART 3: Create a hugging face inference pipeline and generate results
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
st = time.time()
results = pipe("A fisherman at sea...", max_length=250)
end = time.time()
generation_latency = end-st

print('generation latency: ', generation_latency)

Of the three compression techniques we’ll explore, that is the simplest to implement (measured by unique lines of code) and offers the smallest net improvement over a non-quantized baseline.

SmoothQuant (int8)

This system addresses the core challenges of quantizing LLMs, which include handling large-magnitude outliers in activation channels across all layers and tokens, a typical issue that traditional quantization techniques struggle to administer effectively. This system employs a joint mathematical transformation on each weights and activations throughout the model. The transformation strategically reduces the disparity between outlier and non-outlier values for activations, albeit at the associated fee of accelerating this ratio for weights. This adjustment renders the Transformer layers “quantization-friendly,” enabling the successful application of int8 quantization without degrading model quality.

Fig 4. Easy illustration of SmoothQuant showing weights as circles and activations as triangles. The diagram depicts the 2 important steps: (1) the appliance of scaler for smoothing and (2) the quantization to int8 — Image by Creator

Below, you’ll find an easy SmoothQuant implementation — omitting the code for creating the DataLoader, which is a typical and well-documented PyTorch principle. SmoothQuant is an accuracy-aware post-training quantization recipe, meaning that by providing a calibration dataset and model you’ll give you the chance to offer a baseline and limit the language modeling degradation. The calibration model generates a quantization configuration, which is then passed to ipex.llm.optimize() together with the SmoothQuant mapping. Upon execution, the SmoothQuant is applied, and the model may be tested using the .generate() method.

import torch
import intel_extension_for_pytorch as ipex
from intel_extension_for_pytorch.quantization import prepare
import transformers

# PART 1: Load model and tokenizer from Hugging Face + Load SmoothQuant config mapping
tokenizer = AutoTokenizer.from_pretrained("Intel/neural-chat-7b-v3-3")
model = AutoModelForCausalLM.from_pretrained("Intel/neural-chat-7b-v3-3")
qconfig = ipex.quantization.get_smooth_quant_qconfig_mapping()

# PART 2: Configure calibration
# prepare your calibration dataset samples
calib_dataset = DataLoader({Your dataloader parameters})
example_inputs = # provide a sample input out of your calib_dataset
calibration_model = ipex.llm.optimize(
prepared_model = prepare(
calibration_model.eval(), qconfig, example_inputs=example_inputs
with torch.no_grad():
for calib_samples in enumerate(calib_dataset):

# PART 3: Model Quantization using SmoothQuant
model = ipex.llm.optimize(

# generation inference loop
with torch.inference_mode():
model.generate({your generate parameters})

SmoothQuant is a robust model compression technique and helps significantly improve inference latency over full-precision models. Still, it requires a little bit upfront work to organize a calibration dataset and model.

Weight-Only Quantization (int8 and int4)

In comparison with traditional int8 quantization applied to each activation and weight, weight-only quantization (WOQ) offers a greater balance between performance and accuracy. It’s value noting that int4 WOQ requires dequantizing to bf16/fp16 before computation (Figure 4), which introduces an overhead in compute. A basic WOQ technique, tensor-wise asymmetric Round To Nearest (RTN) quantization, presents challenges and sometimes results in reduced accuracy (source). Nevertheless, literature (Zhewei Yao, 2022) suggests that groupwise quantizing the model’s weights helps maintain accuracy. Because the weights are only dequantized for computation, a major memory advantage stays despite this extra step.

Fig 5. Easy illustration of weight-only quantization, with pre-quantized weights in orange and the quantized weights in green. Note that this depicts the initial quantization to int4/int8 and dequantization to fp16/bf16 for the computation step. — Image by Creator

The WOQ implementation below showcases the few lines of code required to quantize a model from Hugging Face with this system. As with the previous implementations, we start by loading a model and tokenizer from Hugging Face. We are able to use the get_weight_only_quant_qconfig_mapping() method to configure the WOQ recipe. The recipe is then passed to the ipex.llm.optimize() function together with the model for optimization and quantization. The quantized model can then be used for inference with the .generate() method.

import torch
import intel_extension_for_pytorch as ipex
import transformers

# PART 1: Model and tokenizer loading
tokenizer = AutoTokenizer.from_pretrained("Intel/neural-chat-7b-v3-3")
model = AutoModelForCausalLM.from_pretrained("Intel/neural-chat-7b-v3-3")

# PART 2: Preparation of quantization config
qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping(
weight_dtype=torch.qint8, # or torch.quint4x2
lowp_mode=ipex.quantization.WoqLowpMode.NONE, # or FP16, BF16, INT8
checkpoint = None # optionally load int4 or int8 checkpoint

# PART 3: Model optimization and quantization
model = ipex.llm.optimize(model, quantization_config=qconfig, low_precision_checkpoint=checkpoint)

# PART 4: Generation inference loop
with torch.inference_mode():
model.generate({your generate parameters})

As you possibly can see, WOQ provides a robust method to compress models all the way down to a fraction of their original size with limited impact on language modeling capabilities.

As an engineer at Intel, I’ve worked closely with the IPEX engineering team at Intel. This has afforded me a novel insight into its benefits and development roadmap, making IPEX a preferred tool. Nevertheless, for developers looking for simplicity without the necessity to manage an additional dependency, PyTorch offers three quantization recipes: Eager Mode, FX Graph Mode (under maintenance), and PyTorch 2 Export Quantization, providing strong, less specialized alternatives.

Regardless of what technique you select, model compression techniques will lead to a point of language modeling performance loss, albeit in <1% in lots of cases. Because of this, it’s essential to guage the appliance’s fault tolerance and establish a baseline for model performance at full (FP32) and/or half-precision (BF16/FP16) before pursuing quantization.

In applications that leverage a point of in-context learning, like Retrieval Augmented Generation (RAG), model compression is perhaps a superb selection. In these cases, the mission-critical knowledge is spoon-fed to the model on the time of inference, so the chance is heavily reduced even with low-fault-tolerant applications.

Quantization is a wonderful method to address LLM inference latency concerns without upgrading or expanding compute infrastructure. It’s value exploring no matter your use case, and IPEX provides a superb option to begin with just just a few lines of code.

Just a few exciting things to try can be:

  • Test the sample code on this tutorial on the Intel Developer Cloud’s free Jupyter Environment.
  • Take an existing model that you just’re running on an accelerator at complete precision and try it out on a CPU at int4/int8
  • Explore all three techniques and determine which works best to your use case. Be sure to match the lack of language modeling performance, not only latency.
  • Upload your quantized model to the Hugging Face Model Hub! In the event you do, let me know — I’d love to examine it out!

Thanks for reading! Don’t forget to follow my profile for more articles like this!


Please enter your comment!
Please enter your name here