Home Artificial Intelligence Introduction to Weight Quantization 📚 Background on Floating Point Representation 🔰 Naïve 8-bit Quantization 🔢 8-bit Quantization with LLM.int8() Conclusion References

Introduction to Weight Quantization 📚 Background on Floating Point Representation 🔰 Naïve 8-bit Quantization 🔢 8-bit Quantization with LLM.int8() Conclusion References

0
Introduction to Weight Quantization
📚 Background on Floating Point Representation
🔰 Naïve 8-bit Quantization
🔢 8-bit Quantization with LLM.int8()
Conclusion
References

The Python implementation is sort of straightforward:

def zeropoint_quantize(X):
# Calculate value range (denominator)
x_range = torch.max(X) - torch.min(X)
x_range = 1 if x_range == 0 else x_range

# Calculate scale
scale = 255 / x_range

# Shift by zero-point
zeropoint = (-scale * torch.min(X) - 128).round()

# Scale and around the inputs
X_quant = torch.clip((X * scale + zeropoint).round(), -128, 127)

# Dequantize
X_dequant = (X_quant - zeropoint) / scale

return X_quant.to(torch.int8), X_dequant

As an alternative of counting on complete toy examples, we will use these two functions on an actual model because of the transformerslibrary.

We start by loading the model and tokenizer for GPT-2. It is a very small model we probably don’t need to quantize, but it’s going to be adequate for this tutorial. First, we would like to watch the model’s size so we will compare it later and evaluate the memory savings as a result of 8-bit quantization.

!pip install -q bitsandbytes>=0.39.0
!pip install -q git+https://github.com/huggingface/speed up.git
!pip install -q git+https://github.com/huggingface/transformers.git
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)

# Set device to CPU for now
device = 'cpu'

# Load model and tokenizer
model_id = 'gpt2'
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Print model size
print(f"Model size: {model.get_memory_footprint():,} bytes")

Model size: 510,342,192 bytes

The dimensions of the GPT-2 model is roughly 487MB in FP32. The subsequent step consists of quantizing the weights using zero-point and absmax quantization. In the next example, we apply these techniques to the primary attention layer of GPT-2 to see the outcomes.

# Extract weights of the primary layer
weights = model.transformer.h[0].attn.c_attn.weight.data
print("Original weights:")
print(weights)

# Quantize layer using absmax quantization
weights_abs_quant, _ = absmax_quantize(weights)
print("nAbsmax quantized weights:")
print(weights_abs_quant)

# Quantize layer using absmax quantization
weights_zp_quant, _ = zeropoint_quantize(weights)
print("nZero-point quantized weights:")
print(weights_zp_quant)

Original weights:
tensor([[-0.4738, -0.2614, -0.0978, ..., 0.0513, -0.0584, 0.0250],
[ 0.0874, 0.1473, 0.2387, ..., -0.0525, -0.0113, -0.0156],
[ 0.0039, 0.0695, 0.3668, ..., 0.1143, 0.0363, -0.0318],
...,
[-0.2592, -0.0164, 0.1991, ..., 0.0095, -0.0516, 0.0319],
[ 0.1517, 0.2170, 0.1043, ..., 0.0293, -0.0429, -0.0475],
[-0.4100, -0.1924, -0.2400, ..., -0.0046, 0.0070, 0.0198]])

Absmax quantized weights:
tensor([[-21, -12, -4, ..., 2, -3, 1],
[ 4, 7, 11, ..., -2, -1, -1],
[ 0, 3, 16, ..., 5, 2, -1],
...,
[-12, -1, 9, ..., 0, -2, 1],
[ 7, 10, 5, ..., 1, -2, -2],
[-18, -9, -11, ..., 0, 0, 1]], dtype=torch.int8)

Zero-point quantized weights:
tensor([[-20, -11, -3, ..., 3, -2, 2],
[ 5, 8, 12, ..., -1, 0, 0],
[ 1, 4, 18, ..., 6, 3, 0],
...,
[-11, 0, 10, ..., 1, -1, 2],
[ 8, 11, 6, ..., 2, -1, -1],
[-18, -8, -10, ..., 1, 1, 2]], dtype=torch.int8)

The difference between the unique (FP32) and quantized values (INT8) is obvious, however the difference between absmax and zero-point weights is more subtle. On this case, the inputs look shifted by a worth of -1. This implies that the load distribution on this layer is sort of symmetric.

We are able to compare these techniques by quantizing every layer in GPT-2 (linear layers, attention layers, etc.) and create two latest models: model_abs and model_zp. To be precise, we’ll actually replace the unique weights with de-quantized ones. This has two advantages: it allows us to 1/ compare the distribution of our weights (same scale) and a couple of/ actually run the models.

Indeed, PyTorch doesn’t allow INT8 matrix multiplication by default. In an actual scenario, we’d dequantize them to run the model (in FP16 for instance) but store them as INT8. In the subsequent section, we’ll use the bitsandbytes library to resolve this issue.

import numpy as np
from copy import deepcopy

# Store original weights
weights = [param.data.clone() for param in model.parameters()]

# Create model to quantize
model_abs = deepcopy(model)

# Quantize all model weights
weights_abs = []
for param in model_abs.parameters():
_, dequantized = absmax_quantize(param.data)
param.data = dequantized
weights_abs.append(dequantized)

# Create model to quantize
model_zp = deepcopy(model)

# Quantize all model weights
weights_zp = []
for param in model_zp.parameters():
_, dequantized = zeropoint_quantize(param.data)
param.data = dequantized
weights_zp.append(dequantized)

Now that our models have been quantized, we would like to ascertain the impact of this process. Intuitively, we would like to ensure that the quantized weights are near the unique ones. A visible approach to check it’s to plot the distribution of the dequantized and original weights. If the quantization is lossy, it could drastically change the load distribution.

The next figure shows this comparison, where the blue histogram represents the unique (FP32) weights, and the red one represents the dequantized (from INT8) weights. Note that we only display this plot between -2 and a couple of due to outliers with very high absolute values (more on that later).

Each plots are quite similar, with a surprising spike around 0. This spike shows that our quantization is sort of lossy since reversing the method doesn’t output the unique values. This is especially true for the absmax model, which displays each a lower valley and the next spike around 0.

Let’s compare the performance of the unique and quantized models. For this purpose, we define a generate_text() function to generate 50 tokens with top-k sampling.

def generate_text(model, input_text, max_length=50):
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)
output = model.generate(inputs=input_ids,
max_length=max_length,
do_sample=True,
top_k=30,
pad_token_id=tokenizer.eos_token_id,
attention_mask=input_ids.new_ones(input_ids.shape))
return tokenizer.decode(output[0], skip_special_tokens=True)

# Generate text with original and quantized models
original_text = generate_text(model, "I actually have a dream")
absmax_text = generate_text(model_abs, "I actually have a dream")
zp_text = generate_text(model_zp, "I actually have a dream")

print(f"Original model:n{original_text}")
print("-" * 50)
print(f"Absmax model:n{absmax_text}")
print("-" * 50)
print(f"Zeropoint model:n{zp_text}")

Original model:
I actually have a dream, and it's a dream I consider I'd get to live in my future. I like my mother, and there was that one time I had been told that my family wasn't even that strong. After which I got the
--------------------------------------------------
Absmax model:
I actually have a dream to seek out out the origin of her hair. She loves it. But there is not any way you could possibly be honest about how her hair is made. She should be crazy.

We found a photograph of the hairstyle posted on
--------------------------------------------------
Zeropoint model:
I actually have a dream of making two full-time jobs in America—one for individuals with mental health issues, and one for individuals who don't suffer from mental illness—or at the least have an employment and family history of substance abuse, to work part

As an alternative of attempting to see if one output makes more sense than the others, we will quantify it by calculating the perplexity of every output. It is a common metric used to guage language models, which measures the uncertainty of a model in predicting the subsequent token in a sequence. On this comparison, we make the common assumption that the lower the rating, the higher the model is. In practice, a sentence with a high perplexity is also correct.

We implement it using a minimal function because it doesn’t need to think about details just like the length of the context window since our sentences are short.

def calculate_perplexity(model, text):
# Encode the text
encodings = tokenizer(text, return_tensors='pt').to(device)

# Define input_ids and target_ids
input_ids = encodings.input_ids
target_ids = input_ids.clone()

with torch.no_grad():
outputs = model(input_ids, labels=target_ids)

# Loss calculation
neg_log_likelihood = outputs.loss

# Perplexity calculation
ppl = torch.exp(neg_log_likelihood)

return ppl

ppl = calculate_perplexity(model, original_text)
ppl_abs = calculate_perplexity(model_abs, absmax_text)
ppl_zp = calculate_perplexity(model_zp, absmax_text)

print(f"Original perplexity: {ppl.item():.2f}")
print(f"Absmax perplexity: {ppl_abs.item():.2f}")
print(f"Zeropoint perplexity: {ppl_zp.item():.2f}")

Original perplexity:  15.53
Absmax perplexity: 17.92
Zeropoint perplexity: 17.97

We see that the perplexity of the unique model is barely lower than the 2 others. A single experiment just isn’t very reliable, but we could repeat this process multiple times to see the difference between each model. In theory, zero-point quantization needs to be barely higher than absmax, but can be more costly to compute.

In this instance, we applied quantization techniques to entire layers (per-tensor basis). Nonetheless, we could apply it at different granularity levels: from your complete model to individual values. Quantizing your complete model in a single pass would seriously degrade the performance, while quantizing individual values would create a giant overhead. In practice, we frequently prefer the vector-wise quantization, which considers the variability of values in rows and columns inside the identical tensor.

Nonetheless, even vector-wise quantization doesn’t solve the issue of outlier features. Outlier features are extreme values (negative or positive) that appear in all transformer layers when the model reach a certain scale (>6.7B parameters). That is a difficulty since a single outlier can reduce the precision for all other values. But discarding these outlier features just isn’t an option since it could greatly degrade the model’s performance.

Introduced by Dettmers et al. (2022), LLM.int8() is an answer to the outlier problem. It relies on a vector-wise (absmax) quantization scheme and introduces mixed-precision quantization. Because of this outlier features are processed in a FP16 format to retain their precision, while the opposite values are processed in an INT8 format. As outliers represent about 0.1% of values, this effectively reduces the memory footprint of the LLM by almost 2x.

Image by creator

LLM.int8() works by conducting matrix multiplication computation in three key steps:

  1. Extract columns from the input hidden states X containing outlier features using a custom threshold.
  2. Perform the matrix multiplication of the outliers using FP16 and the non-outliers using INT8 with vector-wise quantization (row-wise for the hidden state X and column-wise for the load matrix W).
  3. Dequantize the non-outlier results (INT8 to FP16) and add them to the outlier results to get the complete lead to FP16.

LEAVE A REPLY

Please enter your comment!
Please enter your name here