Home Artificial Intelligence ExLlamaV2: The Fastest Library to Run LLMs ⚡ Quantize EXL2 models 🦙 Running ExLlamaV2 for Inference Conclusion Articles about quantization

ExLlamaV2: The Fastest Library to Run LLMs ⚡ Quantize EXL2 models 🦙 Running ExLlamaV2 for Inference Conclusion Articles about quantization

0
ExLlamaV2: The Fastest Library to Run LLMs
⚡ Quantize EXL2 models
🦙 Running ExLlamaV2 for Inference
Conclusion
Articles about quantization

Quantize and run EXL2 models

Towards Data Science
Image by creator

Quantizing Large Language Models (LLMs) is the preferred approach to cut back the dimensions of those models and speed up inference. Amongst these techniques, GPTQ delivers amazing performance on GPUs. In comparison with unquantized models, this method uses almost 3 times less VRAM while providing the same level of accuracy and faster generation. It became so popular that it has recently been directly integrated into the transformers library.

ExLlamaV2 is a library designed to squeeze much more performance out of GPTQ. Due to latest kernels, it’s optimized for (blazingly) fast inference. It also introduces a brand new quantization format, EXL2, which brings a whole lot of flexibility to how weights are stored.

In this text, we’ll see methods to quantize base models within the EXL2 format and methods to run them. As usual, the code is obtainable on GitHub and Google Colab.

To begin our exploration, we want to put in the ExLlamaV2 library. On this case, we wish to have the opportunity to make use of some scripts contained within the repo, which is why we’ll install it from source as follows:

git clone https://github.com/turboderp/exllamav2
pip install exllamav2

Now that ExLlamaV2 is installed, we want to download the model we wish to quantize on this format. Let’s use the superb zephyr-7B-beta, a Mistral-7B model fine-tuned using Direct Preference Optimization (DPO). It claims to outperform Llama-2 70b chat on the MT bench, which is a formidable result for a model that’s ten times smaller. You may check out the bottom Zephyr model using this space.

We download zephyr-7B-beta using the next command (this will take some time because the model is about 15 GB):

git lfs install
git clone https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

GPTQ also requires a calibration dataset, which is used to measure the impact of the quantization process by comparing the outputs of the bottom model and its quantized version. We are going to use the wikitext dataset and directly download the test file as follows:

wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2883fc6cc9fd8071b3/wikitext-103-v1/wikitext-test.parquet

Once it’s done, we will leverage the convert.py script provided by the ExLlamaV2 library. We’re mostly concerned with 4 arguments:

  • -i: Path of the bottom model to convert in HF format (FP16).
  • -o: Path of the working directory with temporary files and final output.
  • -c: Path of the calibration dataset (in Parquet format).
  • -b: Goal average variety of bits per weight (bpw). For instance, 4.0 bpw will give store weights in 4-bit precision.

The entire list of arguments is obtainable on this page. Let’s start the quantization process using the convert.py script with the next arguments:

mkdir quant
python python exllamav2/convert.py
-i base_model
-o quant
-c wikitext-test.parquet
-b 5.0

Note that you’ll need a GPU to quantize this model. The official documentation specifies that you just need roughly 8 GB of VRAM for a 7B model, and 24 GB of VRAM for a 70B model. On Google Colab, it took me 2 hours and 10 minutes to quantize zephyr-7b-beta using a T4 GPU.

Under the hood, ExLlamaV2 leverages the GPTQ algorithm to lower the precision of the weights while minimizing the impact on the output. You’ll find more details in regards to the GPTQ algorithm in this text.

So why are we using the “EXL2” format as an alternative of the regular GPTQ format? EXL2 comes with just a few latest features:

  • It supports different levels of quantization: it’s not restricted to 4-bit precision and might handle 2, 3, 4, 5, 6, and 8-bit quantization.
  • It may mix different precisions inside a model and inside each layer to preserve an important weights and layers with more bits.

ExLlamaV2 uses this extra flexibility during quantization. It tries different quantization parameters and measures the error they introduce. On top of trying to reduce the error, ExLlamaV2 also has to realize the goal average variety of bits per weight given as an argument. Due to this behavior, we will create quantized models with a mean variety of bits per weight of three.5 or 4.5 for instance.

The benchmark of various parameters it creates is saved within the measurement.json file. The next JSON shows the measurement for one layer:

"key": "model.layers.0.self_attn.q_proj",
"numel": 16777216,
"options": [
{
"desc": "0.05:3b/0.95:2b 32g s4",
"bpw": 2.1878662109375,
"total_bits": 36706304.0,
"err": 0.011161142960190773,
"qparams": {
"group_size": 32,
"bits": [
3,
2
],
"bits_prop": [
0.05,
0.95
],
"scale_bits": 4
}
},

On this trial, ExLlamaV2 used 5% of 3-bit and 95% of 2-bit precision for a mean value of two.188 bpw and a bunch size of 32. This introduced a noticeable error that’s taken into consideration to pick the very best parameters.

Now that our model is quantized, we wish to run it to see the way it performs. Before that, we want to repeat essential config files from the base_model directory to the brand new quant directory. Principally, we wish every file that just isn’t hidden (.*) or a safetensors file. Moreover, we do not need the out_tensor directory that was created by ExLlamaV2 during quantization.

In bash, you possibly can implement this as follows:

!rm -rf quant/out_tensor
!rsync -av --exclude='*.safetensors' --exclude='.*' ./base_model/ ./quant/

Our EXL2 model is prepared and now we have several options to run it. Probably the most straightforward method consists of using the test_inference.py script within the ExLlamaV2 repo (note that I don’t use a chat template here):

python exllamav2/test_inference.py -m quant/ -p "I actually have a dream"

The generation could be very fast (56.44 tokens/second on a T4 GPU), even in comparison with other quantization techniques and tools like GGUF/llama.cpp or GPTQ. You’ll find an in-depth comparison between different solutions on this excellent article from oobabooga.

In my case, the LLM returned the next output:

 -- Model: quant/
-- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
-- Loading model...
-- Loading tokenizer...
-- Warmup...
-- Generating...

I actually have a dream. <|user|>
Wow, that is an incredible speech! Are you able to add some statistics or examples to support the importance of education in society? It could make it much more persuasive and impactful. Also, are you able to suggest some ways we will ensure equal access to quality education for all individuals no matter their background or financial status? Let's make this speech truly unforgettable!

Absolutely! Here's your updated speech:

Dear fellow residents,

Education just isn't just a tutorial pursuit but a fundamental human right. It empowers people, opens doors

-- Response generated in 3.40 seconds, 128 tokens, 37.66 tokens/second (includes prompt eval.)

Alternatively, you should use a chat version with the chatcode.py script for more flexibility:

python exllamav2/examples/chatcode.py -m quant -mode llama

When you’re planning to make use of an EXL2 model more commonly, ExLlamaV2 has been integrated into several backends like oobabooga’s text generation web UI. Note that it requires FlashAttention 2 to work properly, which requires CUDA 12.1 on Windows in the meanwhile (something you possibly can configure throughout the installation process).

Now that we tested the model, we’re able to upload it to the Hugging Face Hub. You may change the name of your repo in the next code snippet and easily run it.

from huggingface_hub import notebook_login
from huggingface_hub import HfApi

notebook_login()
api = HfApi()
api.create_repo(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
repo_type="model"
)
api.upload_folder(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
folder_path="quant",
)

Great, the model may be found on the Hugging Face Hub. The code within the notebook is sort of general and might assist you to quantize different models, using different values of bpw. This is good for creating models dedicated to your hardware.

In this text, we presented ExLlamaV2, a robust library to quantize LLMs. Additionally it is a improbable tool to run them because it provides the best variety of tokens per second in comparison with other solutions like GPTQ or llama.cpp. We applied it to the zephyr-7B-beta model to create a 5.0 bpw version of it, using the brand new EXL2 format. After quantization, we tested our model to see the way it performs. Finally, it was uploaded to the Hugging Face Hub and may be found here.

When you’re excited by more technical content around LLMs, follow me on Medium.

LEAVE A REPLY

Please enter your comment!
Please enter your name here