Home Artificial Intelligence QA-LoRA: Effective-Tune a Quantized Large Language Model on Your GPU What’s Incorrect with QLoRA?

QA-LoRA: Effective-Tune a Quantized Large Language Model on Your GPU What’s Incorrect with QLoRA?

0
QA-LoRA: Effective-Tune a Quantized Large Language Model on Your GPU
What’s Incorrect with QLoRA?

Quantization-aware fine-tuning

Towards Data Science
Illustration by the writer — Made with images from Pixabay (1,2)

State-of-the-art large language models (LLMs) are pre-trained with billions of parameters. While pre-trained LLMs can perform many tasks, they will grow to be a lot better once fine-tuned.

Because of LoRA, fine-tuning costs might be dramatically reduced. LoRA adds low-rank tensors, i.e., a small variety of parameters (hundreds of thousands), on top of the frozen original parameters. Only the parameters within the added tensors are trained during fine-tuning.

LoRA still requires the model to be loaded in memory. To cut back the memory cost and speed-up fine-tuning, a brand new approach proposes quantization-aware LoRA (QA-LoRA) fine-tuning.

In this text, I explain QA-LoRA and review its performance compared with previous work (especially QLoRA). I also show learn how to use QA-LoRA to fine-tune your individual quantization-aware LoRA for Llama 2.

Effective-tuning LoRA on top of a quantized LLM is something that may already be done with QLoRA. In my previous articles, I used it again and again to fine-tune LLMs, as an example, Llama 2 and GPT-NeoX, on my desktop computer or using the free instance of Google Colab.

Before delving into QA-LoRA, it’s interesting to grasp what are the present limits of QLoRA.

The NormalFloat4 (NF4) Quantization

LLM quantization algorithms often quantize parameters to a 4-bit precision using the INT4 data type. Computation with this data type is an increasing number of optimized with recent GPUs.

QLoRA doesn’t use INT4 by default but one other data type called NormalFloat4 (NF4). You’ll be able to see it as a compressed float number. In keeping with the authors of QLoRA, NF4 is superior to INT4. LLMs quantized with NF4 achieve a lower perplexity.

Nevertheless, NF4 computation just isn’t optimal for fast inference. That is considered one of the explanation why…

LEAVE A REPLY

Please enter your comment!
Please enter your name here