Home Community Meet LoftQ: LoRA-High quality-Tuning-Aware Quantization for Large Language Models

Meet LoftQ: LoRA-High quality-Tuning-Aware Quantization for Large Language Models

Meet LoftQ: LoRA-High quality-Tuning-Aware Quantization for Large Language Models

The introduction of Pre-trained Language Models (PLMs) has signified a transformative shift in the sector of Natural Language Processing. They’ve demonstrated exceptional proficiency in performing a big selection of language tasks, including Natural Language Understanding (NLU) and Natural Language Generation (NLG). These models typically incorporate hundreds of thousands and even billions of parameters, resulting in substantial computational and memory requirements. Nonetheless, the considerable computational and memory needs of those models present significant challenges, as acknowledged by the research community.

On this paper, the authors introduce a novel quantization framework referred to as LoRA-High quality-Tuning-aware Quantization (LoftQ). This framework is specifically tailored for pre-trained models that necessitate quantization and LoRA fine-tuning. The framework actively combines low-rank approximation, working along with quantization to jointly approximate the unique high-precision pre-trained weights.

The above image demonstrates QLoRA performance with different bits. Left: QLoRA initialization of LLAMA-2-13b on WikiText-2. Right: Apply QLoRA to LLAMA-2-13b on WikiText-2 language modelling task. Smaller perplexity indicates higher performance. 

Quantization Methods. We apply two quantization methods to show LoftQ is compatible with different quantization functions:

• Uniform quantization is a classic quantization method. It uniformly divides a continuous interval into 2N categories and stores an area maximum absolute value for dequantization.

• NF4 and its 2-bit variant NF2 are quantization methods utilized in QLoRA. They assume that the high-precision values are drawn from a Gaussian distribution and map these values to discrete slots which have equal probability.

We perform 2-bit and 4-bit quantization on all models, achieving compression ratios of 25-30% and 15-20% on the 4-bit and 2-bit levels, respectively. All of the experiments are conducted on NVIDIA A100 GPUs.

The evaluation of their quantization framework is carried out through extensive experiments on various downstream tasks, including NLU, query answering, summarization, and NLG. The outcomes of those experiments show that LoftQ consistently surpasses QLoRA across all precision levels. For instance, with 4-bit quantization, they attain a 1.1 and 0.8 improvement in Rouge-1 for XSum and CNN/DailyMail, respectively. As the sector of NLP continues to advance, it is predicted that further innovations and optimizations will help bridge the gap between the immense potential of PLMs and their practical deployment, benefiting a big selection of applications and users.

Take a look at the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to hitch our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more.

If you happen to like our work, you’ll love our newsletter..

We’re also on WhatsApp. Join our AI Channel on Whatsapp..

Janhavi Lande, is an Engineering Physics graduate from IIT Guwahati, class of 2023. She is an upcoming data scientist and has been working on this planet of ml/ai research for the past two years. She is most fascinated by this ever changing world and its constant demand of humans to maintain up with it. In her pastime she enjoys traveling, reading and writing poems.

▶️ Now Watch AI Research Updates On Our Youtube Channel [Watch Now]


Please enter your comment!
Please enter your name here