Home Artificial Intelligence Which Quantization Method is Right for You?(GPTQ vs. GGUF vs. AWQ) 1. HuggingFace

Which Quantization Method is Right for You?(GPTQ vs. GGUF vs. AWQ) 1. HuggingFace

0
Which Quantization Method is Right for You?(GPTQ vs. GGUF vs. AWQ)
1. HuggingFace

Exploring Pre-Quantized Large Language Models

Towards Data Science

Throughout the last yr, we’ve got seen the Wild West of Large Language Models (LLMs). The pace at which latest technology and models were released was astounding! In consequence, we’ve got many alternative standards and ways of working with LLMs.

In this text, we’ll explore one such topic, namely loading your local LLM through several (quantization) standards. With sharding, quantization, and different saving and compression strategies, it shouldn’t be easy to know which method is suitable for you.

Throughout the examples, we’ll use Zephyr 7B, a fine-tuned variant of Mistral 7B that was trained with Direct Preference Optimization (DPO).

🔥 TIP: After each example of loading an LLM, it is suggested to restart your notebook to stop OutOfMemory errors. Loading multiple LLMs requires significant RAM/VRAM. You’ll be able to reset memory by deleting the models and resetting your cache like so:

# Delete any models previously created
del model, tokenizer, pipe

# Empty VRAM cache
import torch
torch.cuda.empty_cache()

You can too follow together with the Google Colab Notebook to be certain that the whole lot works as intended.

Essentially the most straightforward, and vanilla, way of loading your LLM is thru 🤗 Transformers. HuggingFace has created a big suite of packages that allow us to do amazing things with LLMs!

We are going to start by installing HuggingFace, amongst others, from its important branch to support newer models:

# Latest HF transformers version for Mistral-like models
pip install git+https://github.com/huggingface/transformers.git
pip install speed up bitsandbytes xformers

After installation, we are able to use the next pipeline to simply load our LLM:

from torch import bfloat16
from transformers import pipeline

# Load in your LLM with none compression tricks
pipe = pipeline(
"text-generation",
model="HuggingFaceH4/zephyr-7b-beta",
torch_dtype=bfloat16,
device_map="auto"
)

LEAVE A REPLY

Please enter your comment!
Please enter your name here