Home Artificial Intelligence High quality-tune Google Gemma with Unsloth and Distilled DPO on Your Computer A Closer Have a look at Zephyr Gemma

High quality-tune Google Gemma with Unsloth and Distilled DPO on Your Computer A Closer Have a look at Zephyr Gemma

0
High quality-tune Google Gemma with Unsloth and Distilled DPO on Your Computer
A Closer Have a look at Zephyr Gemma

Following Hugging Face’s Zephyr recipe

Towards Data Science
Generated with DALL-E

Finding good training hyperparameters for brand spanking new LLMs is at all times difficult and time-consuming. With Zephyr Gemma 7B, Hugging Face seems to have found a superb recipe for fine-tuning Gemma. They used a mix of distilled supervised fine-tuning and DPO just like what they did for his or her original Zephyr based on Mistral 7B. Nevertheless, training Gemma with DPO on consumer hardware is difficult because of its memory consumption.

In this text, I first review the recipe utilized by Hugging Face to coach Zephyr Gemma 7B. Then, I show methods to use this recipe with Unsloth, a framework implementing various optimizations for fast and memory-efficient training. The tactic presented in this text has a peak memory consumption of 19 GB of VRAM and a complete training time of only 8 hours. In other words, DPO training for Gemma is feasible on consumer hardware.

Supervised High quality-tuning (SFT)

DPO must use for reference a model trained with supervised fine-tuning (SFT) on an instruction dataset. Hugging Face also released this SFT model:

For SFT, they used deita-10k which is a small instruction dataset of 9.5k examples:

A wide range of LLMs have generated all of the examples on this dataset (GPT-4, GPT-3.5, Claude, Vicuna, Llama 2, Mistral 7B, Zephyr, etc.). For SFT training, they used a special data format that we may also use.

Hugging Face used the hyperparameters referenced on this configuration file from their alignment handbook. They didn’t use LoRA or quantization. It implies that they probably used many A100/H100 GPUs for training Zephyr Gemma. Note: In the model card, they wrote “16 devices” but they don’t say what are these devices.

To run this recipe on consumer hardware, we are going to use LoRA and quantization, i.e., QLoRA. I’ll detail the LoRA configuration in the following section.

LEAVE A REPLY

Please enter your comment!
Please enter your name here