Mistral 7B aligned with IPO
To change into chat models, pre-trained large language models (LLMs) are fine-tuned on large datasets of instructions/questions paired with expected answers. While this easy fine-tuning yields convincing chat models, their answers should be incoherent, biased, unethical, and unsafe from a human perspective. This is the reason we often perform an extra training step to higher align the LLM with humans.
This alignment could be done using reinforcement learning with human feedback (RLHF). As demonstrated by OpenAI and the success of ChatGPT, RLHF can yield state-of-the-art chat models. Nevertheless, RLHF is pricey to run. It requires large datasets annotated by humans and the training of several auxiliary models (reference and reward models).
As a less complicated and cheaper alternative to RLHF, direct preference optimization (DPO) has recently been applied with success to align LLMs, corresponding to Hugging Face’s Zephyr and Intel’s Neural Chat.
In this text, based on a piece by Google DeepMind, we are going to see that, while RLHF and DPO perform well at aligning LLMs, they’re removed from optimal given the datasets used for training. DeepMind also demonstrates why DPO is susceptible to overfitting. I’ll explain, in plain English, how the choice proposed by DeepMind, the identity policy optimization (IPO) objective, is easier and higher designed to learn from the training data than RLHF and DPO.
In the next sections, I show use IPO following a training recipe near the one utilized by Hugging Face to coach the Zephyr models.
I even have also implemented a notebook demonstrating IPO training for Mistral 7B. You could find it here:
Get the notebook (#31)
The paper by DeepMind describing IPO is on arXiv:
A General Theoretical Paradigm to Understand Learning from Human Preferences
RLHF and DPO are trained on similar datasets: prompts paired with at the very least two possible answers rated by humans (or LLMs). The answers are paired in order that, in a…