Home Artificial Intelligence Effective-tune a Mistral-7b model with Direct Preference Optimization 🥇 Preference datasets 🎓 Direct Preference Optimization 💾 Formatting the information ⚙️ Training the model with DPO Conclusion References

Effective-tune a Mistral-7b model with Direct Preference Optimization 🥇 Preference datasets 🎓 Direct Preference Optimization 💾 Formatting the information ⚙️ Training the model with DPO Conclusion References

Effective-tune a Mistral-7b model with Direct Preference Optimization
🥇 Preference datasets
🎓 Direct Preference Optimization
💾 Formatting the information
⚙️ Training the model with DPO

Boost the performance of your supervised fine-tuned models

Towards Data Science

Image by writer

Pre-trained Large Language Models (LLMs) can only perform next-token prediction, making them unable to reply questions. Because of this these base models are then fine-tuned on pairs of instructions and answers to act as helpful assistants. Nonetheless, this process can still be flawed: fine-tuned LLMs will be biased, toxic, harmful, etc. That is where Reinforcement Learning from Human Feedback (RLHF) comes into play.

RLHF provides different answers to the LLM, that are ranked in response to a desired behavior (helpfulness, toxicity, etc.). The model learns to output the perfect answer amongst these candidates, hence mimicking the behavior we wish to instill. Often seen as a strategy to censor models, this process has recently change into popular for improving performance, as shown in neural-chat-7b-v3–1.

In this text, we’ll create NeuralHermes-2.5, by fine-tuning OpenHermes-2.5 using a RLHF-like technique: Direct Preference Optimization (DPO). For this purpose, we’ll introduce a preference dataset, describe how the DPO algorithm works, and apply it to our model. We’ll see that it significantly improves the performance of the bottom model on the Open LLM Leaderboard.

As per usual, the code is out there on GitHub and Google Colab.

Preference datasets usually are not standardized, but they typically consist of a group of answers which are ranked by humans. This rating is crucial, because the RLHF process fine-tunes LLMs to output the popular answer. Here is an example of Anthropic/hh-rlhf, a preferred preference dataset:

Image by writer

The structure of the dataset is easy: for every row, there may be one chosen (preferred) answer, and one rejected answer. The goal of RLHF is to guide the model to output the popular answer.

Preference datasets are notoriously costly and difficult to make, as they require collecting manual feedback from humans. This feedback can also be subjective and may easily be biased toward confident (but unsuitable) answers or contradict itself (different annotators have different values). Over time, several solutions have been proposed to tackle these issues, comparable to replacing human feedback with AI feedback (RLAIF).

These datasets also are inclined to be loads smaller than fine-tuning datasets. For example this, the superb neural-chat-7b-v3–1 (best 7B LLM on the Open LLM Leaderboard when it was released) uses 518k samples for fine-tuning (Open-Orca/SlimOrca) but only 12.9k samples for RLHF (Intel/orca_dpo_pairs). On this case, the authors generated answers with GPT-4/3.5 to create the popular answers, and with Llama 2 13b chat to create the rejected responses. It’s a wise strategy to bypass human feedback and only depend on models with different levels of performance.

While the concept of RLHF has been utilized in robotics for a very long time, it was popularized for LLMs in OpenAI’s paper Effective-Tuning Language Models from Human Preferences. On this paper, the authors present a framework where a reward model is trained to approximate human feedback. This reward model is then used to optimize the fine-tuned model’s policy using the Proximal Policy Optimization (PPO) algorithm.

Image by writer

The core concept of PPO revolves around making smaller, incremental updates to the policy, as larger updates can result in instability or suboptimal solutions. From experience, this method is unfortunately still unstable (loss diverges), difficult to breed (quite a few hyperparameters, sensitive to random seeds), and computationally expensive.

That is where Direct Preference Optimization (DPO) comes into play. DPO simplifies control by treating the duty as a classification problem. Concretely, it uses two models: the trained model (or policy model) and a replica of it called the reference model. During training, the goal is to ensure the trained model outputs higher probabilities for preferred answers than the reference model. Conversely, we also want it to output lower probabilities for rejected answers. It means we’re penalizing the LLM for bad answers and rewarding it for good ones.

Image by writer

By utilizing the LLM itself as a reward model and employing binary cross-entropy objectives, DPO efficiently aligns the model’s outputs with human preferences without the necessity for extensive sampling, reward model fitting, or intricate hyperparameter adjustments. It ends in a more stable, more efficient, and computationally less demanding process.

In this instance, we’ll fine-tune the superb OpenHermes-2.5-Mistral-7B, which is a Mistral-7b model that was only supervised fine-tuned. To this end, we’ll use the Intel/orca_dpo_pairs dataset to align our model and improve its performance. We call this recent model NeuralHermes-2.5-Mistral-7B.

Step one consists of putting in the required libraries as follows.

pip install -q datasets trl peft bitsandbytes sentencepiece wandb

Once it’s done, we will import the libraries. I’m also using the secrets tab in Google Colab to store my Hugging Face token.

import os
import gc
import torch

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from trl import DPOTrainer
import bitsandbytes as bnb
from google.colab import userdata
import wandb

# Defined within the secrets tab in Google Colab
hf_token = userdata.get('huggingface')
wb_token = userdata.get('wandb')

model_name = "teknium/OpenHermes-2.5-Mistral-7B"
new_model = "NeuralHermes-2.5-Mistral-7B"

OpenHermes-2.5-Mistral-7B uses a selected chat template, called ChatML. Here is an example of a conversation formatted with this template:

You're a helpful chatbot assistant.<|im_end|>
Hi, how can I make it easier to?<|im_end|>

As you’ll be able to see, ChatML defines different roles (system, user, assistant) and appends special tokens (<|im_start|> and <|im_end|>) to separate them. Furthermore, DPOTrainer also requires a selected format with three columns: prompt, chosen, and rejected.

Our dataset accommodates 4 columns: system, query, chatgpt, and llama2–13b-chat. We’ll simply concatenate the system and query columns to the prompt column. We’ll also map the chatgpt column to “chosen” and llama2–13b-chat to “rejected”. To format the dataset in a reliable way, we’ll use the tokenizer’s apply_chat_template() function, which already uses ChatML.

def chatml_format(example):
# Format system
if len(example['system']) > 0:
message = {"role": "system", "content": example['system']}
system = tokenizer.apply_chat_template([message], tokenize=False)
system = ""

# Format instruction
message = {"role": "user", "content": example['question']}
prompt = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt=True)

# Format chosen answer
chosen = example['chosen'] + "<|im_end|>n"

# Format rejected answer
rejected = example['rejected'] + "<|im_end|>n"

return {
"prompt": system + prompt,
"chosen": chosen,
"rejected": rejected,

# Load dataset
dataset = load_dataset("Intel/orca_dpo_pairs")['train']

# Save columns
original_columns = dataset.column_names

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

# Format dataset
dataset = dataset.map(

Let’s print a sample of the formatted dataset to verify that every thing works as expected:


We will see that the prompt combines system and user instructions. Because of the add_generation_prompt=True argument, it also appends the start of the assistant’s answer. If you wish to skip this step, you’ll be able to directly used the preprocessed dataset as mlabonne/chatml_dpo_pairs.

Next, we define the LoRA configurations to coach the model. As described in Intel’s blog post, we set the rank value to be equal to the lora_alpha, which is unusual (2 * r as a rule of thumb). We also goal all of the linear modules with adapters.

# LoRA configuration
peft_config = LoraConfig(
target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']

We’re now able to load the model we wish to fine-tune with DPO. On this case, two models are required: the model to fine-tune in addition to the reference model. This is usually for the sake of readability, because the DPOTrainer object robotically creates a reference model if none is provided.

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
model.config.use_cache = False

# Reference model
ref_model = AutoModelForCausalLM.from_pretrained(

The ultimate step consists of providing all of the hyperparameters to TrainingArguments and DPOTrainer:

  • Amongst them, the beta parameter is exclusive to DPO because it controls the divergence from the initial policy (0.1 is a typical value for it).
  • In comparison with the values described in Intel’s blog post, we lower the training rate (from 5e-4 to 5e-5) and the variety of steps (from 1,000 to 200). I manually optimized these values after a couple of runs to stabilize training and achieve the perfect results.

We will now start training the model. Note that it requires an A100 GPU and takes between 1 hour to finish the training.

# Training arguments
training_args = TrainingArguments(

# Create DPO trainer
dpo_trainer = DPOTrainer(

# Effective-tune model with DPO

Our model is now fine-tuned. You may check the project on Weights & Biases at this address. Listed here are some interesting metrics to investigate:

Image by writer

Interestingly, the training loss quickly drops to zero (before 50 steps), despite 100 warmup steps. Meanwhile, the opposite metrics keep evolving.

The train/rewards/chosen and train/rewards/rejected plots correspond to the mean difference between the log probabilities output by the trained and reference models. It is smart that, over time, they diverge as our trained model learns the popular answers. The train/rewards/margins plot also shows the difference between these two plots. Finally, the train/reward/accuracies plot shows the frequency of selecting the popular answer. The trained model quickly reaches an ideal accuracy rating, which is sign but could also mean that the difference between preferred and rejected answers is just too obvious.

Now that it’s trained, we will merge the adapter with the unique model. Next, we save the merged model and the tokenizer before pushing it to the Hugging Face Hub.

# Save artifacts

# Flush memory
del dpo_trainer, model, ref_model

# Reload model in FP16 (as an alternative of NF4)
base_model = AutoModelForCausalLM.from_pretrained(
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Merge base model with the adapter
model = PeftModel.from_pretrained(base_model, "final_checkpoint")
model = model.merge_and_unload()

# Save model and tokenizer

# Push them to the HF Hub
model.push_to_hub(new_model, use_temp_dir=False, token=hf_token)
tokenizer.push_to_hub(new_model, use_temp_dir=False, token=hf_token)

Let’s see how our model performs in an actual test. We’ll format the prompt to ask a basic query: “What’s a Large Language Model?”

# Format prompt
message = [
{"role": "system", "content": "You are a helpful assistant chatbot."},
{"role": "user", "content": "What is a Large Language Model?"}
tokenizer = AutoTokenizer.from_pretrained(new_model)
prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)

# Create pipeline
pipeline = transformers.pipeline(

# Generate text
sequences = pipeline(

Here’s the reply from the model:

A big language model is a variety of artificial intelligence (AI) system that has been trained on vast amounts of text data. These models are designed to know and generate human language, allowing them to perform various natural language processing tasks, comparable to text generation, language translation, and query answering. Large language models typically use deep learning techniques, like recurrent neural networks (RNNs) or transformers, to learn patterns and relationships in the information, enabling them to generate coherent and contextually relevant responses. The dimensions of those models, by way of the variety of parameters and the quantity of knowledge they're trained on, plays a big role of their ability to understand and produce complex language structures.

The whole lot appears to be working, we will now evaluate the merged model. As it is a general-purpose model, we will leverage the lm-evaluation-harness to guage it. As the method is sort of resource-intensive, we can even directly submit it for evaluation on the Open LLM Leaderboard. It took a couple of days, but listed here are the outcomes in comparison with other OpenHermes models:

Image by writer

In comparison with the unique model, NeuralHermes-2–5-Mistral-7B model improved the typical rating by 6.7 points (particularly on GSM8K). That is an unexpectedly large improvement, which showcases the facility of Direct Preference Optimization.

In this text, we fine-tuned an already supervised fine-tuned model using DPO and created our own NeuralHermes-2.5 model. By leveraging a high-quality preference dataset, we created a sample-efficient fine-tuning pipeline that produced a big improvement on the Open LLM Leaderboard. If you wish to give it a try, yow will discover quantized variants of this model or use this Hugging Face Space.

Note that our fine-tuning pipeline can still be improved in other ways. For instance, the preference dataset remains to be quite raw and could possibly be improved with more filtering and through the use of different models. As well as, quite a few hyperparameters can still be tweaked to realize higher results. Particularly, the training rate can still be lowered to coach the model on more steps and inject more preference data.


Please enter your comment!
Please enter your name here