Within the previous article of this series, we saw how we could construct practical LLM-powered applications by integrating prompt engineering into our Python code. For the overwhelming majority of LLM use cases, that is the initial approach I like to recommend since it requires significantly less resources and technical expertise than other methods while still providing much of the upside.
Nonetheless, there are situations where prompting an existing LLM out-of-the-box doesn’t cut it, and a more sophisticated solution is required. That is where model fine-tuning can assist.
Superb-tuning is taking a pre-trained model and training a minimum of one internal model parameter (i.e. weights). Within the context of LLMs, what this typically accomplishes is transforming a general-purpose base model (e.g. GPT-3) right into a specialized model for a specific use case (e.g. ChatGPT) [1].
The key upside of this approach is that models can achieve higher performance while requiring (far) fewer manually labeled examples in comparison with models that solely depend on supervised training.
While strictly self-supervised base models can exhibit impressive performance on a wide range of tasks with the assistance of prompt engineering [2], they’re still word predictors and should generate completions that should not entirely helpful or accurate. For instance, let’s compare the completions of davinci (base GPT-3 model) and text-davinci-003 (a fine-tuned model).
Notice the bottom model is just trying to finish the text by listing a set of questions like a Google search or homework task, while the fine-tuned model gives a more helpful response. The flavour of fine-tuning used for text-davinci-003 is alignment tuning, which goals to make the LLM’s responses more helpful, honest, and harmless, but more on that later [3,4].
Superb-tuning not only improves the performance of a base model, but a smaller (fine-tuned) model can often outperform larger (dearer) models on the set of tasks on which it was trained [4]. This was demonstrated by OpenAI with their first generation “InstructGPT” models, where the 1.3B parameter InstructGPT model completions were preferred over the 175B parameter GPT-3 base model despite being 100x smaller [4].
Although many of the LLMs we may interact with as of late should not strictly self-supervised models like GPT-3, there are still drawbacks to prompting an existing fine-tuned model for a selected use case.
A giant one is LLMs have a finite context window. Thus, the model may perform sub-optimally on tasks that require a big knowledge base or domain-specific information [1]. Superb-tuned models can avoid this issue by “learning” this information through the fine-tuning process. This also precludes the necessity to jam-pack prompts with additional context and thus can lead to lower inference costs.
There are 3 generic ways one can fine-tune a model: self-supervised, supervised, and reinforcement learning. These should not mutually exclusive in that any combination of those three approaches may be utilized in succession to fine-tune a single model.
Self-supervised Learning
Self-supervised learning consists of training a model based on the inherent structure of the training data. Within the context of LLMs, what this typically looks like is given a sequence of words (or tokens, to be more precise), predict the following word (token).
While that is what number of pre-trained language models are developed as of late, it might probably even be used for model fine-tuning. A possible use case of that is developing a model that may mimic an individual’s writing style given a set of example texts.
Supervised Learning
The subsequent, and maybe hottest, strategy to fine-tune a model is via supervised learning. This involves training a model on input-output pairs for a specific task. An example is instruction tuning, which goals to enhance model performance in answering questions or responding to user prompts [1,3].
The key step in supervised learning is curating a training dataset. A straightforward strategy to do that is to create question-answer pairs and integrate them right into a prompt template [1,3]. For instance, the question-answer pair: Who was the thirty fifth President of the US? — John F. Kennedy could possibly be pasted into the below prompt template. More example prompt templates can be found in section A.2.1 of ref [4].
"""Please answer the next query.Q: {Query}
A: {Answer}"""
Using a prompt template is vital because base models like GPT-3 are essentially “document completers”. Meaning, given some text, the model generates more text that (statistically) is smart in that context. This goes back to the previous blog of this series and the thought of “tricking” a language model into solving your problem via prompt engineering.
Reinforcement Learning
Finally, one can use reinforcement learning (RL) to fine-tune models. RL uses a reward model to guide the training of the bottom model. This could take many various forms, but the essential idea is to coach the reward model to attain language model completions such that they reflect the preferences of human labelers [3,4]. The reward model can then be combined with a reinforcement learning algorithm (e.g. Proximal Policy Optimization (PPO)) to fine-tune the pre-trained model.
An example of how RL may be used for model fine-tuning is demonstrated by OpenAI’s InstructGPT models, which were developed through 3 key steps [4].
- Generate high-quality prompt-response pairs and fine-tune a pre-trained model using supervised learning. (~13k training prompts) Note: One can (alternatively) skip to step 2 with the pre-trained model [3].
- Use the fine-tuned model to generate completions and have human-labelers rank responses based on their preferences. Use these preferences to coach the reward model. (~33k training prompts)
- Use the reward model and an RL algorithm (e.g. PPO) to fine-tune the model further. (~31k training prompts)
While the strategy above does generally end in LLM completions which might be significantly more preferable to the bottom model, it might probably also come at a value of lower performance in a subset of tasks. This drop in performance can also be referred to as an alignment tax [3,4].
As we saw above, there are various ways wherein one can fine-tune an existing language model. Nonetheless, for the rest of this text, we are going to concentrate on fine-tuning via supervised learning. Below is a high-level procedure for supervised model fine-tuning [1].
- Select fine-tuning task (e.g. summarization, query answering, text classification)
- Prepare training dataset i.e. create (100–10k) input-output pairs and preprocess data (i.e. tokenize, truncate, and pad text).
- Select a base model (experiment with different models and select one which performs best on the specified task).
- Superb-tune model via supervised learning
- Evaluate model performance
While each of those steps could possibly be an article of their very own, I would like to concentrate on step 4 and discuss how we will go about training the fine-tuned model.
On the subject of fine-tuning a model with ~100M-100B parameters, one must be thoughtful of computational costs. Toward this end, a vital query is — which parameters can we (re)train?
With the mountain of parameters at play, now we have countless selections for which of them we train. Here, I’ll concentrate on three generic options of which to decide on.
Option 1: Retrain all parameters
The primary option is to train all internal model parameters (called full parameter tuning) [3]. While this feature is easy (conceptually), it’s probably the most computationally expensive. Moreover, a known issue with full parameter tuning is the phenomenon of catastrophic forgetting. That is where the model “forgets” useful information it “learned” in its initial training [3].
A technique we will mitigate the downsides of Option 1 is to freeze a big portion of the model parameters, which brings us to Option 2.
Option 2: Transfer Learning
The large idea with transfer learning (TL) is to preserve the useful representations/features the model has learned from past training when applying the model to a brand new task. This generally consists of dropping “the pinnacle” of a neural network (NN) and replacing it with a brand new one (e.g. adding latest layers with randomized weights). Note: The top of an NN includes its final layers, which translate the model’s internal representations to output values.
While leaving nearly all of parameters untouched mitigates the large computational cost of coaching an LLM, TL may not necessarily resolve the issue of catastrophic forgetting. To raised handle each of those issues, we will turn to a unique set of approaches.
Option 3: Parameter Efficient Superb-tuning (PEFT)
PEFT involves augmenting a base model with a comparatively small variety of trainable parameters. The important thing results of it is a fine-tuning methodology that demonstrates comparable performance to full parameter tuning at a tiny fraction of the computational and storage cost [5].
PEFT encapsulates a family of techniques, certainly one of which is the favored LoRA (Low-Rank Adaptation) method [6]. The fundamental idea behind LoRA is to choose a subset of layers in an existing model and modify their weights in keeping with the next equation.
Where h() = a hidden layer that can be tuned, x = the input to h(), W₀ = the unique weight matrix for the h, and ΔW = a matrix of trainable parameters injected into h. ΔW is decomposed in keeping with ΔW=BA, where ΔW is a d by k matrix, B is d by r, and A is r by k. r is the assumed “intrinsic rank” of ΔW (which may be as small as 1 or 2) [6].
Sorry for all the mathematics, however the key point is the (d * k) weights in W₀ are frozen and, thus, not included in optimization. As a substitute, the ((d * r) + (r * k)) weights making up matrices B and A are the one ones which might be trained.
Plugging in some made-up numbers for d=1000, k=1000, and r=2 to get a way of the efficiency gains, the variety of trainable parameters drops from 1,000,000 to 4,000 in that layer. In practice, the authors of the LoRA paper cited a 10,000x reduction in parameter checkpoint size using LoRA fine-tune GPT-3 in comparison with full parameter tuning [6].
To make this more concrete, let’s see how we will use LoRA to fine-tune a language model efficiently enough to run on a laptop computer.
In this instance, we are going to use the Hugging Face ecosystem to fine-tune a language model to categorise text as ‘positive’ or ‘negative’. Here, we fine-tune distilbert-base-uncased, a ~70M parameter model based on BERT. Since this base model was trained to do language modeling and never classification, we employ transfer learning to switch the bottom model head with a classification head. Moreover, we use LoRA to fine-tune the model efficiently enough that it might probably run on my Mac Mini (M1 chip with 16GB memory) in an affordable period of time (~20 min).
The code, together with the conda environment files, can be found on the GitHub repository. The ultimate model and dataset [7] can be found on Hugging Face.
Imports
We start by importing helpful libraries and modules. Datasets, transformers, peft, and evaluate are all libraries from Hugging Face (HF).
from datasets import load_dataset, DatasetDict, Datasetfrom transformers import (
AutoTokenizer,
AutoConfig,
AutoModelForSequenceClassification,
DataCollatorWithPadding,
TrainingArguments,
Trainer)
from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig
import evaluate
import torch
import numpy as np
Base model
Next, we load in our base model. The bottom model here is a comparatively small one, but there are several other (larger) ones that we could have used (e.g. roberta-base, llama2, gpt2). A full list is obtainable here.
model_checkpoint = 'distilbert-base-uncased'# define label maps
id2label = {0: "Negative", 1: "Positive"}
label2id = {"Negative":0, "Positive":1}
# generate classification model from model_checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
model_checkpoint, num_labels=2, id2label=id2label, label2id=label2id)
Load data
We will then load our training and validation data from HF’s datasets library. It is a dataset of 2000 movie reviews (1000 for training and 1000 for validation) with binary labels indicating whether the review is positive (or not).
# load dataset
dataset = load_dataset("shawhin/imdb-truncated")
dataset# dataset =
# DatasetDict({
# train: Dataset({
# features: ['label', 'text'],
# num_rows: 1000
# })
# validation: Dataset({
# features: ['label', 'text'],
# num_rows: 1000
# })
# })
Preprocess data
Next, we want to preprocess our data in order that it might probably be used for training. This consists of using a tokenizer to convert the text into an integer representation understood by the bottom model.
# create tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)
To use the tokenizer to the dataset, we use the .map() method. This takes in a custom function that specifies how the text needs to be preprocessed. On this case, that function is known as tokenize_function(). Along with translating text to integers, this function truncates integer sequences such that they are not any longer than 512 numbers to adapt to the bottom model’s max input length.
# create tokenize function
def tokenize_function(examples):
# extract text
text = examples["text"]#tokenize and truncate text
tokenizer.truncation_side = "left"
tokenized_inputs = tokenizer(
text,
return_tensors="np",
truncation=True,
max_length=512
)
return tokenized_inputs
# add pad token if none exists
if tokenizer.pad_token is None:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))
# tokenize training and validation datasets
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset
# tokenized_dataset =
# DatasetDict({
# train: Dataset({
# features: ['label', 'text', 'input_ids', 'attention_mask'],
# num_rows: 1000
# })
# validation: Dataset({
# features: ['label', 'text', 'input_ids', 'attention_mask'],
# num_rows: 1000
# })
# })
At this point, we can even create a knowledge collator, which can dynamically pad examples in each batch during training such that all of them have the identical length. That is computationally more efficient than padding all examples to be equal in length across your complete dataset.
# create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
Evaluation metrics
We will define how we would like to judge our fine-tuned model via a custom function. Here, we define the compute_metrics() function to compute the model’s accuracy.
# import accuracy evaluation metric
accuracy = evaluate.load("accuracy")# define an evaluation function to pass into trainer later
def compute_metrics(p):
predictions, labels = p
predictions = np.argmax(predictions, axis=1)
return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}
Untrained model performance
Before training our model, we will evaluate how the bottom model with a randomly initialized classification head performs on some example inputs.
# define list of examples
text_list = ["It was good.", "Not a fan, don't recommed.",
"Better than the first one.", "This is not worth watching even once.",
"This one is a pass."]print("Untrained model predictions:")
print("----------------------------")
for text in text_list:
# tokenize text
inputs = tokenizer.encode(text, return_tensors="pt")
# compute logits
logits = model(inputs).logits
# convert logits to label
predictions = torch.argmax(logits)
print(text + " - " + id2label[predictions.tolist()])
# Output:
# Untrained model predictions:
# ----------------------------
# It was good. - Negative
# Not a fan, don't recommed. - Negative
# Higher than the primary one. - Negative
# This shouldn't be price watching even once. - Negative
# This one is a pass. - Negative
As expected, the model performance is reminiscent of random guessing. Let’s see how we will improve this with fine-tuning.
Superb-tuning with LoRA
To make use of LoRA for fine-tuning, we first need a config file. This sets all of the parameters for the LoRA algorithm. See comments within the code block for more details.
peft_config = LoraConfig(task_type="SEQ_CLS", # sequence classification
r=4, # intrinsic rank of trainable weight matrix
lora_alpha=32, # this is sort of a learning rate
lora_dropout=0.01, # probablity of dropout
target_modules = ['q_lin']) # we apply lora to question layer only
We will then create a new edition of our model that may be trained via PEFT. Notice that the size of trainable parameters was reduced by about 100x.
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()# trainable params: 1,221,124 || all params: 67,584,004 || trainable%: 1.8068239934408148
Next, we define hyperparameters for model training.
# hyperparameters
lr = 1e-3 # size of optimization step
batch_size = 4 # variety of examples processed per optimziation step
num_epochs = 10 # variety of times model runs through training data# define training arguments
training_args = TrainingArguments(
output_dir= model_checkpoint + "-lora-text-classification",
learning_rate=lr,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=num_epochs,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
Finally, we create a trainer() object and fine-tune the model!
# creater trainer object
trainer = Trainer(
model=model, # our peft model
args=training_args, # hyperparameters
train_dataset=tokenized_dataset["train"], # training data
eval_dataset=tokenized_dataset["validation"], # validation data
tokenizer=tokenizer, # define tokenizer
data_collator=data_collator, # it will dynamically pad examples in each batch to be equal length
compute_metrics=compute_metrics, # evaluates model using compute_metrics() function from before
)# train model
trainer.train()
The above code will generate the next table of metrics during training.
Trained model performance
To see how the model performance has improved, let’s apply it to the identical 5 examples from before.
model.to('mps') # moving to mps for Mac (can alternatively do 'cpu')print("Trained model predictions:")
print("--------------------------")
for text in text_list:
inputs = tokenizer.encode(text, return_tensors="pt").to("mps") # moving to mps for Mac (can alternatively do 'cpu')
logits = model(inputs).logits
predictions = torch.max(logits,1).indices
print(text + " - " + id2label[predictions.tolist()[0]])
# Output:
# Trained model predictions:
# ----------------------------
# It was good. - Positive
# Not a fan, don't recommed. - Negative
# Higher than the primary one. - Positive
# This shouldn't be price watching even once. - Negative
# This one is a pass. - Positive # this one is hard
The fine-tuned model improved significantly from its prior random guessing, appropriately classifying all but certainly one of the examples within the above code. This aligns with the ~90% accuracy metric we saw during training.
Links: Code Repo | Model | Dataset
While fine-tuning an existing model requires more computational resources and technical expertise than using one out-of-the-box, (smaller) fine-tuned models can outperform (larger) pre-trained base models for a specific use case, even when employing clever prompt engineering strategies. Moreover, with all of the open-source LLM resources available, it’s never been easier to fine-tune a model for a custom application.
The subsequent (and final) article of this series will go one step beyond model fine-tuning and discuss the way to train a language model from scratch.
👉 More on LLMs: Introduction | OpenAI API | Hugging Face Transformers | Prompt Engineering