Optimizing Language Model Fine-Tuning: A Practical Guide to Low-Rank Adaptation
In the realm of artificial intelligence, fine-tuning Large Language Models (LLMs) is where the magic truly happens. However, achieving the perfect balance to meet our specific requirements demands significant computational power, which can come at a hefty cost.
Efficiency and cost-effectiveness are crucial factors to consider. We aim to maximize the utilization of our resources without exceeding our budget constraints. This is where Low-Rank Adaptation (LoRA) comes into play. It serves as the bridge between optimizing performance and affordability when fine-tuning your language models by adjusting only some of the model’s weights.
There are different approaches to Parameter-Efficient Fine-Tuning (PEFT), but in this article, we’ll dive into LoRA and discover how it can make our model-tuning process smarter and more efficient. We’ll see what LoRA is, how it works, why it’s a game-changer, and how we can fine-tune an LLM using this technique. Join us on this journey to uncover how LoRA can optimize your language models without breaking the bank.
Fine-Tuning Large Language Models: A Leap Beyond Training from Scratch
LLMs are specific kinds of machine learning models that are pre-trained on a large amount of text data. They are designed to predict the upcoming word in a sentence, which helps them grasp grammar rules, gain knowledge about the world, and develop some level of reasoning skills.
When starting to train a model from scratch, the first step is to set up a model with random parameters. These parameters are adjusted as the model learns from its mistakes in making predictions. This iterative process occurs many times (often millions or billions of times) with a vast amount of data. This requires substantial computational resources and time.
Two of the most common techniques out there right now in order to optimize language models are Retrieval Augmented Generation (RAG) and Fine Tuning. RAG uses a retriever to find relevant data and a generator to form an answer based on the retrieved data, resulting in precise and context-rich responses. Fine-tuning, on the other hand, involves taking a pre-trained model that has already learned from a large dataset and refining it for a specific task. The benefit is that the pre-trained model has already gained valuable knowledge from its previous training, which serves as a foundation for the new task. Fine-tuning makes slight adjustments to the model’s parameters to better suit the new task, making the process faster and requiring less data compared to training a model from scratch.
So, the main difference between pre-training from scratch and fine-tuning is as follows:
- Pre-training: Involves starting with a model that is randomly initialized and then learning all parameters from the data. This method requires a large amount of data and computational resources.
- Fine-tuning: Begins with a pre-trained model and only makes slight adjustments to its parameters for a specific task. This approach requires less data and computational resources compared to training from scratch.
Fine-tuning LLMs involves two main steps: integrating task-specific heads and adjusting the neural network weights. Let’s break it down:
- Integrating Task-Specific Heads: To start fine-tuning, task-specific head or heads are incorporated to the pre-trained model depending on how many tasks you want to fine-tune for. E.g. If you want to classify different rubrics from the text you can add several heads, one by rubric. These heads are additional layers within the existing model architecture. Their purpose is to convert the model’s output into a format suitable for the particular task. For instance, in text classification, the head may include a fully connected layer followed by a softmax activation function to generate probabilities for each class.
- Updating Neural Network Weights: When you add a task-specific head to a neural network, the next thing to do is update the network weights. This involves training the model on a specific dataset for that task or tasks. During training, both the task-specific head or heads and the existing layers’ weights are adjusted (set a higher learning rate for head layers as they need to learn from scratch and a lower learning rate for the pre-existing layers because they have already been trained) to reduce the loss on the specific dataset. This process merges the model’s previous knowledge with the new task requirements, ensuring better performance.
To sum up, fine-tuning LLMs is a valuable technique that builds upon the knowledge captured in pre-trained models. This allows us to achieve good performance on specific tasks with less computational resources. The main idea of this technique is to specialize pre-trained LLMs.
What is LoRA?
Low-Rank Adaptation is a parameter-efficient approach designed to adapt large pre-trained models like language models. It’s a method that aims to balance the trade-off between the model’s performance and the computational resources required for its adaptation.
Here are some of the key benefits of LoRA over full fine-tuning:
- Parameter Efficiency: LoRA introduces a low-rank structure into the pre-trained model’s parameters during the adaptation process. This structure significantly reduces the number of parameters that need to be updated, leading to a more efficient use of computational resources.
- Resource Savings: By reducing the number of parameters that need to be updated, LoRA can save substantial computational resources. This includes both memory and processing power, making it a more cost-effective solution for adapting large models.
- Comparable Performance: Despite its efficiency, LoRA does not compromise on performance. It has been shown to achieve comparable, and in some cases even superior, performance to full fine-tuning on a variety of tasks.
- Preservation of Pre-trained Parameters: Unlike full fine-tuning, which updates all parameters and may cause catastrophic forgetting of the pre-trained knowledge, LoRA only adapts a small subset of parameters, preserving the valuable pre-trained parameters.
In summary, LoRA provides a more resource-efficient way to adapt large pre-trained models, offering a promising solution for real-world applications where both performance and efficiency are critical. It’s a step towards making large models more accessible and practical.
Now, let’s look at how LoRA works in a bit more detail:
- In traditional fine-tuning, we change the original weight matrix (W) of the model to adapt to a new task. The changes made to W are represented by another matrix (ΔW), so the updated weights can be expressed as W + ΔW.
- Instead of changing W directly, LoRA decomposes ΔW into two smaller matrices, A and B. This decomposition is a crucial step in reducing the computational overhead associated with fine-tuning large models.
- The updated weight matrix (W’) thus becomes: W’ = W + BA. In this equation, W remains frozen (i.e., it is not updated during training). The matrices B and A are of lower dimensionality, with their product (BA) representing a low-rank approximation of ΔW.
- By choosing matrices A and B to have a lower rank r, the number of trainable parameters is significantly reduced. This is because the shape of ΔW depends on the shapes of A and B and by reducing r we are also reducing A and B but not ΔW because BA will always have the same shape, regardless of r value. For example, if W is a d x d matrix, traditionally, updating W would involve d² parameters. However, with B and A of sizes d x r and r x d respectively, the total number of parameters reduces to 2dr, which is much smaller when r << d.
LoRA is like a smart shortcut for fine-tuning large models. It keeps the original model, adds small changeable parts, reduces the number of changes needed, and doesn’t add any extra time to the process. This makes it a very efficient method for adapting large models to specific tasks or domains.
How to Fine-tune an LLM using LoRA
In this part, we will fine-tune a Large Language Model (Google Gemma-2b in this particular case) using Low-Rank Adaptation.
The beauty of this process is that it doesn’t require any expensive hardware or complex setup. All you need is a Hugging Face free account to download models from its repository and a Google email account in order to have access to Google Colab, a popular platform that provides free access to a T4 GPU for machine learning education and research.
We will guide you through this process step by step, ensuring that each stage is clear and understandable. From setting up your environment on Google Colab to running your first fine-tuning experiment, we’ve got you covered.
What’s more, we will provide code snippets at each step, allowing you to follow along interactively. By the end of this section, you will not only have a theoretical understanding of fine-tuning an LLM using LoRA but also practical experience that you can apply to your own projects.
First of all, we have to create a Google Colab notebook and attach the GPU. This is done by clicking into the “Runtime” tab and then “Change runtime type”:
A window will pop up in which you will have to select “Python 3” and “T4 GPU”, and click “Save”. It will take a few seconds to have the resources allocated for the rest of the process:
Once we have everything set up, we can start coding! So, let’s install the necessary packages:
# For any HF basic activities like loading models
# and tokenizers for running inference
# upgrade is a must for the newest Gemma model
!pip install --upgrade datasets
!pip install --upgrade transformers
# For doing efficient stuff - PEFT (Parameter-Efficient Fine-Tuning)
!pip install --upgrade peft
!pip install --upgrade trl
!pip install bitsandbytes
!pip install accelerate
# For logging and visualizing training progress
!pip install tensorboard
Then, we need to create a Hugging Face token and set it as a secret variable in Google Colab (Enabling the notebook access):
from google.colab import userdata
HF_TOKEN = userdata.get('HF_TOKEN')
In order to download the Google Gemma-2b model from Hugging Face, we will need to access the UI and accept the terms and conditions for the use of this specific model. Once the privacy terms and conditions are accepted, we can load the model from the Hugging Face models repository:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "google/gemma-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
print(model)
Let’s test the model with a generic question:
input_text = "What should I do on a trip to Europe?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_length=128)
print(tokenizer.decode(outputs[0]))
The first test that we are going to do is trying to fine-tune Gemma-2b on the Dolly dataset without LoRA with the aim of retrieving more concise responses to the input questions given to the LLM, let’s see what happens:
from datasets import load_dataset
dataset_name = "databricks/databricks-dolly-15k"
dataset = load_dataset(dataset_name, split="train[0:1000]")
print(f"Instruction is: {dataset[0]['instruction']}")
print(f"Response is: {dataset[0]['response']}")
dataset
We need to filter ‘open_qa’ and ‘general_qa’ categories for question-answer fine-tuning and format the prompt because the training dataset has to be just one text line and not decomposed in different columns, and then fine-tune the model:
from trl import SFTTrainer
def formatting_prompts_func(example):
output_texts = []
for i in range(len(example['instruction'])):
if example['category'][i] in ['open_qa', 'general_qa']:
text = f"Instruction:n{example['instruction']}nnResponse:n{example['response']}"
output_texts.append(text)
return output_texts
trainer = SFTTrainer(
model,
train_dataset=dataset,
tokenizer=tokenizer,
formatting_func=formatting_prompts_func,
)
print("Initialized trainer for training!")
trainer.train()
We can see the error “OutOfMemoryError”. This is because we need more resources to fine-tune this model. The T4 GPU which has 15Gb of RAM is not enough for this process. We will need to use a parameter-efficient fine-tuning technique, just like LoRA.
Since we have already allocated the base model to the GPU, we will need to click “Disconnect and delete runtime” in the “Runtime” tab to empty the resources and then “Reconnect” the resources:
We need to install all the packages one more time, and import all the necessary libraries for the fine-tuning process:
# For any HF basic activities like loading models
# and tokenizers for running inference
# upgrade is a must for the newest Gemma model
!pip install --upgrade datasets
!pip install --upgrade transformers
# For doing efficient stuff - PEFT
!pip install --upgrade peft
!pip install --upgrade trl
!pip install bitsandbytes
!pip install accelerate
# for logging and visualizing training progress
!pip install tensorboard
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
HfArgumentParser,
TrainingArguments,
logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
And now, we load the Dolly dataset again:
from datasets import load_dataset
dataset_name = "databricks/databricks-dolly-15k"
dataset = load_dataset(dataset_name, split="train[0:1000]")
dataset
Then, we have to add all the configuration parameters for fine-tuning the model, including LoRA ones:
# define some variables - model names
model_name = "google/gemma-2b"
new_model = "gemma-ft"
################################################################################
# LoRA parameters
################################################################################
# LoRA attention dimension
lora_r = 4
# Alpha parameter for LoRA scaling
lora_alpha = 16
# Dropout probability for LoRA layers
lora_dropout = 0.1
################################################################################
# bitsandbytes parameters
################################################################################
# Activate 4-bit precision base model loading
use_4bit = True
# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"
# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"
# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False
################################################################################
# TrainingArguments parameters
################################################################################
# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"
# Number of training epochs
num_train_epochs = 1
# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False
# Batch size per GPU for training
per_device_train_batch_size = 4
# Batch size per GPU for evaluation
per_device_eval_batch_size = 4
# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1
# Enable gradient checkpointing
gradient_checkpointing = True
# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3
# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4
# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001
# Optimizer to use
optim = "paged_adamw_32bit"
# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"
# Number of training steps (overrides num_train_epochs)
max_steps = -1
# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03
# Group sequences into batches with the same length
# Saves memory and speeds up training considerably
group_by_length = True
# Save checkpoint every X update step
save_steps = 25
# Log every X updates steps
logging_steps = 25
################################################################################
# SFT parameters
################################################################################
# Maximum sequence length to use
max_seq_length = 40 # None
# Pack multiple short examples in the same input sequence to increase efficiency
packing = True # False
# Load the entire model on the GPU 0
# device_map = {"": 0}
device_map="auto"
# Load QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
load_in_4bit=use_4bit, # Activates 4-bit precision loading
bnb_4bit_quant_type=bnb_4bit_quant_type, # nf4
bnb_4bit_compute_dtype=compute_dtype, # float16
bnb_4bit_use_double_quant=use_nested_quant, # False
)
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
major, _ = torch.cuda.get_device_capability()
if major >= 8:
print("Setting BF16 to True")
bf16 = True
else:
bf16 = False
Load the model and tokenizer:
# Load base model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1
tokenizer = AutoTokenizer.from_pretrained(model_name,
trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training
Initialize LoRA and training arguments:
# Load LoRA configuration
peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj"]
)
# Set training parameters
training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
optim=optim,
save_steps=save_steps,
logging_steps=logging_steps,
learning_rate=learning_rate,
weight_decay=weight_decay,
fp16=fp16,
bf16=bf16,
max_grad_norm=max_grad_norm,
max_steps=max_steps,
warmup_ratio=warmup_ratio,
group_by_length=group_by_length,
lr_scheduler_type=lr_scheduler_type,
report_to="tensorboard",
)
training_arguments
Initialize the SFTTrainer:
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
packing=packing,
formatting_func=formatting_prompts_func,
)
And fine-tune the model:
# Fine-tune the model
trainer.train()
trainer.model.save_pretrained(new_model)
Let’s visualize the learning curve in TensorBoard:
# !pip install tensorboard
%load_ext tensorboard
%tensorboard --logdir results/runs
Note that the “train/loss” learning curve exhibits a decreasing slope, indicating a successful training progress. What’s even more important is that this progress was achieved in just a few minutes.
Finally, we load and merge the LoRA weights with the base model weights and run the inference with the same prompt we used to test the pre-trained model:
input_text = "What should I do on a trip to Europe?"
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16,
device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()
# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_length=128)
print(tokenizer.decode(outputs[0]))
Surprise! We can see that the output is much more detailed and less generic than the pre-trained model. Good job!
Conclusions
In this article, we have seen how fine-tuning works for LLMs, highlighting the importance of efficient resource utilization and cost-effectiveness. We have specifically focused on LoRA, a parameter-efficient method that offers significant benefits over full fine-tuning.
We have seen how LoRA, by adding task-specific heads and updating neural network weights, can achieve comparable performance to full fine-tuning while saving substantial resources. This makes it an attractive option for businesses looking to leverage the power of LLMs without incurring prohibitive costs.
The practical application of LoRA was demonstrated with simple step-by-step code examples, showing its effectiveness in terms of performance and resource utilization. The results underscore the potential of LoRA as a viable and efficient strategy for fine-tuning LLMs.
In conclusion, the adoption of parameter-efficient methods like LoRA can have far-reaching implications for businesses. It allows them to harness the power of LLMs cost-effectively, opening up new possibilities for leveraging AI in various domains. We encourage further exploration and adoption of such parameter-efficient methods like DoRA to continue pushing the boundaries of what is possible with LLMs. The future of AI is not just about bigger models, but also about smarter and more efficient ways to fine-tune them.
Source link
lol