Introduction to LLM Fine-Tuning

Duration: 5 min

This module provides an introduction to fine-tuning Large Language Models (LLMs). It covers various techniques such as Low-Rank Adaptation (LoRA), Quantized Low-Rank Adaptation (QLoRA), Parameter-Efficient Fine-Tuning (PEFT), Instruction Tuning, Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), and evaluation methods. Understanding these techniques is crucial for optimizing LLMs for specific tasks and improving their performance.

Low-Rank Adaptation (LoRA)

LoRA is a technique that allows for efficient fine-tuning of LLMs by introducing low-rank matrices to adapt the model parameters. Instead of updating all weights $W$, LoRA adds a low-rank update: $W' = W + \Delta W = W + BA$, where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ are trainable matrices with rank $r \ll d$. This reduces trainable parameters from $d^2$ to $2dr$, making fine-tuning memory-efficient and faster.

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load pre-trained model and tokenizer
model_name = 'EleutherAI/gpt-neo-125M'
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define LoRA parameters
lora_r = 8  # Rank of low-rank decomposition
lora_alpha = 32  # Scaling factor
lora_dropout = 0.1

class LoRALayer(nn.Module):
    def __init__(self, in_dim, out_dim, r, alpha, dropout):
        super().__init__()
        self.lora_A = nn.Linear(in_dim, r, bias=False)
        self.lora_B = nn.Linear(r, out_dim, bias=False)
        self.scale = alpha / r
        self.dropout = nn.Dropout(dropout)
        # Initialize A with Gaussian, B with zeros
        nn.init.normal_(self.lora_A.weight, std=1/r)
        nn.init.zeros_(self.lora_B.weight)
    
    def forward(self, x):
        return self.dropout(self.lora_B(self.lora_A(x))) * self.scale

# Apply LoRA to attention layers
for name, module in model.named_modules():
    if 'attention' in name and isinstance(module, nn.Linear):
        # Wrap linear layer with LoRA
        in_dim, out_dim = module.in_features, module.out_features
        lora = LoRALayer(in_dim, out_dim, lora_r, lora_alpha, lora_dropout)
        # In practice, use peft library for production code
        print(f'Applied LoRA to {name}: {in_dim}x{out_dim} -> rank {lora_r}')

print('LoRA applied successfully.')

Try it in Google Colab:

Applied LoRA to gpt_neox.gpt_neox.layers.0.attention.query_key_value: 768x2304 -> rank 8
LoRA applied successfully.

Quantized Low-Rank Adaptation (QLoRA)

QLoRA extends LoRA by quantizing the base model to 4-bit precision while keeping LoRA adapters in higher precision. The base model weights $W$ are quantized to 4-bit, and only the low-rank updates $BA$ remain in full precision during training. This reduces memory by ~4x compared to LoRA alone, enabling fine-tuning of 70B+ models on consumer GPUs.

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType

# Quantization config: 4-bit NormalFloat
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model with 4-bit quantization
model_name = 'EleutherAI/gpt-neo-125M'
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Apply LoRA on top of quantized model
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "v_proj"]  # Target query and value projections
)

model = get_peft_model(model, lora_config)
print(f'Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}')

💡 Tip: QLoRA uses NormalFloat (NF4) quantization which preserves more information than INT4. Double quantization further reduces memory by quantizing the quantization constants themselves.

💡 Tip: When applying LoRA or QLoRA, ensure that the rank 'r' is chosen appropriately to balance between memory efficiency and model performance. A too-low rank may not capture sufficient information, while a too-high rank may negate the benefits of these techniques.

❓ What is the primary benefit of using LoRA for fine-tuning LLMs?

Increased model size Reduced training time Higher computational cost Complex model architecture

❓ How does QLoRA differ from LoRA?

QLoRA uses higher precision QLoRA introduces additional trainable parameters QLoRA incorporates quantization QLoRA requires more memory

Introduction to LLM Fine-Tuning

Low-Rank Adaptation (LoRA)

Quantized Low-Rank Adaptation (QLoRA)

Related Courses