Evaluation Metrics for Fine-Tuned LLMs

Duration: 5 min

This module delves into the critical evaluation metrics used to assess the performance of fine-tuned Large Language Models (LLMs). Understanding these metrics is essential for ensuring that your fine-tuned models meet the desired performance standards and can generalize well to new, unseen data.

Perplexity

Perplexity is a commonly used metric for evaluating language models. It measures how well a probability distribution or probability model predicts a sample. A lower perplexity indicates a better model. It is especially useful for assessing the quality of a language model's predictions.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load pre-trained model and tokenizer
model_name = 'distilgpt2'
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define a sample text
text = "The quick brown fox jumps over the lazy dog."

# Tokenize the text
input_ids = tokenizer.encode(text, return_tensors='pt')

# Calculate perplexity
with torch.no_grad():
    outputs = model(input_ids, labels=input_ids)
    loss, logits = outputs[:2]
    perplexity = torch.exp(loss)

print(f'Perplexity: {perplexity.item():.2f}')

Try it in Google Colab:

Perplexity: 12.34

BLEU Score

The BLEU (Bilingual Evaluation Understudy) score is a metric used to evaluate the quality of text which has been machine-translated from one natural language to another. It compares a candidate translation against one or more reference translations. Higher BLEU scores indicate better translation quality.

from nltk.translate.bleu_score import sentence_bleu

# Define reference and candidate translations
reference = [['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

# Calculate BLEU score
bleuscore = sentence_bleu(reference, candidate)

print(f'BLEU Score: {bleuscore:.4f}')

💡 Tip: When evaluating fine-tuned LLMs, it's important to use a diverse set of metrics to get a comprehensive understanding of the model's performance. Relying solely on one metric can lead to an incomplete assessment.

❓ What does a lower perplexity indicate in the context of language models?

Higher uncertainty Lower quality Better predictions Irrelevant metric

❓ What does a higher BLEU score signify in machine translation?

Poor translation quality Average translation quality Excellent translation quality Inapplicable metric

Evaluation Metrics for Fine-Tuned LLMs

Perplexity

BLEU Score

Related Courses