Capstone Project
Duration: 12 min
This module focuses on the practical application of NLP and Transformers, specifically BERT, using HuggingFace's Transformers library. You will learn how to fine-tune large language models (LLMs) for specific tasks, which is crucial for developing advanced NLP applications.
Loading and Using BERT with HuggingFace
In this section, we will explore how to load a pre-trained BERT model using the HuggingFace Transformers library. This model can be fine-tuned for various NLP tasks, such as sentiment analysis or named entity recognition. The library provides a straightforward API to load and use these models efficiently.
from transformers import BertTokenizer, BertModel
import torch
# Load the BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Encode a sample text
inputs = tokenizer("Hello, how are you?", return_tensors='pt')
# Get the model's output
outputs = model(**inputs)
# Print the last hidden state
last_hidden_state = outputs.last_hidden_state
print(last_hidden_state)tensor([[[-0.0134, -0.0929, 0.0246, ..., -0.0483, -0.0322, 0.0186],
[-0.0134, -0.0929, 0.0246, ..., -0.0483, -0.0322, 0.0186],
[-0.0134, -0.0929, 0.0246, ..., -0.0483, -0.0322, 0.0186],
...,
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]]], grad_fn=<AddmmBackward>)Fine-tuning BERT for a Specific Task
Fine-tuning a pre-trained BERT model involves training it on a specific dataset for a particular NLP task. This process allows the model to adapt its knowledge to the new task, improving its performance. We will demonstrate how to fine-tune BERT for a sentiment analysis task using the HuggingFace library.
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from transformers import BertTokenizer
from sklearn.model_selection import train_test_split
import torch
# Load the BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# Sample dataset
texts = ["I love this product!", "This is awful."]
labels = [1, 0] # 1 for positive, 0 for negative
# Split the dataset
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.2)
# Tokenize the dataset
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
# Create a PyTorch dataset
class Dataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
train_dataset = Dataset(train_encodings, train_labels)
test_dataset = Dataset(test_encodings, test_labels)
# Training arguments
training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy='epoch',
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset
)
# Train the model
trainer.train()💡 Tip: Ensure that your dataset is properly tokenized and formatted as a PyTorch Dataset before training.
❓ What is the purpose of the 'Trainer' class in the HuggingFace library?
❓ Which argument in the 'TrainingArguments' class controls the number of training epochs?