Fine-Tuning GPT-2 with the Alpaca Dataset: A Practical Guide

Introduction

Large language models like GPT-2 can generate impressive text, but sometimes they need to be adapted to specific tasks or styles. That’s where fine-tuning comes in.

In this blog, we’ll walk through fine-tuning GPT-2 using the Alpaca dataset (a popular dataset for instruction tuning, inspired by Stanford’s Alpaca). By the end, you’ll have a working model trained on Alpaca that can respond better to instruction-like prompts.

What You’ll Need

Python 3.8+
Google Colab (or a local GPU environment)
Hugging Face transformers and datasets libraries
The Alpaca dataset

Install dependencies:


!pip install transformers datasets accelerate peft bitsandbytes

Step 1: Load the Alpaca Dataset

The Alpaca dataset is formatted as instructions + outputs. We’ll use Hugging Face Datasets to load it.


from datasets import load_dataset

dataset = load_dataset("yahma/alpaca-cleaned")
print(dataset["train"][0])

Example output:


{
  "instruction": "Describe the benefits of meditation.",
  "input": "",
  "output": "Meditation helps reduce stress, improves focus, and promotes emotional health."
}

Step 2: Prepare the Data for GPT-2

We need to format each example into a single text prompt that GPT-2 can learn from.


def format_instruction(example):
    if example["input"]:
        return f"### Instruction:\n{example['instruction']}\n### Input:\n{example['input']}\n### Response:\n{example['output']}"
    else:
        return f"### Instruction:\n{example['instruction']}\n### Response:\n{example['output']}"

dataset = dataset.map(lambda x: {"text": format_instruction(x)})

Step 3: Tokenize the Data

We’ll use GPT-2’s tokenizer.


from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 doesn’t have a pad token

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Step 4: Fine-Tune GPT-2

We’ll use the Trainer API from Hugging Face.


from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("gpt2")

training_args = TrainingArguments(
    output_dir="./gpt2-alpaca",
    evaluation_strategy="steps",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    num_train_epochs=2,
    save_steps=500,
    save_total_limit=2,
    logging_steps=100,
    fp16=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    tokenizer=tokenizer
)

trainer.train()

Step 5: Test the Fine-Tuned Model

After training, let’s test it with an instruction.


from transformers import pipeline

generator = pipeline("text-generation", model="./gpt2-alpaca", tokenizer=tokenizer)

instruction = "Write a short story about a robot learning emotions."
output = generator(f"### Instruction:\n{instruction}\n### Response:\n", 
                   max_length=200, 
                   num_return_sequences=1)

print(output[0]["generated_text"])

Conclusion

With just a few steps, we fine-tuned GPT-2 on the Alpaca dataset. This makes GPT-2 much more instruction-following than its base version.

Fine-tuning helps align general-purpose models with specific needs.
Alpaca dataset is a great starting point for instruction tuning.
You can experiment with different hyperparameters, datasets, or larger models for better results.

Next steps: You could deploy this fine-tuned model with Hugging Face Spaces, integrate it into a chatbot, or continue training with more data.

Search This Blog

vikram aditya