Teaching AI Your Voice

In App 2 we built a GPT from scratch and it learned to sound like Shakespeare. But what if you want it to sound like you? Or like a pirate? Or like a recipe book?

That is fine-tuning, taking a pre-trained model and teaching it your style. Today, we use HuggingFace Transformers and PyTorch to do exactly that.

The Writer

The Idea

Training a language model from scratch takes millions of dollars. Fine-tuning one takes a laptop and 20 minutes.

Start with a pre-trained model (it already knows English)
Feed it your text (poems, tweets, recipes, whatever)
It adapts its style to match yours

Setup

# Install what we need
# pip install transformers datasets torch
 
from transformers import (
    GPT2LMHeadModel,
    GPT2Tokenizer,
    TextDataset,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
)

We are using GPT-2 Small (124M parameters). It is free, open-source, and fits on any laptop.

Step 1: Prepare Your Data

Create a text file with the style you want. The more text, the better. Here is an example, pirate speak:

# Create a training file
training_text = """
Ahoy! The sea be rough today, and me crew be lazier than a barnacle on a rock.
We sailed three leagues before the wind turned foul. The captain cursed the sky.
Every morning I wake to the sound of waves and the smell of salt and bad decisions.
The treasure map be nothing but lies, but we follow it anyway. What else is there?
A pirate without a ship is just a man with bad hygiene and questionable life choices.
The parrot said nothing useful today. As usual. I am starting to doubt its intelligence.
"""
 
# Save to file (in practice, use a larger dataset)
with open("pirate_text.txt", "w") as f:
    for _ in range(100):  # repeat to give the model more to learn from
        f.write(training_text)

In real use, you would collect much more text, blog posts, books, chat logs, etc.

Step 2: Load the Model

model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
 
# Prepare the dataset
dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="pirate_text.txt",
    block_size=128,
)
 
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # GPT-2 is causal, not masked
)

Step 3: Fine-Tune

training_args = TrainingArguments(
    output_dir="./pirate-gpt",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=500,
    save_total_limit=2,
    logging_steps=100,
    learning_rate=5e-5,
    warmup_steps=100,
)
 
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)
 
trainer.train()

On a laptop CPU this takes 10-20 minutes. On a GPU, a couple minutes.

Step 4: Generate Text

def generate(prompt, max_length=150):
    inputs = tokenizer.encode(prompt, return_tensors="pt")
    outputs = model.generate(
        inputs,
        max_length=max_length,
        num_return_sequences=1,
        temperature=0.8,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)
 
print(generate("The captain looked at the horizon and"))

The output will have a pirate flavor, sea metaphors, rough language, and questionable decisions.

Temperature: The Creativity Knob

Temperature	Behavior
0.1	Very safe, repetitive, boring
0.5	Balanced, coherent
0.8	Creative, interesting
1.2	Wild, sometimes nonsense
2.0	Completely unhinged

# Conservative
print(generate("The sea", max_length=50))  # temperature=0.3
 
# Creative
print(generate("The sea", max_length=50))  # temperature=1.0

Save Your Model

model.save_pretrained("./pirate-gpt")
tokenizer.save_pretrained("./pirate-gpt")
 
# Load later
model = GPT2LMHeadModel.from_pretrained("./pirate-gpt")
tokenizer = GPT2Tokenizer.from_pretrained("./pirate-gpt")

Ideas for Your Own Writer

Style	Training Data
Your own writing voice	Your blog posts, emails, journal
Recipe generator	Cooking websites, recipe books
Poet	Poetry collections (public domain)
Code commenter	GitHub commit messages
DnD narrator	Game transcripts and fantasy novels

What You Built

A text generator that:

Starts from GPT-2 (which already knows English)
Learns your specific writing style from examples
Generates new text that sounds like your training data
Has a temperature knob for creativity control

This is the same process used to create specialized AI assistants, creative writing tools, and domain-specific chatbots. The only difference is scale.

Next up: we put everything on the internet.