Back to Blog
Blog Post

Fine-Tuning a 3.8B Model for $0 to Fix Hymn Punctuation

12 min read
Machine Learning
Accessibility
Python
Data Pipeline

How I built a data cleaning pipeline, generated synthetic training data, and fine-tuned Phi-3.5-mini with QLoRA on a free Colab GPU — all to make hymn lyrics accessible to screen readers.

Hymnal digital songbook platform

When I built hymnal.kolbeypruitt.com, I scraped 260 hymns from hymnary.org and gccsatx.com. The lyrics looked fine on screen. Then I turned on VoiceOver.

What I heard was a wall of run-on speech. “Amazing grace how sweet the sound that saved a wretch like me I once was lost but now am found was blind but now I see.” No pauses. No phrasing. No breath marks. The screen reader had no idea where one thought ended and the next began, because the lyrics had no punctuation.

This is the story of how I built a deterministic cleaning pipeline, generated synthetic training data, and fine-tuned a 3.8B parameter model for $0 to fix it.

The Problem

Screen readers rely on punctuation for speech prosody. A comma triggers a short pause. A period triggers a longer pause and a pitch drop. A question mark raises the pitch. Without these cues, hymn lyrics become an undifferentiated stream of words — unusable for anyone navigating by ear.

The lyrics in the database are stored as JSON:

{
  "verses": [
    "Amazing grace how sweet the sound\nThat saved a wretch like me",
    "Twas grace that taught my heart to fear\nAnd grace my fears relieved"
  ],
  "chorus": "Optional chorus text here"
}

When I audited the 260 songs, roughly 120 had zero or near-zero punctuation. But punctuation was only one of seven quality issues I found. Fixing the others turned out to be a prerequisite — and the pipeline I built to do it ended up generating 21% more training data than I would have had otherwise.

Data Discovery

I wrote a series of inspection scripts to categorize every quality problem across the 260 songs. Seven distinct issue categories emerged:

#IssueSongs AffectedSeverity
1Crammed verses — all verses concatenated into a single string with embedded 1. , 2. markers53High
2Truncated lyrics — incomplete scrapes, sometimes just a few words3+High
3Stray markers[Refrain], [Chorus] text left in verses67Medium
4Verse numbers — leading 1. , 2) prefixes on properly split verses35Medium
5Tabs and whitespace — literal \t characters and multi-space runs~20Low
6Soft hyphens and zero-width spaces — invisible Unicode artifacts from web scraping~15Low
7Missing punctuation — no commas, periods, or terminal marks~120High

The key insight: issues 1–6 had to be fixed before issue 7. You cannot ask a model to punctuate text that has all its verses crammed into one string or contains [Refrain] markers between stanzas. And fixing issue 1 (verse splitting) would unlock songs that were previously unusable as training data.

Building the Cleaning Pipeline

Architecture

The pipeline is a standalone Python project with a single core abstraction — the cleaner function:

def clean(song: Song) -> Song

Every cleaner follows two contracts:

  1. Immutable. A cleaner never mutates the input Song. It returns either the original object (no changes needed) or a new Song via song.copy(). This makes the identity check cleaned is not song a reliable signal for whether anything changed.

  2. Idempotent. Running the same cleaner twice on the same input produces identical output. The second pass returns the original object because there is nothing left to fix.

The Song model carries its own audit trail:

@dataclass
class Song:
    id: str
    title: str
    slug: str
    lyrics: LyricsData
    changes: list[ChangeRecord]  # what changed and why
    flags: list[Flag]            # issues that need human review
    extra: dict                  # passthrough CSV columns

The Seven Cleaners

Cleaners run in a fixed order — split_verses must run before strip_numbers so that newly created verses get their leading numbers stripped.

1. split_verses — Detects songs where all verses are crammed into one string. Splits on verse number patterns (\n followed by \d+\.\s) or falls back to double-newline boundaries.

2. strip_numbers — Removes leading 1. , 2) prefixes from verses. Includes a sanity check: it only strips if the remaining text starts with an uppercase letter or quote mark.

3. strip_tabs — Replaces literal tabs with spaces and collapses multi-space runs. Simple but necessary — tabs cause screen readers to announce “tab” as a word.

4. fix_special — Removes soft hyphens (\u00ad) and zero-width spaces (\u200b) left over from web scraping. Also flags songs whose last verse contains embedded metadata for manual review.

5. fix_refrains — Strips [Refrain] and [Chorus] markers from verse text — both standalone lines and inline markers like , [Refrain] at line endings. Flags songs for manual review when markers existed but no chorus field is set.

6. detect_truncated — Detection only, no modification. Flags songs with very short verses, low total word counts, or known broken titles. These get routed to the manual_review.json report.

7. add_punctuation — The LLM-based cleaner. Songs below 1% punctuation density are sent to the model with a strict system prompt. The response is validated at two levels: word-level comparison (case-insensitive, ignoring refrain markers) and structural comparison (verse count must match). Any deviation causes the edit to be rejected and flagged.

# Validate: words must be identical (case-insensitive, ignore refrain markers)
old_words = [w.lower() for w in _extract_words(_lyrics_text(song.lyrics))
             if w.lower() not in ("refrain", "chorus")]
new_words = [w.lower() for w in _extract_words(_lyrics_text(new_lyrics))
             if w.lower() not in ("refrain", "chorus")]

if old_words != new_words:
    result.flags.append(Flag(
        type="llm_word_change",
        severity="error",
        description="LLM changed words, not just punctuation — rejecting edit",
    ))
    return result

Deterministic Cleaning Results

CleanerSongs Modified
strip_tabs64
split_verses30
strip_numbers4
fix_refrains2
fix_special1

109 songs modified total. 34 pytest tests covering every cleaner plus edge cases.

Generating Training Data

The cleaning pipeline handled the deterministic issues. But ~67 songs still needed punctuation, and I did not want to pay for Claude API calls in production or require an API key to run the pipeline. I needed a local model.

The Synthetic Pairs Approach

The idea: take the ~170 songs that already have good punctuation, strip the punctuation programmatically, and use the original/stripped pairs as training data. The model learns to restore what was removed.

STRIP_CHARS = set('.,;:!?')         # Characters to strip
PRESERVE_CHARS = set("'\"—-()")     # Characters to keep (word-internal)

The distinction between STRIP_CHARS and PRESERVE_CHARS is critical. Apostrophes in contractions (heav'n, o'er, 'Tis) and em dashes are part of the hymn’s text — stripping them would change words. Only the pacing-relevant characters get removed.

Data Augmentation

173 base pairs wasn’t enough for a 3.8B model to learn the “don’t change words” constraint. I added two augmentation strategies:

Partial-strip variants — For each training song, generate additional pairs by stripping only subsets of punctuation:

AUGMENT_SUBSETS = [
    set('.,'),      # commas and periods only
    set(',;'),      # commas and semicolons only
    set('.!?'),     # terminal punctuation only
    set(':;'),      # colons and semicolons only
]

Identity examples — Pairs where input equals output (already punctuated). These teach the model “if it’s already correct, don’t change anything”:

identity_pairs.append({
    'original': p['original'],
    'stripped': p['original'],  # input = output (no stripping)
})

Training Data Cleanup

Refrain markers in the training data were confusing the model — it would drop “Refrain” from verses or add refrain text that wasn’t there. The fix: strip all refrain/chorus markers from training data before generating pairs, including inline markers like , [Refrain] at line endings and instruction lines like “Refrain (may be sung after final stanza only).”

Final Training Data

MetricValue
Base pairs from cleaned songs173
Training pairs (after augmentation)725
Identity examples108
Total training examples833
Eval pairs (unaugmented)26

The 30 additional base pairs (vs. the original 143) came from verse splitting — songs that were single-blob strings became properly structured multi-verse lyrics that passed the quality filters. The cleaning pipeline paid for itself.

Fine-Tuning with QLoRA

Model Choice: Phi-3.5-mini-instruct

I chose Microsoft’s Phi-3.5-mini-instruct (3.8B parameters) because it’s small enough for free Colab T4 GPUs with 4-bit quantization (~4GB VRAM), has strong instruction following for its size, and is fast enough to run locally via Ollama without a dedicated GPU.

Training Evolution

Getting the training right took four iterations:

v1 — Baseline (147 examples, 3 epochs, lr 2e-4) The model learned JSON output perfectly (100% valid) but only preserved words in 42% of eval songs. With such a small dataset, the model’s strong language priors overwhelmed the “don’t change words” instruction.

v2 — More data, more epochs (725 examples, 5 epochs, lr 1e-4) Added data augmentation (partial punctuation stripping) to 5x the training set. But validation loss told the story: it was 0.78 at epoch 1 and climbed to 1.39 by epoch 5. Classic overfitting — the model memorized the training data.

StepTrain LossVal Loss
910.7320.785
1820.4710.884
2730.2421.205
3640.1071.388

v3 — Fixed overfitting (725 examples, 2 epochs, dropout 0.05) Reduced to 2 epochs, added mild LoRA dropout, and enabled load_best_model_at_end to auto-select the checkpoint with lowest validation loss. Val loss stayed healthy: 0.82 → 0.78, with train/val gap never exceeding 0.15.

v4 — Clean data + identity examples (833 examples, 2 epochs) Stripped all refrain markers from training data. Added 108 identity examples (input = output). Same hyperparameters as v3.

SFTConfig(
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,    # effective batch size = 8
    warmup_steps = 10,
    max_steps = total_steps,            # 2 epochs
    learning_rate = 1e-4,
    lr_scheduler_type = "cosine",
    load_best_model_at_end = True,
    metric_for_best_model = "eval_loss",
)

Total training cost across all four iterations: $0 (free Colab T4 GPU).

Evaluation & Results

The eval loop checks 26 held-out songs against three criteria: valid JSON, word preservation (case-insensitive, ignoring refrain markers), and exact punctuation match.

Honest Accounting

Not all “failures” in the eval are model failures. Of the 26 eval songs, some contain data quality issues (truncated verses, broken input) that no model could fix correctly. Counting those against the model would misrepresent its actual performance.

Raw eval results (v4):

MetricResult
Valid JSON output100% (26/26)
Words preserved (strict)62% (16/26)
Exact punctuation match0% (0/26)

After excluding data quality issues and normalizing for acceptable variations:

Of the 10 “failures,” the breakdown is:

CategoryCountExamples
Archaic contraction expansion3th'angelicthe angelic, ev'ryevery, vict'ryvictory
Word added/dropped3Added “O” before “Thou”, dropped word “and”
Content truncated/reordered2Dropped last 4 words, reordered verse content
Minor word substitution2uponon, lestleast

The archaic contraction expansions are debatable — every for ev'ry is a reasonable modernization, not a hallucination. If we count those as acceptable, effective word preservation is ~73% (19/26).

Production Pipeline Results

When the v4 model ran against all 260 songs through the full pipeline:

MetricValue
Songs needing punctuation67
Ollama accepted (words preserved)11
Ollama rejected by safety net48
Skipped (empty/broken input data)8

The 19% acceptance rate on production data (vs. 62–73% on the eval set) reflects the difficulty gap — the eval set contains songs that already had good punctuation (stripped synthetically), while the production set contains songs that never had punctuation at all. These are harder because they’re often the most unusual hymns in the corpus.

The safety net caught every bad output. No incorrect lyrics were written to the database.

Integration & Deployment

Local Inference with Ollama

The fine-tuned model exports as safetensors, converts to GGUF format via llama.cpp, and loads into Ollama with a one-command setup. Inference runs locally on Apple Silicon (M3 Pro, 18GB) — no API key, no cost, no network latency.

Pipeline Fallback Chain

The add_punctuation cleaner supports multiple backends with automatic fallback:

def _call_llm(lyrics: LyricsData) -> tuple[LyricsData | None, str]:
    result = _call_ollama(lyrics)      # 1. Local fine-tuned model (free)
    if result is not None:
        return result, "ollama"

    result = _call_claude(lyrics)       # 2. Claude API (paid fallback)
    if result is not None:
        return result, "claude"

    return None, "none"                 # 3. Flag-only mode

The validation layer makes this swap safe. If any backend produces output that changes words or breaks JSON structure, the cleaner rejects it and flags for manual review. The worst case is not wrong punctuation — it is no punctuation (status quo).

Data Flow

Supabase CSV Export

  Cleaning Pipeline (7 ordered cleaners)

  Cleaned CSV + Reports

  prep_finetune_data.py (quality filters + augmentation)

  833 training examples + 26 eval examples

  Google Colab Notebook (QLoRA + Phi-3.5-mini, free T4 GPU)

  GGUF Model → Ollama (local, free inference)

  Pipeline add_punctuation cleaner (Ollama → Claude API fallback)

  Fixed lyrics → Supabase

What I Learned

What Worked

The immutable cleaner contract was the best architectural decision. Every cleaner returns either the original Song object or a fresh copy. This makes identity checks trivially correct and debugging easy — you can inspect the changes list on any song and see exactly what happened and why.

The cleaning pipeline increased training data by 21%. By splitting crammed verses before generating training pairs, 30 songs that were previously disqualified became properly structured multi-verse lyrics that passed quality filters.

Synthetic training data from real data works. Rather than manually labeling songs with “correct” punctuation, I used the songs that already had it as ground truth and synthetically created the unpunctuated inputs. The ground truth is real (not LLM-generated), and the validation is mechanical (word lists must match).

Word-level validation as a universal safety net. The same extract-words-and-compare pattern appears in training data generation, inference validation, and the evaluation loop. It catches every possible failure mode without knowing anything about the model’s internals.

Overfitting is visible if you look. The jump from v2 (val loss climbing every epoch) to v3 (2 epochs, load_best_model_at_end) was a dramatic improvement. Watching train/val loss diverge in real-time made the problem obvious and the fix obvious.

What Surprised Me

Data quality issues masquerading as model failures. Early eval results showed 42% word preservation. Alarming. But detailed error analysis revealed that many “failures” were the model correctly handling bad input — dropping refrain markers that shouldn’t have been there, expanding archaic contractions. The model was smarter than the eval gave it credit for.

Identity examples matter. Adding 108 examples where input = output (already punctuated, output unchanged) was a small change that reinforced the most important constraint: don’t change what’s already correct.

Free GPU quotas run out fast. After three Colab training runs, Google throttled my GPU access. The workaround: logging into a different Google account. Not elegant, but effective for $0.

What I Would Do Differently

Start with the data audit. I initially went straight to the punctuation problem and tried to fix it with Claude API calls. Only after burning through API credits did I step back and realize the upstream data issues were both fixable and necessary to fix first.

A 3.8B model may be too small for this task. The model frequently changes words it shouldn’t — expanding contractions, substituting synonyms, adding interjections. A 7B model (Mistral, Llama) would likely have better instruction following with the same training data. The tradeoff is slower local inference and potentially needing more VRAM.

Key Numbers

MetricValue
Total songs in database260
Songs cleaned by deterministic rules109
Training examples (with augmentation)833
Eval set26 songs
Valid JSON output100%
Word preservation (eval, normalized)~73%
Production acceptance rate19% (11/59 viable songs)
Training cost$0
Training iterations4
Base modelPhi-3.5-mini-instruct (3.8B)
Fine-tuning methodQLoRA (4-bit, LoRA rank 16)
Pytest tests34
Pipeline cleaners7

Built for hymnal.kolbeypruitt.com. Source code at github.com/kolbeypruitt/hymnal-toolkit.

Hire me today
KP
Kolbey's Assistant Available for work
Hey — I'm Kolbey's portfolio assistant. Ask me about his skills, projects, experience, or how to get in touch.