Fine-Tuning a 3.8B Model for $0 to Fix Hymn Punctuation

When I built hymnal.kolbeypruitt.com, I scraped 260 hymns from hymnary.org and gccsatx.com. The lyrics looked fine on screen. Then I turned on VoiceOver.

What I heard was a wall of run-on speech. “Amazing grace how sweet the sound that saved a wretch like me I once was lost but now am found was blind but now I see.” No pauses. No phrasing. No breath marks. The screen reader had no idea where one thought ended and the next began, because the lyrics had no punctuation.

This is the story of how I built a deterministic cleaning pipeline, generated synthetic training data, and fine-tuned a 3.8B parameter model for $0 to fix it.

The Problem

Screen readers rely on punctuation for speech prosody. A comma triggers a short pause. A period triggers a longer pause and a pitch drop. A question mark raises the pitch. Without these cues, hymn lyrics become an undifferentiated stream of words — unusable for anyone navigating by ear.

The lyrics in the database are stored as JSON:

{
  "verses": [
    "Amazing grace how sweet the sound\nThat saved a wretch like me",
    "Twas grace that taught my heart to fear\nAnd grace my fears relieved"
  ],
  "chorus": "Optional chorus text here"
}

When I audited the 260 songs, roughly 120 had zero or near-zero punctuation. But punctuation was only one of seven quality issues I found. Fixing the others turned out to be a prerequisite — and the pipeline I built to do it ended up generating 21% more training data than I would have had otherwise.

Data Discovery

I wrote a series of inspection scripts to categorize every quality problem across the 260 songs. Seven distinct issue categories emerged:

#	Issue	Songs Affected	Severity
1	Crammed verses — all verses concatenated into a single string with embedded `1.` , `2.` markers	53	High
2	Truncated lyrics — incomplete scrapes, sometimes just a few words	3+	High
3	Stray markers — `[Refrain]`, `[Chorus]` text left in verses	67	Medium
4	Verse numbers — leading `1.` , `2)` prefixes on properly split verses	35	Medium
5	Tabs and whitespace — literal `\t` characters and multi-space runs	~20	Low
6	Soft hyphens and zero-width spaces — invisible Unicode artifacts from web scraping	~15	Low
7	Missing punctuation — no commas, periods, or terminal marks	~120	High

The key insight: issues 1–6 had to be fixed before issue 7. You cannot ask a model to punctuate text that has all its verses crammed into one string or contains [Refrain] markers between stanzas. And fixing issue 1 (verse splitting) would unlock songs that were previously unusable as training data.

Building the Cleaning Pipeline

Architecture

The pipeline is a standalone Python project with a single core abstraction — the cleaner function:

def clean(song: Song) -> Song

Every cleaner follows two contracts:

Immutable. A cleaner never mutates the input Song. It returns either the original object (no changes needed) or a new Song via song.copy(). This makes the identity check cleaned is not song a reliable signal for whether anything changed.
Idempotent. Running the same cleaner twice on the same input produces identical output. The second pass returns the original object because there is nothing left to fix.

The Song model carries its own audit trail:

@dataclass
class Song:
    id: str
    title: str
    slug: str
    lyrics: LyricsData
    changes: list[ChangeRecord]  # what changed and why
    flags: list[Flag]            # issues that need human review
    extra: dict                  # passthrough CSV columns

The Seven Cleaners

Cleaners run in a fixed order — split_verses must run before strip_numbers so that newly created verses get their leading numbers stripped.

1. split_verses — Detects songs where all verses are crammed into one string. Splits on verse number patterns (\n followed by \d+\.\s) or falls back to double-newline boundaries.

2. strip_numbers — Removes leading 1. , 2) prefixes from verses. Includes a sanity check: it only strips if the remaining text starts with an uppercase letter or quote mark.

3. strip_tabs — Replaces literal tabs with spaces and collapses multi-space runs. Simple but necessary — tabs cause screen readers to announce “tab” as a word.

4. fix_special — Removes soft hyphens (\u00ad) and zero-width spaces (\u200b) left over from web scraping. Also flags songs whose last verse contains embedded metadata for manual review.

5. fix_refrains — Strips [Refrain] and [Chorus] markers from verse text — both standalone lines and inline markers like , [Refrain] at line endings. Flags songs for manual review when markers existed but no chorus field is set.

6. detect_truncated — Detection only, no modification. Flags songs with very short verses, low total word counts, or known broken titles. These get routed to the manual_review.json report.

7. add_punctuation — The LLM-based cleaner. Songs below 1% punctuation density are sent to the model with a strict system prompt. The response is validated at two levels: word-level comparison (case-insensitive, ignoring refrain markers) and structural comparison (verse count must match). Any deviation causes the edit to be rejected and flagged.

# Validate: words must be identical (case-insensitive, ignore refrain markers)
old_words = [w.lower() for w in _extract_words(_lyrics_text(song.lyrics))
             if w.lower() not in ("refrain", "chorus")]
new_words = [w.lower() for w in _extract_words(_lyrics_text(new_lyrics))
             if w.lower() not in ("refrain", "chorus")]

if old_words != new_words:
    result.flags.append(Flag(
        type="llm_word_change",
        severity="error",
        description="LLM changed words, not just punctuation — rejecting edit",
    ))
    return result

Deterministic Cleaning Results

Cleaner	Songs Modified
`strip_tabs`	64
`split_verses`	30
`strip_numbers`	4
`fix_refrains`	2
`fix_special`	1

109 songs modified total. 34 pytest tests covering every cleaner plus edge cases.

Generating Training Data

The cleaning pipeline handled the deterministic issues. But ~67 songs still needed punctuation, and I did not want to pay for Claude API calls in production or require an API key to run the pipeline. I needed a local model.

The Synthetic Pairs Approach

The idea: take the ~170 songs that already have good punctuation, strip the punctuation programmatically, and use the original/stripped pairs as training data. The model learns to restore what was removed.

STRIP_CHARS = set('.,;:!?')         # Characters to strip
PRESERVE_CHARS = set("'\"—-()")     # Characters to keep (word-internal)

The distinction between STRIP_CHARS and PRESERVE_CHARS is critical. Apostrophes in contractions (heav'n, o'er, 'Tis) and em dashes are part of the hymn’s text — stripping them would change words. Only the pacing-relevant characters get removed.

Data Augmentation

173 base pairs wasn’t enough for a 3.8B model to learn the “don’t change words” constraint. I added two augmentation strategies:

Partial-strip variants — For each training song, generate additional pairs by stripping only subsets of punctuation:

AUGMENT_SUBSETS = [
    set('.,'),      # commas and periods only
    set(',;'),      # commas and semicolons only
    set('.!?'),     # terminal punctuation only
    set(':;'),      # colons and semicolons only
]

Identity examples — Pairs where input equals output (already punctuated). These teach the model “if it’s already correct, don’t change anything”:

identity_pairs.append({
    'original': p['original'],
    'stripped': p['original'],  # input = output (no stripping)
})

Training Data Cleanup

Refrain markers in the training data were confusing the model — it would drop “Refrain” from verses or add refrain text that wasn’t there. The fix: strip all refrain/chorus markers from training data before generating pairs, including inline markers like , [Refrain] at line endings and instruction lines like “Refrain (may be sung after final stanza only).”

Final Training Data

Metric	Value
Base pairs from cleaned songs	173
Training pairs (after augmentation)	725
Identity examples	108
Total training examples	833
Eval pairs (unaugmented)	26

The 30 additional base pairs (vs. the original 143) came from verse splitting — songs that were single-blob strings became properly structured multi-verse lyrics that passed the quality filters. The cleaning pipeline paid for itself.

Fine-Tuning with QLoRA

Model Choice: Phi-3.5-mini-instruct

I chose Microsoft’s Phi-3.5-mini-instruct (3.8B parameters) because it’s small enough for free Colab T4 GPUs with 4-bit quantization (~4GB VRAM), has strong instruction following for its size, and is fast enough to run locally via Ollama without a dedicated GPU.

Training Evolution

Getting the training right took four iterations:

v1 — Baseline (147 examples, 3 epochs, lr 2e-4) The model learned JSON output perfectly (100% valid) but only preserved words in 42% of eval songs. With such a small dataset, the model’s strong language priors overwhelmed the “don’t change words” instruction.

v2 — More data, more epochs (725 examples, 5 epochs, lr 1e-4) Added data augmentation (partial punctuation stripping) to 5x the training set. But validation loss told the story: it was 0.78 at epoch 1 and climbed to 1.39 by epoch 5. Classic overfitting — the model memorized the training data.

Step	Train Loss	Val Loss
91	0.732	0.785
182	0.471	0.884
273	0.242	1.205
364	0.107	1.388

v3 — Fixed overfitting (725 examples, 2 epochs, dropout 0.05) Reduced to 2 epochs, added mild LoRA dropout, and enabled load_best_model_at_end to auto-select the checkpoint with lowest validation loss. Val loss stayed healthy: 0.82 → 0.78, with train/val gap never exceeding 0.15.

v4 — Clean data + identity examples (833 examples, 2 epochs) Stripped all refrain markers from training data. Added 108 identity examples (input = output). Same hyperparameters as v3.

SFTConfig(
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,    # effective batch size = 8
    warmup_steps = 10,
    max_steps = total_steps,            # 2 epochs
    learning_rate = 1e-4,
    lr_scheduler_type = "cosine",
    load_best_model_at_end = True,
    metric_for_best_model = "eval_loss",
)

Total training cost across all four iterations: $0 (free Colab T4 GPU).

Evaluation & Results

The eval loop checks 26 held-out songs against three criteria: valid JSON, word preservation (case-insensitive, ignoring refrain markers), and exact punctuation match.

Honest Accounting

Not all “failures” in the eval are model failures. Of the 26 eval songs, some contain data quality issues (truncated verses, broken input) that no model could fix correctly. Counting those against the model would misrepresent its actual performance.

Raw eval results (v4):

Metric	Result
Valid JSON output	100% (26/26)
Words preserved (strict)	62% (16/26)
Exact punctuation match	0% (0/26)

After excluding data quality issues and normalizing for acceptable variations:

Of the 10 “failures,” the breakdown is:

Category	Count	Examples
Archaic contraction expansion	3	`th'angelic` → `the angelic`, `ev'ry` → `every`, `vict'ry` → `victory`
Word added/dropped	3	Added “O” before “Thou”, dropped word “and”
Content truncated/reordered	2	Dropped last 4 words, reordered verse content
Minor word substitution	2	`upon` → `on`, `lest` → `least`

The archaic contraction expansions are debatable — every for ev'ry is a reasonable modernization, not a hallucination. If we count those as acceptable, effective word preservation is ~73% (19/26).

Production Pipeline Results

When the v4 model ran against all 260 songs through the full pipeline:

Metric	Value
Songs needing punctuation	67
Ollama accepted (words preserved)	11
Ollama rejected by safety net	48
Skipped (empty/broken input data)	8

The 19% acceptance rate on production data (vs. 62–73% on the eval set) reflects the difficulty gap — the eval set contains songs that already had good punctuation (stripped synthetically), while the production set contains songs that never had punctuation at all. These are harder because they’re often the most unusual hymns in the corpus.

The safety net caught every bad output. No incorrect lyrics were written to the database.

Integration & Deployment

Local Inference with Ollama

The fine-tuned model exports as safetensors, converts to GGUF format via llama.cpp, and loads into Ollama with a one-command setup. Inference runs locally on Apple Silicon (M3 Pro, 18GB) — no API key, no cost, no network latency.

Pipeline Fallback Chain

The add_punctuation cleaner supports multiple backends with automatic fallback:

def _call_llm(lyrics: LyricsData) -> tuple[LyricsData | None, str]:
    result = _call_ollama(lyrics)      # 1. Local fine-tuned model (free)
    if result is not None:
        return result, "ollama"

    result = _call_claude(lyrics)       # 2. Claude API (paid fallback)
    if result is not None:
        return result, "claude"

    return None, "none"                 # 3. Flag-only mode

The validation layer makes this swap safe. If any backend produces output that changes words or breaks JSON structure, the cleaner rejects it and flags for manual review. The worst case is not wrong punctuation — it is no punctuation (status quo).

Data Flow

Supabase CSV Export
       ↓
  Cleaning Pipeline (7 ordered cleaners)
       ↓
  Cleaned CSV + Reports
       ↓
  prep_finetune_data.py (quality filters + augmentation)
       ↓
  833 training examples + 26 eval examples
       ↓
  Google Colab Notebook (QLoRA + Phi-3.5-mini, free T4 GPU)
       ↓
  GGUF Model → Ollama (local, free inference)
       ↓
  Pipeline add_punctuation cleaner (Ollama → Claude API fallback)
       ↓
  Fixed lyrics → Supabase

What I Learned

What Worked

The immutable cleaner contract was the best architectural decision. Every cleaner returns either the original Song object or a fresh copy. This makes identity checks trivially correct and debugging easy — you can inspect the changes list on any song and see exactly what happened and why.

The cleaning pipeline increased training data by 21%. By splitting crammed verses before generating training pairs, 30 songs that were previously disqualified became properly structured multi-verse lyrics that passed quality filters.

Synthetic training data from real data works. Rather than manually labeling songs with “correct” punctuation, I used the songs that already had it as ground truth and synthetically created the unpunctuated inputs. The ground truth is real (not LLM-generated), and the validation is mechanical (word lists must match).

Word-level validation as a universal safety net. The same extract-words-and-compare pattern appears in training data generation, inference validation, and the evaluation loop. It catches every possible failure mode without knowing anything about the model’s internals.

Overfitting is visible if you look. The jump from v2 (val loss climbing every epoch) to v3 (2 epochs, load_best_model_at_end) was a dramatic improvement. Watching train/val loss diverge in real-time made the problem obvious and the fix obvious.

What Surprised Me

Data quality issues masquerading as model failures. Early eval results showed 42% word preservation. Alarming. But detailed error analysis revealed that many “failures” were the model correctly handling bad input — dropping refrain markers that shouldn’t have been there, expanding archaic contractions. The model was smarter than the eval gave it credit for.

Identity examples matter. Adding 108 examples where input = output (already punctuated, output unchanged) was a small change that reinforced the most important constraint: don’t change what’s already correct.

Free GPU quotas run out fast. After three Colab training runs, Google throttled my GPU access. The workaround: logging into a different Google account. Not elegant, but effective for $0.

What I Would Do Differently

Start with the data audit. I initially went straight to the punctuation problem and tried to fix it with Claude API calls. Only after burning through API credits did I step back and realize the upstream data issues were both fixable and necessary to fix first.

A 3.8B model may be too small for this task. The model frequently changes words it shouldn’t — expanding contractions, substituting synonyms, adding interjections. A 7B model (Mistral, Llama) would likely have better instruction following with the same training data. The tradeoff is slower local inference and potentially needing more VRAM.

Key Numbers

Metric	Value
Total songs in database	260
Songs cleaned by deterministic rules	109
Training examples (with augmentation)	833
Eval set	26 songs
Valid JSON output	100%
Word preservation (eval, normalized)	~73%
Production acceptance rate	19% (11/59 viable songs)
Training cost	$0
Training iterations	4
Base model	Phi-3.5-mini-instruct (3.8B)
Fine-tuning method	QLoRA (4-bit, LoRA rank 16)
Pytest tests	34
Pipeline cleaners	7

Built for hymnal.kolbeypruitt.com. Source code at github.com/kolbeypruitt/hymnal-toolkit.