Using Claude API for Intelligent Duplicate Detection

Hymnal is a digital songbook platform I built for churches. One of the biggest data quality challenges was duplicate detection. When thousands of songs are entered by hundreds of different people, you end up with duplicates — but not the kind that a simple DISTINCT query can catch.

Consider these entries:

“Amazing Grace” and “Amazing Grace (How Sweet the Sound)”
“How Great Thou Art” and “How Great Thou Art!”
“10,000 Reasons” and “Ten Thousand Reasons (Bless the Lord)”
“It Is Well” and “It Is Well with My Soul”

These are all the same songs, but no combination of string normalization, Levenshtein distance, or trigram matching will reliably catch all of them without drowning you in false positives.

Why Traditional Approaches Fall Short

I started with the usual toolkit:

Exact matching after normalization — lowercase, strip punctuation, collapse whitespace. This catches maybe 20% of duplicates.

Levenshtein distance — Works for typos but fails on alternate titles. “10,000 Reasons” and “Ten Thousand Reasons” have a huge edit distance despite being the same song.

Trigram similarity (PostgreSQL’s pg_trgm) — Better than Levenshtein for partial matches, but the similarity threshold is impossible to tune. Too low and you get false positives. Too high and you miss real duplicates.

Soundex / Metaphone — Phonetic algorithms help with spelling variations but can’t handle fundamentally different titles for the same song.

The core problem is that duplicate detection in this domain requires semantic understanding, not just string comparison. You need to know that “It Is Well” is a commonly shortened form of “It Is Well with My Soul.”

Enter Claude

I built a batch processing pipeline that sends groups of potentially similar songs to the Claude API for evaluation. The approach:

Pre-filter candidates using trigram similarity (threshold 0.3) to create candidate pairs. This reduces the problem space from O(n²) to a manageable set.
Batch candidates into groups of 10-20 potential duplicates.
Ask Claude to evaluate each group and identify true duplicates.

The prompt is specific and structured:

const prompt = `You are evaluating potential duplicate songs in a hymnal database.
For each group of songs below, identify which ones are duplicates of each other.

Consider:
- Alternate titles (shortened or extended versions)
- Spelling variations
- Number formatting (10,000 vs Ten Thousand)
- Subtitles in parentheses
- Common hymn naming conventions

Return a JSON array of duplicate groups. Each group should contain the IDs
of songs that are the same song.

Songs to evaluate:
${JSON.stringify(candidates, null, 2)}`;

The key insight is that Claude has broad knowledge of hymns and worship songs. It knows that “It Is Well” and “It Is Well with My Soul” are the same hymn. It knows that “10,000 Reasons” is the same as “Ten Thousand Reasons (Bless the Lord).” This domain knowledge is exactly what makes it better than algorithmic approaches.

Structured Output

I used Claude’s tool use to get structured JSON responses rather than parsing free text:

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  messages: [{ role: "user", content: prompt }],
  tools: [{
    name: "report_duplicates",
    description: "Report groups of duplicate songs",
    input_schema: {
      type: "object",
      properties: {
        duplicate_groups: {
          type: "array",
          items: {
            type: "object",
            properties: {
              canonical_title: { type: "string" },
              song_ids: { type: "array", items: { type: "string" } },
              confidence: { type: "number" },
            },
          },
        },
      },
    },
  }],
});

This gives me structured, parseable results with confidence scores. I only auto-merge above 0.95 confidence; everything else goes into a review queue.

Results

Running this across the full hymnal database:

Traditional methods caught ~35% of duplicates
Claude-assisted detection caught ~92% of duplicates
False positive rate was under 2% (and those were caught by the review queue)

The total API cost for processing the entire database was under $5. For a one-time data cleanup operation, that’s essentially free.

Batch Processing Details

Processing thousands of songs required some engineering around the AI calls:

Rate limiting — I throttled requests to stay within API limits, processing batches with a small delay between each.
Idempotency — Each batch is tracked in the database so the pipeline can be restarted without reprocessing.
Human review — Medium-confidence matches (0.7-0.95) go into a review queue where an admin can confirm or reject the merge.

When to Use AI for Data Quality

This approach works well when:

Domain knowledge matters. The AI knows things about the data that you’d have to hard-code (like common hymn title variations).
The volume is manageable. Thousands of records, not millions. At scale, you’d want to use AI to build training data for a traditional classifier.
Precision matters more than speed. This isn’t a real-time system — it’s a batch process that runs periodically.

It’s not a replacement for traditional deduplication. The trigram pre-filter is essential for reducing the candidate set. But for the final determination of “are these actually the same thing?” — AI is remarkably good.

Key Takeaway

AI-assisted data quality isn’t about replacing traditional tools — it’s about adding a semantic layer on top of them. Use algorithms for the cheap, fast filtering. Use AI for the nuanced judgment calls. The combination is more powerful than either alone.