If you’ve read the SQR Classifier article, you know the origin story: single-pass LLM classification is unreliable. Run the same batch of search terms twice and you get two different answers. Not wildly different — just different enough that you can’t trust any individual run.

That article describes the classification rules — the four intent categories, the context-before-classification requirement, the conservative default. Those rules power each individual run and they’re solid.

But a single run of solid rules still produces inconsistent output. That’s the nature of probabilistic systems. The rules don’t fix the variance problem. They just make each individual run better.

The 3-Run Pipeline is what fixes the variance problem.

Get the SQR 3-Run Pipeline skill → github.com/fourteenwm/ppc-ai-skills/sqr-3run

Free and open-sourced. Drop the SKILL.md into any Claude Code project. Requires the SQR Classifier skill as the classification engine.

The Insight That Changed Everything

The framework that made this click came from a developer named Indie Dev Dan. He talks about scaling compute rather than refining prompts — when you have an unreliable system, sometimes the answer isn’t to make a single pass better, it’s to run multiple passes and extract the signal from the noise.

When I heard that, my whole approach shifted. I’d been spending weeks trying to perfect the classification prompt. Tweaking the wording. Adding examples. Adjusting the category boundaries. Each revision made the single-pass output a little better, but it never got reliable enough to stop reviewing every batch.

Three runs with a consensus rule got me there in an afternoon.

How It Works

The pipeline has three stages:

Stage 1: Three Independent Classification Passes

The same batch of search terms runs through the SQR Classifier three times. Each run is an independent Claude Sonnet agent with no knowledge of the other runs’ output. Same prompt, same rules, same context — but three separate inference passes.

This matters. If the agents could see each other’s output, they’d converge. You’d get three copies of the same answer instead of three independent judgments. The independence is what makes the consensus meaningful.

Stage 2: Consensus Merge

After all three runs complete, a compare step aligns the results by search term and checks agreement:

  • 3-of-3 agree — High confidence. All three runs independently reached the same classification. Act on it.
  • 2-of-3 agree — Medium confidence. Majority rules. Still reliable enough to act on in most cases.
  • 0-of-3 agree — No consensus. All three runs disagreed. These are the genuinely ambiguous terms where human judgment matters.

The merge step doesn’t just count votes. It captures the specific disagreement — which run said what — so when I review the no-consensus bucket, I can see the reasoning from each pass and make a faster call.

Stage 3: Output

Results go back to a Google Sheet with tabs for each confidence tier. I review the 3-of-3 tab in under a minute (spot-check only), spend real time on the 2-of-3 tab (override the occasional bad majority call), and manually classify the no-consensus terms.

In practice, the no-consensus bucket is usually 5-10% of the total batch. That means the pipeline handles 90-95% of the classification work without me, and the 5-10% it can’t handle are exactly the terms where my judgment adds the most value.

Why Three Runs and Not Two or Five

Two runs creates a problem: when they disagree, you have a tie with no tiebreaker. You’re back to making a judgment call on every contested term, which defeats the purpose.

Five runs would work, but the marginal accuracy gain from run 4 and 5 is small relative to the compute cost. Three runs gives you a clear majority rule (2-of-3) with a meaningful confidence tier (3-of-3 unanimous), and it completes fast enough to run across a full portfolio in a morning.

I tested this empirically before settling on three. The jump from one run to three was dramatic. The jump from three to five was not.

What This Actually Looks Like in Production

I run this pipeline weekly across my portfolio. The practical workflow:

  1. Pull search terms from the last 7 days via the Google Ads API
  2. Batch them by account with business context attached to each batch
  3. Spawn three Sonnet agents in parallel — each classifies the full batch independently
  4. Merge results into consensus tiers
  5. Review: spot-check the 3-of-3 tab, inspect the 2-of-3 tab, manually classify the no-consensus terms
  6. Approved negatives go to the SQR Upload skill for bulk addition to shared keyword lists

The whole cycle — from pulling terms to uploading negatives — runs in about 20 minutes for a portfolio of accounts. The manual review portion is usually 5-8 minutes. The rest is compute.

Before this pipeline existed, the same work took me most of a day. And I was still missing things, because human attention degrades across hundreds of queries in a way that AI compute doesn’t.

The Failure Mode I Had to Fix Along the Way

One thing I learned the hard way: when you tell an AI agent to “classify these search terms,” some agents will try to be clever about it. Instead of using LLM judgment on each term, they’ll write a Python regex script that pattern-matches keywords.

The output looks reasonable. The coverage is fast. But it misses everything that requires actual intent judgment — which is the entire point of using an LLM instead of a keyword script.

The fix was explicit in the agent prompt: “Do NOT write a script. Classify using your own judgment on each term.” Without that instruction, one of my three runs would occasionally produce a completely different classification methodology than the other two, and the consensus step would flag everything as no-consensus because the disagreement wasn’t about individual terms — it was about the classification approach itself.

This is the kind of thing you only discover by running the pipeline across real batches. The SKILL.md includes this instruction because I hit it in production.

The Architecture Decision: Claude Sonnet Agents, Not External APIs

The 3-Run Pipeline uses Claude’s built-in Task agents for each classification pass. That means:

  • No external API costs. No OpenAI bills. No per-token charges beyond your Claude subscription.
  • Parallel execution. All three runs start simultaneously and finish within minutes.
  • Self-contained. The entire pipeline runs inside Claude Code. No orchestration layer, no webhook chains, no Make.com scenarios.

I originally built this on external APIs (GPT-4o through Make.com), and it worked, but the cost per batch made it hard to justify running weekly across every account. Moving to Claude Sonnet agents dropped the incremental cost to effectively zero, which meant I could run it as often as I wanted without budget anxiety.

The Companion Skills

The 3-Run Pipeline is the orchestration layer. It depends on two other skills for the classification and upload steps:

  • SQR Classifier — The four-category classification rules that power each individual run
  • SQR Upload — Bulk upload approved negatives to Google Ads shared keyword lists via the API

Together, the three skills form a complete pipeline: classify → vote → upload. Each skill works independently, but the real value is in the combination.

Get the SQR 3-Run Pipeline Skill

Install in 30 seconds

→ View the skill on GitHub

Copy the SKILL.md file into your Claude Code project:

mkdir -p .claude/skills/sqr-3run
curl -o .claude/skills/sqr-3run/SKILL.md \
  https://raw.githubusercontent.com/fourteenwm/ppc-ai-skills/main/sqr-3run/SKILL.md

The pipeline orchestrates three classification passes and merges results into consensus tiers. Requires the SQR Classifier skill as the classification engine. Works with any Google Sheets setup for input/output.

Free. Open-sourced. MIT licensed.

The full repo has 32 PPC AI skills I use in production — mutation safety, ad copy verification, impression share diagnostics, portfolio health prioritization, and more. All at github.com/fourteenwm/ppc-ai-skills.

The Bigger Point: Reliability Is a Design Problem, Not a Model Problem

The PPC industry’s relationship with AI right now is stuck on a question of model quality. “Is GPT-4 good enough?” “Is Claude accurate enough?” “What about Gemini?” These are the wrong questions.

The right question is: given that any single LLM pass is probabilistic, what system design makes the output reliable enough to ship?

The answer, at least for classification tasks, turns out to be embarrassingly simple. Run it three times. Take the consensus. Flag the disagreements for human review. That’s it. No fine-tuning. No custom training data. No proprietary model. Just a design decision to treat variance as a signal instead of pretending it doesn’t exist.

The SQR 3-Run Pipeline is the simplest version of that idea I could build. It’s three runs, one merge, and a human review step on the ambiguous tail. And it’s the most reliable component in my entire AI ops stack — not because the underlying model is perfect, but because the architecture doesn’t require it to be.

Build for reliability first. The speed is already there.