Back in 2024, search query reports were starting to crush me.
I was running a portfolio large enough that the math had stopped working. Every account produced a new batch of search terms every few days. Every batch needed a human eye to decide what to negative out. At some point the volume hit the ceiling — I literally couldn’t get to all of them in a week, and the ones I couldn’t review were just burning budget on queries I’d never have approved.
This is the problem that made me incorporate AI into my PPC workflow for the first time.
I started with a Make.com workflow. It worked. It wasn’t pretty, but it pulled search terms, ran them through an LLM, and handed me back a classification. For a while, it kept me from drowning. Then I moved the whole thing over to Claude Code for more flexibility, and that’s where the real failure hit.
One day I ran a batch through the Claude Code version and the output came back with almost nothing flagged. I mean almost nothing. A batch that should have produced dozens of clear off-brand negatives came back looking like everything was fine. I ran it again. Different result — still sparse, but different.
That was the moment I stopped trusting single-pass LLM classification for anything production.
The hard question that forced me to rebuild this:
If a single LLM run on the same batch can produce two different answers, how can I build a pipeline that actually catches the bad queries before they burn my clients’ budget?
That question is what the SQR Classifier skill exists to answer. It’s the first AI skill I ever built for Google Ads, the longest-running one I’ve got, and the foundation the rest of my AI ops layer sits on.
Here’s what it does, why single-pass isn’t enough, and the framework I borrowed from a developer named Indie Dev Dan that fixed the whole thing.
Get the SQR Classifier skill → github.com/fourteenwm/ppc-ai-skills/sqr-classifier
Free and open-sourced. Drop the SKILL.md into any Claude Code project in under a minute. No configuration required.
The Core Problem: One LLM Pass Is a Coin Flip at Scale
When you run an LLM over a batch of search queries once and look at the output, it usually seems right. The categories are plausible. The reasoning is coherent. You can read the output and nod along.
But “seems right” isn’t the same as “is right.” When I started running the same batch through multiple times, the disagreement rate was high enough to scare me. A term that got flagged as off-brand on one run would come back as low-intent on the next. A clearly informational query would sometimes get waved through as high-intent. None of the individual runs were obviously broken. They just weren’t the same.
This is the failure mode nobody talks about with LLM classification. It isn’t that the model hallucinates wildly. It’s that the model is slightly inconsistent in ways you can’t predict from any single output. You only see it when you look at the same input twice.
For one-off analysis, that’s fine. You use your judgment, you move on. For a production pipeline that’s supposed to catch wasted spend at scale, it’s a disaster. The whole point is that you’re trusting the output enough to not look at every query. If the output is a coin flip, you’ve just added an expensive step that produces unreliable results.
The fix isn’t “use a better model.” Better models still drift. The fix is to run the classification enough times that the disagreements become a signal instead of noise.
Rule 1: Context Before Classification
The SKILL.md opens with the most important rule, and it’s the one I see violated most often when people try to replicate this with raw prompting: context is required before you classify anything.
A search term doesn’t have a fixed category. “Pool maintenance” is high intent if the business is a pool service company. It’s off-brand if the business is an apartment complex. “Apartments in Austin” is high intent for an Austin property and off-brand for a property in Denver. You cannot classify a term correctly without knowing what business it belongs to and where that business operates.
So the first thing the SQR classifier does — before it touches a single query — is ask what the business is and where it serves. If I skip this step, the classification falls apart. If a teammate uses the skill without this step, they get noise.
This sounds obvious. It isn’t. I’ve seen people try to classify search terms with “just run this through GPT” and get garbage results because the model has no idea which of the 87 industries the term belongs to.
Rule 2: When in Doubt, Default to Low Intent
The second rule is the one that makes the skill conservative in the right direction: when a term is ambiguous between high intent and low intent, classify it as low intent.
The asymmetry matters. A false positive — keeping a marginal term that should have been negated — costs you some wasted clicks. A false negative — negating a real buyer who was just using an unusual phrase — costs you the conversion entirely. Between those two errors, the cheaper mistake is keeping the marginal term.
So the classifier leans conservative. If it isn’t obvious a term is high intent, it gets bucketed as low intent, and the reviewer decides whether to negate it later. This is the exact opposite of how most keyword research tools work, and it’s deliberate. Conservative classification protects budget because it doesn’t act on low-confidence calls.
Rule 3: Classify the Intent, Not the Word
The third rule is the one I had to learn the hard way: pay attention to what the searcher wants, not what the word technically means.
“Cheap apartments” contains the word cheap. A lot of people see that and assume it’s a low-value searcher. But the intent in that phrase is clearly “I want an apartment, and I have a budget constraint.” That’s still a buyer. That’s still high intent.
On the flip side, “apartment maintenance jobs” contains the word apartment, which is the business’s core product. But the intent is “I want to work in apartment maintenance.” That’s a job seeker, not a renter. Off-brand or low-intent, depending on how strict you want to be.
The classifier is built around this distinction. It doesn’t pattern-match on keywords. It judges the intent expressed by the whole phrase, given the business context. That’s what makes it work on terms a regex never could.
The Four Intent Categories
The public skill classifies every term into exactly one of four buckets:
- High Intent — actively looking for what the business offers. Keep these. Bid on them.
- Low Intent — related but unlikely to convert. Usually current customers, job seekers, or early research.
- Informational — research or educational queries. “How to,” “what is,” “why does.” Usually negate unless you’re running a top-of-funnel content play.
- Off-Brand — completely unrelated to the business. Different industry, wrong category, wrong geography. Negate immediately.
One category per term. No multi-labeling. No “somewhere between high and low.” The classifier forces a single judgment call, because that’s what the downstream human reviewer needs to make a keep/negate decision in under a second.
Why I Run This Three Times in Production
Here’s the part that’s not in the public skill, but is the reason I trust this pipeline in production across my portfolio.
The inspiration came from a developer named Indie Dev Dan and a framework he talks about called “Scale Your Compute to Scale Your Impact.” The idea is that when you’re working with LLMs, the thing you can scale isn’t always the model or the context or the prompt — sometimes the thing to scale is the raw number of passes. If one run is unreliable, running the same prompt three times and taking the consensus is often dramatically more reliable than spending a week fine-tuning the prompt.
When I heard that, something clicked. My single-pass pipeline wasn’t failing because the prompt was bad. It was failing because I was asking a probabilistic system for a single deterministic answer. The only thing holding me back from running the classification three times was having to do it manually — and “manually” is exactly the thing AI is supposed to eliminate.
So I rebuilt it. Three independent classification passes on the same batch, in parallel, with no agents seeing each other’s output. Then a compare step that merges the results into three buckets: queries where all three runs agreed, queries where two of three agreed, and queries where all three disagreed. I only act on queries with 2-of-3 or 3-of-3 agreement. The all-disagree bucket gets flagged for manual review — those are usually the genuinely ambiguous terms where my judgment matters anyway.
The result is a pipeline I actually trust to run across my full portfolio without me reviewing every batch. Consensus-based confidence. Statistical significance where I didn’t have any before.
The open-sourced version in the repo is the single-pass classifier — the four-category taxonomy and the judgment rules that power each individual run. If you want the 3-pass consensus on top of it, that’s a workflow you’d build yourself using whatever orchestration you prefer. The classification skill is the atomic unit either way.
Get the SQR Classifier Skill
Install in 30 seconds
Copy the SKILL.md file into your Claude Code project:
mkdir -p .claude/skills/sqr-classifier
curl -o .claude/skills/sqr-classifier/SKILL.md \
https://raw.githubusercontent.com/fourteenwm/ppc-ai-skills/main/sqr-classifier/SKILL.mdClaude Code auto-loads the skill when you paste search terms or ask for intent classification. No configuration required. Works with any AI harness that respects skill files — I built it for Claude Code but the rules are portable.
Free. Open-sourced. MIT licensed.
The full repo has nine other PPC AI skills I use in production every day — mutation safety, impression share diagnostics, ad copy verification, lead quality pattern analysis, and more. All at github.com/fourteenwm/ppc-ai-skills.
The Bigger Point: Scale Your Compute, Not Your Headcount
The reality is, managing search query reports at scale is one of those PPC tasks that looks boring from the outside and is genuinely the difference between a portfolio that performs and one that bleeds out slowly. Nobody wins an account review meeting by saying “I reviewed all the search terms this week.” They lose them by not reviewing the search terms and letting the wrong queries eat a meaningful chunk of the budget before anyone notices.
AI doesn’t fix this problem by being smarter than a human. It fixes it by being willing to work on every query every day without getting tired, bored, or distracted. The catch is that AI also isn’t consistent out of the box. One pass is a coin flip. Two passes is better but still noisy. Three passes with a consensus rule is where it becomes something you can actually build a workflow around.
This is the lesson I took from Indie Dev Dan’s framework and baked into my production pipeline. When the thing you’re scaling is an unreliable probabilistic system, the answer is often more of the same system, not a better one.
Classify once, get a guess. Classify three times, get a signal. Classify three times and demand agreement, get a result you can ship.