Agent Case Study: The Search Query Pipeline

The Task

Search query review is the most fundamental quality control task in PPC. You pull the queries that triggered your ads, classify each one as on-brand or off-brand, and add the bad ones as negative keywords to stop wasting spend. For a single account, it’s straightforward — export to a spreadsheet, scan the list, make decisions, upload negatives. Maybe 20 minutes of work.

The Pain at Scale

Across a portfolio of 20-30 auto repair accounts, a monthly review cycle means classifying over 3,000 search queries. Each query requires a judgment call: is “emergency AC repair near me” on-brand for an auto shop that does HVAC, or is it off-brand because they only do automotive AC? Is “brake shop hiring” a job seeker or someone looking for a brake shop? These classifications require understanding each client’s specific service offerings, their brand names, and their competitors’ brand names — 298 competitor terms across the portfolio.

At that volume, the work is too tedious to do well by hand. Fatigue sets in after the first few hundred queries, and classification quality degrades just as the volume increases. Reviews get pushed to “next week.” Off-brand queries keep triggering ads and wasting spend for days or weeks before someone gets to them.

The other problem is consistency. Two people reviewing the same query list will disagree on 10-20% of classifications. Even the same person reviewing on different days will make different calls. For queries in the gray zone, there’s no reliable baseline. The work isn’t intellectually hard — it’s relentless, and humans aren’t built for relentless.

The Agent

The Search Query Pipeline uses a consensus architecture: three independent classification runs, each performed by a separate AI agent, with only the queries where two or more runs agree on “off-brand” surfaced for human review.

The pipeline has three stages.

Stage 1: Prep. A script reads the portfolio’s search query data from Google Sheets and splits it into batches of roughly 50 queries each. Each batch includes the account’s brand names and the 298-term competitor keyword list for context.

Stage 2: Classification. Three AI agents launch in parallel — each processing every batch independently. The agents don’t see each other’s work. Each classifies every query into one of four categories: high intent, low intent, informational, or off-brand.

Stage 3: Consensus. A comparison script merges the three runs and filters for agreement. Queries where all three runs agree on off-brand go to a “3-3 Agree” tab. Queries where two of three agree go to a “2-3 Agree” tab. Both tabs include a human review column — the final call is always a person’s.

Only after human approval does the upload script push the approved negatives to the accounts’ shared keyword lists via the Google Ads API.

The Result

3,000+ queries classified in 30-60 minutes of wall clock time. The same volume would take multiple days of manual review — work that realistically wouldn’t get done at all.

The consensus model produces two tiers of confidence. The unanimous results (all three runs agree on off-brand) are nearly 100% accurate — I almost never override them. The majority results (two of three agree) land around 80% accuracy, which is why they get a closer human look.

The biggest value isn’t catching things a human would miss. It’s doing the work at all. Before the pipeline, a monthly review of 3,000+ queries across 20-30 accounts simply wasn’t happening at this depth. Fatigue made it impossible. The agent doesn’t get tired, doesn’t lose focus at query 2,000, and doesn’t decide “good enough” after the first hundred. It processes every query with the same attention as the first one.

This is one of 22 agents running in production across 118 accounts, orchestrated by a 600-line operational playbook.

Previous: How the Account Builder generates complete Google Ads accounts in 15 seconds | Next: How the Morning Briefing triages 87 accounts before coffee