Industry TrendsLLM TestingChatbot QAAI Automation

AI Secret Shoppers at Scale: How LLM Simulators Are Replacing Manual Chatbot QA

UndercoverAgent Team

DoorDash's new LLM conversation simulator signals a shift in chatbot testing. Here's how synthetic test generation, LLM-as-Judge scoring, and continuous evaluation are redefining QA in 2026.

DoorDash just showed the industry what chatbot QA looks like at enterprise scale. Their newly published LLM conversation simulator generates thousands of synthetic multi-turn customer support conversations, tests their AI chatbot against them, and scores the results automatically. No manual reviewers. No spreadsheets. Just an LLM stress-testing another LLM, around the clock.

It's the secret shopper concept, running at machine speed. And it's not just DoorDash. This pattern is becoming the standard for conversational AI testing in 2026.

LLM-as-Judge Is Replacing Manual QA Rubrics

The biggest shift? A second LLM now plays the role of the evaluator. Instead of human reviewers grading chatbot transcripts against a rubric, companies are using LLM judges to score responses across dimensions like correctness, helpfulness, safety, and format compliance.

This isn't about cutting corners. It's about making continuous evaluation possible. Manual QA can review dozens of conversations per day. An LLM judge can evaluate thousands per hour, with consistent criteria every time. Teams at SitePoint and in the broader testing community are documenting how to separate deterministic checks (tool routing, API parsing) from semantic evaluation, giving teams confidence in both the plumbing and the output.

Synthetic Conversations from Real Transcripts

DoorDash's approach is especially interesting because of where their test data comes from: real customer support transcripts. They feed historical conversations into an LLM along with backend mocks, generating realistic multi-turn scenarios that mirror actual user behavior.

This creates a simulation-evaluation flywheel. Engineers tweak a prompt, regenerate test scenarios, run them through the simulator, and get scored results in minutes. Iteration cycles that used to take days now happen before lunch.

The approach solves one of the hardest problems in chatbot testing: getting realistic test data without exposing customer PII or waiting for production incidents to reveal gaps.

The Secret Shopper Pattern Goes 24/7

What DoorDash built internally, platforms like ChatBotKit are now packaging as reusable frameworks. The pattern is consistent: deploy automated test suites that run against live chatbots on a schedule, essentially AI mystery shoppers that probe for regressions before customers encounter them.

This is exactly the approach UndercoverAgent was built around. Our 22 pre-built scenarios cover happy paths, adversarial inputs, edge cases, and compliance checks. Each test conversation gets multi-pass analysis with scoring from 0 to 100. The difference between running this manually and running it continuously is the difference between a quarterly audit and a 24/7 security camera.

Red Teaming Adds a Security Layer

Functional QA catches broken flows. But chatbots have attack surfaces that traditional testing never covered: prompt injection, hallucination under pressure, and data leakage through clever social engineering.

AI red teaming is emerging as a parallel discipline. Companies are pen-testing their chatbots the same way they pen-test their APIs, probing for vulnerabilities that only appear in adversarial conversations. QA trends reports from Sthenos and CloudQA both flag this as a top priority for teams shipping AI features in 2026.

What This Means for Your Team

If you're still testing chatbots with manual transcripts and spot checks, you're operating on a model that doesn't match how these systems fail. LLMs drift. Prompts regress. Edge cases multiply with every new feature.

The companies getting this right are running automated, scenario-driven evaluations continuously. Not once before launch. Every day.

Want to see how your chatbot holds up? Try UndercoverAgent's demo and run your first secret shopper test in under two minutes.

Catch Failures Before Production

Run secret-shopper QA continuously and surface hidden chatbot failures before customers do.

Request a Demo