Your AI Chatbot Still Hallucinates 30% of the Time. Here's How to Catch It.
A new benchmark reveals even the best AI models hallucinate in 30% of multi-turn conversations. Vendor claims say otherwise. Independent testing tells the real story.
Your AI Chatbot Still Hallucinates 30% of the Time. Here's How to Catch It.
Nvidia's CEO recently declared that LLMs don't hallucinate anymore. A week later, researchers at EPFL and the Max Planck Institute published Halluhard, a new benchmark that tested 950 questions across law, medicine, research, and programming. The best model they tested, Claude Opus 4.5 with web search enabled, still hallucinated in roughly 30% of multi-turn conversations. Without search grounding, most models hit 50 to 60%.
Someone is wrong. And if your company is deploying an AI chatbot to customers, you need to know who.
The Numbers Don't Lie
Halluhard isn't another leaderboard game. It uses realistic three-turn conversations with follow-up questions, the same pattern your actual customers use when they interact with a support bot or product assistant. This matters because single-question benchmarks mask the real failure modes. A chatbot might nail the first response, then confidently fabricate details when the user asks a follow-up.
Here's what the benchmark found:
- Claude Opus 4.5 + web search: ~30% hallucination rate (the best result)
- GPT-5.2 Thinking + search: 38.2%
- Most models without search: 50 to 60%+
- Grok-3 in one independent test: 94% incorrect answers
That last number isn't a typo. And here's the kicker: paid models actually performed worse than free versions in certain scenarios. Price does not equal reliability. Only testing reveals the truth.
Why Multi-Turn Testing Is Critical
Most internal QA catches the obvious failures. Bot didn't respond? Logged. Bot returned an error? Flagged. But hallucinations are subtle. The bot responds confidently, uses the right tone, formats the answer correctly, and gets the facts completely wrong.
This is especially dangerous in high-stakes domains. Hallucination rates vary wildly by subject. A chatbot that performs acceptably on general knowledge questions might fabricate legal citations, invent medical dosages, or confidently recommend deprecated API methods. Domain-specific adversarial testing isn't optional for companies in regulated industries. It's a liability issue.
And the failure pattern is consistent: chatbots break down in sustained dialogue, not one-shot prompts. If your testing strategy only sends single messages and checks for responses, you're missing the exact scenarios where your customers will get hurt.
The Trust Gap
When the CEO of the world's most valuable company tells the public that hallucinations are solved, while independent research shows 30 to 60% failure rates, there's a credibility problem that goes beyond benchmarks. Vendors have every incentive to minimize known weaknesses. Customers have every reason to demand independent verification.
This is the secret shopper opportunity. Third-party testers running realistic, multi-turn conversations with your chatbot will find what internal QA and vendor benchmarks won't surface.
How UndercoverAgent Tests for Hallucinations
UndercoverAgent's scenario library includes adversarial tests designed specifically for hallucination detection:
Scenario: Factual Accuracy Under Pressure
Persona: Customer asking detailed follow-up questions
Turns: 3-5 (escalating specificity)
Scoring:
- Factual accuracy per turn (0-100)
- Confidence calibration (did the bot hedge when unsure?)
- Source attribution (did it cite or fabricate?)
- Graceful failure (did it admit uncertainty, or bluff?)
Each test run scores your bot from 0 to 100 across multiple passes. You get a clear, quantified picture of where your chatbot invents answers, which domains trigger the worst failures, and whether it knows how to say "I don't know."
Stop Trusting Vendor Claims
Your chatbot is hallucinating. The question is how often, in which domains, and whether your customers notice before you do.
Run your first hallucination test at undercoveragent.ai/demo and see the results for yourself. No vendor spin. Just data.
Catch Failures Before Production
Run secret-shopper QA continuously and surface hidden chatbot failures before customers do.
Request a Demo