Can AI coaching clones hallucinate? How to evaluate accuracy.
Why hallucinations are the category's defining failure mode
Coaches, consultants, and other experts build their practices on being right more often than their clients. An AI protégé that makes up an answer is not just embarrassing; it's the exact failure mode that destroys the reason a client would trust the protégé in the first place.
The problem is that modern language models are optimized to produce plausible text. When grounded context is thin, the most plausible continuation is often a confidently-worded answer that isn't grounded. The failure is quiet: nothing in the model's behavior tells the client they're being fabricated to. The client finds out later, if at all. Usually from someone else.
A hallucinated answer in a session is not a bug; it's the model doing what it was trained to do. The question for a platform is whether the architecture made it possible for the model to do it in the first place.
You can't prompt your way out of this. "Don't make things up" is not a constraint; it's a request. The fix has to happen at the architecture layer. That's what we cover in What is a zero-hallucination AI architecture?. This post is the evaluation companion: how to test any platform's accuracy claim before you commit.
The five accuracy tests to run before you buy
Most evaluation frameworks for AI products ask general questions: "how good is the voice?" "how fast is the setup?" For accuracy, the questions need to be specific and reproducible. Run these five before you sign a contract.
- The adversarial question. Ask the protégé something the expert genuinely doesn't know.
- The out-of-scope question. Ask something plausible but clearly outside the expert's practice.
- The citation trace. Ask a question the expert does know, then try to trace the answer back to source material.
- The refinement round-trip. Note a specific mistake or style issue; submit feedback; see how long it takes to show up in a live session.
- The transcript audit. Pull a session transcript and verify you can reconstruct what the protégé retrieved and why.
Each tests a different property. A platform can pass one and fail three.
Test 1: The adversarial question
Pick a topic adjacent to the expert's practice that they genuinely don't know, not obscure, just outside their lane. For a leadership coach: ask a specific tax question. For a personal-injury attorney: ask about a mergers-and-acquisitions provision. For a financial advisor: ask about a medical device.
What you're testing: Whether the protégé says "I don't have that in my knowledge base" or invents a plausible answer.
Red flags:
- Confident answer with specifics you can't verify
- Generic advice that could have come from any blog post
- No acknowledgment that the question is outside scope
Green flags:
- Explicit disclosure: "I don't have that information. Do you want me to flag it for [expert] to follow up?"
- Offer to escalate or to refer out
- The answer, if given, is narrow and cites the specific part of the KB it came from
This test is the fastest way to tell prompt-level grounding from architectural grounding. Prompt-level systems can be talked into answering. Architectural systems can't.
Test 2: The out-of-scope question
Similar but different: ask a question that sounds like the expert's domain but is deliberately outside it. For a recruiting expert: ask about compensation law in a country the expert doesn't practice in. For a coach: ask about a framework by a different-named author with similar frameworks.
What you're testing: Whether the protégé distinguishes "I know this topic generally" from "I know this from the expert's perspective."
Red flags:
- The protégé confidently attributes a competitor's framework to the expert
- The protégé blends its own knowledge with the expert's without distinguishing
- No acknowledgment that the question shades outside the expert's specific work
Green flags:
- The protégé says "[Expert] doesn't work on that specifically. Would you like me to note it for follow-up?"
- Explicit attribution when the answer draws from the expert's published work
- Clear scope-setting in the opening greeting that primes this behavior
Test 3: The citation trace
Ask a question the expert does know well. Get the answer. Then try to trace the answer back to source material: a document, a transcript, a specific KB entry.
What you're testing: Whether the platform exposes an audit trail.
Red flags:
- No way to see what was retrieved
- Citations, if shown, are vague ("from your uploaded content")
- No similarity scores or threshold information
Green flags:
- You can see the specific chunks retrieved and their similarity scores
- You can click through to the source document and see the passage
- The transcript log records the threshold in force at the time
If a platform can't produce a citation trace, it can't defend its answers. That's disqualifying in any regulated vertical and concerning in any engagement where a client might push back on something the protégé said.
Test 4: The refinement round-trip
Do a session. Note something specific you'd want to change: a tone issue, a missing framework, a handling of a particular question type. Submit that feedback to the platform. Then do another session and see if the change is live.
What you're testing: How fast the platform closes the loop between "I noticed something" and "the protégé behaves differently."
Red flags:
- Feedback disappears into a queue with no timeline
- Changes require admin-panel edits by platform staff
- You have to re-upload documents or re-run training to see changes
Green flags:
- Feedback extracts into structured instructions you can see
- Regeneration and redeploy happen in one request
- The next session shows the change
On Apex Replicant, this is the refinement loop: plain-English feedback → our system extracts structured instructions → protégé regenerates → redeploys to our voice platform → all in one request. See Why most AI coaching clones fail at nuance for why this loop matters more than the initial onboarding.
Test 5: The transcript audit
After a session, pull the full transcript. Check that you can see:
- The client's questions verbatim
- The protégé's answers verbatim
- Which parts of the KB were retrieved to produce each answer
- Whether the similarity threshold was met or the "I don't know" path was triggered
- The structured insights extracted by session insight analysis
What you're testing: Whether the platform treats the session as an auditable artifact or a disposable interaction.
Red flags:
- No transcript, or transcript stored in an opaque format
- No retrieval information
- Session insights, if present, are generic summaries without category structure
Green flags:
- Full transcript with timestamps
- Retrieval chunks visible with similarity scores
- Insights structured across categories (sentiment, topics, action items, unanswered questions, etc.)
Scoring the platform
Run the five tests. Each is pass/fail. A platform that passes five is architecturally serious about accuracy. A platform that passes three or fewer is using "accuracy" as a marketing word, not an engineering constraint.
Here's the shorthand I use when evaluating any AI clone or RAG product:
- 5/5: architectural grounding, auditable, fast refinement loop. Safe for regulated verticals and reputation-sensitive engagements.
- 3–4/5: good posture but gaps. Usable for most coaching contexts; risky for regulated verticals.
- ≤2/5: prompt-level grounding at best. Avoid for anything where an incorrect answer has a reputational or regulatory cost.
Platforms rarely volunteer their scores. You have to run the tests yourself. The good news is that all five can be done during a trial, in under a day of deliberate effort.
An accuracy story that can't survive five deliberate tests isn't an accuracy story; it's an accuracy wish.
FAQ
Can I really test these during a sales trial, or do I need to be a paying customer? All five tests work on a trial. The adversarial and out-of-scope questions run on any protégé. The citation trace and transcript audit require access to the admin or transcript view, and most platforms offer that to trialists. The refinement round-trip is the one that may be gated; if it is, that's signal.
What if the protégé says "I don't know" to too many real client questions? Expected early on when the knowledge base is thin. The refinement loop is how you close those gaps; flagged unanswered questions surface as refinement suggestions. Most experts see the "I don't know" rate drop sharply in weeks 2–4.
Is hallucination different from "getting something wrong because my source material was wrong"? Yes. Hallucination is the model inventing content that wasn't in any source. Source-based errors are the model accurately reflecting bad source material. Zero-hallucination architecture prevents the first; it doesn't prevent the second. The fix for bad source material is curation; in practice, the AI insight preview catches most issues before ingestion.
How does Apex Replicant score on its own five tests? All five pass as a platform property. Architectural grounding (patent-protected retrieval architecture), explicit threshold behavior, full transcript and retrieval audit, one-request refinement loop, and structured session insights. Individual protégés score based on their expert's KB and configuration, but the platform scaffolding is in place.
Do any other platforms pass all five? Based on public documentation as of our competitive refresh, 2026-04-22, none of the major competitors publicly documents all five properties. Delphi documents grounding behavior and refinement; Steno documents the interview onboarding; Personal.ai documents enterprise memory primitives. The architectural grounding + transcript audit + refinement-round-trip combination is the one that hasn't been claimed at architecture level.
What's the single most important of the five tests? Test 1 (adversarial question) for most buyers. It's the fastest and most diagnostic. If a platform fails test 1, it's unlikely to pass the other four, and the failure is immediately visible to you.
Related reading
Co-founder of Expert Scale, Inc. Writes on platform architecture, product decisions, and how Apex Replicant builds expert-driven AI that refuses to guess.
More from Drew Harris→