AI hallucination guardrails — what we built into our VA workflow

The four guardrails we put around every AI-augmented task to stop a VA from shipping a hallucination. Real examples of what slipped through, what we changed, and what we still cannot prevent.

Every AI-augmented VA agency that says “we’ve never shipped a hallucination” is lying or hasn’t been doing this long enough. We have shipped them. We have caught them. We have built guardrails because of them.

Here are the four guardrails we put in place and the cases that slipped through anyway.

What we mean by hallucination

Two flavours:

Factual hallucination. The AI states a number, source, date, or fact that is not in the prompt and is wrong. Example: “Australian Consumer Law section 42 protects refund rights” (it doesn’t; the relevant section is 18 for misleading conduct and 54 for consumer guarantees).

Tone hallucination. The AI states something true but in a tone that doesn’t match the brand. Example: an empathetic refund response that says “we totally understand and absolutely want to make this right!!” when our brand voice is “we’re sorry, here’s the fix, this is what we’ll do differently”. Less obviously wrong but cumulatively erodes trust.

Across our 12 months of data, tone hallucinations outnumber factual hallucinations 4-to-1. We over-invested in fact-check guardrails initially and under-invested in tone guardrails.

Guardrail 1: source verification at the prompt level

Every prompt that asks for a factual output includes the constraint that any “fact” must come from the prompt itself.

Bad prompt:

Write a customer reply explaining our refund policy.

Better prompt:

Write a customer reply explaining our refund policy.

Our refund policy:
- 30-day no-questions-asked window for change-of-mind returns
- Faulty goods refunded under Australian Consumer Law
- Refunds processed within 5 business days
- Customer pays return shipping for change-of-mind

DO NOT cite any other policy. If the customer asks about something not
covered here, output [ESCALATE] instead of inventing the answer.

The bad prompt gets the AI inventing 30-day vs 14-day vs 60-day windows depending on its mood. The better prompt grounds the response.

This is the easiest guardrail to teach and the one with the biggest single payoff.

Guardrail 2: claim-level review at the edit step

Before any AI output ships, the VA marks every specific claim and verifies it.

A “claim” is anything that could be fact-checked: a number, a date, a regulation reference, a named person, a source attribution, a price, a deadline, a feature claim.

The verification loop:

  1. VA reads the AI output and highlights every claim
  2. For each, VA either:
    • Confirms the claim came from the prompt (no further check needed)
    • Verifies the claim against the source of truth (Xero, Shopify, your SOP, etc)
    • Marks it [VERIFY] and escalates if it can’t be confirmed

This takes 30-90 seconds per output. It catches roughly 6-8 hallucinations per VA per month based on our spot-check data.

The training rule: if you’re not certain a claim is real, don’t ship it. Mark it [VERIFY] and ask.

Guardrail 3: customer-facing edit floor

For any customer-facing copy (emails, social, support replies, marketing), the VA must visibly edit the AI output. Minimum threshold: at least 2 sentences different from the AI version.

Rationale: rubber-stamping AI output is the #1 failure mode (see Training a new VA on AI in their first week). The edit floor forces the VA to engage with the output rather than copy-paste.

What “edited” looks like:

  • Replaced cliched openers (“I hope this email finds you well”) with specific openers
  • Tightened over-apologetic phrases (“we’re so terribly sorry”) to measured ones (“we apologise”)
  • Added a specific detail from the customer’s history (their name, their previous order, their location)
  • Removed any sentence that adds no information

If the final output is 100 per cent identical to AI, the VA hasn’t done their job, and we redo it.

Guardrail 4: the day-5 competency check

Every new VA passes a recognition drill on day 5 of onboarding before they go live (see Training a new VA on AI in their first week).

The drill: we show them deliberately broken AI outputs containing the three failure modes (confident invention, tone drift, generic phrasing). They must identify each failure and fix.

Pass: they go live with weekly spot-checks for the first 30 days.

Fail: they get a second week of supervised practice. We don’t push a VA to handle AI-augmented customer-facing work until they pass.

Roughly 12 per cent of new VAs need the second week. Roughly 3 per cent never pass and we replace.

What still slips through

Honest cases from the last 12 months:

Case 1: Fabricated study citation in a blog draft. AI cited “a 2023 University of Melbourne study showing 78 per cent of small businesses…” The study did not exist. VA didn’t verify because the AI cited a confidence-inspiring source. Caught at editorial review (we have a senior content checker on any blog draft before publication). The study citation was removed; we now require every external citation to have a URL or [VERIFY] tag.

Case 2: Wrong product weight in a Shopify product description. AI invented a “250g” weight where the actual product was 175g. VA didn’t catch because the spec sheet wasn’t in the prompt. Customer complained on receipt. We refunded shipping. Workflow now requires the spec sheet to be in every product-description prompt; weights are tagged with [SPEC] to indicate they need verification.

Case 3: Tone drift on a complaint resolution. AI’s reply was technically correct but used the phrase “totally understand”. Customer was already inflamed and read it as condescending. Reply was sent before we caught it. We apologised separately. Now: any reply to a complaint flagged with severity > 5/10 gets a founder review before sending.

Case 4: Date hallucination on a calendar invite. AI generated a Calendly invite for “Tuesday 14 May” when 14 May was actually a Wednesday. VA was on autopilot, didn’t double-check. Customer arrived on Tuesday 13th to a no-show. Fix: all date generation now uses a function-call to the system clock, not AI free-text.

Each case prompted a workflow change. The guardrails are not “set and forget” — they evolve with new failure modes.

The hardest case

The one we still can’t fully prevent: when the VA’s verification is itself wrong.

Example: AI cites “ATO Determination 2024/3”. VA googles, sees results, accepts as valid. The first Google result is itself AI-generated content citing the same fabricated determination. The hallucination has propagated through the broader web.

Mitigation: for legal, tax, and regulatory references specifically, we require verification against the primary source (ato.gov.au, legislation.gov.au, ahpra.gov.au, mara.gov.au) — not against secondary search results.

For everything else, the residual risk is real and manageable. We have caught 6 propagated hallucinations in 12 months across 30+ placements. None have caused material customer harm.

What we tell clients

Honesty over optics:

  • We use AI in roughly 65-70 per cent of VA tasks.
  • We’ve shipped at least 4 visible hallucinations in 12 months across 30 placements.
  • Each one taught us a guardrail. The rate is dropping.
  • We will not promise zero hallucinations. We will promise faster detection and faster correction than any non-AI workflow.

Clients respond well to this. Vibes-y “we have it under control” loses business.

What’s next

For the wider context, see Why we don’t replace VAs with AI.

For the training plan that builds in these guardrails, see Training a new VA on AI in their first week.

For the SOP format that supports verifiable outputs, see Writing SOPs your VA AND Claude can both follow.

If you want a placement that uses these guardrails by default, book a discovery call.

Tools mentioned in this post

  • Claude
  • ChatGPT