Shipping an AI feature straight into the user path is how teams end up learning from angry customers instead of from data. The model looks fine in demos, then quietly misroutes tickets, leaks raw context into logs, or burns budget because the real workload is messier than the test prompt pack.
Shadow mode fixes that. You let the model see production-shaped traffic, capture its decisions, and score its behavior before it is allowed to affect user-visible outcomes.
This post walks through a practical shadow mode rollout for AI features, including replay architecture, scorecards, redaction boundaries, and the gates I would require before turning on actions.
Why this matters
Most AI failures are not syntax failures. They are judgment failures under noisy inputs, stale retrieval, missing tools, unexpected latency, and ambiguous user intent. That is why a clean notebook demo tells you almost nothing about production readiness.
Shadow mode gives you real traffic shape, side-by-side scoring, and promotion evidence that is stronger than gut feel. It works especially well for drafting, routing, ranking, and tool-selection features where correctness and policy both matter.
Architecture and workflow overview
flowchart LR
A[User request] --> B[Primary production path]
A --> C[Shadow traffic fork]
C --> D[Redaction + trace envelope]
D --> E[Model inference]
E --> F[Shadow result store]
F --> G[Scorer + policy checks]
G --> H[Daily scorecards]
H --> I{Promotion gate}
I -->|pass| J[Enable limited live actions]
I -->|fail| K[Stay shadow-only and fix]
Implementation details
1) Fork traffic without leaking unsafe context
The shadow request should carry enough context to reproduce the decision, but not every raw payload field. Redact first, then replay.
import type { Request, Response, NextFunction } from "express";
import { enqueueShadowRun } from "./shadow-queue";
import { redactForModel } from "./redact";
export async function forkShadowTraffic(req: Request, _res: Response, next: NextFunction) {
const envelope = {
requestId: req.id,
route: req.path,
actorId: req.user?.id,
createdAt: new Date().toISOString(),
input: redactForModel(req.body),
retrievalRefs: req.context?.documents?.map((doc: any) => doc.id) ?? []
};
void enqueueShadowRun(envelope, {
feature: "ticket-router-v2",
mode: "shadow",
deadlineMs: 12000
});
next();
}I would not pass raw transcripts, full customer records, or tool credentials into the shadow lane unless there is a documented reason and a retention policy.
2) Score the model on behavior, not vibes
A shadow system that only logs outputs will become a graveyard of JSON. You need scoring rules tied to the actual job.
from dataclasses import dataclass
@dataclass
class ScoreResult:
correctness: float
policy_ok: bool
latency_ms: int
label: str
def score_ticket_route(run, expected_team, allow_actions=False):
predicted_team = run.output.get("team")
action_attempted = run.output.get("action_called", False)
correctness = 1.0 if predicted_team == expected_team else 0.0
policy_ok = not action_attempted if not allow_actions else True
if not policy_ok:
label = "policy_violation"
elif correctness == 0.0:
label = "wrong_route"
elif run.latency_ms > 8000:
label = "slow_but_usable"
else:
label = "pass"
return ScoreResult(
correctness=correctness,
policy_ok=policy_ok,
latency_ms=run.latency_ms,
label=label,
)Failure labeling matters more than people think. Wrong route, tool misuse, timeout, and retrieval miss each imply a different fix.
3) Make promotion explicit and boring
If the switch from shadow to live is hidden in code or tribal memory, someone will flip it too early.
feature: ticket-router-v2
rollout:
mode: shadow
promotion_requirements:
min_shadow_runs: 5000
min_accuracy: 0.93
max_policy_violations: 0
max_p95_latency_ms: 4500
required_review_sample: 100
blocked_labels:
- prompt_injection_followed
- pii_leak
- wrong_high_priority_route
next_step:
mode: cohort_live
percentage: 5Terminal output example
$ python scripts/shadow_report.py --feature ticket-router-v2 --window 24h
feature: ticket-router-v2
mode: shadow
runs: 6842
accuracy: 94.1%
policy_violations: 0
p95_latency_ms: 3920
review_sample_completed: 124
top_failure_labels:
retrieval_miss: 131
wrong_route: 96
slow_but_usable: 41
promotion_status: eligible_for_5_percent_liveTradeoffs and what went wrong
| Approach | What you gain | What it costs | Where it breaks |
|---|---|---|---|
| Demo-only launch | Fastest path to users | Almost no evidence | Real traffic surprises you immediately |
| Shadow mode | Real production-shaped evaluation | Extra infra and scoring work | Can drift if traffic replay is low quality |
| Full parallel live rollout | Strongest comparability | Higher cost and ops complexity | Risky if tool actions are not tightly sandboxed |
Replay drift
Your shadow request is often missing hidden context from the production path. Maybe the retrieval snapshot was not captured, maybe a feature flag changed behavior, maybe a dependent service returned different data five seconds later.
PII and secret leakage
Teams often fork traffic first and think about privacy later. That is backwards. AI traces are sticky, searchable, and easy to over-retain.
Latency blind spots
A model can look accurate in shadow mode but still be too slow for the intended UX. If you only score correctness, you may promote a feature users will hate.
Practical checklist
- Redact inputs before they enter the shadow queue
- Capture retrieval references and feature-flag state in the replay envelope
- Block all state-changing tool actions in shadow mode
- Label failures by class, not just pass or fail
- Track latency, timeout rate, and policy violations alongside quality
- Require sampled human review for hard or high-risk cases
- Put promotion gates in config, not in verbal agreements
- Start live rollout with a narrow cohort and a fast rollback path
What I would not do
I would not let a shadow system call the same side-effecting tools as production just for realism. That is how test traffic becomes real damage.
I would not promote based on a single benchmark set collected by the same team that built the feature.
I would not keep unlimited shadow traces. Evaluation data is useful, but operational hoarding creates privacy and security debt.
Direct references
- OpenAI evals design ideas
- LangSmith tracing and evaluation docs
- OpenTelemetry documentation
- GitHub Actions environments and deployment protection rules
Conclusion
Shadow mode is not glamorous, but it is one of the cleanest ways to make AI launches less reckless. If the model needs real traffic to be trustworthy, let it observe first, score it honestly, and only then give it the right to act.