Token Spend Anomaly Detection for Always-On AI Agents

Always-on agents rarely fail with one dramatic invoice. They drift. A new fallback path fires too often, a retrieval loop gets chatty, or a verifier starts replaying large prompts after every timeout. Everything still works, but your daily spend graph starts leaning the wrong way.

The annoying part is that plain provider dashboards usually tell you that spend is up, not which workflow changed shape. If you run coding agents, cron workers, support automation, or evaluation jobs, you need task-level cost signals before the monthly bill becomes the first alert.

This post walks through a practical token spend anomaly pipeline: label every run, build rolling baselines per lane, fire human-actionable alerts, and avoid the two classic mistakes, alerting on normal product growth and ignoring prompt inflation hidden inside successful runs.

Why this matters

Agent systems create cost in ways that look legitimate at first glance. Better models quietly become the default, retries multiply prompt overhead, retrieval expands context packets over time, and background jobs keep running long after the original need fades.

That makes cost a reliability concern, not just a finance report. If a run suddenly needs 4x more input tokens, something operational changed, even if the user-facing result still looks fine.

Which workflow lane got more expensive?
Is the jump volume-driven or per-run inflation?
Did a model route, retry path, or context packet change?
What should the operator do next?

Architecture or workflow overview

flowchart LR
  A[agent run] --> B[usage event]
  B --> C[label normalizer]
  C --> D[time-series store]
  D --> E[baseline job]
  E --> F[anomaly scorer]
  F --> G[alert router]
  G --> H[Slack or Discord summary]
  F --> I[budget lane action]
  I --> J[throttle fallback or human review]

The key design choice is to baseline lanes, not the whole platform. A nightly eval suite should not share a baseline with customer-facing chat or a repo automation worker.

Visual plan

Hero: dark dashboard banner with labeled usage lanes, EWMA baseline, and early alert callout
Diagram: run event to labels to baseline to anomaly scorer to alert router
Code sections: usage event schema, rolling baseline scorer, alert summary generation

Implementation details

Start with one rule: every model invocation must emit a normalized usage event, even if the request fails.

1) Emit lane-aware usage events

{
  "ts": "2026-05-23T11:42:18Z",
  "run_id": "run_8fd2",
  "workflow": "repo-pr-worker",
  "lane": "scheduled-medium-risk",
  "model": "gpt-5.4",
  "input_tokens": 18420,
  "output_tokens": 2110,
  "cached_input_tokens": 9400,
  "tool_calls": 6,
  "retry_count": 1,
  "status": "success",
  "git_repo": "negiadventures.github.io",
  "cost_usd": 0.2142
}

The lane field matters more than most teams expect. If you only group by model, you miss the operational story. The expensive run is usually tied to a workflow pattern, not just a model name.

2) Score anomalies against a rolling baseline

I like an EWMA baseline plus median absolute deviation because it is simple, explainable, and good enough for most agent fleets.

from dataclasses import dataclass
from statistics import median

@dataclass
class UsagePoint:
    lane: str
    cost_usd: float
    input_tokens: int
    output_tokens: int


def mad(values: list[float]) -> float:
    center = median(values)
    return median([abs(v - center) for v in values]) or 0.0001


def score_lane(points: list[UsagePoint], current: UsagePoint, alpha: float = 0.25) -> dict:
    costs = [p.cost_usd for p in points]
    ewma = costs[0]
    for value in costs[1:]:
        ewma = alpha * value + (1 - alpha) * ewma

    spread = mad(costs)
    z_like = (current.cost_usd - ewma) / spread
    return {
        "baseline_cost": round(ewma, 4),
        "spread": round(spread, 4),
        "score": round(z_like, 2),
        "is_anomalous": current.cost_usd > ewma * 1.8 and z_like >= 4.0,
    }

This is intentionally boring. That is a feature. Operators can reason about it during an incident. Fancy black-box anomaly detectors tend to lose trust fast when they cannot explain why a normal traffic spike got paged.

3) Generate alerts with an operator hint

An alert without probable cause is just a guilt delivery system. Include model route, retry count, cache effectiveness, and context size change.

def build_alert(run: dict, baseline: dict) -> str:
    cache_ratio = 0.0
    if run["input_tokens"]:
        cache_ratio = run["cached_input_tokens"] / run["input_tokens"]

    return f"""
Lane: {run['lane']}
Workflow: {run['workflow']}
Cost: ${run['cost_usd']:.4f} vs baseline ${baseline['baseline_cost']:.4f}
Score: {baseline['score']}
Model: {run['model']}
Retries: {run['retry_count']}
Cache hit ratio: {cache_ratio:.0%}
Likely check: route change, retry storm, or larger context packet
Recommended action: inspect last 20 runs and compare token shape before throttling
""".strip()

Example terminal summary

$ agent-cost-watch daily-summary --lane scheduled-medium-risk

lane=scheduled-medium-risk
runs=148
avg_cost=$0.071
current_p95=$0.204
baseline_p95=$0.109
anomalies=3
likely_cause=retry inflation after verifier timeout change
next_action=cap retries at 1 and re-check context packet size

What went wrong and tradeoffs

The first candidate I would not trust is hard daily caps only. Static budget caps are useful, but they are too blunt on their own. They catch catastrophes late and say nothing about why costs shifted.

Control style	Good at	Weak at	Best use
Static daily cap	catching runaway spend	misses gradual drift	final safety net
Pure anomaly baseline	spotting shape changes	noisy during launches	operator visibility
Hybrid cap plus baseline	alerting early and containing blast radius	more moving parts	most production agent systems

Pitfalls:

Mixing unrelated workflows into one baseline, which hides real anomalies.
Ignoring failed runs, even though retries and failed tool loops often create the spend spike.
Alerting on total daily cost instead of per-run inflation, which confuses growth with waste.
Missing cached-token ratios, so prompt regressions look like model-price changes.

Cost alerts can also expose private workload names, repo names, or user identifiers if you dump raw labels into chat. Normalize labels up front and redact anything user-derived before sending alerts outside your metrics store.

I would not let anomaly detection auto-switch every workflow to a cheaper model the moment spend rises. That often turns one problem into two, higher costs and degraded outcomes.

Practical checklist or decision framework

What I would do again:

Label every run with workflow and lane before building any dashboards.
Track per-run cost, retry count, cache ratio, and input token growth together.
Use a simple explainable baseline before reaching for more complex detection.
Send alerts with one probable-cause hint and one concrete next step.
Keep a hard budget ceiling as the final blast-radius guard.

[ ] every model call emits a normalized usage event
[ ] failed runs count toward spend analysis
[ ] baselines are separated by workflow lane
[ ] alerts include retry count and cache effectiveness
[ ] operators can compare the current run against the previous 20 runs
[ ] a hard spend cap still exists for catastrophic failures

Conclusion

If you run agents continuously, token spend is an operational signal, not just a billing artifact. The practical win is not a perfect forecast. It is catching quiet per-run inflation early enough that a human can fix the workflow before the invoice becomes the incident report.

Cost Engineering AI Agents Observability FinOps Reliability