Circuit Breakers for AI Agents That Touch Real Systems

Circuit Breakers for AI Agents That Touch Real Systems

AI agents fail differently than normal request handlers. A flaky model endpoint does not just fail one call. It can trigger retries, replan loops, duplicate tool invocations, and confused fallback behavior that burns budget while doing less work.

I have become a lot more skeptical of simple retry logic in agent systems. When an LLM is orchestrating other systems, a retry policy without a circuit breaker is often an outage amplifier.

This post shows how to wrap model calls and tool invocations in circuit breakers, what signals to trip on, how to recover safely, and where teams usually get the thresholds wrong.

Why this matters

In a production AI workflow, one broken dependency can spread across the whole run. A model API starts timing out, the planner retries and creates more requests, the executor repeats a side-effecting tool call, and the user gets a vague status update while costs keep climbing.

The practical goal is simple. When a dependency gets unhealthy, the agent should do less, explain more, and preserve the option to recover cleanly.

Architecture or workflow overview

Failure containment path
User taskPlannerGuard layerModel or tool call
timeout or budget breachbreaker openscooldown windowsafe fallback or stop
stable probe successhalf-openclosed again
Mermaid version
flowchart LR
  U[User task] --> P[Planner]
  P --> G[Guard layer]
  G --> M[Model call]
  G --> T[Tool call]
  M -->|timeout/error| B[Circuit breaker]
  T -->|timeout/error| B
  B --> C[Cooldown window]
  C --> F[Fallback or safe stop]

Implementation details

1) Put a breaker in front of every unstable edge

The first mistake is having one global breaker for the entire agent. That hides the real failure domain. Breakers should usually live per model endpoint, per tool class, or per tenant-sensitive integration.

// breaker.ts
export type BreakerState = 'closed' | 'open' | 'half-open';

export class CircuitBreaker {
  private state: BreakerState = 'closed';
  private failures = 0;
  private successes = 0;
  private openedAt = 0;

  constructor(
    private readonly failureThreshold = 5,
    private readonly halfOpenSuccesses = 2,
    private readonly cooldownMs = 30_000,
  ) {}

  canExecute(now = Date.now()) {
    if (this.state === 'open' && now - this.openedAt < this.cooldownMs) return false;
    if (this.state === 'open') {
      this.state = 'half-open';
      this.successes = 0;
    }
    return true;
  }

  recordSuccess() {
    if (this.state === 'half-open') {
      this.successes += 1;
      if (this.successes >= this.halfOpenSuccesses) {
        this.state = 'closed';
        this.failures = 0;
      }
      return;
    }
    this.failures = 0;
  }

  recordFailure() {
    this.failures += 1;
    if (this.failures >= this.failureThreshold) {
      this.state = 'open';
      this.openedAt = Date.now();
    }
  }
}

2) Wrap model calls with timeout and token budget guards

Model outages are not always hard 500s. Often the first symptom is latency drift or cost blowups from repeated re-asks. The wrapper should account for both.

// guarded-model-call.ts
import pTimeout from 'p-timeout';
import { CircuitBreaker } from './breaker';

const plannerBreaker = new CircuitBreaker(4, 2, 45_000);

export async function guardedPlannerCall(client: any, payload: any) {
  if (!plannerBreaker.canExecute()) {
    throw new Error('planner breaker open: skip call and use fallback summary');
  }

  try {
    const result = await pTimeout(
      client.responses.create(payload),
      { milliseconds: 12_000, message: 'planner timed out' }
    );

    if (result.usage?.total_tokens > 40_000) {
      throw new Error('planner exceeded token budget');
    }

    plannerBreaker.recordSuccess();
    return result;
  } catch (error) {
    plannerBreaker.recordFailure();
    throw error;
  }
}

3) Keep policy in config

Prompt tweaks are too soft for operational safety. Breaker policy should live in config that is reviewed like infrastructure.

# breaker-policy.yaml
models:
  planner:
    failure_threshold: 4
    cooldown_ms: 45000
    half_open_successes: 2
    timeout_ms: 12000
    max_total_tokens: 40000
  executor:
    failure_threshold: 3
    cooldown_ms: 60000
    half_open_successes: 1
    timeout_ms: 15000

tools:
  github_write:
    failure_threshold: 2
    cooldown_ms: 180000
    half_open_successes: 1
    side_effecting: true
  web_fetch:
    failure_threshold: 5
    cooldown_ms: 20000
    half_open_successes: 2
    side_effecting: false

4) Log why the breaker opened

A breaker that opens silently creates a second debugging problem. Structured open, half-open, and close events make incidents much easier to reconstruct.

2026-05-05T11:42:09Z breaker.open dependency=planner-model reason=timeout_window
window_failures=4 timeout_ms=12000 fallback=summary_only_request

2026-05-05T11:42:54Z breaker.half_open dependency=planner-model probe=1
2026-05-05T11:43:01Z breaker.close dependency=planner-model stable_successes=2

What went wrong, and the tradeoffs

Without a breaker, retries stack on top of replanning and tool loops. The system looks busy while quality collapses. If a write tool fails after a remote system already accepted the request, an agent may try again. Breakers reduce blast radius, but idempotency keys are still mandatory for write paths.

The opposite problem is false opens. If thresholds are too strict, one short regional wobble opens the circuit for everyone. Sliding windows and tenant-aware scopes help, and half-open probes should stay scarce so recovery testing does not become a thundering herd.

ChoiceUpsideDownsideBest fit
Per-model breakerClean failure isolationMore configs to tunePlanner and executor use different providers or budgets
Global agent breakerEasy to addHides root cause, over-blocks healthy pathsTiny prototypes only
Fast open thresholdsStops cost leaks quicklyCan degrade availability during transient blipsSide-effecting tools or expensive models
Slow open thresholdsFewer false positivesMore wasted retries and user latencyCheap read-only tools
Best practice

When the breaker opens, do not pretend the agent is still fully capable. Downgrade the plan visibly, for example “search is temporarily degraded, returning cached context only,” so users know the system chose safety on purpose.

Pitfall

Teams often keep retries in the HTTP client, the tool wrapper, and the planner at the same time. That triple stack makes outages look random. Pick one retry owner and let the breaker coordinate the rest.

Practical checklist

Conclusion

Agent reliability gets much better when unhealthy dependencies cause smaller behavior, not louder behavior. Circuit breakers are not glamorous, but they are one of the cleanest ways to stop model hiccups and flaky tools from turning into cascading incidents.

If I were adding only three things tomorrow, I would start with per-dependency breakers, token-aware health thresholds, and visible fallback messages.

References