CI Failure Triage for AI Coding Agents That Should Not Retry Blindly

Most AI coding pipelines treat a red CI run like a generic retry signal. That is how you end up paying for three more model calls while the real problem is a missing secret, a flaky browser test, or a transient runner outage.

A human reviewer usually sees the shape of the failure quickly. Agents need that judgment encoded in the runtime, not improvised from scratch every time the workflow turns red.

This pattern is the one I trust most: bucket the failure, preserve a compact artifact pack, then choose between a retry lane, a patch lane, or a human-escalation lane with explicit policy.

Why this matters

CI is where a lot of agent systems quietly waste money. The code patcher wakes up for failures it should never touch, retries spin on the same bad signal, and reviewers get a noisy branch plus no clear explanation of what the system believed.

The production problem is not just correctness. It is allocation. Code regressions deserve model effort. Infra faults deserve patience. Secret and permission failures deserve a human.

real regressions move faster into a focused patch lane
flaky and infra failures stop consuming expensive model turns
every automated action becomes easier to audit later
PRs carry fewer fake fixes for non-code problems

Architecture or workflow overview

flowchart TD
    A[CI run fails] --> B[Collect artifact pack]
    B --> C[Failure classifier]
    C --> D{Bucket}
    D -->|code regression| E[Patch lane]
    D -->|flaky test| F[Rerun or quarantine lane]
    D -->|infra outage| G[Cooldown retry]
    D -->|secret or config| H[Human escalation]
    E --> I[Focused patch + verifier]
    F --> J[Fingerprint failure]
    G --> K[Retry same SHA]
    H --> L[Stop automation, attach evidence]

Retry, patch, and escalation are different products. They should not share the same budget, permissions, or evidence packet.

Implementation details

1) Normalize raw failures into buckets

I would not send a full workflow transcript back into the model first. Start with a small classifier config and obvious string or structured-log signals.

buckets:
  code_regression:
    match_any:
      - "AssertionError"
      - "TypeError:"
      - "undefined is not a function"
    action: patch
    max_auto_retries: 0
  flaky_test:
    match_any:
      - "Timeout 30000ms exceeded"
      - "ECONNRESET during test"
      - "stale element reference"
    action: rerun_once
    max_auto_retries: 1
  infra_fault:
    match_any:
      - "failed to pull image"
      - "network timed out"
      - "No space left on device"
    action: cooldown_retry
    max_auto_retries: 2
  secret_or_config:
    match_any:
      - "401 Unauthorized"
      - "Missing required environment variable"
      - "Resource not accessible by integration"
    action: escalate
    max_auto_retries: 0

This gets the easy cases out of the way and blocks a lot of obviously wrong remediation.

2) Persist a compact artifact pack

The artifact pack is the unit I want to store, inspect, and pass to downstream automation. It captures the high-signal evidence without becoming another giant log blob.

{
  "run_id": 194281775,
  "commit_sha": "1d3c5af",
  "workflow": "test-and-lint",
  "failed_job": "playwright-e2e",
  "bucket": "flaky_test",
  "fingerprint": "playwright-timeout:checkout.spec.ts:guest checkout works",
  "first_failure_line": "Timeout 30000ms exceeded while waiting for [data-test=place-order]",
  "suspect_files": ["tests/e2e/checkout.spec.ts", "playwright.config.ts"],
  "rerun_eligible": true,
  "links": {
    "run": "https://github.com/org/repo/actions/runs/194281775",
    "artifacts": "https://github.com/org/repo/actions/runs/194281775/artifacts"
  }
}

I especially like stable fingerprints because they let you separate one-off failures from recurring test debt.

3) Put retry policy behind code, not model optimism

The runtime should decide whether another automated attempt is allowed.

export function decideNextAction(input: {
  bucket: 'code_regression' | 'flaky_test' | 'infra_fault' | 'secret_or_config';
  retriesUsed: number;
  maxRetries: number;
  sameFingerprintCount: number;
}) {
  if (input.bucket === 'code_regression') {
    return { action: 'open_patch_lane', reason: 'code evidence present' };
  }

  if (input.bucket === 'flaky_test' && input.retriesUsed < 1) {
    return { action: 'rerun_same_sha', reason: 'single rerun allowed for flaky bucket' };
  }

  if (input.bucket === 'infra_fault' && input.retriesUsed < input.maxRetries) {
    return { action: 'cooldown_retry', reason: 'runner or network fault likely transient' };
  }

  return {
    action: 'escalate',
    reason: input.sameFingerprintCount > 2
      ? 'repeated fingerprint suggests systemic issue'
      : 'policy disallows more automation'
  };
}

This sounds obvious, but it is where a lot of systems quietly improve. The model no longer self-approves another turn.

4) Give humans a short triage summary

$ triage-ci-failure --run 194281775
bucket: flaky_test
fingerprint: playwright-timeout:checkout.spec.ts:guest checkout works
sha: 1d3c5af
next-action: rerun_same_sha
why: single rerun allowed for flaky bucket
notes: same code passed on previous two commits, failure isolated to e2e shard 3

A reviewer can absorb that in five seconds, which is exactly what you want when a bot is making repeated decisions on their behalf.

What went wrong, and the tradeoffs

Failure buckets drift over time. What used to mean "flaky" can become a genuine regression after a framework upgrade. So the classifier needs periodic review, not blind trust.

Artifact packs can also become too thin. If you over-reduce, the patch lane lacks context. If you under-reduce, you are back to stuffing giant logs into prompts.

Bucket	Best first action	Risk if misclassified	What I watch
Code regression	Open patch lane	Missed bug or bad auto-fix	failing test name, blame window, suspect files
Flaky test	Rerun once, same SHA	hides real instability	fingerprint frequency, shard skew, pass-on-rerun rate
Infra fault	Cooldown retry	wasteful loops during outages	provider status, runner pull and cache failures
Secret or config	Escalate to human	impossible patch attempts	auth errors, env availability, permission scope

fork PR permissions often look like failing code when they are actually expected security boundaries
rerun-success can hide test-order dependence, so passing on rerun is not the end of the story
artifact links need durable retention or incident review gets much worse
repeat fingerprints should cap automation before one noisy test burns a week of tokens

Pitfall: do not let the patch lane mutate code after a secret_or_config bucket. If the environment is wrong, code changes are usually just theater.

Best practice: pass the patch lane only the failing test context, suspect files, and compact artifact pack. Narrow packets are cheaper and easier to verify.

Practical checklist

[ ] define a small, reviewable set of CI failure buckets
[ ] collect artifact packs with stable fingerprints
[ ] separate retry, patch, and escalation lanes in code
[ ] cap retries by bucket instead of with one global number
[ ] rerun flaky failures on the same commit SHA before changing code
[ ] persist recurring fingerprints so chronic flakes are visible
[ ] stop automation on auth, secret, and permission failures
[ ] attach a short reviewer summary to every decision

Conclusion

AI coding agents get more reliable when CI failure handling stops being one vague fix-it loop.

Bucket the failure, preserve the evidence, and make retry a policy decision. That alone cuts a surprising amount of bad patching and noisy automation.