Most AI coding pipelines treat a red CI run like a generic retry signal. That is how you end up paying for three more model calls while the real problem is a missing secret, a flaky browser test, or a transient runner outage.
A human reviewer usually sees the shape of the failure quickly. Agents need that judgment encoded in the runtime, not improvised from scratch every time the workflow turns red.
This pattern is the one I trust most: bucket the failure, preserve a compact artifact pack, then choose between a retry lane, a patch lane, or a human-escalation lane with explicit policy.
Why this matters
CI is where a lot of agent systems quietly waste money. The code patcher wakes up for failures it should never touch, retries spin on the same bad signal, and reviewers get a noisy branch plus no clear explanation of what the system believed.
The production problem is not just correctness. It is allocation. Code regressions deserve model effort. Infra faults deserve patience. Secret and permission failures deserve a human.
- real regressions move faster into a focused patch lane
- flaky and infra failures stop consuming expensive model turns
- every automated action becomes easier to audit later
- PRs carry fewer fake fixes for non-code problems
Architecture or workflow overview
flowchart TD
A[CI run fails] --> B[Collect artifact pack]
B --> C[Failure classifier]
C --> D{Bucket}
D -->|code regression| E[Patch lane]
D -->|flaky test| F[Rerun or quarantine lane]
D -->|infra outage| G[Cooldown retry]
D -->|secret or config| H[Human escalation]
E --> I[Focused patch + verifier]
F --> J[Fingerprint failure]
G --> K[Retry same SHA]
H --> L[Stop automation, attach evidence]Retry, patch, and escalation are different products. They should not share the same budget, permissions, or evidence packet.
Implementation details
1) Normalize raw failures into buckets
I would not send a full workflow transcript back into the model first. Start with a small classifier config and obvious string or structured-log signals.
buckets:
code_regression:
match_any:
- "AssertionError"
- "TypeError:"
- "undefined is not a function"
action: patch
max_auto_retries: 0
flaky_test:
match_any:
- "Timeout 30000ms exceeded"
- "ECONNRESET during test"
- "stale element reference"
action: rerun_once
max_auto_retries: 1
infra_fault:
match_any:
- "failed to pull image"
- "network timed out"
- "No space left on device"
action: cooldown_retry
max_auto_retries: 2
secret_or_config:
match_any:
- "401 Unauthorized"
- "Missing required environment variable"
- "Resource not accessible by integration"
action: escalate
max_auto_retries: 0This gets the easy cases out of the way and blocks a lot of obviously wrong remediation.
2) Persist a compact artifact pack
The artifact pack is the unit I want to store, inspect, and pass to downstream automation. It captures the high-signal evidence without becoming another giant log blob.
{
"run_id": 194281775,
"commit_sha": "1d3c5af",
"workflow": "test-and-lint",
"failed_job": "playwright-e2e",
"bucket": "flaky_test",
"fingerprint": "playwright-timeout:checkout.spec.ts:guest checkout works",
"first_failure_line": "Timeout 30000ms exceeded while waiting for [data-test=place-order]",
"suspect_files": ["tests/e2e/checkout.spec.ts", "playwright.config.ts"],
"rerun_eligible": true,
"links": {
"run": "https://github.com/org/repo/actions/runs/194281775",
"artifacts": "https://github.com/org/repo/actions/runs/194281775/artifacts"
}
}I especially like stable fingerprints because they let you separate one-off failures from recurring test debt.
3) Put retry policy behind code, not model optimism
The runtime should decide whether another automated attempt is allowed.
export function decideNextAction(input: {
bucket: 'code_regression' | 'flaky_test' | 'infra_fault' | 'secret_or_config';
retriesUsed: number;
maxRetries: number;
sameFingerprintCount: number;
}) {
if (input.bucket === 'code_regression') {
return { action: 'open_patch_lane', reason: 'code evidence present' };
}
if (input.bucket === 'flaky_test' && input.retriesUsed < 1) {
return { action: 'rerun_same_sha', reason: 'single rerun allowed for flaky bucket' };
}
if (input.bucket === 'infra_fault' && input.retriesUsed < input.maxRetries) {
return { action: 'cooldown_retry', reason: 'runner or network fault likely transient' };
}
return {
action: 'escalate',
reason: input.sameFingerprintCount > 2
? 'repeated fingerprint suggests systemic issue'
: 'policy disallows more automation'
};
}This sounds obvious, but it is where a lot of systems quietly improve. The model no longer self-approves another turn.
4) Give humans a short triage summary
$ triage-ci-failure --run 194281775
bucket: flaky_test
fingerprint: playwright-timeout:checkout.spec.ts:guest checkout works
sha: 1d3c5af
next-action: rerun_same_sha
why: single rerun allowed for flaky bucket
notes: same code passed on previous two commits, failure isolated to e2e shard 3A reviewer can absorb that in five seconds, which is exactly what you want when a bot is making repeated decisions on their behalf.
What went wrong, and the tradeoffs
Failure buckets drift over time. What used to mean "flaky" can become a genuine regression after a framework upgrade. So the classifier needs periodic review, not blind trust.
Artifact packs can also become too thin. If you over-reduce, the patch lane lacks context. If you under-reduce, you are back to stuffing giant logs into prompts.
| Bucket | Best first action | Risk if misclassified | What I watch |
|---|---|---|---|
| Code regression | Open patch lane | Missed bug or bad auto-fix | failing test name, blame window, suspect files |
| Flaky test | Rerun once, same SHA | hides real instability | fingerprint frequency, shard skew, pass-on-rerun rate |
| Infra fault | Cooldown retry | wasteful loops during outages | provider status, runner pull and cache failures |
| Secret or config | Escalate to human | impossible patch attempts | auth errors, env availability, permission scope |
- fork PR permissions often look like failing code when they are actually expected security boundaries
- rerun-success can hide test-order dependence, so passing on rerun is not the end of the story
- artifact links need durable retention or incident review gets much worse
- repeat fingerprints should cap automation before one noisy test burns a week of tokens
Pitfall: do not let the patch lane mutate code after a secret_or_config bucket. If the environment is wrong, code changes are usually just theater.
Best practice: pass the patch lane only the failing test context, suspect files, and compact artifact pack. Narrow packets are cheaper and easier to verify.
Practical checklist
- [ ] define a small, reviewable set of CI failure buckets
- [ ] collect artifact packs with stable fingerprints
- [ ] separate retry, patch, and escalation lanes in code
- [ ] cap retries by bucket instead of with one global number
- [ ] rerun flaky failures on the same commit SHA before changing code
- [ ] persist recurring fingerprints so chronic flakes are visible
- [ ] stop automation on auth, secret, and permission failures
- [ ] attach a short reviewer summary to every decision
Conclusion
AI coding agents get more reliable when CI failure handling stops being one vague fix-it loop.
Bucket the failure, preserve the evidence, and make retry a policy decision. That alone cuts a surprising amount of bad patching and noisy automation.