Skip to main content
← Back to All Posts

Eval Harnesses for AI Coding Agents That Actually Catch Bad Patches

April 19, 2026•10 min read
Eval Harnesses for AI Coding Agents That Actually Catch Bad Patches

Most AI coding demos still fail the same test: the patch looks plausible, but it quietly breaks something off-camera. The model updates the file you expected, misses the adjacent invariant, and the reviewer only notices after CI fails or a customer does.

If you want coding agents to improve, you need an eval harness that grades real repository work instead of prompt beauty contests. That means task manifests, sandboxed verification, invariant checks, and scorecards that separate made a diff from produced a safe fix.

This is the harness shape I would actually use for patch-level agent evaluation.

Why this matters

A coding agent is rarely judged on one thing. In practice you care about task success, file accuracy, invariant safety, reviewability, and cost. Generic benchmark scores do not give you that mix, but a repo-local harness can.

  • did it edit the right files?
  • did the verify commands pass?
  • did it preserve non-obvious behavior?
  • did it avoid risky shortcuts?
  • is the patch something a reviewer can merge calmly?

Useful references: OpenAI Evals, SWE-bench, OpenTelemetry, and LangSmith evaluations.

Architecture or workflow overview

A useful harness has four stages: prepare a task, run the agent in isolation, verify the patch, and score the result.

flowchart LR
    A[Task manifest
repo state + prompt + allowed paths] --> B[Agent runner
branch or worktree sandbox]
    B --> C[Verification lane
pytest, lint, invariant checks, policy scans]
    C --> D[Scorecard
pass rate, file accuracy, risk flags, review quality]
    D --> E[Report
json + html summary + failing evidence]

Task packet

I like task manifests that are explicit enough to replay later and small enough to diff in code review.

id: auth-refresh-token-regression
base_commit: 6bbda18
repo: github.com/acme/api
prompt: |
  Fix the bug where refresh tokens remain valid after password reset.
  Preserve the mobile login flow and do not change public API schemas.
allowed_paths:
  - services/auth/**
  - tests/auth/**
verify:
  - pytest tests/auth -q
  - ruff check services/auth tests/auth
  - python scripts/check_invariants.py --task auth-refresh-token-regression
invariants:
  - password reset must revoke outstanding refresh tokens
  - existing session audit logging must stay intact
risk_flags:
  - auth
  - session-management

Runner evidence

The runner should save more than a final diff. It should capture changed files, checks, trace output, runtime, and risk flags so failures are explainable.

from dataclasses import dataclass
from pathlib import Path
import subprocess
import time

@dataclass
class EvalResult:
    task_id: str
    exit_code: int
    changed_files: list[str]
    runtime_seconds: float
    checks: dict
    risk_flags: list[str]


def run_check(command: str, cwd: Path) -> dict:
    started = time.time()
    proc = subprocess.run(command, cwd=cwd, shell=True, text=True, capture_output=True)
    return {
        "command": command,
        "ok": proc.returncode == 0,
        "exit_code": proc.returncode,
        "stdout": proc.stdout[-6000:],
        "stderr": proc.stderr[-6000:],
        "runtime_seconds": round(time.time() - started, 2),
    }

Implementation details

The highest-value harnesses do not collapse everything into one magic number. They keep a small scorecard that maps to how humans actually review patches.

DimensionWhat it measuresGood signalFailure smell
Task successWhether required checks passedTargeted tests greenSuite skipped or weakened
File accuracyWhether edits stayed in scopeOnly expected files changedUnrelated churn across repo
Invariant safetyWhether critical behavior stayed trueCustom checks passAuth, billing, or data-loss regressions
ReviewabilityWhether a human can inspect the patch quicklyClear diff, small scopeGiant generated rewrite
EfficiencyWhether runtime and cost stay boundedStable runtimeLooping retries and token waste

Invariant checks

A lot of bad patches pass focused tests because the original tests were incomplete. A second verification lane catches cheap wins that would otherwise pollute your benchmark.

from pathlib import Path
import re

DIFF = Path('.git/invariant.patch')

def invariant_auth_logging(diff_text: str) -> bool:
    return 'audit.log_security_event' in diff_text


def invariant_no_test_downgrade(diff_text: str) -> bool:
    forbidden = [r'-\\s*assert .*is False', r'\\bskip\\(', r'xfail']
    return not any(re.search(pattern, diff_text) for pattern in forbidden)
$ python run_eval.py --task auth-refresh-token-regression --model local-qwen-coder
[agent] plan: inspect auth service, patch token revocation, run focused tests
[verify] pytest tests/auth -q .................................... PASSED
[verify] ruff check services/auth tests/auth ..................... PASSED
[verify] python scripts/check_invariants.py ...................... FAILED
[score] task_success=0.75 file_accuracy=1.00 invariant_safety=0.00 reviewability=0.92
[hint] audit.log_security_event disappeared from services/auth/reset.py

What went wrong and the tradeoffs

Failure mode 1, you overfit to your harness

If tasks stay static, the model starts learning the grading routine instead of the engineering problem. Hidden holdout tasks and rotated prompts help a lot.

Failure mode 2, patch success hides review pain

A patch can pass tests and still be miserable to merge because it rewrites too much or smuggles in unrelated cleanup. Reviewability deserves its own score.

Failure mode 3, security-sensitive repos need harsher weighting

In auth, payments, infra, and deletion flows, passing tests should not outweigh broken invariants or forbidden patterns.

Pitfall: do not grade only success-path tests. Models will absolutely preserve the demo while quietly weakening the guardrails around it.
Eval styleFast to set upUseful for coding agentsMain weakness
Prompt-response gradingYesLowIgnores repo state and execution
Golden diff matchingMediumMediumPunishes valid alternate fixes
Test-only gradingYesMediumMisses unsafe shortcuts
Patch + invariants + review scoreNoHighMore setup and maintenance

Practical checklist

Best practice: keep the harness boring and deterministic. The model can be creative. The grader should not be.
  • define task manifests with base commit, prompt, allowed paths, and verify commands
  • run each task in a fresh branch, worktree, or container
  • score both success and blast radius
  • add at least one invariant check for every high-risk task family
  • save structured artifacts, not just pass or fail
  • maintain a hidden holdout set before changing prompts or models
  • inspect a sample of passing patches manually every week

Conclusion

If you want better coding agents, stop asking whether the model can produce a patch and start asking whether the patch survives verification, preserves invariants, and stays mergeable. A good eval harness turns that question into data instead of guesswork.

AI Evaluation Coding Agents Reliability Developer Workflows Benchmarks

Want more practical AI engineering notes? Browse the rest of the blog.