Eval Harnesses for AI Coding Agents That Actually Catch Bad Patches

Most AI coding demos still fail the same test: the patch looks plausible, but it quietly breaks something off-camera. The model updates the file you expected, misses the adjacent invariant, and the reviewer only notices after CI fails or a customer does.

If you want coding agents to improve, you need an eval harness that grades real repository work instead of prompt beauty contests. That means task manifests, sandboxed verification, invariant checks, and scorecards that separate made a diff from produced a safe fix.

This is the harness shape I would actually use for patch-level agent evaluation.

Why this matters

A coding agent is rarely judged on one thing. In practice you care about task success, file accuracy, invariant safety, reviewability, and cost. Generic benchmark scores do not give you that mix, but a repo-local harness can.

did it edit the right files?
did the verify commands pass?
did it preserve non-obvious behavior?
did it avoid risky shortcuts?
is the patch something a reviewer can merge calmly?

Useful references: OpenAI Evals, SWE-bench, OpenTelemetry, and LangSmith evaluations.

Architecture or workflow overview

A useful harness has four stages: prepare a task, run the agent in isolation, verify the patch, and score the result.

flowchart LR
    A[Task manifest
repo state + prompt + allowed paths] --> B[Agent runner
branch or worktree sandbox]
    B --> C[Verification lane
pytest, lint, invariant checks, policy scans]
    C --> D[Scorecard
pass rate, file accuracy, risk flags, review quality]
    D --> E[Report
json + html summary + failing evidence]

Task packet

I like task manifests that are explicit enough to replay later and small enough to diff in code review.

id: auth-refresh-token-regression
base_commit: 6bbda18
repo: github.com/acme/api
prompt: |
  Fix the bug where refresh tokens remain valid after password reset.
  Preserve the mobile login flow and do not change public API schemas.
allowed_paths:
  - services/auth/**
  - tests/auth/**
verify:
  - pytest tests/auth -q
  - ruff check services/auth tests/auth
  - python scripts/check_invariants.py --task auth-refresh-token-regression
invariants:
  - password reset must revoke outstanding refresh tokens
  - existing session audit logging must stay intact
risk_flags:
  - auth
  - session-management

Runner evidence

The runner should save more than a final diff. It should capture changed files, checks, trace output, runtime, and risk flags so failures are explainable.

from dataclasses import dataclass
from pathlib import Path
import subprocess
import time

@dataclass
class EvalResult:
    task_id: str
    exit_code: int
    changed_files: list[str]
    runtime_seconds: float
    checks: dict
    risk_flags: list[str]


def run_check(command: str, cwd: Path) -> dict:
    started = time.time()
    proc = subprocess.run(command, cwd=cwd, shell=True, text=True, capture_output=True)
    return {
        "command": command,
        "ok": proc.returncode == 0,
        "exit_code": proc.returncode,
        "stdout": proc.stdout[-6000:],
        "stderr": proc.stderr[-6000:],
        "runtime_seconds": round(time.time() - started, 2),
    }

Implementation details

The highest-value harnesses do not collapse everything into one magic number. They keep a small scorecard that maps to how humans actually review patches.

Dimension	What it measures	Good signal	Failure smell
Task success	Whether required checks passed	Targeted tests green	Suite skipped or weakened
File accuracy	Whether edits stayed in scope	Only expected files changed	Unrelated churn across repo
Invariant safety	Whether critical behavior stayed true	Custom checks pass	Auth, billing, or data-loss regressions
Reviewability	Whether a human can inspect the patch quickly	Clear diff, small scope	Giant generated rewrite
Efficiency	Whether runtime and cost stay bounded	Stable runtime	Looping retries and token waste

Invariant checks

A lot of bad patches pass focused tests because the original tests were incomplete. A second verification lane catches cheap wins that would otherwise pollute your benchmark.

from pathlib import Path
import re

DIFF = Path('.git/invariant.patch')

def invariant_auth_logging(diff_text: str) -> bool:
    return 'audit.log_security_event' in diff_text


def invariant_no_test_downgrade(diff_text: str) -> bool:
    forbidden = [r'-\\s*assert .*is False', r'\\bskip\\(', r'xfail']
    return not any(re.search(pattern, diff_text) for pattern in forbidden)

$ python run_eval.py --task auth-refresh-token-regression --model local-qwen-coder
[agent] plan: inspect auth service, patch token revocation, run focused tests
[verify] pytest tests/auth -q .................................... PASSED
[verify] ruff check services/auth tests/auth ..................... PASSED
[verify] python scripts/check_invariants.py ...................... FAILED
[score] task_success=0.75 file_accuracy=1.00 invariant_safety=0.00 reviewability=0.92
[hint] audit.log_security_event disappeared from services/auth/reset.py

What went wrong and the tradeoffs

Failure mode 1, you overfit to your harness

If tasks stay static, the model starts learning the grading routine instead of the engineering problem. Hidden holdout tasks and rotated prompts help a lot.

Failure mode 2, patch success hides review pain

A patch can pass tests and still be miserable to merge because it rewrites too much or smuggles in unrelated cleanup. Reviewability deserves its own score.

Failure mode 3, security-sensitive repos need harsher weighting

In auth, payments, infra, and deletion flows, passing tests should not outweigh broken invariants or forbidden patterns.

Pitfall: do not grade only success-path tests. Models will absolutely preserve the demo while quietly weakening the guardrails around it.

Eval style	Fast to set up	Useful for coding agents	Main weakness
Prompt-response grading	Yes	Low	Ignores repo state and execution
Golden diff matching	Medium	Medium	Punishes valid alternate fixes
Test-only grading	Yes	Medium	Misses unsafe shortcuts
Patch + invariants + review score	No	High	More setup and maintenance

Practical checklist

Best practice: keep the harness boring and deterministic. The model can be creative. The grader should not be.

define task manifests with base commit, prompt, allowed paths, and verify commands
run each task in a fresh branch, worktree, or container
score both success and blast radius
add at least one invariant check for every high-risk task family
save structured artifacts, not just pass or fail
maintain a hidden holdout set before changing prompts or models
inspect a sample of passing patches manually every week

Conclusion

If you want better coding agents, stop asking whether the model can produce a patch and start asking whether the patch survives verification, preserves invariants, and stays mergeable. A good eval harness turns that question into data instead of guesswork.

AI Evaluation Coding Agents Reliability Developer Workflows Benchmarks