Most AI coding demos still fail the same test: the patch looks plausible, but it quietly breaks something off-camera. The model updates the file you expected, misses the adjacent invariant, and the reviewer only notices after CI fails or a customer does.
If you want coding agents to improve, you need an eval harness that grades real repository work instead of prompt beauty contests. That means task manifests, sandboxed verification, invariant checks, and scorecards that separate made a diff from produced a safe fix.
This is the harness shape I would actually use for patch-level agent evaluation.
Why this matters
A coding agent is rarely judged on one thing. In practice you care about task success, file accuracy, invariant safety, reviewability, and cost. Generic benchmark scores do not give you that mix, but a repo-local harness can.
- did it edit the right files?
- did the verify commands pass?
- did it preserve non-obvious behavior?
- did it avoid risky shortcuts?
- is the patch something a reviewer can merge calmly?
Useful references: OpenAI Evals, SWE-bench, OpenTelemetry, and LangSmith evaluations.
Architecture or workflow overview
A useful harness has four stages: prepare a task, run the agent in isolation, verify the patch, and score the result.
flowchart LR
A[Task manifest
repo state + prompt + allowed paths] --> B[Agent runner
branch or worktree sandbox]
B --> C[Verification lane
pytest, lint, invariant checks, policy scans]
C --> D[Scorecard
pass rate, file accuracy, risk flags, review quality]
D --> E[Report
json + html summary + failing evidence]
Task packet
I like task manifests that are explicit enough to replay later and small enough to diff in code review.
id: auth-refresh-token-regression
base_commit: 6bbda18
repo: github.com/acme/api
prompt: |
Fix the bug where refresh tokens remain valid after password reset.
Preserve the mobile login flow and do not change public API schemas.
allowed_paths:
- services/auth/**
- tests/auth/**
verify:
- pytest tests/auth -q
- ruff check services/auth tests/auth
- python scripts/check_invariants.py --task auth-refresh-token-regression
invariants:
- password reset must revoke outstanding refresh tokens
- existing session audit logging must stay intact
risk_flags:
- auth
- session-managementRunner evidence
The runner should save more than a final diff. It should capture changed files, checks, trace output, runtime, and risk flags so failures are explainable.
from dataclasses import dataclass
from pathlib import Path
import subprocess
import time
@dataclass
class EvalResult:
task_id: str
exit_code: int
changed_files: list[str]
runtime_seconds: float
checks: dict
risk_flags: list[str]
def run_check(command: str, cwd: Path) -> dict:
started = time.time()
proc = subprocess.run(command, cwd=cwd, shell=True, text=True, capture_output=True)
return {
"command": command,
"ok": proc.returncode == 0,
"exit_code": proc.returncode,
"stdout": proc.stdout[-6000:],
"stderr": proc.stderr[-6000:],
"runtime_seconds": round(time.time() - started, 2),
}Implementation details
The highest-value harnesses do not collapse everything into one magic number. They keep a small scorecard that maps to how humans actually review patches.
| Dimension | What it measures | Good signal | Failure smell |
|---|---|---|---|
| Task success | Whether required checks passed | Targeted tests green | Suite skipped or weakened |
| File accuracy | Whether edits stayed in scope | Only expected files changed | Unrelated churn across repo |
| Invariant safety | Whether critical behavior stayed true | Custom checks pass | Auth, billing, or data-loss regressions |
| Reviewability | Whether a human can inspect the patch quickly | Clear diff, small scope | Giant generated rewrite |
| Efficiency | Whether runtime and cost stay bounded | Stable runtime | Looping retries and token waste |
Invariant checks
A lot of bad patches pass focused tests because the original tests were incomplete. A second verification lane catches cheap wins that would otherwise pollute your benchmark.
from pathlib import Path
import re
DIFF = Path('.git/invariant.patch')
def invariant_auth_logging(diff_text: str) -> bool:
return 'audit.log_security_event' in diff_text
def invariant_no_test_downgrade(diff_text: str) -> bool:
forbidden = [r'-\\s*assert .*is False', r'\\bskip\\(', r'xfail']
return not any(re.search(pattern, diff_text) for pattern in forbidden)$ python run_eval.py --task auth-refresh-token-regression --model local-qwen-coder [agent] plan: inspect auth service, patch token revocation, run focused tests [verify] pytest tests/auth -q .................................... PASSED [verify] ruff check services/auth tests/auth ..................... PASSED [verify] python scripts/check_invariants.py ...................... FAILED [score] task_success=0.75 file_accuracy=1.00 invariant_safety=0.00 reviewability=0.92 [hint] audit.log_security_event disappeared from services/auth/reset.py
What went wrong and the tradeoffs
Failure mode 1, you overfit to your harness
If tasks stay static, the model starts learning the grading routine instead of the engineering problem. Hidden holdout tasks and rotated prompts help a lot.
Failure mode 2, patch success hides review pain
A patch can pass tests and still be miserable to merge because it rewrites too much or smuggles in unrelated cleanup. Reviewability deserves its own score.
Failure mode 3, security-sensitive repos need harsher weighting
In auth, payments, infra, and deletion flows, passing tests should not outweigh broken invariants or forbidden patterns.
| Eval style | Fast to set up | Useful for coding agents | Main weakness |
|---|---|---|---|
| Prompt-response grading | Yes | Low | Ignores repo state and execution |
| Golden diff matching | Medium | Medium | Punishes valid alternate fixes |
| Test-only grading | Yes | Medium | Misses unsafe shortcuts |
| Patch + invariants + review score | No | High | More setup and maintenance |
Practical checklist
- define task manifests with base commit, prompt, allowed paths, and verify commands
- run each task in a fresh branch, worktree, or container
- score both success and blast radius
- add at least one invariant check for every high-risk task family
- save structured artifacts, not just pass or fail
- maintain a hidden holdout set before changing prompts or models
- inspect a sample of passing patches manually every week
Conclusion
If you want better coding agents, stop asking whether the model can produce a patch and start asking whether the patch survives verification, preserves invariants, and stays mergeable. A good eval harness turns that question into data instead of guesswork.