Multi-agent demos usually fail in one boring place, the handoff. The planner writes a beautiful task list, the executor improvises past it, and by the time a reviewer looks at the diff nobody can tell what was intended versus what was guessed.
This gets worse under real load. Long prompts turn into soft contracts, retries re-run half-complete work, and two agents quietly operate on different assumptions about allowed files, budgets, or verification.
The fix is not more agent cleverness. It is a better handoff contract. If the planner emits a small manifest, the executor claims one task at a time, and verification gates are explicit, the system becomes much easier to debug and trust.
Why this matters
The planner-executor split is attractive because it gives each agent a narrower job:
- the planner decomposes work
- the executor edits code or runs commands
- a verifier checks whether the result actually satisfied the contract
That separation only helps if the handoff is strong enough to survive model drift, retries, and human review. Otherwise you get more moving parts and less accountability.
Useful references: Anthropic on building effective agents, OpenTelemetry, and AWS Step Functions task tokens.
Architecture or workflow overview
I like a four-artifact handoff: plan, claim check, execution record, and verification result.
flowchart LR
A[Planner] --> B[Task manifest
objective, files, constraints, checks]
B --> C[Claim check
task id, owner, ttl, attempt]
C --> D[Executor
edit, run, summarize]
D --> E[Verifier
tests, invariants, diff review]
E --> F[Ledger
status, rollback note, trace id]The contract I actually want
- Manifest with one bounded task, not a novel.
- Allowed surface area listing files, tools, and write scope.
- Verification gates that must pass before completion.
- Claim check so only one executor owns the task at a time.
- Execution summary that says what changed and what remains risky.
Implementation details
1. Make the planner write a real task manifest
If the planner hands over prose, the executor will fill in missing rules from vibes. A manifest gives you something reviewable.
task_id: repo-142
objective: Add rate-limit backoff to the GitHub sync worker
allowed_paths:
- src/github_sync.py
- tests/test_github_sync.py
constraints:
- Do not change API response schemas
- Keep max retry delay under 30 seconds
verification:
- pytest tests/test_github_sync.py -q
- python -m scripts.smoke_github_sync
handoff_notes:
- Existing failures come from 429 handling
- Prefer jittered exponential backoffThe planner should not pre-write the entire patch. It should define the boundary, the success checks, and the danger zones.
2. Use a claim check before execution starts
This is the part teams skip, and then they wonder why two executors stamped on the same task.
from dataclasses import dataclass
from datetime import datetime, timedelta
@dataclass
class ClaimCheck:
task_id: str
owner: str
expires_at: datetime
attempt: int
@classmethod
def create(cls, task_id: str, owner: str, minutes: int = 20, attempt: int = 1):
return cls(
task_id=task_id,
owner=owner,
expires_at=datetime.utcnow() + timedelta(minutes=minutes),
attempt=attempt,
)
def expired(self) -> bool:
return datetime.utcnow() >= self.expires_atA claim check is not fancy. It just prevents accidental parallelism and gives you a clean retry story.
3. Keep executor output structured
I do not want the executor returning only "done". I want a compact record a human or verifier can inspect.
{
"task_id": "repo-142",
"status": "needs_review",
"files_changed": [
"src/github_sync.py",
"tests/test_github_sync.py"
],
"commands_run": [
"pytest tests/test_github_sync.py -q",
"python -m scripts.smoke_github_sync"
],
"risks": [
"Backoff constants tuned for current API rate window"
],
"next_step": "Review retry cap before merging"
}That record becomes the bridge between execution and review. It also makes it easier to build audit trails later.
4. Put verification outside the executor loop
The executor should propose work. The verifier should decide whether the work satisfied the contract.
$ agent-run execute repo-142 [claim] task=repo-142 owner=executor-3 ttl=20m [edit] src/github_sync.py updated [test] pytest tests/test_github_sync.py -q ........ PASS [smoke] python -m scripts.smoke_github_sync ...... PASS [summary] status=needs_review files=2 risks=1 [verify] waiting on diff and invariant checks
This split matters because otherwise the same model that made the change is also grading its own homework.
What went wrong and the tradeoffs
Failure mode 1, the planner over-specifies the implementation
A planner that writes twenty tiny steps plus code-level instructions usually just pushes complexity downstream. The executor either ignores the plan or follows it so literally that better options are missed.
What I would not do: make the planner dictate exact code when the real need is a boundary and a verification target.
Failure mode 2, the executor escapes the manifest
If allowed paths are not explicit, the executor will eventually wander into config files, shared utilities, or formatting churn. Then review scope explodes.
Failure mode 3, retries lose ownership state
Without claim expiry and attempt counts, a crashed executor leaves behind a zombie task. The next worker cannot tell whether to resume, retry, or stop.
| Pattern | Why it feels good initially | What breaks later | Better default |
|---|---|---|---|
| Planner writes long prose | Easy to prompt | Soft constraints, hidden assumptions | Small manifest with bounded fields |
| No claim ownership | Fewer moving parts | Duplicate execution | Claim check with TTL and attempt |
| Executor self-verifies | Fast loop | Self-grading bias | Separate verifier or invariant stage |
| Open-ended file access | More flexibility | Review blast radius grows | Allowed path list |
One tradeoff is real, though. Stronger handoffs add a bit of friction. For very small tasks, a single well-bounded agent may be simpler. I only split planner and executor when the task has enough complexity that reviewability matters more than pure speed.
Practical checklist or decision framework
- write manifests with objective, allowed paths, constraints, and verification commands
- issue claim checks before any write-capable work starts
- keep one executor per task at a time, with ttl and attempt count
- require executor summaries that list files changed, commands run, and remaining risks
- separate verification from generation whenever the task can affect code, infra, or external state
- prefer a single agent for tiny tasks and a planner-executor split for higher-risk or multi-file work
- store ledger events so retries and postmortems are explainable later
Conclusion
Planner-executor systems stay useful when the handoff looks more like a small workflow contract and less like a motivational speech. Keep the manifest tight, claim work explicitly, verify outside the edit loop, and make every run easy to inspect. Then multi-agent architecture stops being theater and starts being operationally sane.