Planner-Executor Handoffs for AI Agents That Stay Reviewable

Multi-agent demos usually fail in one boring place, the handoff. The planner writes a beautiful task list, the executor improvises past it, and by the time a reviewer looks at the diff nobody can tell what was intended versus what was guessed.

This gets worse under real load. Long prompts turn into soft contracts, retries re-run half-complete work, and two agents quietly operate on different assumptions about allowed files, budgets, or verification.

The fix is not more agent cleverness. It is a better handoff contract. If the planner emits a small manifest, the executor claims one task at a time, and verification gates are explicit, the system becomes much easier to debug and trust.

Why this matters

The planner-executor split is attractive because it gives each agent a narrower job:

the planner decomposes work
the executor edits code or runs commands
a verifier checks whether the result actually satisfied the contract

That separation only helps if the handoff is strong enough to survive model drift, retries, and human review. Otherwise you get more moving parts and less accountability.

Useful references: Anthropic on building effective agents, OpenTelemetry, and AWS Step Functions task tokens.

Architecture or workflow overview

I like a four-artifact handoff: plan, claim check, execution record, and verification result.

flowchart LR
    A[Planner] --> B[Task manifest
objective, files, constraints, checks]
    B --> C[Claim check
task id, owner, ttl, attempt]
    C --> D[Executor
edit, run, summarize]
    D --> E[Verifier
tests, invariants, diff review]
    E --> F[Ledger
status, rollback note, trace id]

The contract I actually want

Manifest with one bounded task, not a novel.
Allowed surface area listing files, tools, and write scope.
Verification gates that must pass before completion.
Claim check so only one executor owns the task at a time.
Execution summary that says what changed and what remains risky.

Implementation details

1. Make the planner write a real task manifest

If the planner hands over prose, the executor will fill in missing rules from vibes. A manifest gives you something reviewable.

task_id: repo-142
objective: Add rate-limit backoff to the GitHub sync worker
allowed_paths:
  - src/github_sync.py
  - tests/test_github_sync.py
constraints:
  - Do not change API response schemas
  - Keep max retry delay under 30 seconds
verification:
  - pytest tests/test_github_sync.py -q
  - python -m scripts.smoke_github_sync
handoff_notes:
  - Existing failures come from 429 handling
  - Prefer jittered exponential backoff

The planner should not pre-write the entire patch. It should define the boundary, the success checks, and the danger zones.

2. Use a claim check before execution starts

This is the part teams skip, and then they wonder why two executors stamped on the same task.

from dataclasses import dataclass
from datetime import datetime, timedelta

@dataclass
class ClaimCheck:
    task_id: str
    owner: str
    expires_at: datetime
    attempt: int

    @classmethod
    def create(cls, task_id: str, owner: str, minutes: int = 20, attempt: int = 1):
        return cls(
            task_id=task_id,
            owner=owner,
            expires_at=datetime.utcnow() + timedelta(minutes=minutes),
            attempt=attempt,
        )

    def expired(self) -> bool:
        return datetime.utcnow() >= self.expires_at

A claim check is not fancy. It just prevents accidental parallelism and gives you a clean retry story.

3. Keep executor output structured

I do not want the executor returning only "done". I want a compact record a human or verifier can inspect.

{
  "task_id": "repo-142",
  "status": "needs_review",
  "files_changed": [
    "src/github_sync.py",
    "tests/test_github_sync.py"
  ],
  "commands_run": [
    "pytest tests/test_github_sync.py -q",
    "python -m scripts.smoke_github_sync"
  ],
  "risks": [
    "Backoff constants tuned for current API rate window"
  ],
  "next_step": "Review retry cap before merging"
}

That record becomes the bridge between execution and review. It also makes it easier to build audit trails later.

4. Put verification outside the executor loop

The executor should propose work. The verifier should decide whether the work satisfied the contract.

$ agent-run execute repo-142
[claim] task=repo-142 owner=executor-3 ttl=20m
[edit] src/github_sync.py updated
[test] pytest tests/test_github_sync.py -q ........ PASS
[smoke] python -m scripts.smoke_github_sync ...... PASS
[summary] status=needs_review files=2 risks=1
[verify] waiting on diff and invariant checks

This split matters because otherwise the same model that made the change is also grading its own homework.

What went wrong and the tradeoffs

Failure mode 1, the planner over-specifies the implementation

A planner that writes twenty tiny steps plus code-level instructions usually just pushes complexity downstream. The executor either ignores the plan or follows it so literally that better options are missed.

What I would not do: make the planner dictate exact code when the real need is a boundary and a verification target.

Failure mode 2, the executor escapes the manifest

If allowed paths are not explicit, the executor will eventually wander into config files, shared utilities, or formatting churn. Then review scope explodes.

Failure mode 3, retries lose ownership state

Without claim expiry and attempt counts, a crashed executor leaves behind a zombie task. The next worker cannot tell whether to resume, retry, or stop.

Pitfall: a planner-executor split does not magically create reliability. It usually adds one more place where ambiguous state can hide unless the handoff is explicit.

Pattern	Why it feels good initially	What breaks later	Better default
Planner writes long prose	Easy to prompt	Soft constraints, hidden assumptions	Small manifest with bounded fields
No claim ownership	Fewer moving parts	Duplicate execution	Claim check with TTL and attempt
Executor self-verifies	Fast loop	Self-grading bias	Separate verifier or invariant stage
Open-ended file access	More flexibility	Review blast radius grows	Allowed path list

One tradeoff is real, though. Stronger handoffs add a bit of friction. For very small tasks, a single well-bounded agent may be simpler. I only split planner and executor when the task has enough complexity that reviewability matters more than pure speed.

Practical checklist or decision framework

Best practice: if a human reviewer cannot understand the task from the manifest plus execution record, the handoff is still too fuzzy.

write manifests with objective, allowed paths, constraints, and verification commands
issue claim checks before any write-capable work starts
keep one executor per task at a time, with ttl and attempt count
require executor summaries that list files changed, commands run, and remaining risks
separate verification from generation whenever the task can affect code, infra, or external state
prefer a single agent for tiny tasks and a planner-executor split for higher-risk or multi-file work
store ledger events so retries and postmortems are explainable later

Conclusion

Planner-executor systems stay useful when the handoff looks more like a small workflow contract and less like a motivational speech. Keep the manifest tight, claim work explicitly, verify outside the edit loop, and make every run easy to inspect. Then multi-agent architecture stops being theater and starts being operationally sane.

Multi-Agent Systems AI Agents Workflow Design Reliability Reviewability