Skip to main content
← Back to All Posts

Prompt Caching for Tool-Heavy AI Agents That Need to Stay Fast

April 2, 2026 • 9 min read
Prompt Caching for Tool-Heavy AI Agents That Need to Stay Fast

Tool-heavy agents get expensive for a boring reason: they keep resending the same context. The system prompt barely changes. Tool schemas barely change. Operating rules barely change. But every turn pays to reprocess them unless you design for cacheability.

If your agent reads code, calls tools, retries, summarizes, and asks follow-up questions, prompt caching is one of the highest-leverage optimizations you can make. It does not magically fix a bad workflow, but it can shave real latency and cost off the repetitive part of every turn.

The trick is simple in theory and easy to get wrong in practice: keep the front of the prompt stable, push variable inputs toward the end, and stop dragging a full transcript around after the useful bits have already been compressed.

Where tool-heavy agents waste money

A typical coding or ops agent sends the model a long system prompt, stable policies, tool schemas, prior turns, retrieved files, logs, and the actual new task. Only the tail of that request changes a lot. The front is usually near-identical.

That means the expensive part of each request is often not the new task. It is the repeated preamble. If you keep changing the prefix order, inserting dynamic state too early, or stuffing every new retrieval chunk before the instruction block, you destroy cache reuse.

The practical design rule: freeze the prefix

For prompt caching to help, exact prefix matches matter. So the goal is not just making prompts shorter. It is making the beginning of prompts extremely stable.

  1. System role and permanent operating rules
  2. Stable team or app instructions
  3. Stable tool definitions
  4. Stable examples if you truly need them
  5. Rolling compacted state
  6. Retrieved task-specific context
  7. The latest user request

That ordering keeps the reusable block at the front and pushes volatile data to the back where it cannot poison the cache key.

A bad pattern versus a better one

Bad

{
  "input": [
    {"role": "system", "content": "You are a coding agent."},
    {"role": "user", "content": "Current time: 2026-04-02T12:00:07Z"},
    {"role": "user", "content": "Relevant repo files: ...dynamic chunk..."},
    {"role": "user", "content": "Tool list: ...long schema block..."},
    {"role": "user", "content": "Please fix the failing tests."}
  ]
}

Better

{
  "input": [
    {"role": "system", "content": "You are a coding agent. Follow repo rules, explain risky actions, and verify before finishing."},
    {"role": "system", "content": "Available tools: ...stable schema block..."},
    {"role": "system", "content": "Workflow policy: inspect, plan, edit, test, summarize."},
    {"role": "assistant", "content": "Compacted prior state: branch, known failures, accepted constraints."},
    {"role": "user", "content": "Relevant repo files: ...dynamic chunk..."},
    {"role": "user", "content": "Please fix the failing tests in auth middleware."}
  ]
}

Nothing revolutionary there. It is just discipline.

Prompt caching is not only for chatbots

In a tool-heavy workflow, the model repeatedly re-reads tool schemas, environment policies, output rules, coding conventions, repo guidance, and escalation rules. That is exactly the material you want the model to see every turn, and exactly the material that should be stable enough to cache.

Agent quality often pushes you toward richer instructions, while agent cost pushes you toward a reusable prefix. Prompt caching is the compromise that lets you have both.

Use compaction before the transcript becomes the product

Long-running agents have a second problem: even if the prefix is stable, the conversation history keeps growing. After enough tool calls, intermediate plans, and dead-end reasoning, sending the full transcript every turn becomes silly.

The fix is to carry forward state, not raw history. Use server-side compaction when your API supports it, or explicitly compact and pass forward the compacted window as the new baseline context.

What should survive compaction

  • the current goal
  • accepted constraints
  • important prior decisions
  • unresolved risks
  • tool outputs that still matter
  • critical names, paths, and identifiers

Usually drop superseded plans, repetitive confirmations, verbose tool chatter, and raw logs that can be fetched again.

Retrieval boundaries matter as much as caching

Teams often sabotage caching by retrieving too much context too early. A better pattern is to keep the reusable instruction prefix fixed, retrieve only the smallest task-relevant context, inject those results after the stable prefix, and replace them aggressively between turns.

A coding agent usually does not need the entire repo map, multiple old traces, and a copied issue thread on every turn. It needs the few files or traces relevant to the current decision.

A production-friendly request loop

BASE_PREFIX = [
    SYSTEM_RULES,
    TOOL_SCHEMAS,
    WORKFLOW_POLICY,
]

state = load_compacted_state()

while True:
    retrieved = fetch_minimum_relevant_context(task=state.active_task)
    latest_user_message = get_latest_user_message()

    response = client.responses.create(
        model="gpt-5.4",
        input=[*BASE_PREFIX, state, *retrieved, latest_user_message],
        prompt_cache_retention="24h",
        prompt_cache_key="repo:main-agent",
        store=False,
        context_management=[{"type": "compaction", "compact_threshold": 200000}],
    )

    state = extract_latest_compacted_state(response)
    handle_tools_and_output(response)

The important bits are that the base prefix stays stable, state is compact instead of a raw transcript, retrieval is deliberately minimal, and compaction prevents context from bloating forever.

Mistakes that quietly kill cache hit rate

  • Injecting timestamps near the top
  • Reordering tools between requests
  • Shoving retrieval before instructions
  • Treating every past turn as sacred
  • Changing prompt templates casually

How to tell whether this is working

Do not guess. Measure cached prompt tokens, median latency by workflow, long-tail latency after many turns, input token count before and after compaction, and task success rate after shrinking retrieval scope.

If cost drops but accuracy tanks, you compacted too aggressively or retrieved too little. If accuracy stays stable but latency barely moves, your supposedly stable prefix probably is not actually stable.

A sane rollout plan

  1. Freeze and version the reusable prefix
  2. Move dynamic retrieval and timestamps to the end
  3. Keep tool schemas stable across related requests
  4. Introduce prompt cache keys for repeated workflows
  5. Add compaction once long-running sessions become noisy
  6. Measure cost and task success before tightening further

The real takeaway

The fastest way to make an agent cheaper is often not choosing a smaller model. It is stopping the expensive habit of making the model reread the same novel every turn.

Prompt caching works best when paired with prompt discipline, narrow retrieval, and aggressive context compaction. Together, those three patterns make tool-heavy agents less bloated, less fragile, and much easier to run at scale.

References and resources

  • OpenAI API: Prompt Caching
  • OpenAI API: Compaction
  • OpenAI API: Cost Optimization
  • OpenAI API: Prompting
AI Agents Prompt Caching Cost Optimization Developer Workflow

Back to Portfolio

Practical notes on AI engineering, developer workflows, and production tradeoffs.