Prompt Caching for Tool-Heavy AI Agents That Need to Stay Fast

Tool-heavy agents get expensive for a boring reason: they keep resending the same context. The system prompt barely changes. Tool schemas barely change. Operating rules barely change. But every turn pays to reprocess them unless you design for cacheability.

If your agent reads code, calls tools, retries, summarizes, and asks follow-up questions, prompt caching is one of the highest-leverage optimizations you can make. It does not magically fix a bad workflow, but it can shave real latency and cost off the repetitive part of every turn.

The trick is simple in theory and easy to get wrong in practice: keep the front of the prompt stable, push variable inputs toward the end, and stop dragging a full transcript around after the useful bits have already been compressed.

Where tool-heavy agents waste money

A typical coding or ops agent sends the model a long system prompt, stable policies, tool schemas, prior turns, retrieved files, logs, and the actual new task. Only the tail of that request changes a lot. The front is usually near-identical.

That means the expensive part of each request is often not the new task. It is the repeated preamble. If you keep changing the prefix order, inserting dynamic state too early, or stuffing every new retrieval chunk before the instruction block, you destroy cache reuse.

The practical design rule: freeze the prefix

For prompt caching to help, exact prefix matches matter. So the goal is not just making prompts shorter. It is making the beginning of prompts extremely stable.

System role and permanent operating rules
Stable team or app instructions
Stable tool definitions
Stable examples if you truly need them
Rolling compacted state
Retrieved task-specific context
The latest user request

That ordering keeps the reusable block at the front and pushes volatile data to the back where it cannot poison the cache key.

A bad pattern versus a better one

Bad

{
  "input": [
    {"role": "system", "content": "You are a coding agent."},
    {"role": "user", "content": "Current time: 2026-04-02T12:00:07Z"},
    {"role": "user", "content": "Relevant repo files: ...dynamic chunk..."},
    {"role": "user", "content": "Tool list: ...long schema block..."},
    {"role": "user", "content": "Please fix the failing tests."}
  ]
}

Better

{
  "input": [
    {"role": "system", "content": "You are a coding agent. Follow repo rules, explain risky actions, and verify before finishing."},
    {"role": "system", "content": "Available tools: ...stable schema block..."},
    {"role": "system", "content": "Workflow policy: inspect, plan, edit, test, summarize."},
    {"role": "assistant", "content": "Compacted prior state: branch, known failures, accepted constraints."},
    {"role": "user", "content": "Relevant repo files: ...dynamic chunk..."},
    {"role": "user", "content": "Please fix the failing tests in auth middleware."}
  ]
}

Nothing revolutionary there. It is just discipline.

Prompt caching is not only for chatbots

In a tool-heavy workflow, the model repeatedly re-reads tool schemas, environment policies, output rules, coding conventions, repo guidance, and escalation rules. That is exactly the material you want the model to see every turn, and exactly the material that should be stable enough to cache.

Agent quality often pushes you toward richer instructions, while agent cost pushes you toward a reusable prefix. Prompt caching is the compromise that lets you have both.

Use compaction before the transcript becomes the product

Long-running agents have a second problem: even if the prefix is stable, the conversation history keeps growing. After enough tool calls, intermediate plans, and dead-end reasoning, sending the full transcript every turn becomes silly.

The fix is to carry forward state, not raw history. Use server-side compaction when your API supports it, or explicitly compact and pass forward the compacted window as the new baseline context.

What should survive compaction

the current goal
accepted constraints
important prior decisions
unresolved risks
tool outputs that still matter
critical names, paths, and identifiers

Usually drop superseded plans, repetitive confirmations, verbose tool chatter, and raw logs that can be fetched again.

Retrieval boundaries matter as much as caching

Teams often sabotage caching by retrieving too much context too early. A better pattern is to keep the reusable instruction prefix fixed, retrieve only the smallest task-relevant context, inject those results after the stable prefix, and replace them aggressively between turns.

A coding agent usually does not need the entire repo map, multiple old traces, and a copied issue thread on every turn. It needs the few files or traces relevant to the current decision.

A production-friendly request loop

BASE_PREFIX = [
    SYSTEM_RULES,
    TOOL_SCHEMAS,
    WORKFLOW_POLICY,
]

state = load_compacted_state()

while True:
    retrieved = fetch_minimum_relevant_context(task=state.active_task)
    latest_user_message = get_latest_user_message()

    response = client.responses.create(
        model="gpt-5.4",
        input=[*BASE_PREFIX, state, *retrieved, latest_user_message],
        prompt_cache_retention="24h",
        prompt_cache_key="repo:main-agent",
        store=False,
        context_management=[{"type": "compaction", "compact_threshold": 200000}],
    )

    state = extract_latest_compacted_state(response)
    handle_tools_and_output(response)

The important bits are that the base prefix stays stable, state is compact instead of a raw transcript, retrieval is deliberately minimal, and compaction prevents context from bloating forever.

Mistakes that quietly kill cache hit rate

Injecting timestamps near the top
Reordering tools between requests
Shoving retrieval before instructions
Treating every past turn as sacred
Changing prompt templates casually

How to tell whether this is working

Do not guess. Measure cached prompt tokens, median latency by workflow, long-tail latency after many turns, input token count before and after compaction, and task success rate after shrinking retrieval scope.

If cost drops but accuracy tanks, you compacted too aggressively or retrieved too little. If accuracy stays stable but latency barely moves, your supposedly stable prefix probably is not actually stable.

A sane rollout plan

Freeze and version the reusable prefix
Move dynamic retrieval and timestamps to the end
Keep tool schemas stable across related requests
Introduce prompt cache keys for repeated workflows
Add compaction once long-running sessions become noisy
Measure cost and task success before tightening further

The real takeaway

The fastest way to make an agent cheaper is often not choosing a smaller model. It is stopping the expensive habit of making the model reread the same novel every turn.

Prompt caching works best when paired with prompt discipline, narrow retrieval, and aggressive context compaction. Together, those three patterns make tool-heavy agents less bloated, less fragile, and much easier to run at scale.

References and resources

AI Agents Prompt Caching Cost Optimization Developer Workflow