Semantic caching looks like free performance until it replays yesterday's answer into today's repository. That is where a lot of agent setups get into trouble. A fast wrong answer is often worse than a slow fresh one.
The useful version of semantic caching is not "vector search, but for prompts." It is a narrow reuse layer that understands task shape, repo state, tool versions, and trust boundaries. If those inputs drift, the cache should miss on purpose.
In this post I will show how I would wire a semantic cache for coding agents so it actually helps. The goal is lower latency and lower model spend without reusing stale plans, leaking cross-repo context, or letting cached tool results bypass verification.
Why this matters
Coding agents repeat themselves more than people expect. They summarize the same stack traces, explain the same build failures, restate the same repo instructions, and regenerate similar patch plans for similar files. If every one of those calls goes back to the model, latency and cost climb fast.
The catch is that code tasks are not static. A small commit, dependency bump, or changed tool contract can turn a previously correct answer into a misleading one. That makes semantic caching a production reliability problem, not just an optimization trick.
- Is this really the same task?
- Is the repository state close enough?
- Did the dependency or tool contract change?
- Does this response still need verification before reuse?
Architecture or workflow overview
flowchart LR
Agent request -> Normalize task -> Build semantic fingerprint
-> Vector plus exact lookup -> Candidate hit?
-> Repo and tool validation -> Lightweight verifier -> Serve or regenerateflowchart LR
A[Agent request] --> B[Normalize task]
B --> C[Build semantic fingerprint]
C --> D[Vector + exact lookup]
D --> E{Hit above threshold?}
E -- no --> F[Call model]
E -- yes --> G[Check repo snapshot, tool versions, policy lane]
G --> H{Still valid?}
H -- no --> F
H -- yes --> I[Run lightweight verifier]
I --> J{Verifier passes?}
J -- no --> F
J -- yes --> K[Serve cached response]
F --> L[Store with metadata, TTL, and invalidation keys]The important part is that the semantic index is only the first gate. Similarity gets you a candidate. It does not give you permission to trust it.
| Layer | Good for | What I would cache | Main risk |
|---|---|---|---|
| Prefix or provider prompt cache | Large repeated system prompts and repo instructions | Stable instruction blocks | Limited control over semantic correctness |
| Semantic response cache | Repeated explanations, plans, summaries, and low-risk code assists | Natural-language outputs with narrow scope | Stale answers after repo drift |
| Tool-result cache | Expensive read-only tool calls like repo maps or symbol graphs | Deterministic read outputs | Hidden dependence on changed files |
| Patch cache | Almost never by default | Tiny boilerplate transforms only | Replaying wrong edits into changed code |
Implementation details
1. Build a fingerprint that is more than prompt text
A prompt-only hash misses the thing that matters most in coding workflows: the surrounding state. I would fingerprint a request from normalized task inputs plus a small repo snapshot.
from __future__ import annotations
import hashlib
import json
from pathlib import Path
from typing import Iterable
def sha256_text(value: str) -> str:
return hashlib.sha256(value.encode("utf-8")).hexdigest()
def build_repo_snapshot(repo_root: Path, files: Iterable[str]) -> dict:
snapshot = {}
for rel in sorted(set(files)):
path = repo_root / rel
if not path.exists():
snapshot[rel] = {"exists": False}
continue
text = path.read_text(encoding="utf-8", errors="ignore")
snapshot[rel] = {
"exists": True,
"sha256": sha256_text(text),
"bytes": len(text.encode("utf-8")),
}
return snapshot
def semantic_fingerprint(task: dict, repo_snapshot: dict, tool_versions: dict) -> str:
payload = {
"intent": task["intent"],
"language": task.get("language"),
"error_class": task.get("error_class"),
"paths": sorted(task.get("paths", [])),
"constraints": sorted(task.get("constraints", [])),
"repo_snapshot": repo_snapshot,
"tool_versions": tool_versions,
"policy_lane": task.get("policy_lane", "default"),
}
canonical = json.dumps(payload, sort_keys=True, separators=(",", ":"))
return sha256_text(canonical)This is intentionally picky. If the repo snapshot or tool versions change, I want the cache key to change too. That hurts hit rate a bit, but it protects correctness where it matters.
2. Store retrieval metadata, not just the answer
If you only store the final text, you lose the context needed to decide whether a hit is safe. A usable entry needs its own small audit record.
{
"fingerprint": "d9c8...",
"embedding_key": "plan-debug-webpack-build",
"task_class": "build_failure_summary",
"repo": "negiadventures.github.io",
"branch": "master",
"created_at": "2026-05-08T11:40:00Z",
"ttl_seconds": 21600,
"tool_versions": {
"node": "22.22.1",
"eslint": "9.2.0"
},
"watched_paths": [
"package.json",
"package-lock.json",
"webpack.config.js",
"src/components/Header.tsx"
],
"verification": {
"required": true,
"kind": "lightweight"
},
"response": {
"summary": "ESM import mismatch caused the build failure",
"actions": ["rename config to webpack.config.mjs", "update import path aliases"]
}
}The watched paths become your invalidation hooks. If one of them changes, the entry should miss even if the semantic similarity still looks high.
3. Separate low-risk reuse from high-risk reuse
- Low risk: summaries, explanations, stack-trace classification, command suggestions
- Medium risk: refactor plans, test recommendations, migration checklists
- High risk: concrete patches, deployment commands, schema-changing edits
Only low-risk items should be reusable with lightweight verification. Medium-risk items usually need a fresh repo snapshot check and a short verifier. High-risk items should often fall through to a fresh generation, even when the semantic match is strong.
4. Verify the hit before serving it
The verifier can be small. You do not need a second full model pass every time. You need a cheap guard that checks whether the assumptions behind the cached answer still hold.
import { execFile } from "node:child_process";
import { promisify } from "node:util";
const execFileAsync = promisify(execFile);
type CacheEntry = {
watchedPaths: string[];
toolVersions: Record<string, string>;
verification: { required: boolean; kind: "none" | "lightweight" | "strict" };
};
export async function verifyCacheHit(entry: CacheEntry): Promise<boolean> {
if (!entry.verification.required) return true;
const { stdout } = await execFileAsync("git", ["diff", "--name-only", "HEAD~1", "HEAD"]);
const changed = new Set(stdout.split("
").filter(Boolean));
for (const path of entry.watchedPaths) {
if (changed.has(path)) return false;
}
if (entry.verification.kind === "strict") {
const checks = [execFileAsync("npm", ["run", "lint", "--", "--quiet"]), execFileAsync("npm", ["test", "--", "--runInBand", "affected"])];
const results = await Promise.allSettled(checks);
return results.every((result) => result.status === "fulfilled");
}
return true;
}Cache safety usually comes from boring checks tied to real repository signals, not from clever embedding math.
5. Cache tool outputs with explicit invalidation keys
Tool-result caching is underrated. Repo maps, dependency graphs, and symbol indexes are often expensive to regenerate but deterministic enough to cache safely if you bind them to the right files.
A symbol graph cache tied to tsconfig.json, package-lock.json, and all src/**/*.ts hashes is much easier to reason about than a cached patch suggestion. Start there.
Terminal view
$ agent run "explain why lint fails on CI but not locally"
cache.lookup candidate_score=0.94 task_class=lint_failure_summary
cache.validate repo_snapshot=match tool_versions=match policy_lane=low-risk
cache.verify watched_paths_clean=true verifier=lightweight
cache.result HIT latency_ms=74 saved_model_call=trueWhat went wrong and the tradeoffs
Failure mode 1: cache hits survived dependency bumps
The request looked the same, but a minor dependency release changed build behavior. The cached explanation kept pointing engineers toward the old fix.
Failure mode 2: semantically similar does not mean operationally identical
Two TypeScript errors can look almost identical while coming from very different causes. If one lives in a generated file and the other in a hand-written adapter, the right next step may be completely different.
Failure mode 3: cached plans leaked across trust boundaries
A cache shared between repos or teams can quietly turn into a context leak. This is especially ugly if prompts contain internal architecture notes or file paths.
Failure mode 4: aggressive caching made humans trust the agent too much
Fast answers feel authoritative. Teams stop noticing that they are reading reused advice.
| Choice | Upside | Downside | My default |
|---|---|---|---|
| Long TTLs | Better hit rate | More stale answers | Short TTLs for code tasks |
| Broad semantic matching | More reuse | Wrong-task collisions | Narrow task-class matching |
| Shared cache across repos | Lower cost | Context leakage risk | No cross-repo reuse |
| Cache patch outputs | Fast edits | High correctness risk | Avoid unless boilerplate |
| No verifier | Lowest latency | Silent bad hits | Always verify anything non-trivial |
Practical checklist
- Start with read-only tasks like summaries, diagnostics, and repo maps
- Include repo snapshot data, tool versions, and policy lane in the fingerprint
- Scope cache entries to a single repo and trust boundary
- Add watched-path invalidation for lockfiles, config files, and affected sources
- Require a verifier for anything that changes engineering decisions
- Keep TTLs short for build, test, and migration related outputs
- Log cache provenance so operators can see why a hit was served
- Track stale-hit rejection rate, not just hit rate
- Avoid caching concrete patches unless the transform is tiny and deterministic
What I would and would not do
I would absolutely cache stack-trace summaries, repeated repo instructions, and deterministic tool outputs like symbol maps. Those are high-frequency and low-drama.
I would not cache migration patches, deployment commands, or broad refactor plans without a strict verifier. The speedup is real, but so is the blast radius when the cached answer is slightly out of date.
Conclusion
Semantic caching for coding agents works best when it behaves like a suspicious teammate. It should ask whether the task is really the same, whether the repo is still in the same shape, and whether the answer still deserves trust.
That makes the cache a little less magical, but much more useful.