Most tool-heavy agents fail in a boring way. They are not missing a better model. They are drowning the model in raw output from kubectl, gh, terraform, logs, JSON payloads, and shell transcripts that should never have gone back into the prompt unchanged.
That creates two problems at once. First, context fills up with low-signal noise. Second, tainted external text gets another chance to steer the agent. The result is slower runs, weaker decisions, and summaries that somehow hide the one line you actually needed.
A better pattern is a reduction layer between tool execution and model context. Instead of feeding raw output back into the loop, reduce it into typed fields, bounded excerpts, citations, and escalation signals. This post walks through a practical implementation that keeps the useful evidence without paying to re-prompt the entire transcript.
Why this matters
Tool output is one of the easiest ways to quietly ruin an otherwise decent agent system. It usually starts with convenience. A tool returns text, the orchestrator appends that text to the conversation, and things look fine until a few long runs later.
In production, this shows up as token spend creep, buried signals, prompt-injection exposure from untrusted content, and bad retries caused by reasoning over stale or partial text instead of stable state.
- token spend creeping up because every step carries previous step output
- important signals buried inside hundreds of lines of logs
- prompt-injection risk when untrusted content comes from web pages, CI logs, or tickets
- bad retries because the model reasons over stale or partial output instead of structured state
Direct references worth reading here: Model Context Protocol, OpenTelemetry, and jq.
Treat raw tool output as an artifact to store and cite, not the default thing to re-inject into the next model turn.
Architecture or workflow overview
Mermaid flow
flowchart TD A[Tool executes] --> B[Raw artifact store] A --> C[Reducer selector] C --> D[Schema-aware reducer] C --> E[Text reducer with taint labels] D --> F[Cited result packet] E --> F F --> G[Planner or executor model] B --> H[Human review or deep debug] F --> I[Escalation rule if confidence is low] I --> H
Numbered sequence
- Run the tool and capture the full raw output as an immutable artifact.
- Attach metadata such as tool name, trust lane, byte size, exit status, and source URL if relevant.
- Choose a reducer based on output type, size, and safety lane.
- Produce a bounded result packet with typed fields, summary bullets, citations, and truncation notes.
- Feed the packet, not the raw blob, into the next model step.
- Escalate to human review or artifact fetch when the packet loses too much confidence or precision.
Implementation details
Store raw output, but do not pass it through by default
I like making raw output a first-class artifact with an ID. That keeps the evidence available for debugging without forcing every downstream model call to pay for it.
{
"artifact_id": "toolrun_01jw4x9m5k6r_logs",
"tool": "kubectl.logs",
"created_at": "2026-05-26T12:05:00Z",
"content_type": "text/plain",
"trust_lane": "external-runtime",
"bytes": 184392,
"sha256": "8c57d3...",
"retention": "7d"
}That metadata is already more useful than a raw transcript pasted back into chat. It lets policy decide whether the next step should see a reducer packet, a quoted excerpt, or nothing.
Build reducer contracts per tool family
Reducers should not be one generic summarizer prompt. They should be small contracts that know what matters for a given tool family.
from dataclasses import dataclass
from typing import Any
@dataclass
class ReducedPacket:
summary: list[str]
fields: dict[str, Any]
citations: list[dict[str, Any]]
tainted: bool
truncated: bool
confidence: float
def reduce_github_pr(payload: dict) -> ReducedPacket:
checks = payload.get("statusCheckRollup", [])
failing = [c for c in checks if c.get("conclusion") not in ("SUCCESS", None)]
reviewers = [r["login"] for r in payload.get("latestReviews", []) if r.get("state") == "CHANGES_REQUESTED"]
return ReducedPacket(
summary=[
f"PR #{{payload['number']}} is {{payload['state'].lower()}}",
f"{{len(failing)}} checks are failing",
f"changes requested by: {{', '.join(reviewers) if reviewers else 'none'}}",
],
fields={{
"pr_number": payload["number"],
"title": payload["title"],
"failing_checks": [c.get("name") for c in failing],
"reviewers_requesting_changes": reviewers,
}},
citations=[{{"path": "statusCheckRollup", "kind": "json-pointer"}}, {{"path": "latestReviews", "kind": "json-pointer"}}],
tainted=False,
truncated=False,
confidence=0.97,
)The key point is that the model receiving this packet can reason over stable fields instead of rediscovering the same facts from a large payload every time.
Reduce unstructured text into bounded evidence, not vibes
For logs and shell output, I prefer a reducer that extracts the shape of the failure and then preserves only the lines needed to support it.
import re
ERROR_PATTERNS = [
re.compile(r"\bERROR\b"),
re.compile(r"\bTraceback\b"),
re.compile(r"\bpanic:\b"),
]
def reduce_text_output(text: str) -> dict:
lines = text.splitlines()
hits = []
for idx, line in enumerate(lines, start=1):
if any(p.search(line) for p in ERROR_PATTERNS):
window = lines[max(0, idx - 3): min(len(lines), idx + 2)]
hits.append({{
"line": idx,
"excerpt": window,
}})
return {{
"summary": [
f"captured {{len(lines)}} lines",
f"found {{len(hits)}} error windows",
],
"citations": hits[:5],
"truncated": len(hits) > 5,
"tainted": True,
}}A five-line cited window around the real error is far more valuable than replaying 2,000 lines of container logs into the next step.
Carry truncation and trust signals forward
Reducers should admit when they might be hiding something. If the packet is partial, say so explicitly.
tool: web.fetch
packet:
summary:
- "Page contains a pricing comparison table for hosted vector databases"
- "Three providers mentioned in the visible excerpt"
fields:
providers: ["Pinecone", "Weaviate", "pgvector"]
citations:
- kind: line-range
start: 88
end: 121
tainted: true
truncated: true
confidence: 0.62
escalation:
recommended: true
reason: "Long external page reduced to one excerpt window"The right goal is not perfect compression. It is honest compression with a clear path back to the source artifact.
Example terminal-output visual
$ agent-run inspect toolrun_01jw4x9m5k6r_logs artifact: toolrun_01jw4x9m5k6r_logs source: kubectl.logs bytes: 184392 reducer: text-error-window-v2 packet size: 812 bytes confidence: 0.74 tainted: true truncated: true escalation: artifact fetch required before auto-remediation
That terminal block is the behavior I want. The model gets a compact packet, while operators still have a clean path to the source evidence.
What went wrong, and the tradeoffs
Failure mode 1: the reducer removes the clue you needed
This is the obvious risk. If the reduction layer is too aggressive, the agent becomes fast and confidently wrong.
- attach citations to every important claim
- propagate a confidence score and truncation flag
- let reducers request artifact fetch when the packet is too lossy
Failure mode 2: reducers become prompt-injection laundromats
A reducer is not magically safe because it is shorter. If it paraphrases malicious tool output without taint labels, you still have untrusted text steering the agent, just in a tidier form.
- preserve trust-lane metadata from the original tool source
- mark summaries of untrusted content as tainted
- prevent tainted packets from directly triggering write actions without another policy gate
Failure mode 3: every tool gets the same summarizer
A generic summarizer tends to flatten meaning. CI logs, GitHub PR metadata, Terraform plans, and RAG traces each need different reduction rules.
- group reducers by tool family
- prefer typed extraction before free-text summarization
- log reducer version so you can evaluate regressions later
| Pattern | Good at | Weak at | I would use it when |
|---|---|---|---|
| Raw output passthrough | Maximum fidelity | Token bloat, safety risk, low signal | Human-only debugging in short runs |
| Generic summarizer | Fast to ship | Drops structure, inconsistent quality | Temporary stopgap while contracts mature |
| Schema-aware reducers | Stable fields and low noise | More implementation effort | Core tools appear in many workflows |
| Artifact plus cited packet | Good balance of cost, safety, debugability | Needs storage and retrieval plumbing | Production agent systems with long runs |
Security and reliability concerns
Reducers are part of the trusted computing base. A bad reducer can hide a risky line as effectively as a bad model can miss it. Version them, test them, and keep examples of known bad outputs.
Raw artifacts may contain secrets, tokens, or user data. If you store them for later citation, storage policy matters just as much as prompt policy. Redaction, retention windows, and access controls need to exist before the artifact cache quietly becomes your most sensitive datastore.
- Do not let reducers emit unsupported conclusions without a citation.
- Do not mix trusted internal outputs and tainted external content in one unlabeled packet.
- Do not hide truncation just because the short answer looked plausible.
- Do not assume JSON equals safe. Structured payloads can still carry hostile instructions in string fields.
Practical checklist
What I would do again
- Store raw tool output as an artifact with an ID and retention policy.
- Feed reduced packets to the next model step by default.
- Prefer typed field extraction before prose summarization.
- Preserve citations, truncation flags, and taint labels.
- Escalate to artifact fetch when confidence is low or the action is high risk.
- Version reducers and test them against known bad examples.
What I would not do
- I would not pipe full logs back into the planner just because it is easy.
- I would not use one generic summarizer prompt for every tool in the system.
- I would not allow tainted packets to authorize write operations on their own.
- I would not retain raw artifacts indefinitely without redaction and access controls.
Conclusion
Tool-heavy agents do better when raw output stops being the default context format. Store the evidence, reduce it into typed packets, and make the lossiness visible.
That one architectural move usually improves three things at once: cost, reviewability, and safety. It is not flashy, but it is one of the highest-leverage habits in practical agent engineering.