Local coding stacks feel great right up until they sit idle for ten minutes. Then the next agent call hits a half-awake model server, spends a minute paging weights back into memory, and times out before the first token arrives.
That failure mode looks random from the agent side, but it usually is not. It is a traffic-shaping problem mixed with memory pressure, weak health checks, and a gateway that assumes every request is equally urgent.
This post covers a setup I would actually run for a small team: one warm lane, one burst lane, a strict queue budget, and a cheap fallback path. The goal is not maximum benchmark throughput. It is predictable first-token latency for real coding workflows.
Why this matters
Cold starts are more expensive in coding workflows than in chat demos. Agents usually invoke the model after a tool run, retries can duplicate expensive work, and one stalled request can block every other session waiting behind it.
- coding agents already consumed time before they ask for tokens
- local boxes have uneven headroom once editors, browsers, and indexes are open
- timeouts create misleading failure reports when the real issue is serving latency
Architecture or workflow overview
flowchart LR
A[Agent request] --> B[Gateway]
B --> C{Hot model available?}
C -- yes --> D[Warm lane]
C -- no --> E{Queue budget left?}
E -- yes --> F[Wake or load model]
F --> D
E -- no --> G[Fallback lane or fast failure]
D --> H[Stream tokens]
H --> I[Latency + memory metrics]
I --> BA good local gateway needs admission control, a warm-pool policy, first-token health probes, and a fallback route that preserves reliability when the machine is under pressure.
Implementation details
Put a real gateway in front of the runtime
The runtime should serve tokens. The gateway should decide whether a request can enter, wait, or reroute.
# gateway-policy.yaml
models:
deep-coder:
backend: ollama
model: qwen2.5-coder:32b
keepWarm: true
maxConcurrent: 2
maxQueue: 6
coldStartBudgetMs: 25000
fallback: coder-small
coder-small:
backend: llama.cpp
model: /models/qwen2.5-coder-7b-q4_k_m.gguf
keepWarm: true
maxConcurrent: 4
maxQueue: 12
coldStartBudgetMs: 4000
routing:
- match: { lane: "high-context" }
target: deep-coder
- match: { lane: "fast-edit" }
target: coder-smallThe gateway should own queue budgets and fallback policy. If every caller talks directly to the runtime, each caller discovers overload too late.
Probe for first-token readiness
import time
import requests
def probe_first_token(base_url: str, model: str, timeout_s: float = 8.0) -> dict:
started = time.perf_counter()
response = requests.post(
f"{base_url}/api/generate",
json={"model": model, "prompt": "ping", "stream": False, "options": {"num_predict": 1}},
timeout=timeout_s,
)
response.raise_for_status()
elapsed_ms = round((time.perf_counter() - started) * 1000)
return {"ok": True, "latencyMs": elapsed_ms, "model": model}If a probe takes 18 seconds to emit one token, the lane is cold for interactive work. Mark it that way and stop accepting latency-sensitive traffic.
Serialize model wake-ups
from asyncio import Lock
wake_locks: dict[str, Lock] = {}
async def ensure_model_ready(model_key: str, load_fn):
lock = wake_locks.setdefault(model_key, Lock())
async with lock:
state = await current_state(model_key)
if state.ready:
return state
await load_fn(model_key)
return await current_state(model_key)$ gateway status MODEL STATE FIRST TOKEN QUEUE MEM qwen32b warm 1180ms 2/6 31.8G qwen7b warm 420ms 1/12 8.6G embed-small warm 90ms 0/20 1.4G policy: deep-coder fallback -> coder-small after 25000ms cold-start budget
Use memory-aware routing instead of a fixed favorite
| Routing choice | Good for | Risk | What I would do |
|---|---|---|---|
| Always hit the biggest coder | Best quality when idle | Terrible burst behavior | Avoid as the default |
| Always hit the smallest coder | Fast replies | Lower patch quality on harder edits | Use only for trivial lanes |
| Memory-aware primary with fallback | Mixed workloads | More routing complexity | Best overall tradeoff |
| Remote failover after queue budget | Keeps SLOs intact | Costs money and adds trust concerns | Worth it for team use |
What went wrong and tradeoffs
If you add a remote fallback, treat it as a separate trust lane. Do not forward raw repo context, secrets, or tool output just because the local box is busy.
On small shared hosts, the difference between cold and warm first-token latency is often much larger than the quality difference between neighboring quantization levels. That is why cold-start policy deserves first-class treatment.
Practical checklist
- [ ] Keep exactly one high-value coding model warm during active hours
- [ ] Measure first-token readiness, not only port health
- [ ] Serialize wake-ups per model
- [ ] Set a queue budget per lane and fail fast after it fills
- [ ] Reserve a smaller fallback model for short or low-risk tasks
- [ ] Expose queue depth, first-token latency, and memory headroom in status output
- [ ] Treat remote fallback as a separate trust boundary
Conclusion
Local coding models do not fail only because the model is weak. They fail because the serving path has no opinion about cold starts, queues, or memory pressure. Add those opinions at the gateway layer and the whole stack feels much more reliable.