Cold-Start Control for Local Coding Model Gateways

Local coding stacks feel great right up until they sit idle for ten minutes. Then the next agent call hits a half-awake model server, spends a minute paging weights back into memory, and times out before the first token arrives.

That failure mode looks random from the agent side, but it usually is not. It is a traffic-shaping problem mixed with memory pressure, weak health checks, and a gateway that assumes every request is equally urgent.

This post covers a setup I would actually run for a small team: one warm lane, one burst lane, a strict queue budget, and a cheap fallback path. The goal is not maximum benchmark throughput. It is predictable first-token latency for real coding workflows.

Why this matters

Cold starts are more expensive in coding workflows than in chat demos. Agents usually invoke the model after a tool run, retries can duplicate expensive work, and one stalled request can block every other session waiting behind it.

coding agents already consumed time before they ask for tokens
local boxes have uneven headroom once editors, browsers, and indexes are open
timeouts create misleading failure reports when the real issue is serving latency

Architecture or workflow overview

flowchart LR
    A[Agent request] --> B[Gateway]
    B --> C{Hot model available?}
    C -- yes --> D[Warm lane]
    C -- no --> E{Queue budget left?}
    E -- yes --> F[Wake or load model]
    F --> D
    E -- no --> G[Fallback lane or fast failure]
    D --> H[Stream tokens]
    H --> I[Latency + memory metrics]
    I --> B

A good local gateway needs admission control, a warm-pool policy, first-token health probes, and a fallback route that preserves reliability when the machine is under pressure.

Implementation details

Put a real gateway in front of the runtime

The runtime should serve tokens. The gateway should decide whether a request can enter, wait, or reroute.

# gateway-policy.yaml
models:
  deep-coder:
    backend: ollama
    model: qwen2.5-coder:32b
    keepWarm: true
    maxConcurrent: 2
    maxQueue: 6
    coldStartBudgetMs: 25000
    fallback: coder-small
  coder-small:
    backend: llama.cpp
    model: /models/qwen2.5-coder-7b-q4_k_m.gguf
    keepWarm: true
    maxConcurrent: 4
    maxQueue: 12
    coldStartBudgetMs: 4000

routing:
  - match: { lane: "high-context" }
    target: deep-coder
  - match: { lane: "fast-edit" }
    target: coder-small

The gateway should own queue budgets and fallback policy. If every caller talks directly to the runtime, each caller discovers overload too late.

Probe for first-token readiness

import time
import requests


def probe_first_token(base_url: str, model: str, timeout_s: float = 8.0) -> dict:
    started = time.perf_counter()
    response = requests.post(
        f"{base_url}/api/generate",
        json={"model": model, "prompt": "ping", "stream": False, "options": {"num_predict": 1}},
        timeout=timeout_s,
    )
    response.raise_for_status()
    elapsed_ms = round((time.perf_counter() - started) * 1000)
    return {"ok": True, "latencyMs": elapsed_ms, "model": model}

If a probe takes 18 seconds to emit one token, the lane is cold for interactive work. Mark it that way and stop accepting latency-sensitive traffic.

Serialize model wake-ups

from asyncio import Lock

wake_locks: dict[str, Lock] = {}

async def ensure_model_ready(model_key: str, load_fn):
    lock = wake_locks.setdefault(model_key, Lock())
    async with lock:
        state = await current_state(model_key)
        if state.ready:
            return state
        await load_fn(model_key)
        return await current_state(model_key)

$ gateway status
MODEL         STATE   FIRST TOKEN   QUEUE   MEM
qwen32b       warm    1180ms        2/6     31.8G
qwen7b        warm    420ms         1/12    8.6G
embed-small   warm    90ms          0/20    1.4G

policy: deep-coder fallback -> coder-small after 25000ms cold-start budget

Use memory-aware routing instead of a fixed favorite

Routing choice	Good for	Risk	What I would do
Always hit the biggest coder	Best quality when idle	Terrible burst behavior	Avoid as the default
Always hit the smallest coder	Fast replies	Lower patch quality on harder edits	Use only for trivial lanes
Memory-aware primary with fallback	Mixed workloads	More routing complexity	Best overall tradeoff
Remote failover after queue budget	Keeps SLOs intact	Costs money and adds trust concerns	Worth it for team use

What went wrong and tradeoffs

Pitfall: keep-alive can hide bad capacity planning. If a single resident model leaves almost no breathing room, your next build or browser burst can destabilize the whole box.

Pitfall: burst queues feel safe until they become silent latency. Expose queue depth and estimated wait time, or users will assume the system is broken.

If you add a remote fallback, treat it as a separate trust lane. Do not forward raw repo context, secrets, or tool output just because the local box is busy.

On small shared hosts, the difference between cold and warm first-token latency is often much larger than the quality difference between neighboring quantization levels. That is why cold-start policy deserves first-class treatment.

Practical checklist

[ ] Keep exactly one high-value coding model warm during active hours
[ ] Measure first-token readiness, not only port health
[ ] Serialize wake-ups per model
[ ] Set a queue budget per lane and fail fast after it fills
[ ] Reserve a smaller fallback model for short or low-risk tasks
[ ] Expose queue depth, first-token latency, and memory headroom in status output
[ ] Treat remote fallback as a separate trust boundary

Conclusion

Local coding models do not fail only because the model is weak. They fail because the serving path has no opinion about cold starts, queues, or memory pressure. Add those opinions at the gateway layer and the whole stack feels much more reliable.

References

Local LLMsInference OpsAI Coding AgentsOllamavLLM