If you want Claude Code to use a local model, the hard part is not finding a model name. The hard part is choosing the right model shape, getting the serving stack right, and avoiding the performance traps that make open models feel much slower than they should.
This guide walks through a practical setup for running Claude Code with Unsloth/Qwen3.5-35B-A3B on a Mac mini M4 Pro with 48GB of unified memory. The goal is not just to get it technically running. The goal is to get a setup that is usable for real coding sessions, especially agentic flows where latency compounds across many tool calls.
The short version is that Qwen3.5-35B-A3B is often a better local fit than the dense Qwen3.5-35B for this kind of workflow because it is designed for more efficient inference while still retaining strong coding ability. On a machine like an M4 Pro Mac mini, that difference matters.
Who this guide is for
- a local fallback or primary model for Claude Code
- better privacy and local control than a hosted-only setup
- a step-by-step macOS workflow they can copy without guessing
- a coding model that is practical for repeated agentic turns, not just one-off completions
First fix the slowdown issue
Before anything else, See Fixing 90% slower inference in Claude Code first to fix open models being 90% slower due to KV Cache invalidation.
This matters because many people judge an open-model setup before fixing the most obvious performance trap. If Claude Code is invalidating the KV cache too aggressively, the model can feel dramatically slower than it should, even if the model and hardware are otherwise fine.
- understand the KV cache issue
- apply the fix or recommended configuration from the Unsloth guidance
- then judge throughput and usability
- Unsloth Claude Code basics
- If you are following Unsloth's companion notes on KV cache invalidation, do that first before benchmarking the final setup.
Why Qwen3.5-35B-A3B is a good fit for agentic coding
A dense 35B model can still be very capable, but for local agentic setups you usually care about more than raw benchmark quality. You care about latency, sustained throughput across many requests, memory pressure, and whether the model remains pleasant to use when the tool loop gets long.
That is where Qwen3.5-35B-A3B gets interesting. The A3B version is optimized around a more efficient mixture-of-experts style architecture, which means you can often get better speed behavior than a dense 35B model while still keeping strong coding performance.
Quick comparison
| Model | Why you might choose it | Main downside | Best fit |
|---|---|---|---|
| Dense Qwen3.5-35B | Strong all-around quality | Heavier, slower local feel on repeated turns | Higher-latency local use or stronger hardware |
| Unsloth/Qwen3.5-35B-A3B | Better efficiency for local inference, practical for agent loops | Still not as strong as top hosted frontier models in every scenario | Local Claude Code workflows and iterative coding |
| Smaller Qwen variants | Faster response and easier memory fit | Lower ceiling on difficult coding tasks | Lightweight local assistance |
The point is not that the dense model is bad. The point is that for Claude Code-style agentic usage, where you care about many repeated calls and not just a single completion, the efficient variant is often the more practical choice.
Hardware target: Mac mini M4 Pro with 48GB
This is a strong machine for local development because the unified memory pool makes larger local model setups much more realistic than they would be on many laptops. You still have to be careful with context size, quantization choice, and server flags, but the machine is a very good fit for a serious local coding workflow.
- stable model loading
- reasonable context size
- good cache behavior
- acceptable parallelism
- avoiding pathological slowdowns
Architecture overview
Claude Code
-> model alias or provider shim
-> local llama.cpp server
-> Unsloth Qwen3.5-35B-A3B model
-> Mac mini M4 Pro (48GB unified memory)In practice, Claude Code talks to a model alias, that alias points to a local server, and the local server hosts the Unsloth model with the serving flags that keep latency under control.
Step 1: install prerequisites on macOS
xcode-select --install
brew install git cmake python nodeIf you already have Homebrew and the developer tools, this should be quick. The point is just to make sure you have the build and runtime dependencies needed for local serving and command-line tooling.
Step 2: install or build llama.cpp
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build --config Release -jIf you want the most predictable local serving path on macOS, llama.cpp is a good default. If you already have a working build, you can skip rebuilding it.
Step 3: get the Unsloth model
Follow the Unsloth guidance for the exact model artifact you want. For a local Claude Code workflow on Apple Silicon, an Unsloth-flavored Qwen3.5-35B-A3B GGUF is the practical target here.
mkdir -p ~/models/qwen
# place the model file here once downloaded
ls ~/models/qwenOne practical filename pattern looks like Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf. The exact name may vary depending on the quantization you choose.
Step 4: launch the local inference server
cd ~/ai/llama.cpp
./llama-server \
--model ~/models/qwen/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
--alias unsloth/Qwen3.5-35B-A3B \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--port 8081 \
--parallel 4 \
--kv-unified \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn on \
--ctx-size 131072Why these flags matter: the alias gives Claude Code a stable target, parallelism helps repeated tool loops, the cache flags matter for performance consistency, flash attention improves throughput, and the context size should be tuned based on actual responsiveness rather than vanity.
Step 5: create a simple Claude Code alias
A simple way to test the local model with Claude Code is to create a shell alias that pins the model name directly. Add this to your shell config, such as ~/.zshrc:
alias claudecli='claude --model unsloth/Qwen3.5-35B-A3B'Then reload your shell:
source ~/.zshrcIf you are intentionally testing a permissive local workflow, you can create a second alias for that case only:
alias claudecli_unsafe='claude --model unsloth/Qwen3.5-35B-A3B --dangerously-skip-permissions'I would keep the safer alias as the default and only use the permissive one for deliberate local experiments.
Step 6: verify the local server is healthy
Reload your shell so the new alias is actually available in the terminal you are using.
curl http://127.0.0.1:8081/health || true
curl http://127.0.0.1:8081/models || trueIf `claudecli` is not available after this step, stop and fix that before going further.
Step 7: run Claude Code against the local model
Before touching Claude Code, make sure the server itself is alive:
cd ~/path/to/test-repo
claudecliYou want to confirm the server is reachable and that the model alias exists before debugging Claude Code itself.
Summarize the repo structure and suggest one safe small improvement.Step 8: verify the model is actually in use
If anything feels wrong, keep the verification loop small. Check the local server logs, the model alias, whether Claude Code is silently falling back to another model, and whether the KV cache fix is actually in place. Asking a direct prompt like What model are you? is a quick sanity check before you spend time debugging the wrong part of the stack. Only after that should you judge whether the setup is good enough for everyday use.
First-time Claude Code account note
For many people, the very first Claude Code setup still requires logging in with a Claude account and satisfying Claude Code access requirements, including any minimum account funding requirements that apply at the time, such as a small initial credit threshold. Treat that as the bootstrap step for getting Claude Code itself working. Once Claude Code is installed, logged in, and functioning, you can then point it at the local model path described in this guide for everyday local use.
The smallest version that actually works
- install
llama.cpp - download the Unsloth model artifact
- start the server with a moderate context size
- verify the alias
- fix the KV cache invalidation issue
- test Claude Code on a small task
That gets you to a usable baseline faster than trying to optimize everything at once.
Practical tuning on an M4 Pro Mac mini
Context size
A huge context window sounds attractive, but on local hardware it can hurt latency badly. Start lower if the machine feels sluggish.
Parallelism
If the system feels unstable or too memory-hungry, reduce --parallel. If it is underutilized, you can experiment upward.
Cache precision
The cache flags are worth keeping an eye on because they affect both memory usage and runtime behavior.
Prompt discipline
Agentic systems magnify token waste. Smaller, cleaner context packets help local models much more than people expect.
Where most setups go wrong
Pitfall: Many people compare open local models to hosted frontier models before fixing the obvious performance killers. If KV cache invalidation or a bad server configuration is dragging the system down, you are not evaluating the model fairly. You are evaluating a broken setup.
- using an overly large context window immediately
- assuming the biggest model is automatically the best local choice
- forgetting to verify that Claude Code is actually using the local alias
- judging the setup on one bad run before the cache and server settings are sane
Best practices I would use
Best practice: optimize for responsiveness and stability first. A local model that is slightly weaker on paper but consistently fast enough for repeated coding turns is often more useful than a heavier model that drags the entire loop down.
- fix KV cache invalidation first
- start with a moderate context size
- verify the alias and server health explicitly
- keep one known-good server command documented
- only then start chasing bigger context windows or fancier tuning
What I would not do
I would not jump straight to the dense 35B variant just because it sounds more complete on paper. On a local agentic workflow, the subjective feel of the system matters a lot. If the model becomes sluggish across repeated calls, the whole setup becomes harder to use no matter how good its raw capability is.
I would also avoid treating this as a full replacement for hosted top-end Claude models in every scenario. The better framing is a hybrid one: use the local model for private, iterative, or cost-sensitive workflows, and keep hosted models for the highest-end reasoning or coding ceiling.
Final checklist
- Install the macOS prerequisites
- Build or install
llama.cpp - Download the Unsloth Qwen3.5-35B-A3B model artifact
- Start the local server with a documented command
- Read the Unsloth Claude Code guide
- Fix the KV cache invalidation issue first
- Verify Claude Code is actually calling the local alias
- Test on a small repo before using it for real work
- Tune context size and parallelism for your machine
Conclusion
If you want a local coding model for Claude Code on a Mac mini M4 Pro, Unsloth/Qwen3.5-35B-A3B is a very sensible choice. The reason is not just model quality. It is that the model is much more practical for repeated local agent loops than a heavier dense alternative when the rest of the stack is tuned properly.
The key is to treat the setup as a system, not just a model download. Fix the KV cache issue first, keep the serving command sane, verify the alias path end to end, and tune for repeated interactive use rather than bragging-rights context sizes.