rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF overview
Qwen3.6 Claude Coder — local MoE coding agent llama.cpp build A custom configuration of Qwen3.6 35B A3B Mixture of Experts, ~3B active parameters , set up to a…
Runs locally from ~22.29 GB disk (24 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| qwen36-a3b-claude-coder-q4_K_M-llama.cpp.gguf | GGUF | Q4_K_M | 22.29 GB | Download |
Model Details
| Model ID | rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF |
|---|---|
| Author | rafw007 |
| Pipeline | text-generation |
| License | apache-2.0 |
| Base model | Qwen/Qwen3.6-35B-A3B |
| Last modified | 2026-06-07T22:58:21.000Z |
Model README
---
license: apache-2.0
language:
- en
- pl
base_model: Qwen/Qwen3.6-35B-A3B
pipeline_tag: text-generation
tags:
- qwen3
- moe
- coding-agent
- tool-calling
- gguf
- llama.cpp
- ik_llama.cpp
- claude-code
- opencode
---
Qwen3.6 Claude Coder — local MoE coding agent (llama.cpp build)
A custom configuration of Qwen3.6-35B-A3B (Mixture-of-Experts, ~3B active parameters), set up
to act as an autonomous coding agent: it uses tools instead of guessing, grounds every answer in
the actual tool output (never fabricates results), does not loop on the same tool, and returns
complete, runnable code. No-think mode is wired into the system prompt for fast, direct answers.
Safety guardrails of the base model are intact.
It drives Claude Code, Codex and opencode fully locally — your code never leaves your
machine and cloud token cost drops to zero.
> This is the llama.cpp / ik_llama.cpp build. Same behavior and configuration as
> rafw007/qwen36-a3b-claude-coder on Ollama —
> packaged so it loads on stock llama.cpp. See "Why a separate version" below.
Why a separate version (vs. the Ollama one)
The Ollama model and this one share the same agent config (system prompt + sampling params).
What differs is packaging and the loader they target:
| | Ollama version | This llama.cpp version |
|---|---|---|
| Runtime | Ollama engine + Modelfile (RENDERER/PARSER qwen3.5) | stock llama.cpp / ik_llama.cpp (llama-server) |
| Weights | nvfp4 (~21 GB) | GGUF Q4_K_M (~24 GB) |
| Tool format | Ollama's native Qwen parser | GGUF Jinja chat template + --jinja |
| Agent config | baked into the Modelfile | supplied via launch flags + a system-prompt file (below) |
The actual fix. Qwen3.5/3.6-MoE uses multimodal RoPE (mRoPE) whose native
rope.dimension_sections is 3 ints [t, h, w]. Ollama's loader is lenient and accepts that.
Recent stock llama.cpp (the Qwen3.5 loader from PR #19435) validates that key as a length-4
array and rejects the 3-element one:
key qwen35moe.rope.dimension_sections has wrong array length; expected 4, got 3
This is a known, family-wide converter/loader mismatch — not specific to this quant. **This GGUF has
the section array padded to length 4** ([11, 11, 10] → [11, 11, 10, 0]; the 4th slot is the unused
text section, it does not change inference), so it loads cleanly on current llama.cpp and
ik_llama.cpp. If you hit the error above with any other Qwen3.5/3.6-MoE GGUF, this is the cause.
What it is (and what it is not)
Honest framing: the weights are stock Qwen3.6-35B-A3B. The "Claude Coder" behavior comes entirely
from an agentic system prompt + sampling configuration, plus the llama.cpp-compatibility rope fix
described above. Everything here is measured, not marketing.
Quick start (llama.cpp / ik_llama.cpp)
llama-server \
-m qwen36-a3b-claude-coder-q4_K_M-llama.cpp.gguf \
--jinja --reasoning-budget 0 \
-c 65536 \
--temp 0.6 --top-k 20 --top-p 0.8 --repeat-penalty 1 --presence-penalty 0 \
--system-prompt-file qwen36-system.txt \
--host 0.0.0.0 --port 8080
--reasoning-budget 0 enforces no-think. --jinja enables native tool-calling via the embedded
Qwen chat template. qwen36-system.txt is your agent system-prompt file (same configuration as the
Ollama build — its contents are not published).
Tested
End-to-end under opencode against ik_llama.cpp (llama-server, port-bound, --jinja): the
model emitted real tool_calls, executed a real df -h, grounded its answer on the actual output
and exited cleanly (no tool loop). Loads without the rope error on ik_llama.cpp (mRoPE sections
reported as [11, 11, 10, 0]).
Context
- Configured for 64K (Claude Code's recommended minimum). Base Qwen3.6 natively supports 262K,
so context can be raised on stronger hardware. On a CPU-only box lower it (e.g. 16–32K) to fit RAM.
Files
| File | Quant | Size | Notes |
|---|---|---|---|
| qwen36-a3b-claude-coder-q4_K_M-llama.cpp.gguf | Q4_K_M | ~24 GB | mRoPE dimension_sections padded to length-4 for stock llama.cpp / ik_llama.cpp. |
How it was made
Designed, built and tested with the help of Claude Opus — the system prompt, parameter choices
and context configuration come from that work. The llama.cpp packaging (rope-section fix + launch
recipe) was added after a user report that the Ollama-targeted GGUF would not load on stock
llama.cpp.
License
Apache 2.0 (inherited from the base Qwen3.6).
Run rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models