GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF overview

Qwen3.6 Claude Coder — local MoE coding agent llama.cpp build A custom configuration of Qwen3.6 35B A3B Mixture of Experts, ~3B active parameters , set up to a…

ggufqwen3moecoding-agenttool-callingllama.cppik_llama.cppclaude-codeopencodetext-generationenplbase_model:Qwen/Qwen3.6-35B-A3Bbase_model:quantized:Qwen/Qwen3.6-35B-A3Blicense:apache-2.0endpoints_compatibleregion:usconversational

Runs locally from ~22.29 GB disk (24 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
0
Likes
0
Pipeline
text-generation
Author

Repository Files & Downloads

1 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
qwen36-a3b-claude-coder-q4_K_M-llama.cpp.ggufGGUFQ4_K_M22.29 GBDownload

Model Details

Model IDrafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF
Authorrafw007
Pipelinetext-generation
Licenseapache-2.0
Base modelQwen/Qwen3.6-35B-A3B
Last modified2026-06-07T22:58:21.000Z

Model README

---

license: apache-2.0

language:

  • en
  • pl

base_model: Qwen/Qwen3.6-35B-A3B

pipeline_tag: text-generation

tags:

  • qwen3
  • moe
  • coding-agent
  • tool-calling
  • gguf
  • llama.cpp
  • ik_llama.cpp
  • claude-code
  • opencode

---

Qwen3.6 Claude Coder — local MoE coding agent (llama.cpp build)

A custom configuration of Qwen3.6-35B-A3B (Mixture-of-Experts, ~3B active parameters), set up

to act as an autonomous coding agent: it uses tools instead of guessing, grounds every answer in

the actual tool output (never fabricates results), does not loop on the same tool, and returns

complete, runnable code. No-think mode is wired into the system prompt for fast, direct answers.

Safety guardrails of the base model are intact.

It drives Claude Code, Codex and opencode fully locally — your code never leaves your

machine and cloud token cost drops to zero.

> This is the llama.cpp / ik_llama.cpp build. Same behavior and configuration as

> rafw007/qwen36-a3b-claude-coder on Ollama —

> packaged so it loads on stock llama.cpp. See "Why a separate version" below.

Why a separate version (vs. the Ollama one)

The Ollama model and this one share the same agent config (system prompt + sampling params).

What differs is packaging and the loader they target:

| | Ollama version | This llama.cpp version |

|---|---|---|

| Runtime | Ollama engine + Modelfile (RENDERER/PARSER qwen3.5) | stock llama.cpp / ik_llama.cpp (llama-server) |

| Weights | nvfp4 (~21 GB) | GGUF Q4_K_M (~24 GB) |

| Tool format | Ollama's native Qwen parser | GGUF Jinja chat template + --jinja |

| Agent config | baked into the Modelfile | supplied via launch flags + a system-prompt file (below) |

The actual fix. Qwen3.5/3.6-MoE uses multimodal RoPE (mRoPE) whose native

rope.dimension_sections is 3 ints [t, h, w]. Ollama's loader is lenient and accepts that.

Recent stock llama.cpp (the Qwen3.5 loader from PR #19435) validates that key as a length-4

array and rejects the 3-element one:

key qwen35moe.rope.dimension_sections has wrong array length; expected 4, got 3

This is a known, family-wide converter/loader mismatch — not specific to this quant. **This GGUF has

the section array padded to length 4** ([11, 11, 10] → [11, 11, 10, 0]; the 4th slot is the unused

text section, it does not change inference), so it loads cleanly on current llama.cpp and

ik_llama.cpp. If you hit the error above with any other Qwen3.5/3.6-MoE GGUF, this is the cause.

What it is (and what it is not)

Honest framing: the weights are stock Qwen3.6-35B-A3B. The "Claude Coder" behavior comes entirely

from an agentic system prompt + sampling configuration, plus the llama.cpp-compatibility rope fix

described above. Everything here is measured, not marketing.

Quick start (llama.cpp / ik_llama.cpp)

llama-server \
  -m qwen36-a3b-claude-coder-q4_K_M-llama.cpp.gguf \
  --jinja --reasoning-budget 0 \
  -c 65536 \
  --temp 0.6 --top-k 20 --top-p 0.8 --repeat-penalty 1 --presence-penalty 0 \
  --system-prompt-file qwen36-system.txt \
  --host 0.0.0.0 --port 8080

--reasoning-budget 0 enforces no-think. --jinja enables native tool-calling via the embedded

Qwen chat template. qwen36-system.txt is your agent system-prompt file (same configuration as the

Ollama build — its contents are not published).

Tested

End-to-end under opencode against ik_llama.cpp (llama-server, port-bound, --jinja): the

model emitted real tool_calls, executed a real df -h, grounded its answer on the actual output

and exited cleanly (no tool loop). Loads without the rope error on ik_llama.cpp (mRoPE sections

reported as [11, 11, 10, 0]).

Context

  • Configured for 64K (Claude Code's recommended minimum). Base Qwen3.6 natively supports 262K,

so context can be raised on stronger hardware. On a CPU-only box lower it (e.g. 16–32K) to fit RAM.

Files

| File | Quant | Size | Notes |

|---|---|---|---|

| qwen36-a3b-claude-coder-q4_K_M-llama.cpp.gguf | Q4_K_M | ~24 GB | mRoPE dimension_sections padded to length-4 for stock llama.cpp / ik_llama.cpp. |

How it was made

Designed, built and tested with the help of Claude Opus — the system prompt, parameter choices

and context configuration come from that work. The llama.cpp packaging (rope-section fix + launch

recipe) was added after a user report that the Ollama-targeted GGUF would not load on stock

llama.cpp.

License

Apache 2.0 (inherited from the base Qwen3.6).

Run rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models