lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test overview
llama.cpp Pull Request: https://github.com/ggml org/llama.cpp/pull/22105 DFlash Drafter: https://huggingface.co/z lab/Qwen3.6 35B A3B DFlash 2026 06 25: Re con…
Runs locally from ~401.6 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
Model Details
Model README
---
license: mit
---
llama.cpp Pull Request: https://github.com/ggml-org/llama.cpp/pull/22105
DFlash Drafter: https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash
2026-06-25:
Re-converted and tested v2 implementation
>DFlash v2 has been updated and now looks cleaner and more robust. The performance also looks good to me. As discussed with @ggerganov in #24904, this implementation is much simpler and offers better graph reuse.
Update: https://github.com/ggml-org/llama.cpp/pull/22105#issuecomment-4792767176
Steps (follow the PR)
1) git clone -b dflash https://github.com/ruixiang63/llama.cpp
2) download draft model from https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash/
3) download tokenizer from https://huggingface.co/Qwen/Qwen3.6-35B-A3B
4) convert draft model to gguf
Conversion log: [conversion.log](https://huggingface.co/lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test/blob/main/gguf_conversion.log)
5) Build llama.cpp
- CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
- VULKAN
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j
6) Run DFlash speculative decoding
# Test Results
WebUI (llama.cpp)

Config (models.ini)
version = 1
[*]
flash-attn = on
mlock = off
mmap = off
fit = on
warmup = on
batch-size = 256
ubatch-size = 256
cache-type-k = q4_0
cache-type-v = q5_1
kv-unified = true
swa-full = true
jinja = true
direct-io = off
cache-prompt = true
cache-ram = 28672
n-gpu-layers = 99
reasoning = off
reasoning-budget = 0
chat-template-kwargs = {"preserve_thinking": true}
spec-default = true
ctx-checkpoints = 64
parallel = 1
threads-http = 1
ctx-size = 65536
--- MODELS ---
[LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Claude-Genesis-GGUF]
alias = LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Claude-Genesis-GGUF
model = /root/.cache/llama.cpp/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Claude-Genesis-GGUF/Qwen3.6-35B-A3B-Uncensored-Claude-Genesis-V3-APEX-Compact.gguf
mmproj = /root/.cache/llama.cpp/mmproj/mmproj-Qwen3.6-35B-A3B-Uncensored-BF16.gguf
#spec-draft-model = /root/.cache/llama.cpp/mtp/Qwen3.6-35B-A3B-MTP-q4_0.gguf
#spec-draft-model = /root/.cache/llama.cpp/eagle3/eagle3-draft-q8_0.gguf
spec-draft-model = /root/.cache/llama.cpp/dflash/Qwen3.6-35B-A3B-DFlash-q8_0.gguf
temperature = 0.7
top-k = 20
top-p = 0.8
presence-penalty = 1.5
repeat-penalty = 1.0
seed = 42
spec-type = draft-dflash,ngram-mod,ngram-map-k4v
spec-draft-n-max = 3
spec-draft-p-min = 0.50
spec-draft-prio = 2
spec-draft-prio-batch = 2
spec-ngram-mod-n-match = 24
spec-ngram-mod-n-min = 48
spec-ngram-mod-n-max = 64
spec-ngram-map-k4v-size-n = 8
spec-ngram-map-k4v-size-m = 24
spec-ngram-map-k4v-min-hits = 2
Hardware tested (low-budget mini PC)
Model: Machenike GTR Mini PC (~$600)
CPU: AMD R7-H255 (780M iGPU)
RAM: 32G DDR5 (Shared/Unified memory)
Backend: llama.cpp (Vulkan)
---
## Archived Tests and Investigations
# 2026-04-19:
Rebase dflash feature onto latest master
git clone -b master https://github.com/ggml-org/llama.cpp
git remote add ruixiang63 https://github.com/ruixiang63/llama.cpp
git fetch ruixiang63
git checkout -b dflash-test origin/master
git merge ruixiang63/dflash --no-edit
Then solve conflicts manually
- gguf-py/gguf/constants.py
- src/CMakeLists.txt
- src/llama-arch.cpp
- src/llama-hparams.h
- src/llama-model.cpp
Notes
- src/CMakeLists.txt
Use glob to collect src/models sources: https://github.com/ggml-org/llama.cpp/pull/22005/changes
- src/llama-arch.cpp
remove per-arch tensor name lists: https://github.com/ggml-org/llama.cpp/pull/21531/changes
- src/llama-model.cpp
Refactor bias tensor variable names: https://github.com/ggml-org/llama.cpp/pull/22079/changes#diff-36e262e316ec1404e29880eb8b8ce4660ac584f0d0434710efc48a66497bdb59
from:
layer.bq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "bias", i), {n_embd_head_k * n_head}, TENSOR_NOT_REQUIRED);
layer.bk = create_tensor(tn(LLM_TENSOR_ATTN_K, "bias", i), {n_embd_k_gqa}, TENSOR_NOT_REQUIRED);
layer.bv = create_tensor(tn(LLM_TENSOR_ATTN_V, "bias", i), {n_embd_v_gqa}, TENSOR_NOT_REQUIRED);
layer.bo = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "bias", i), {n_embd}, TENSOR_NOT_REQUIRED);
to:
layer.wq_b = create_tensor(tn(LLM_TENSOR_ATTN_Q, "bias", i), {n_embd_head_k * n_head}, TENSOR_NOT_REQUIRED);
layer.wk_b = create_tensor(tn(LLM_TENSOR_ATTN_K, "bias", i), {n_embd_k_gqa}, TENSOR_NOT_REQUIRED);
layer.wv_b = create_tensor(tn(LLM_TENSOR_ATTN_V, "bias", i), {n_embd_v_gqa}, TENSOR_NOT_REQUIRED);
layer.wo_b = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "bias", i), {n_embd}, TENSOR_NOT_REQUIRED);
- src/models/dflash.cpp follows the same
layer.bq -> layer.wq_b
layer.bk -> layer.wk_b
layer.bv -> layer.wv_b
layer.bo -> layer.wo_b
- src/models/eagle3.cpp:134:
support NVFP4 tensors for Gemma4: https://github.com/ggml-org/llama.cpp/pull/21971/changes#diff-9be9eea14f4aefce7375482c05968900192634e88e92ac263cedb955a64ad7feR2099
cur = build_attn(inp_attn,
model.layers[il].wo, NULL, NULL, // 3rd tensor parameter (wo_s)
Qcur, Kcur, Vcur, nullptr, nullptr, nullptr, kq_scale, il);
# 2026-04-20:
Support for Qwen3.5/3.6 MoE and notes
- https://github.com/ggml-org/llama.cpp/discussions/21569#discussioncomment-16624433
- https://github.com/ggml-org/llama.cpp/pull/22105/changes/d1d2c81caccc748eaaff32b6b7823bad090fd1dd
Z Lab's new benchmark
- https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash/commit/82252400cd9baebdfa5730b0aa809e10db5dba12
# 2026-04-22:
Re-uploaded gguf based on new drafter https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash/commit/31977fbe13a86e8b961774f773058175676d89b8
# Issues and Solutions
/src/models/dflash.cpp:39: GGML_ASSERT(model.target_tok_embd != nullptr && "DFlash decoder requires target model's tok_embd") failed
check if `--dflash` param is added to the `llama-speculative-simple` testRun lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models