What is lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test?

--- license: mit --- llama.cpp Pull Request: https://github.com/ggml-org/llama.cpp/pull/22105 DFlash Drafter: https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash # 2026-06-25: Re-converted and tested v2 implementation [ruixiang63](https://github.com/ruixiang63): >DFlash v2 has been updated and now looks cleaner and more robust. The performance also looks good to me. As discussed with [@ggerganov](https://github.com/ggerganov) in [#24904](https://github.com/ggml-org/llama.cpp/discussions/24904), this implementation is much simpler and offers better graph reuse. Update: https://github.com/ggml-org/llama.cpp/pull/22105#issuecomment-4792767176 # Steps (follow the PR) 1) `git clone -b dflash https://github.com/ruixiang63/llama.cpp` 2) download draft model from https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash/ 3) download tokenizer from https://huggingface.co/Qwen/Qwen3.6-35B-A3B 4) conver…

What license applies to lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test?

License: mit. Verify terms on Hugging Face before commercial use.

How do I run lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test locally?

Download a GGUF file from this page and load it in guIDE or llama.cpp. Pipeline task: text-generation.

Model Intelligence Sheet

lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test overview

llama.cpp Pull Request: https://github.com/ggml org/llama.cpp/pull/22105 DFlash Drafter: https://huggingface.co/z lab/Qwen3.6 35B A3B DFlash 2026 06 25: Re con…

gguflicense:mitendpoints_compatibleregion:usconversational

Runs locally from ~401.6 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads

234

Likes

Pipeline

—

Author

lym00

Repository Files & Downloads

3 GGUF files detected

Direct downloads for local inference

File	Type	Quantization	Size	Link
Qwen3.6-35B-A3B-DFlash-bf16.gguf	GGUF	BF16	746.6 MB	Download
Qwen3.6-35B-A3B-DFlash-f16.gguf	GGUF	F16	746.6 MB	Download
Qwen3.6-35B-A3B-DFlash-q8_0.gguf	GGUF	Q8_0	401.6 MB	Download

Model Details

Model ID	lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test
Author	lym00
Pipeline	—
License	mit
Base model	—
Last modified	2026-06-25T01:09:05.000Z

Model README

---

license: mit

---

llama.cpp Pull Request: https://github.com/ggml-org/llama.cpp/pull/22105

DFlash Drafter: https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash

2026-06-25:

Re-converted and tested v2 implementation

ruixiang63:

>DFlash v2 has been updated and now looks cleaner and more robust. The performance also looks good to me. As discussed with @ggerganov in #24904, this implementation is much simpler and offers better graph reuse.

Update: https://github.com/ggml-org/llama.cpp/pull/22105#issuecomment-4792767176

Steps (follow the PR)

1) git clone -b dflash https://github.com/ruixiang63/llama.cpp

2) download draft model from https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash/

3) download tokenizer from https://huggingface.co/Qwen/Qwen3.6-35B-A3B

4) convert draft model to gguf


Conversion log: [conversion.log](https://huggingface.co/lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test/blob/main/gguf_conversion.log)

5) Build llama.cpp
- CUDA

cmake -B build -DGGML_CUDA=ON

cmake --build build --config Release -j


- VULKAN

cmake -B build -DGGML_VULKAN=ON

cmake --build build --config Release -j


6) Run DFlash speculative decoding

# Test Results

WebUI (llama.cpp)

![image](https://cdn-uploads.huggingface.co/production/uploads/683b74909c3fe5951bce2e37/F5aTybUXRlYqyfDItWfTb.png)

Config (models.ini)

version = 1

[*]

flash-attn = on

mlock = off

mmap = off

fit = on

warmup = on

batch-size = 256

ubatch-size = 256

cache-type-k = q4_0

cache-type-v = q5_1

kv-unified = true

swa-full = true

jinja = true

direct-io = off

cache-prompt = true

cache-ram = 28672

n-gpu-layers = 99

reasoning = off

reasoning-budget = 0

chat-template-kwargs = {"preserve_thinking": true}

spec-default = true

ctx-checkpoints = 64

parallel = 1

threads-http = 1

ctx-size = 65536

--- MODELS ---

[LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Claude-Genesis-GGUF]

alias = LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Claude-Genesis-GGUF

model = /root/.cache/llama.cpp/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Claude-Genesis-GGUF/Qwen3.6-35B-A3B-Uncensored-Claude-Genesis-V3-APEX-Compact.gguf

mmproj = /root/.cache/llama.cpp/mmproj/mmproj-Qwen3.6-35B-A3B-Uncensored-BF16.gguf

#spec-draft-model = /root/.cache/llama.cpp/mtp/Qwen3.6-35B-A3B-MTP-q4_0.gguf

#spec-draft-model = /root/.cache/llama.cpp/eagle3/eagle3-draft-q8_0.gguf

spec-draft-model = /root/.cache/llama.cpp/dflash/Qwen3.6-35B-A3B-DFlash-q8_0.gguf

temperature = 0.7

top-k = 20

top-p = 0.8

presence-penalty = 1.5

repeat-penalty = 1.0

seed = 42

spec-type = draft-dflash,ngram-mod,ngram-map-k4v

spec-draft-n-max = 3

spec-draft-p-min = 0.50

spec-draft-prio = 2

spec-draft-prio-batch = 2

spec-ngram-mod-n-match = 24

spec-ngram-mod-n-min = 48

spec-ngram-mod-n-max = 64

spec-ngram-map-k4v-size-n = 8

spec-ngram-map-k4v-size-m = 24

spec-ngram-map-k4v-min-hits = 2


Hardware tested (low-budget mini PC)

Model: Machenike GTR Mini PC (~$600)
CPU: AMD R7-H255 (780M iGPU)
RAM: 32G DDR5 (Shared/Unified memory)
Backend: llama.cpp (Vulkan)

---

## Archived Tests and Investigations

# 2026-04-19:
Rebase dflash feature onto latest master

git clone -b master https://github.com/ggml-org/llama.cpp

git remote add ruixiang63 https://github.com/ruixiang63/llama.cpp

git fetch ruixiang63

git checkout -b dflash-test origin/master

git merge ruixiang63/dflash --no-edit


Then solve conflicts manually
- gguf-py/gguf/constants.py
- src/CMakeLists.txt
- src/llama-arch.cpp
- src/llama-hparams.h
- src/llama-model.cpp

Notes
- src/CMakeLists.txt
Use glob to collect src/models sources: https://github.com/ggml-org/llama.cpp/pull/22005/changes

- src/llama-arch.cpp
remove per-arch tensor name lists: https://github.com/ggml-org/llama.cpp/pull/21531/changes

- src/llama-model.cpp
Refactor bias tensor variable names: https://github.com/ggml-org/llama.cpp/pull/22079/changes#diff-36e262e316ec1404e29880eb8b8ce4660ac584f0d0434710efc48a66497bdb59

from:

layer.bq = create_tensor(tn(LLM_TENSOR_ATTN_Q, "bias", i), {n_embd_head_k * n_head}, TENSOR_NOT_REQUIRED);

layer.bk = create_tensor(tn(LLM_TENSOR_ATTN_K, "bias", i), {n_embd_k_gqa}, TENSOR_NOT_REQUIRED);

layer.bv = create_tensor(tn(LLM_TENSOR_ATTN_V, "bias", i), {n_embd_v_gqa}, TENSOR_NOT_REQUIRED);

layer.bo = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "bias", i), {n_embd}, TENSOR_NOT_REQUIRED);

to:

layer.wq_b = create_tensor(tn(LLM_TENSOR_ATTN_Q, "bias", i), {n_embd_head_k * n_head}, TENSOR_NOT_REQUIRED);

layer.wk_b = create_tensor(tn(LLM_TENSOR_ATTN_K, "bias", i), {n_embd_k_gqa}, TENSOR_NOT_REQUIRED);

layer.wv_b = create_tensor(tn(LLM_TENSOR_ATTN_V, "bias", i), {n_embd_v_gqa}, TENSOR_NOT_REQUIRED);

layer.wo_b = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "bias", i), {n_embd}, TENSOR_NOT_REQUIRED);


- src/models/dflash.cpp follows the same

layer.bq -> layer.wq_b

layer.bk -> layer.wk_b

layer.bv -> layer.wv_b

layer.bo -> layer.wo_b

- src/models/eagle3.cpp:134:
support NVFP4 tensors for Gemma4: https://github.com/ggml-org/llama.cpp/pull/21971/changes#diff-9be9eea14f4aefce7375482c05968900192634e88e92ac263cedb955a64ad7feR2099

cur = build_attn(inp_attn,

model.layers[il].wo, NULL, NULL, // 3rd tensor parameter (wo_s)

Qcur, Kcur, Vcur, nullptr, nullptr, nullptr, kq_scale, il);


# 2026-04-20:
Support for Qwen3.5/3.6 MoE and notes
- https://github.com/ggml-org/llama.cpp/discussions/21569#discussioncomment-16624433

- https://github.com/ggml-org/llama.cpp/pull/22105/changes/d1d2c81caccc748eaaff32b6b7823bad090fd1dd

Z Lab's new benchmark
- https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash/commit/82252400cd9baebdfa5730b0aa809e10db5dba12

# 2026-04-22: 
Re-uploaded gguf based on new drafter https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash/commit/31977fbe13a86e8b961774f773058175676d89b8


# Issues and Solutions

/src/models/dflash.cpp:39: GGML_ASSERT(model.target_tok_embd != nullptr && "DFlash decoder requires target model's tok_embd") failed

check if `--dflash` param is added to the `llama-speculative-simple` test

Run lym00/Qwen3.6-35B-A3B-DFlash-GGUF-Test with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models