ji-farthing/Qwen3.5-35B-A3B-DFlash-SWA-ik-llama-GGUF overview
Qwen3.5 35B A3B DFlash SWA draft for ik llama This repo contains an ik llama compatible DFlash draft GGUF converted from z lab/Qwen3.5 35B A3B DFlash , carryin…
Runs locally from ~401.6 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| Qwen3.5-35B-A3B-DFlash-SWA-ik_llama-Q8_0.gguf | GGUF | Q8_0 | 401.6 MB | Download |
Model Details
| Model ID | ji-farthing/Qwen3.5-35B-A3B-DFlash-SWA-ik-llama-GGUF |
|---|---|
| Author | ji-farthing |
| Pipeline | — |
| License | apache-2.0 |
| Base model | z-lab/Qwen3.5-35B-A3B-DFlash,Qwen/Qwen3.5-35B-A3B |
| Last modified | 2026-06-24T13:33:51.000Z |
Model README
---
license: apache-2.0
base_model:
- z-lab/Qwen3.5-35B-A3B-DFlash
- Qwen/Qwen3.5-35B-A3B
tags:
- gguf
- qwen3.5
- dflash
- speculative-decoding
- sliding-window-attention
- ik_llama
---
Qwen3.5-35B-A3B DFlash SWA draft for ik_llama
This repo contains an ik_llama-compatible DFlash draft GGUF converted from z-lab/Qwen3.5-35B-A3B-DFlash, carrying the per-layer sliding-window attention (SWA) pattern.
This is not a standalone chat model. Use it as a --model-draft file next to a matching Qwen3.5-35B-A3B target GGUF, with DFlash speculative decoding.
Sliding-window attention
The draft is sliding-window on every layer except a final full-attention (global) layer: sliding_window_pattern = [true, true, true, true, true, false], sliding_window = 4096.
Files
| File | Quant | Draft window |
| --- | --- | --- |
| Qwen3.5-35B-A3B-DFlash-SWA-ik_llama-Q8_0.gguf | Q8_0 | 4096 (5 sliding + 1 global) |
Use
llama-server \
-m /path/to/Qwen3.5-35B-A3B-<quant>.gguf \
--model-draft /path/to/Qwen3.5-35B-A3B-DFlash-SWA-ik_llama-Q8_0.gguf \
--spec-type dflash:n_max=4,cross_ctx=8192 \
-c 8192
SWA only engages once the DFlash cross-context exceeds the 4096 window, so set cross_ctx above the window for long-context prompts (the default 512 does not grow with -c).
Validation (RTX 4070, ik_llama DFlash SWA branch)
Draft acceptance and throughput versus the same draft run with full attention, as the prompt overflows the 4096 window, where clip = (prompt - 4096) / prompt:
| prompt tok | clip | accept, full-attn | accept, SWA | acceptance gain | tok/s change |
| --- | --- | --- | --- | --- | --- |
| 37 | 0% | 36.1% | 39.2% | +3.0 pp | +6.8% |
| 5613 | 27% | 28.4% | 28.6% | +0.2 pp | -1.1% |
| 8044 | 49% | 22.3% | 27.7% | +5.4 pp | +9.9% |
| 11005 | 63% | 15.9% | 26.4% | +10.5 pp | +23% |
| 19946 | 80% | 5.5% | 23.0% | +17.4 pp | +44% |
The benefit grows with how far the prompt overflows the window and does not saturate.
Conversion
Converted from z-lab/Qwen3.5-35B-A3B-DFlash with ik_llama's convert_hf_to_gguf.py DFlash draft converter (sliding-window support branch), then quantized to Q8_0. The per-layer SWA pattern is taken from the source layer_types. Conversion requires a --target-model-dir containing the target tokenizer merges.
Run ji-farthing/Qwen3.5-35B-A3B-DFlash-SWA-ik-llama-GGUF with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models