GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

ji-farthing/Qwen3.5-35B-A3B-DFlash-SWA-ik-llama-GGUF overview

Qwen3.5 35B A3B DFlash SWA draft for ik llama This repo contains an ik llama compatible DFlash draft GGUF converted from z lab/Qwen3.5 35B A3B DFlash , carryin…

ggufqwen3.5dflashspeculative-decodingsliding-window-attentionik_llamabase_model:Qwen/Qwen3.5-35B-A3Bbase_model:quantized:Qwen/Qwen3.5-35B-A3Blicense:apache-2.0endpoints_compatibleregion:usconversational

Runs locally from ~401.6 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
0
Likes
0
Pipeline

Repository Files & Downloads

1 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
Qwen3.5-35B-A3B-DFlash-SWA-ik_llama-Q8_0.ggufGGUFQ8_0401.6 MBDownload

Model Details

Model IDji-farthing/Qwen3.5-35B-A3B-DFlash-SWA-ik-llama-GGUF
Authorji-farthing
Pipeline
Licenseapache-2.0
Base modelz-lab/Qwen3.5-35B-A3B-DFlash,Qwen/Qwen3.5-35B-A3B
Last modified2026-06-24T13:33:51.000Z

Model README

---

license: apache-2.0

base_model:

- z-lab/Qwen3.5-35B-A3B-DFlash

- Qwen/Qwen3.5-35B-A3B

tags:

- gguf

- qwen3.5

- dflash

- speculative-decoding

- sliding-window-attention

- ik_llama

---

Qwen3.5-35B-A3B DFlash SWA draft for ik_llama

This repo contains an ik_llama-compatible DFlash draft GGUF converted from z-lab/Qwen3.5-35B-A3B-DFlash, carrying the per-layer sliding-window attention (SWA) pattern.

This is not a standalone chat model. Use it as a --model-draft file next to a matching Qwen3.5-35B-A3B target GGUF, with DFlash speculative decoding.

Sliding-window attention

The draft is sliding-window on every layer except a final full-attention (global) layer: sliding_window_pattern = [true, true, true, true, true, false], sliding_window = 4096.

Files

| File | Quant | Draft window |

| --- | --- | --- |

| Qwen3.5-35B-A3B-DFlash-SWA-ik_llama-Q8_0.gguf | Q8_0 | 4096 (5 sliding + 1 global) |

Use

llama-server \
  -m /path/to/Qwen3.5-35B-A3B-<quant>.gguf \
  --model-draft /path/to/Qwen3.5-35B-A3B-DFlash-SWA-ik_llama-Q8_0.gguf \
  --spec-type dflash:n_max=4,cross_ctx=8192 \
  -c 8192

SWA only engages once the DFlash cross-context exceeds the 4096 window, so set cross_ctx above the window for long-context prompts (the default 512 does not grow with -c).

Validation (RTX 4070, ik_llama DFlash SWA branch)

Draft acceptance and throughput versus the same draft run with full attention, as the prompt overflows the 4096 window, where clip = (prompt - 4096) / prompt:

| prompt tok | clip | accept, full-attn | accept, SWA | acceptance gain | tok/s change |

| --- | --- | --- | --- | --- | --- |

| 37 | 0% | 36.1% | 39.2% | +3.0 pp | +6.8% |

| 5613 | 27% | 28.4% | 28.6% | +0.2 pp | -1.1% |

| 8044 | 49% | 22.3% | 27.7% | +5.4 pp | +9.9% |

| 11005 | 63% | 15.9% | 26.4% | +10.5 pp | +23% |

| 19946 | 80% | 5.5% | 23.0% | +17.4 pp | +44% |

The benefit grows with how far the prompt overflows the window and does not saturate.

Conversion

Converted from z-lab/Qwen3.5-35B-A3B-DFlash with ik_llama's convert_hf_to_gguf.py DFlash draft converter (sliding-window support branch), then quantized to Q8_0. The per-layer SWA pattern is taken from the source layer_types. Conversion requires a --target-model-dir containing the target tokenizer merges.

Run ji-farthing/Qwen3.5-35B-A3B-DFlash-SWA-ik-llama-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models