GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

XpressAI/Qwen3.6-27B-RYS-GGUF overview

Qwen3.6 27B — RYS Layer Surgery GGUF A modified version of Qwen3.6 27B Instruct https://huggingface.co/Qwen/Qwen3.6 27B Instruct produced by RYS layer duplicat…

ggufqwen3.6ryslayer-surgeryreasoningbfclfunction-callingspeculative-decodingdflashenlicense:apache-2.0endpoints_compatibleregion:usconversational

Runs locally from ~1.72 GB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
309
Likes
8
Pipeline
Author

Repository Files & Downloads

3 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
Qwen3.6-27B-DFlash-Q8_0-rys.ggufGGUFQ8_01.72 GBDownload
Qwen3.6-27B-rys_33-36-Q8_0.ggufGGUFQ8_028.14 GBDownload
Qwen3.6-27B-rys_33-36-UD-Q4_K_XL.ggufGGUFQ4_K_XL17.32 GBDownload

Model Details

Model IDXpressAI/Qwen3.6-27B-RYS-GGUF
AuthorXpressAI
Pipeline
Licenseapache-2.0
Base modelQwen/Qwen3.6-27B-Instruct
Last modified2026-06-18T23:45:09.000Z

Model README

---

license: apache-2.0

base_model: Qwen/Qwen3.6-27B-Instruct

tags:

- gguf

- qwen3.6

- rys

- layer-surgery

- reasoning

- bfcl

- function-calling

- speculative-decoding

- dflash

language:

- en

---

Qwen3.6-27B — RYS Layer Surgery (GGUF)

A modified version of Qwen3.6-27B-Instruct

produced by RYS layer duplication — no training, no weight changes, just

running layers 33–36 a second time during the forward pass.

Based on David Ng's RYS method.

---

TL;DR

On the Berkeley Function-Call Leaderboard (BFCL v4, 100 tests/category × 13

single-turn categories, sampled), this variant **beats the unmodified base

model by +1.96 pp on average** when run with thinking mode enabled — driven by

large gains on the hardest live categories:

| Category | Base | rys_33-36 | Δ |

|---|---|---|---|

| live_parallel | 68.75% | 87.50% | +18.75 |

| live_relevance | 68.75% | 81.25% | +12.50 |

| live_parallel_multiple | 70.83% | 75.00% | +4.17 |

| mean (13 categories) | 82.56% | 84.52% | +1.96 |

The wins come from improved reasoning during prefill on multi-call /

relevance-judgement queries. The trade is small regressions (−1 to −3 pp) on

easier non-live categories. Thinking mode is required — without it, this

variant slightly underperforms base.

---

Files

| File | Quant | Layers | Size | Role |

|------|-------|--------|------|------|

| Qwen3.6-27B-rys_33-36-UD-Q4_K_XL.gguf | Q4_K_XL | 68 | 18 GiB | target model |

| Qwen3.6-27B-rys_33-36-Q8_0.gguf | Q8_0 | 68 | 29 GiB | target model |

| Qwen3.6-27B-DFlash-Q8_0-rys.gguf | Q8_0 | 5 | 1.8 GiB | DFlash draft (speculative decoding) |

The base GGUF (no surgery) is at

unsloth/Qwen3.6-27B-GGUF.

---

Internal probe results

A small probe of math, EQ, and reasoning prompts was run during the layer

search. The probe categories are tiny (3 questions per reasoning subcategory,

~16 EQ-Bench-style items, ~16 math problems) so individual numbers should be

treated as directional, not definitive.

| Metric | Base | rys_33-36 |

|---|---|---|

| Math (GSM8K-style partial credit) | 0.537 | 0.500 |

| EQ (EQ-Bench-style, 0–100) | 93.59 | 86.64 |

| Reasoning total (17 probes, 5 subcategories) | 0.765 | 0.882 |

|   ↳ causal | 0.67 | 1.00 |

|   ↳ date | 1.00 | 1.00 |

|   ↳ logic | 1.00 | 1.00 |

|   ↳ navigation | 0.67 | 1.00 |

|   ↳ gsm | 0.60 | 0.60 |

Layers 33–36 was the only configuration in the layer-block sweep that

achieved a perfect score on the causal reasoning subcategory while keeping

the other reasoning categories at or above their baseline. This is what

motivated picking it for the BFCL run below.

---

BFCL results (sampled, thinking enabled)

| Category | Base | rys_33-36 |

|---|---|---|

| irrelevance | 90.00 | 88.00 |

| multiple | 96.00 | 95.00 |

| parallel | 93.00 | 91.00 |

| parallel_multiple | 87.00 | 85.00 |

| simple_java | 59.00 | 61.00 |

| simple_javascript | 74.00 | 72.00 |

| simple_python | 95.00 | 92.00 |

| live_irrelevance | 98.00 | 99.00 |

| live_multiple | 88.00 | 87.00 |

| live_parallel | 68.75 | 87.50 |

| live_parallel_multiple | 70.83 | 75.00 |

| live_relevance | 68.75 | 81.25 |

| live_simple | 85.00 | 85.00 |

| mean | 82.56 | 84.52 |

Sample size: 100 tests/category for categories with ≥100 entries; the full

category was used for the smaller ones (live_parallel, live_parallel_multiple,

live_relevance, simple_javascript). 1006 tests per model in total. The full

benchmark would be ~5x larger and would also cover multi-turn, memory, and

web-search categories that we did not run.

Inference: llama.cpp llama-server --jinja, BFCL via /v1/chat/completions

with native tool use, temperature=1.0, top_p=0.95, top_k=20, max_tokens=8192.

Multi-turn, memory, and web-search categories were not run.

---

What is RYS?

Transformers self-organise during training into functional circuits

contiguous blocks of layers that act together. RYS duplicates a specific block

in the forward pass using the same weights:

Normal:    0 → … → 32 → 33 → 34 → 35 → 36 → 37 → … → 63
rys_33-36: 0 → … → 32 → 33 → 34 → 35 → 36
                       → 33 → 34 → 35 → 36 → 37 → … → 63

The model processes layers 33–36 twice. No fine-tuning, no extra parameters

beyond the GGUF file overhead. Total layer count goes from 64 → 68.

---

How the layer range was found

A two-pass sweep across all 64 layers using a small probe of math, EQ, and

reasoning prompts:

  • Pass 1 (8-layer blocks, stride 4): identified hot zones around layers

32–48 (math gains, causal reasoning) and 48–60 (general reasoning gains).

  • Pass 2 (4-layer blocks, stride 1, layers 32–58): (33, 37) was the only

configuration that achieved a perfect score on the probe's causal

reasoning subcategory while keeping date, logic, and nav at their

baseline ceilings.

The probe alone suggested rys_33-36 was a moderate win. The **sampled BFCL

run with thinking enabled confirms it on the harder live categories** (above).

Extended evaluation (Ng's protocol)

After a thoughtful question on the discussion forum about deviations from

David Ng's suggested reproduction path,

we went back and ran the steps we had skipped:

Extended probemath_120 + eq_140 from Ng's repo, --reasoning off to

match the protocol's intent (the math probe is designed for intuitive

guessing, not deliberate computation):

| Variant | math_120 | eq_140 |

|---|---|---|

| base | 0.9986 | 74.53 |

| rys_33-36 | 0.9930 | 78.81 |

On the larger probe rys_33-36 holds its EQ improvement (+4.28 pp). Math is at

ceiling for both. Note this is the opposite direction from our small

internal probe (where rys_33-36 had lower EQ) — small-probe variance was

misleading us; the 140-question sample is the trustworthy reading.

Depth-2 beam search — 10 non-overlapping pair-combinations of the top

single-block configs, each scored on the same probe:

| Variant | math_120 | eq_140 |

|---|---|---|

| rys_33-36 | 0.9930 | 78.81 |

| rys_33-36 + 49-52 | 0.9226 | 75.66 |

| rys_33-36 + 53-56 | 0.9219 | 75.27 |

| rys_33-36 + 54-57 | 0.9639 | 72.21 |

| rys_33-36 + 56-59 | 0.9643 | 74.21 |

| rys_33-36 + 58-61 | 0.9930 | 68.78 |

| rys_49-52 + 53-56 | 0.8864 | 66.70 |

| rys_49-52 + 56-59 | 0.9654 | 69.67 |

| rys_49-52 + 58-61 | 0.9606 | 69.18 |

| rys_53-56 + 58-61 | 0.9635 | 63.57 |

| rys_54-57 + 58-61 | 0.9703 | 59.93 |

No depth-2 combination beats rys_33-36 on EQ_140. Stacking blocks degrades

math (sometimes catastrophically) without improving EQ. So the shortcut we

took in candidate selection (no beam search) did not cost us a better

configuration in this neighborhood. We did not train Ng's surrogate

regressor or run a deeper beam search — those would explore more of the

configuration space and might find something better.

---

Hybrid Mamba/attention architecture constraint

Qwen3.6-27B is a hybrid SSM/attention model (full_attention_interval = 4):

full attention every 4th layer, Gated DeltaNet SSM everywhere else. This

creates a hard constraint: the total layer count must remain divisible by 4.

  • Block size 4 → 64 + 4 = 68 layers (68 ÷ 4 = 17 ✓)
  • Block size 3 → 64 + 3 = 67 layers (67 ÷ 4 = 16.75 ✗ → crash)

---

Usage

llama.cpp / llama-server

The wins require thinking mode. Use --jinja so the server applies the

Qwen3.6 chat template, which primes thinking properly:

llama-server -m Qwen3.6-27B-rys_33-36-UD-Q4_K_XL.gguf \
             --jinja \
             -ngl 99 -c 32768 \
             --port 8080

Sampling parameters (Qwen3.6 thinking-mode defaults)

temperature = 1.0
top_p       = 0.95
top_k       = 20
min_p       = 0.0

For more deterministic / coding-focused tasks, Qwen recommends

temperature=0.6 instead. Either way, leave thinking enabled.

Token budget

Qwen3.6's thinking chains can be long (we observed up to ~7k tokens of

reasoning on hard BFCL parallel cases). Set max_tokens ≥ 8192 to avoid

truncating mid-thought.

VRAM

About 22 GiB at Q4_K_XL with 32k context and Q8 KV cache. Fits comfortably on

a single A100 40 GB.

Speculative decoding with DFlash (faster inference)

RYS makes the model deeper (68 vs 64 layers), so each token costs a little more

compute. You can win that back — and then some — with **DFlash speculative

decoding**, a diffusion-style draft head that proposes a block of tokens per

step which the target model verifies in one pass.

Qwen3.6-27B-DFlash-Q8_0-rys.gguf is a DFlash draft **trained against the

68-layer RYS layout** — its hidden-state taps (`target_layer_ids = [1, 16, 31,

50, 65]`) index into the duplicated-block arrangement, so it must be paired with

the RYS target, not the stock base model.

DFlash support is not in upstream llama.cpp yet; use

BeeLlama, a llama.cpp fork that adds

the dflash spec type:

./build/bin/llama-server \
  -m ~/models/Qwen3.6-27B-rys_33-36-UD-Q4_K_XL.gguf \
  --spec-draft-model ~/models/Qwen3.6-27B-DFlash-Q8_0-rys.gguf \
  --spec-type copyspec,dflash \
  --spec-dflash-cross-ctx 1024 \
  --port 9999 -np 1 --kv-unified \
  -ngl all --spec-draft-ngl all \
  -b 2048 -ub 1024 --flash-attn on \
  --jinja --no-host --reasoning on \
  --chat-template-kwargs '{"preserve_thinking":true}' \
  --temp 0.6 --top-k 20 --top-p 1.0 --min-p 0.0

The draft adds ~1.8 GiB of VRAM. Acceptance is highest on the long, structured

thinking chains this model produces, which is exactly where the RYS variant

spends its tokens.

---

When to use this

  • You want better function-calling performance on complex live queries

(parallel calls, relevance judgement) and you can afford ~6 extra layers of

prefill compute.

  • You're running with thinking mode on (this is where the gain comes from).

When NOT to use this

  • You're running without thinking — base will be ~1.5 pp better.
  • You care about the very-easy categories (simple_python, multiple) more

than the hard live ones — base is 1–3 pp better there.

---

Credits

License

Apache 2.0 (inherited from base model).

Run XpressAI/Qwen3.6-27B-RYS-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models