XpressAI/Qwen3.6-27B-RYS-GGUF overview
Qwen3.6 27B — RYS Layer Surgery GGUF A modified version of Qwen3.6 27B Instruct https://huggingface.co/Qwen/Qwen3.6 27B Instruct produced by RYS layer duplicat…
Runs locally from ~1.72 GB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
Model Details
Model README
---
license: apache-2.0
base_model: Qwen/Qwen3.6-27B-Instruct
tags:
- gguf
- qwen3.6
- rys
- layer-surgery
- reasoning
- bfcl
- function-calling
- speculative-decoding
- dflash
language:
- en
---
Qwen3.6-27B — RYS Layer Surgery (GGUF)
A modified version of Qwen3.6-27B-Instruct
produced by RYS layer duplication — no training, no weight changes, just
running layers 33–36 a second time during the forward pass.
Based on David Ng's RYS method.
---
TL;DR
On the Berkeley Function-Call Leaderboard (BFCL v4, 100 tests/category × 13
single-turn categories, sampled), this variant **beats the unmodified base
model by +1.96 pp on average** when run with thinking mode enabled — driven by
large gains on the hardest live categories:
| Category | Base | rys_33-36 | Δ |
|---|---|---|---|
| live_parallel | 68.75% | 87.50% | +18.75 |
| live_relevance | 68.75% | 81.25% | +12.50 |
| live_parallel_multiple | 70.83% | 75.00% | +4.17 |
| mean (13 categories) | 82.56% | 84.52% | +1.96 |
The wins come from improved reasoning during prefill on multi-call /
relevance-judgement queries. The trade is small regressions (−1 to −3 pp) on
easier non-live categories. Thinking mode is required — without it, this
variant slightly underperforms base.
---
Files
| File | Quant | Layers | Size | Role |
|------|-------|--------|------|------|
| Qwen3.6-27B-rys_33-36-UD-Q4_K_XL.gguf | Q4_K_XL | 68 | 18 GiB | target model |
| Qwen3.6-27B-rys_33-36-Q8_0.gguf | Q8_0 | 68 | 29 GiB | target model |
| Qwen3.6-27B-DFlash-Q8_0-rys.gguf | Q8_0 | 5 | 1.8 GiB | DFlash draft (speculative decoding) |
The base GGUF (no surgery) is at
---
Internal probe results
A small probe of math, EQ, and reasoning prompts was run during the layer
search. The probe categories are tiny (3 questions per reasoning subcategory,
~16 EQ-Bench-style items, ~16 math problems) so individual numbers should be
treated as directional, not definitive.
| Metric | Base | rys_33-36 |
|---|---|---|
| Math (GSM8K-style partial credit) | 0.537 | 0.500 |
| EQ (EQ-Bench-style, 0–100) | 93.59 | 86.64 |
| Reasoning total (17 probes, 5 subcategories) | 0.765 | 0.882 |
| ↳ causal | 0.67 | 1.00 |
| ↳ date | 1.00 | 1.00 |
| ↳ logic | 1.00 | 1.00 |
| ↳ navigation | 0.67 | 1.00 |
| ↳ gsm | 0.60 | 0.60 |
Layers 33–36 was the only configuration in the layer-block sweep that
achieved a perfect score on the causal reasoning subcategory while keeping
the other reasoning categories at or above their baseline. This is what
motivated picking it for the BFCL run below.
---
BFCL results (sampled, thinking enabled)
| Category | Base | rys_33-36 |
|---|---|---|
| irrelevance | 90.00 | 88.00 |
| multiple | 96.00 | 95.00 |
| parallel | 93.00 | 91.00 |
| parallel_multiple | 87.00 | 85.00 |
| simple_java | 59.00 | 61.00 |
| simple_javascript | 74.00 | 72.00 |
| simple_python | 95.00 | 92.00 |
| live_irrelevance | 98.00 | 99.00 |
| live_multiple | 88.00 | 87.00 |
| live_parallel | 68.75 | 87.50 |
| live_parallel_multiple | 70.83 | 75.00 |
| live_relevance | 68.75 | 81.25 |
| live_simple | 85.00 | 85.00 |
| mean | 82.56 | 84.52 |
Sample size: 100 tests/category for categories with ≥100 entries; the full
category was used for the smaller ones (live_parallel, live_parallel_multiple,
live_relevance, simple_javascript). 1006 tests per model in total. The full
benchmark would be ~5x larger and would also cover multi-turn, memory, and
web-search categories that we did not run.
Inference: llama.cpp llama-server --jinja, BFCL via /v1/chat/completions
with native tool use, temperature=1.0, top_p=0.95, top_k=20, max_tokens=8192.
Multi-turn, memory, and web-search categories were not run.
---
What is RYS?
Transformers self-organise during training into functional circuits —
contiguous blocks of layers that act together. RYS duplicates a specific block
in the forward pass using the same weights:
Normal: 0 → … → 32 → 33 → 34 → 35 → 36 → 37 → … → 63
rys_33-36: 0 → … → 32 → 33 → 34 → 35 → 36
→ 33 → 34 → 35 → 36 → 37 → … → 63
The model processes layers 33–36 twice. No fine-tuning, no extra parameters
beyond the GGUF file overhead. Total layer count goes from 64 → 68.
---
How the layer range was found
A two-pass sweep across all 64 layers using a small probe of math, EQ, and
reasoning prompts:
- Pass 1 (8-layer blocks, stride 4): identified hot zones around layers
32–48 (math gains, causal reasoning) and 48–60 (general reasoning gains).
- Pass 2 (4-layer blocks, stride 1, layers 32–58):
(33, 37)was the only
configuration that achieved a perfect score on the probe's causal
reasoning subcategory while keeping date, logic, and nav at their
baseline ceilings.
The probe alone suggested rys_33-36 was a moderate win. The **sampled BFCL
run with thinking enabled confirms it on the harder live categories** (above).
Extended evaluation (Ng's protocol)
After a thoughtful question on the discussion forum about deviations from
David Ng's suggested reproduction path,
we went back and ran the steps we had skipped:
Extended probe — math_120 + eq_140 from Ng's repo, --reasoning off to
match the protocol's intent (the math probe is designed for intuitive
guessing, not deliberate computation):
| Variant | math_120 | eq_140 |
|---|---|---|
| base | 0.9986 | 74.53 |
| rys_33-36 | 0.9930 | 78.81 |
On the larger probe rys_33-36 holds its EQ improvement (+4.28 pp). Math is at
ceiling for both. Note this is the opposite direction from our small
internal probe (where rys_33-36 had lower EQ) — small-probe variance was
misleading us; the 140-question sample is the trustworthy reading.
Depth-2 beam search — 10 non-overlapping pair-combinations of the top
single-block configs, each scored on the same probe:
| Variant | math_120 | eq_140 |
|---|---|---|
| rys_33-36 | 0.9930 | 78.81 |
| rys_33-36 + 49-52 | 0.9226 | 75.66 |
| rys_33-36 + 53-56 | 0.9219 | 75.27 |
| rys_33-36 + 54-57 | 0.9639 | 72.21 |
| rys_33-36 + 56-59 | 0.9643 | 74.21 |
| rys_33-36 + 58-61 | 0.9930 | 68.78 |
| rys_49-52 + 53-56 | 0.8864 | 66.70 |
| rys_49-52 + 56-59 | 0.9654 | 69.67 |
| rys_49-52 + 58-61 | 0.9606 | 69.18 |
| rys_53-56 + 58-61 | 0.9635 | 63.57 |
| rys_54-57 + 58-61 | 0.9703 | 59.93 |
No depth-2 combination beats rys_33-36 on EQ_140. Stacking blocks degrades
math (sometimes catastrophically) without improving EQ. So the shortcut we
took in candidate selection (no beam search) did not cost us a better
configuration in this neighborhood. We did not train Ng's surrogate
regressor or run a deeper beam search — those would explore more of the
configuration space and might find something better.
---
Hybrid Mamba/attention architecture constraint
Qwen3.6-27B is a hybrid SSM/attention model (full_attention_interval = 4):
full attention every 4th layer, Gated DeltaNet SSM everywhere else. This
creates a hard constraint: the total layer count must remain divisible by 4.
- Block size 4 → 64 + 4 = 68 layers (68 ÷ 4 = 17 ✓)
- Block size 3 → 64 + 3 = 67 layers (67 ÷ 4 = 16.75 ✗ → crash)
---
Usage
llama.cpp / llama-server
The wins require thinking mode. Use --jinja so the server applies the
Qwen3.6 chat template, which primes thinking properly:
llama-server -m Qwen3.6-27B-rys_33-36-UD-Q4_K_XL.gguf \
--jinja \
-ngl 99 -c 32768 \
--port 8080
Sampling parameters (Qwen3.6 thinking-mode defaults)
temperature = 1.0
top_p = 0.95
top_k = 20
min_p = 0.0
For more deterministic / coding-focused tasks, Qwen recommends
temperature=0.6 instead. Either way, leave thinking enabled.
Token budget
Qwen3.6's thinking chains can be long (we observed up to ~7k tokens of
reasoning on hard BFCL parallel cases). Set max_tokens ≥ 8192 to avoid
truncating mid-thought.
VRAM
About 22 GiB at Q4_K_XL with 32k context and Q8 KV cache. Fits comfortably on
a single A100 40 GB.
Speculative decoding with DFlash (faster inference)
RYS makes the model deeper (68 vs 64 layers), so each token costs a little more
compute. You can win that back — and then some — with **DFlash speculative
decoding**, a diffusion-style draft head that proposes a block of tokens per
step which the target model verifies in one pass.
Qwen3.6-27B-DFlash-Q8_0-rys.gguf is a DFlash draft **trained against the
68-layer RYS layout** — its hidden-state taps (`target_layer_ids = [1, 16, 31,
50, 65]`) index into the duplicated-block arrangement, so it must be paired with
the RYS target, not the stock base model.
DFlash support is not in upstream llama.cpp yet; use
BeeLlama, a llama.cpp fork that adds
the dflash spec type:
./build/bin/llama-server \
-m ~/models/Qwen3.6-27B-rys_33-36-UD-Q4_K_XL.gguf \
--spec-draft-model ~/models/Qwen3.6-27B-DFlash-Q8_0-rys.gguf \
--spec-type copyspec,dflash \
--spec-dflash-cross-ctx 1024 \
--port 9999 -np 1 --kv-unified \
-ngl all --spec-draft-ngl all \
-b 2048 -ub 1024 --flash-attn on \
--jinja --no-host --reasoning on \
--chat-template-kwargs '{"preserve_thinking":true}' \
--temp 0.6 --top-k 20 --top-p 1.0 --min-p 0.0
The draft adds ~1.8 GiB of VRAM. Acceptance is highest on the long, structured
thinking chains this model produces, which is exactly where the RYS variant
spends its tokens.
---
When to use this
- You want better function-calling performance on complex live queries
(parallel calls, relevance judgement) and you can afford ~6 extra layers of
prefill compute.
- You're running with thinking mode on (this is where the gain comes from).
When NOT to use this
- You're running without thinking — base will be ~1.5 pp better.
- You care about the very-easy categories (
simple_python,multiple) more
than the hard live ones — base is 1–3 pp better there.
---
Credits
- David Ng for the original RYS method
- Unsloth for the base
UD-Q4_K_XLquantization - Qwen team for Qwen3.6-27B
- llama.cpp for local inference
- The Berkeley Function-Call Leaderboard for the eval harness
License
Apache 2.0 (inherited from base model).
Run XpressAI/Qwen3.6-27B-RYS-GGUF with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models