GraySoft
Projects Models About FAQ Contact Download guIDE →
Model Intelligence Sheet

xpressai/qwen3.5-27b-rys-ud-q4_k_xl-gguf overview

Two modified versions of Qwen3.5-27B produced by RYS layer duplication — no training, no weight changes, just routing hidden states through a specific circuit twice. Based on David Ng's RYS method. ---

ggufqwen3.5ryslayer-surgeryreasoningmambahybridenbase_model:Qwen/Qwen3.5-27Bbase_model:quantized:Qwen/Qwen3.5-27Blicense:apache-2.0endpoints_compatibleregion:usimatrixconversational
xpressai/qwen3.5-27b-rys-ud-q4_k_xl-gguf visual
Downloads
5,337
Likes
2
Pipeline
Library
Visibility
Public
Access
Open

Repository Files & Downloads

2 files detected
Direct downloads for all repository files
FileTypeQuantizationSizeLink
Qwen3.5-27B-rys_30-33-UD-Q4_K_XL.gguf GGUF Q4_K_XL 17.31 GB Download
Qwen3.5-27B-rys_34-37_eq-UD-Q4_K_XL.gguf GGUF Q4_K_XL 17.33 GB Download

Model Details Live

Model Slug
xpressai/qwen3.5-27b-rys-ud-q4_k_xl-gguf
Author
XpressAI
Pipeline Task
Library
Created
2026-03-26
Last Modified
2026-04-16
Gated
No
Private
No
HF SHA
d7e7c78e0a88277f4aa5b77341fa6b29284f12b8
License
apache-2.0
Language
en
Base Model
Qwen/Qwen3.5-27B

Metadata Inspector

Normalized metadata (stored in metadata_json)
{
  "metadata": {},
  "card_data": {
    "license": "apache-2.0",
    "base_model": "Qwen/Qwen3.5-27B",
    "tags": [
      "gguf",
      "qwen3.5",
      "rys",
      "layer-surgery",
      "reasoning",
      "mamba",
      "hybrid"
    ],
    "language": [
      "en"
    ],
    "frontmatter": {
      "license": "apache-2.0",
      "base_model": "Qwen/Qwen3.5-27B",
      "tags": [
        "gguf",
        "qwen3.5",
        "rys",
        "layer-surgery",
        "reasoning",
        "mamba",
        "hybrid"
      ],
      "language": [
        "en"
      ]
    },
    "hero_image_url": "",
    "summary": "Two modified versions of Qwen3.5-27B produced by **RYS layer duplication** — no training, no weight changes, just routing hidden states through a specific circuit twice. Based on David Ng's RYS method. ---",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nlicense: apache-2.0\nbase_model: Qwen/Qwen3.5-27B\ntags:\n  - gguf\n  - qwen3.5\n  - rys\n  - layer-surgery\n  - reasoning\n  - mamba\n  - hybrid\nlanguage:\n  - en\n---\n\n# Qwen3.5-27B — RYS Layer Surgery (GGUF)\n\nTwo modified versions of [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) produced by\n**RYS layer duplication** — no training, no weight changes, just routing hidden states through a specific circuit twice.\n\nBased on [David Ng's RYS method](https://dnhkng.github.io/posts/rys/).\n\n---\n\n## Files\n\n| File | Layers | Size |\n|------|--------|------|\n| `Qwen3.5-27B-UD-Q4_K_XL.gguf` | 64 | 17 GiB |\n| `Qwen3.5-27B-rys_30-33-UD-Q4_K_XL.gguf` | 68 | 21 GiB |\n| `Qwen3.5-27B-rys_34-37_eq-UD-Q4_K_XL.gguf` | 68 | 21 GiB |\n\n---\n\n## Probe scores\n\nScores from an internal sweep benchmark run during circuit search. Sample sizes are small — treat these as directional indicators, not definitive benchmarks.\n\n| Model | Math | EQ | Reasoning | Logic |\n|-------|------|----|-----------|-------|\n| Base (64 layers) | 0.375 | 11.5 | 0.000 | 0.00 |\n| rys_30-33 (68 layers) | 0.438 | 29.5 | **0.353** | **1.00** |\n| rys_34-37 (68 layers) | 0.375 | **39.4** | 0.000 | 0.00 |\n\n- **Math**: Ng's partial-credit scoring on a small GSM8K sample\n- **EQ**: EQ-Bench-style emotional intelligence score (0–100)\n- **Reasoning**: fraction correct across causal, date, logic, navigation, and GSM8K probes\n- **Logic**: fraction correct on logical deduction probes only\n\nrys_30-33 shows the best combined improvement across reasoning categories. rys_34-37 shows the highest EQ score but no reasoning improvement over baseline.\n\n---\n\n# Benchmarks (based on BFCLv4)\n\n## Non-Live Tests\n\n| Task | Qwen3.5-27B-RYS-30-34 (Δ vs Best) | Qwen3.5-27B-FC (Baseline) | Claude Opus 4.5 (FC) | Claude Sonnet 4.5 (FC) | GLM 4.6 (FC) | Grok-4 (FC) | GPT-5.2 (FC) |\n|------|-----------------------------------|--------------------------|---------------------|-----------------------|--------------|-------------|--------------|\n| irrelevance | 86.67% (-1.25%) | 87.50% | 85.83% | **87.92%** | 85.42% | 77.50% | 80.00% |\n| multiple | **96.50%** | 96.50% | 95.50% | 95.50% | 95.00% | 92.50% | 88.00% |\n| parallel | **95.00%** | 93.00% | 93.50% | 94.50% | 91.50% | 88.50% | 89.00% |\n| parallel_multiple | 91.50% (-0.50%) | 76.00% | 88.50% | **92.00%** | 89.50% | 87.00% | 77.50% |\n| simple_java | 62.00% (-3.00%) | **65.00%** | 60.00% | 62.00% | 64.00% | 62.00% | 62.00% |\n| simple_javascript | 72.00% (-2.00%) | 66.00% | **74.00%** | 58.00% | 64.00% | 66.00% | 64.00% |\n| simple_python | 95.25% (-2.50%) | 95.00% | 96.50% | **97.75%** | 94.75% | 92.50% | 92.75% |\n\n## Live Tests\n\n| Task | Qwen3.5-27B-RYS-30-34 (Δ vs Best) | Qwen3.5-27B-FC (Baseline) | Claude Opus 4.5 (FC) | Claude Sonnet 4.5 (FC) | GLM 4.6 (FC) | Grok-4 (FC) | GPT-5.2 (FC) |\n|------|-----------------------------------|--------------------------|---------------------|-----------------------|--------------|-------------|--------------|\n| live_irrelevance | 82.24% (-3.05%) | 80.88% | 83.60% | **85.29%** | 84.50% | 73.30% | 78.85% |\n| live_multiple | 79.68% (-1.14%) | **80.82%** | 78.16% | 78.92% | 78.92% | 73.88% | 70.37% |\n| live_parallel | 81.25% (-6.25%) | **87.50%** | **87.50%** | **87.50%** | 81.25% | 75.00% | 68.75% |\n| live_parallel_multiple | 75.00% (-8.33%) | 79.17% | 75.00% | **83.33%** | 75.00% | 79.17% | 58.33% |\n| live_relevance | 81.25% (-6.25%) | 68.75% | 62.50% | 68.75% | 75.00% | **87.50%** | 75.00% |\n| live_simple | 84.50% (-5.03%) | 87.60% | 86.43% | **89.53%** | **89.53%** | 82.17% | 71.71% |\n\n## Multi-Turn Tests\n\n| Task | Qwen3.5-27B-RYS-30-34 (Δ vs Best) | Qwen3.5-27B-FC (Baseline) | Claude Opus 4.5 (FC) | Claude Sonnet 4.5 (FC) | GLM 4.6 (FC) | Grok-4 (FC) | GPT-5.2 (FC) |\n|------|-----------------------------------|--------------------------|---------------------|-----------------------|--------------|-------------|--------------|\n| multi_turn_base | 74.50% (-6.50%) | 70.50% | **81.00%** | 69.00% | 74.50% | 44.00% | 36.50% |\n| multi_turn_long_context | 67.50% (-3.00%) | 59.00% | **70.50%** | 59.00% | 66.50% | 44.00% | 30.50% |\n\n## Memory Tests (Agentic)\n\n| Task | Qwen3.5-27B-RYS-30-34 (Δ vs Best) | Qwen3.5-27B-FC (Baseline) | Claude Opus 4.5 (FC) | Claude Sonnet 4.5 (FC) | GLM 4.6 (FC) | Grok-4 (FC) | GPT-5.2 (FC) |\n|------|-----------------------------------|--------------------------|---------------------|-----------------------|--------------|-------------|--------------|\n| memory_kv | 45.81% (-25.16%) | N/A | **70.97%** | 54.19% | 43.87% | 57.42% | 33.55% |\n| memory_rec_sum | 70.97% (-12.26%) | N/A | 77.42% | **83.23%** | 67.10% | 51.61% | 60.65% |\n| memory_vector | 63.23% (-9.67%) | N/A | **72.90%** | 57.42% | 56.13% | 58.71% | 43.23% |\n\n## RYS vs Baseline Comparison (All Tests)\n\n| Task | RYS | Baseline | Δ (RYS - Baseline) |\n|------|-----|----------|-------------------|\n| irrelevance | 86.67% | 87.50% | **-0.83%** |\n| multiple | 96.50% | 96.50% | **0.00%** |\n| parallel | 95.00% | 93.00% | **+2.00%** ✅ |\n| parallel_multiple | 91.50% | 76.00% | **+15.50%** ✅ |\n| simple_java | 62.00% | 65.00% | **-3.00%** |\n| simple_javascript | 72.00% | 66.00% | **+6.00%** ✅ |\n| simple_python | 95.25% | 95.00% | **+0.25%** |\n| live_irrelevance | 82.24% | 80.88% | **+1.36%** ✅ |\n| live_multiple | 79.68% | 80.82% | **-1.14%** |\n| live_parallel | 81.25% | 87.50% | **-6.25%** |\n| live_parallel_multiple | 75.00% | 79.17% | **-4.17%** |\n| live_relevance | 81.25% | 68.75% | **+12.50%** ✅ |\n| live_simple | 84.50% | 87.60% | **-3.10%** |\n| multi_turn_base | 74.50% | 70.50% | **+4.00%** ✅ |\n| multi_turn_long_context | 67.50% | 59.00% | **+8.50%** ✅ |\n| memory_kv | 45.81% | N/A | ✅ |\n| memory_rec_sum | 70.97% | N/A | ✅ |\n| memory_vector | 63.23% | N/A |  ✅ |\n\n\n---\n\n## What is RYS?\n\nTransformers self-organise during training into functional **circuits** — contiguous blocks of layers that act together. The RYS technique duplicates a specific block in the forward pass using the same weights, with no extra copies on disk beyond the GGUF file overhead:\n\n```\nNormal:     0 → 1 → … → 29 → 30 → 31 → 32 → 33 → 34 → … → 63\nrys_30-33:  0 → 1 → … → 29 → 30 → 31 → 32 → 33 → 30 → 31 → 32 → 33 → 34 → … → 63\n```\n\nThe model processes the same circuit twice, without any weight changes or fine-tuning.\n\n---\n\n## Hybrid Mamba/attention architecture constraint\n\nQwen3.5-27B is a **hybrid SSM/attention model** (`full_attention_interval = 4`): full attention every 4th layer, Mamba SSM everywhere else.\n\nThis creates a hard constraint on layer surgery: **the total layer count must remain divisible by 4**.\n\n- Block size 4 → 64 + 4 = 68 layers (68 ÷ 4 = 17 ✓)\n- Block size 3 → 64 + 3 = 67 layers (67 ÷ 4 = 16.75 ✗ → server crash at load)\n- Block size 8 → 64 + 8 = 72 layers (72 ÷ 4 = 18 ✓)\n\nOnly multiples of 4 work as block sizes for this model family.\n\n---\n\n## How the circuit was found\n\nA two-pass sweep over the 64-layer model using a probe benchmark:\n\n**Pass 1** — 8-layer blocks, stride 4, layers 4–60:\n- Identified hot zones at layers 8–16 (reasoning) and 28–40 (EQ/math)\n\n**Pass 2** — 4-layer blocks, stride 1, within each hot zone:\n- `(30, 34)` achieved the best combined score: reasoning=0.353, EQ=29.5, logic=1.0\n- `(34, 38)` achieved the highest EQ score: EQ=39.4\n\nEach configuration was tested by patching the GGUF layer path, loading with llama-server, and scoring with the probe suite.\n\n---\n\n## Usage\n\n### llama.cpp / llama-server\n\n```bash\nllama-server -m Qwen3.5-27B-rys_30-33.gguf -ngl 99 --port 8080\n```\n\n### Thinking mode\n\nQwen3.5 defaults to thinking mode (`<think>…</think>`). Add `/no_think` to the system prompt for fast, direct answers:\n\n```python\nmessages = [\n    {\"role\": \"system\", \"content\": \"/no_think\"},\n    {\"role\": \"user\",   \"content\": \"Your question here\"}\n]\n```\n\n### VRAM requirements\n\nThe model weights alone are ~21 GiB (Q4_K_XL quantization, 68 layers). A single A100 80GB or H100 runs this comfortably. Consumer GPU setups depend on your llama.cpp version's tensor split support.\n\n---\n\n## Credits\n\n- [David Ng](https://dnhkng.github.io/posts/rys/) for the original RYS method\n- [Unsloth](https://huggingface.co/unsloth) for the base `Q4_K_XL` GGUF quantization\n- [Qwen team](https://huggingface.co/Qwen) for Qwen3.5-27B\n- [llama.cpp](https://github.com/ggml-org/llama.cpp) for local inference\n\n## License\n\nApache 2.0 (inherited from base model)\n",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "qwen3.5",
    "rys",
    "layer-surgery",
    "reasoning",
    "mamba",
    "hybrid",
    "en",
    "base_model:Qwen/Qwen3.5-27B",
    "base_model:quantized:Qwen/Qwen3.5-27B",
    "license:apache-2.0",
    "endpoints_compatible",
    "region:us",
    "imatrix",
    "conversational"
  ],
  "likes": 2,
  "downloads": 5337,
  "gated": false,
  "private": false,
  "last_modified": "2026-04-16T00:37:24.000Z",
  "created_at": "2026-03-26T11:55:20.000Z",
  "pipeline_tag": "",
  "library_name": ""
}
Source payload excerpt (from Hugging Face API)
{
  "_id": "69c51ea85ee2ee57e0359762",
  "id": "XpressAI/Qwen3.5-27B-RYS-UD-Q4_K_XL-GGUF",
  "modelId": "XpressAI/Qwen3.5-27B-RYS-UD-Q4_K_XL-GGUF",
  "sha": "d7e7c78e0a88277f4aa5b77341fa6b29284f12b8",
  "createdAt": "2026-03-26T11:55:20.000Z",
  "lastModified": "2026-04-16T00:37:24.000Z",
  "author": "XpressAI",
  "downloads": 5337,
  "likes": 2,
  "gated": false,
  "private": false,
  "pipeline_tag": "",
  "library_name": "",
  "siblings_count": 4
}