GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

deucebucket/Qwen3.6-27B-Cerebellum-GGUF overview

<p align="center" <img src="cerebellum banner.png" alt="Cerebellum" width="640" </p Qwen 3.6 27B — Cerebellum v4 GGUF 12 GB Ablation informed mixed precision q…

ggufquantizedcerebellumqwen3.6ablation-informedtext-generationarxiv:2306.00978base_model:Qwen/Qwen3.6-27Bbase_model:quantized:Qwen/Qwen3.6-27Blicense:apache-2.0model-indexendpoints_compatibleregion:usimatrixconversational

Runs locally from ~11.98 GB disk (12 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
314
Likes
10
Pipeline
text-generation

Repository Files & Downloads

1 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
Qwen3.6-27B-Cerebellum-v4-Q2_K_Mixed.ggufGGUFQ2_K_MIXED11.98 GBDownload

Model Details

Model IDdeucebucket/Qwen3.6-27B-Cerebellum-GGUF
Authordeucebucket
Pipelinetext-generation
Licenseapache-2.0
Base modelQwen/Qwen3.6-27B
Last modified2026-06-15T04:20:57.000Z

Model README

---

license: apache-2.0

tags:

  • gguf
  • quantized
  • cerebellum
  • qwen3.6
  • ablation-informed

base_model: Qwen/Qwen3.6-27B

model_type: qwen3

quantized_by: deucebucket

pipeline_tag: text-generation

model-index:

  • name: Qwen3.6-27B-Cerebellum-GGUF

results:

- task:

name: Text Generation

type: text-generation

dataset:

name: AI2 Reasoning Challenge

type: ai2_arc

config: ARC-Challenge

split: test

metrics:

- name: normalized accuracy

type: acc_norm

value: 0.968

source:

name: Local benchmark run (RTX 3090, llama.cpp)

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmarks

- task:

name: Text Generation

type: text-generation

dataset:

name: HellaSwag

type: hellaswag

split: validation

metrics:

- name: accuracy

type: acc

value: 0.922

source:

name: Local benchmark run (RTX 3090, llama.cpp)

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmarks

- task:

name: Text Generation

type: text-generation

dataset:

name: MMLU-Redux

type: cais/mmlu

config: all

split: test

metrics:

- name: accuracy

type: acc

value: 0.766

source:

name: Local benchmark run (RTX 3090, llama.cpp)

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmarks

- task:

name: Text Generation

type: text-generation

dataset:

name: HumanEval (pass@1)

type: openai_humaneval

split: test

metrics:

- name: pass@1

type: pass@1

value: 0.927

source:

name: EvalPlus chat-mode (llama-server) — samples + eval JSON in benchmark_results/

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmark_results

- task:

name: Text Generation

type: text-generation

dataset:

name: HumanEval+ (pass@1)

type: evalplus/humanevalplus

split: test

metrics:

- name: pass@1

type: pass@1

value: 0.890

source:

name: EvalPlus chat-mode (llama-server) — samples + eval JSON in benchmark_results/

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmark_results

- task:

name: Text Generation

type: text-generation

dataset:

name: WikiText-2 Perplexity

type: wikitext

config: wikitext-2-raw-v1

split: test

metrics:

- name: perplexity

type: perplexity

value: 7.034

source:

name: Local benchmark run (RTX 3090, llama.cpp)

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmarks

---

<p align="center">

<img src="cerebellum_banner.png" alt="Cerebellum" width="640">

</p>

Qwen 3.6 27B — Cerebellum v4 GGUF (12 GB)

Ablation-informed mixed-precision quantization of Qwen 3.6 27B. 12 GB file size, 7.034 perplexity, 181 per-tensor quant overrides.

Benchmarks

Measured directly on this GGUF with the local llama.cpp benchmark harness on RTX 3090 at temperature 0. The model-index metadata in this card's frontmatter mirrors the v4 numbers below; MMLU-Redux is used for the MMLU entry there.

| Benchmark | Score | Questions |

|-----------|-------|-----------|

| Perplexity (WikiText-2, 2048 ctx) | 7.034 | — |

| HumanEval pass@1 | 92.7% | 164 |

| HumanEval+ pass@1 | 89.0% | 164 |

| ARC-Challenge | 96.8% | 1,172 |

| HellaSwag | 92.2% | 10,042 |

| MMLU | 82.5% | 11,643 |

| MMLU-Redux | 76.6% | 2,400 |

Recommended sampling: temperature=0. Tested across the full benchmark suite, temp=0 scored highest on all benchmarks.

2026-05-03 Score Corrections: Found and fixed bugs in the benchmark scripts. ARC had 19 questions misjudged due to numeric label handling. HellaSwag had 108 empty responses incorrectly counted as wrong. Full audit trail and per-question results in the Cerebellum repo.

2026-06-14 HumanEval correction (81.1% → 92.7% / HumanEval+ 89.0%): The earlier HumanEval figure (81.1%) came from a local raw-/v1/completions harness that mechanically understated the score — code was extracted from raw completions, where indentation and sanitization artifacts cost real passes that the model had actually solved. The corrected numbers come from the standard upstream EvalPlus pipeline (evalplus.codegen --backend openai against llama-server in chat mode, greedy/temp=0), which scored HumanEval 92.7% and HumanEval+ 89.0% on this same GGUF. The samples JSONL and EvalPlus eval JSON are in benchmark_results/ for verification. Only the HumanEval/HumanEval+ numbers changed; PPL, ARC, HellaSwag, and MMLU are unchanged.

vs Q2_K imatrix (10 GB)

| Benchmark | Cerebellum v4 (12 GB) | Q2_K imatrix (10 GB) |

|-----------|:---:|:---:|

| Perplexity | 7.034 | 7.500 |

| HumanEval | 92.7%¹ | 47.0%² |

| ARC-Challenge | 96.8% | 95.0% |

| HellaSwag | 92.2% | 90.8% |

| MMLU-Redux | 76.6% | 74.3% |

¹ Cerebellum v4 HumanEval re-measured under upstream EvalPlus chat-mode (see correction note above). ² The Q2_K imatrix HumanEval figure is still from the older local raw-completion harness and is pending a like-for-like EvalPlus rerun, so treat this row as indicative of direction, not an exact gap.

Short-answer benchmarks (ARC, HellaSwag) are nearly identical — both methods preserve surface reasoning at 2-bit. The gap opens on tasks requiring precise code generation and deep knowledge (MMLU-Redux: +2.8%), where ablation-informed precision allocation protects the tensors that matter.

Speed (RTX 3090, full GPU offload)

| Metric | Value |

|--------|-------|

| Prompt processing | 71 tok/s |

| Generation | 36.5 tok/s |

| Context tested | 4,096 tokens |

Perplexity vs Size

| Method | Size | PPL |

|--------|------|-----|

| Cerebellum v4 | 11.98 GB | 7.034 |

| Cerebellum v2 | 10.68 GB | 7.087 |

| Q2_K + imatrix | 9.98 GB | 7.500 |

| Q2_K (no imatrix) | 9.98 GB | 7.649 |

How Cerebellum Works

Standard quantization applies the same precision level uniformly across every tensor. Cerebellum measures the actual sensitivity of each tensor and allocates bits where they matter.

Step 1: Ablation Sweep

Each tensor is individually crushed to Q2_K while keeping all other tensors at their baseline quant. The perplexity impact of each crush is measured. This produces a sensitivity map of the entire model.

Example measurements from this model (baseline PPL 8.256):

| Tensor | PPL when crushed | Delta | Verdict |

|--------|-----------------|-------|---------|

| blk.63.attn_q | 8.418 | +0.162 | Sacred — needs max precision |

| blk.63.ffn_down | 8.393 | +0.138 | Sacred |

| blk.1.ffn_gate | 8.294 | +0.039 | Sensitive |

| blk.50.ffn_down | 8.246 | -0.010 | Safe to crush |

| blk.34.ffn_down | 8.161 | -0.095 | Demotable — improves when crushed |

| blk.2.ffn_gate | 8.109 | -0.147 | Demotable — actively helps |

Step 2: Budget Allocation

Given a target file size (12 GB), the allocator promotes sacred tensors to higher quant levels (Q3_K, Q4_K, Q5_K, Q6_K, Q8_0) in multiple passes, spending the size budget on tensors with the highest measured sensitivity. Demotable tensors are explicitly kept at Q2_K.

Step 3: Build

The final GGUF is built with llama-quantize --tensor-type @tensor_types.txt, which applies per-tensor quant overrides.

What It Found

181 tensor overrides across 64 layers:

| Quant Level | Tensors | Purpose |

|-------------|---------|---------|

| Q8_0 | 7 | Sacred attention/FFN in the most sensitive layers |

| Q6_K | 41 | High-sensitivity layers |

| Q5_K | 70 | Moderate sensitivity |

| Q4_K | 22 | Mild sensitivity |

| Q3_K | 19 | Low sensitivity |

| Q2_K (demoted) | 22 | Improve when crushed — kept at minimum |

Key findings:

  • Layer 63 is the most sensitive — q_proj (+0.162 PPL) and ffn_down (+0.138 PPL) need maximum precision
  • 7 tensors actively improve at Q2_K — crushing them reduces perplexity (negative delta)
  • Same-layer interactions are destructive — crushing two FFN tensors in the same layer simultaneously causes worse regression than expected (interaction ratio 0.13)
  • Cross-layer effects are ~86% additive — single-tensor ablation deltas predict multi-tensor outcomes with ~14% attenuation

VRAM Requirements

| Context | VRAM |

|---------|------|

| 2K | ~13 GB |

| 4K | ~13.5 GB |

| 16K | ~14.5 GB |

Measured launch (RTX 3090, llama.cpp)

Measured 2026-06-13 on a single RTX 3090 (24 GB), one llama-server, KV cache q8_0:

| metric | measured |

|---|---|

| decode speed | 36.5 tok/s |

| peak VRAM (4-slot serving) | 16.2 GB |

| max measured context (q8_0 KV) | 131,072 |

llama-server -m Qwen3.6-27B-Cerebellum-v4-Q2_K_Mixed.gguf \
  -ngl 99 --parallel 4 -c 24576 --jinja

_This rig's measurements; no quality claims beyond them._

Usage

# llama.cpp — thinking disabled (recommended for chat)
llama-server \
  --model Qwen3.6-27B-Cerebellum-v4-Q2_K_Mixed.gguf \
  --n-gpu-layers 99 \
  --ctx-size 4096 \
  --reasoning-budget 0

# llama.cpp — thinking enabled (for complex reasoning)
llama-server \
  --model Qwen3.6-27B-Cerebellum-v4-Q2_K_Mixed.gguf \
  --n-gpu-layers 99 \
  --ctx-size 4096

Ollama

echo 'FROM ./Qwen3.6-27B-Cerebellum-v4-Q2_K_Mixed.gguf' > Modelfile
ollama create qwen36-cerebellum -f Modelfile
ollama run qwen36-cerebellum

Reproducing This Quant

The full ablation data, tensor type allocations, and tools are in the Cerebellum repo.

pip install -e .

# 1. Run ablation sweep
python -m osmosis.cerebellum ablate \
    --base-gguf qwen36-Q2_K.gguf \
    --tensors ablation_plan.json \
    --output ablation_results.json

# 2. Generate tensor allocation for 12GB budget
python -m osmosis.cerebellum allocate \
    --ablation ablation_results.json \
    --budget 12.0 \
    --output tensor_types.txt

# 3. Build the GGUF
llama-quantize --imatrix imatrix.dat \
    --tensor-type @tensor_types.txt \
    qwen36-f16.gguf qwen36-cerebellum.gguf Q2_K

Model Details

  • Base model: Qwen/Qwen3.6-27B
  • Architecture: Dense transformer, 64 layers, 851 tensors
  • Base quant: Q2_K with importance matrix
  • Overrides: 181 tensors promoted or demoted based on ablation data
  • File format: GGUF v3

Test Hardware

| Component | Spec |

|-----------|------|

| GPU | NVIDIA RTX 3090 (24 GB) |

| CPU | AMD Ryzen 7 5800XT |

| RAM | 64 GB DDR4 |

| OS | Fedora Linux 43 (Atomic) |

Attribution

  • Qwen Team — open-weight base model
  • llama.cpp — imatrix quantization and tensor type override support
  • AWQ — channel-level weight sensitivity insights

License

Apache 2.0

Run deucebucket/Qwen3.6-27B-Cerebellum-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models