deucebucket/Qwen3.6-27B-Cerebellum-GGUF overview
<p align="center" <img src="cerebellum banner.png" alt="Cerebellum" width="640" </p Qwen 3.6 27B — Cerebellum v4 GGUF 12 GB Ablation informed mixed precision q…
Runs locally from ~11.98 GB disk (12 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| Qwen3.6-27B-Cerebellum-v4-Q2_K_Mixed.gguf | GGUF | Q2_K_MIXED | 11.98 GB | Download |
Model Details
| Model ID | deucebucket/Qwen3.6-27B-Cerebellum-GGUF |
|---|---|
| Author | deucebucket |
| Pipeline | text-generation |
| License | apache-2.0 |
| Base model | Qwen/Qwen3.6-27B |
| Last modified | 2026-06-15T04:20:57.000Z |
Model README
---
license: apache-2.0
tags:
- gguf
- quantized
- cerebellum
- qwen3.6
- ablation-informed
base_model: Qwen/Qwen3.6-27B
model_type: qwen3
quantized_by: deucebucket
pipeline_tag: text-generation
model-index:
- name: Qwen3.6-27B-Cerebellum-GGUF
results:
- task:
name: Text Generation
type: text-generation
dataset:
name: AI2 Reasoning Challenge
type: ai2_arc
config: ARC-Challenge
split: test
metrics:
- name: normalized accuracy
type: acc_norm
value: 0.968
source:
name: Local benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmarks
- task:
name: Text Generation
type: text-generation
dataset:
name: HellaSwag
type: hellaswag
split: validation
metrics:
- name: accuracy
type: acc
value: 0.922
source:
name: Local benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmarks
- task:
name: Text Generation
type: text-generation
dataset:
name: MMLU-Redux
type: cais/mmlu
config: all
split: test
metrics:
- name: accuracy
type: acc
value: 0.766
source:
name: Local benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmarks
- task:
name: Text Generation
type: text-generation
dataset:
name: HumanEval (pass@1)
type: openai_humaneval
split: test
metrics:
- name: pass@1
type: pass@1
value: 0.927
source:
name: EvalPlus chat-mode (llama-server) — samples + eval JSON in benchmark_results/
url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmark_results
- task:
name: Text Generation
type: text-generation
dataset:
name: HumanEval+ (pass@1)
type: evalplus/humanevalplus
split: test
metrics:
- name: pass@1
type: pass@1
value: 0.890
source:
name: EvalPlus chat-mode (llama-server) — samples + eval JSON in benchmark_results/
url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmark_results
- task:
name: Text Generation
type: text-generation
dataset:
name: WikiText-2 Perplexity
type: wikitext
config: wikitext-2-raw-v1
split: test
metrics:
- name: perplexity
type: perplexity
value: 7.034
source:
name: Local benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmarks
---
<p align="center">
<img src="cerebellum_banner.png" alt="Cerebellum" width="640">
</p>
Qwen 3.6 27B — Cerebellum v4 GGUF (12 GB)
Ablation-informed mixed-precision quantization of Qwen 3.6 27B. 12 GB file size, 7.034 perplexity, 181 per-tensor quant overrides.
Benchmarks
Measured directly on this GGUF with the local llama.cpp benchmark harness on RTX 3090 at temperature 0. The model-index metadata in this card's frontmatter mirrors the v4 numbers below; MMLU-Redux is used for the MMLU entry there.
| Benchmark | Score | Questions |
|-----------|-------|-----------|
| Perplexity (WikiText-2, 2048 ctx) | 7.034 | — |
| HumanEval pass@1 | 92.7% | 164 |
| HumanEval+ pass@1 | 89.0% | 164 |
| ARC-Challenge | 96.8% | 1,172 |
| HellaSwag | 92.2% | 10,042 |
| MMLU | 82.5% | 11,643 |
| MMLU-Redux | 76.6% | 2,400 |
Recommended sampling: temperature=0. Tested across the full benchmark suite, temp=0 scored highest on all benchmarks.
2026-05-03 Score Corrections: Found and fixed bugs in the benchmark scripts. ARC had 19 questions misjudged due to numeric label handling. HellaSwag had 108 empty responses incorrectly counted as wrong. Full audit trail and per-question results in the Cerebellum repo.
2026-06-14 HumanEval correction (81.1% → 92.7% / HumanEval+ 89.0%): The earlier HumanEval figure (81.1%) came from a local raw-/v1/completions harness that mechanically understated the score — code was extracted from raw completions, where indentation and sanitization artifacts cost real passes that the model had actually solved. The corrected numbers come from the standard upstream EvalPlus pipeline (evalplus.codegen --backend openai against llama-server in chat mode, greedy/temp=0), which scored HumanEval 92.7% and HumanEval+ 89.0% on this same GGUF. The samples JSONL and EvalPlus eval JSON are in benchmark_results/ for verification. Only the HumanEval/HumanEval+ numbers changed; PPL, ARC, HellaSwag, and MMLU are unchanged.
vs Q2_K imatrix (10 GB)
| Benchmark | Cerebellum v4 (12 GB) | Q2_K imatrix (10 GB) |
|-----------|:---:|:---:|
| Perplexity | 7.034 | 7.500 |
| HumanEval | 92.7%¹ | 47.0%² |
| ARC-Challenge | 96.8% | 95.0% |
| HellaSwag | 92.2% | 90.8% |
| MMLU-Redux | 76.6% | 74.3% |
¹ Cerebellum v4 HumanEval re-measured under upstream EvalPlus chat-mode (see correction note above). ² The Q2_K imatrix HumanEval figure is still from the older local raw-completion harness and is pending a like-for-like EvalPlus rerun, so treat this row as indicative of direction, not an exact gap.
Short-answer benchmarks (ARC, HellaSwag) are nearly identical — both methods preserve surface reasoning at 2-bit. The gap opens on tasks requiring precise code generation and deep knowledge (MMLU-Redux: +2.8%), where ablation-informed precision allocation protects the tensors that matter.
Speed (RTX 3090, full GPU offload)
| Metric | Value |
|--------|-------|
| Prompt processing | 71 tok/s |
| Generation | 36.5 tok/s |
| Context tested | 4,096 tokens |
Perplexity vs Size
| Method | Size | PPL |
|--------|------|-----|
| Cerebellum v4 | 11.98 GB | 7.034 |
| Cerebellum v2 | 10.68 GB | 7.087 |
| Q2_K + imatrix | 9.98 GB | 7.500 |
| Q2_K (no imatrix) | 9.98 GB | 7.649 |
How Cerebellum Works
Standard quantization applies the same precision level uniformly across every tensor. Cerebellum measures the actual sensitivity of each tensor and allocates bits where they matter.
Step 1: Ablation Sweep
Each tensor is individually crushed to Q2_K while keeping all other tensors at their baseline quant. The perplexity impact of each crush is measured. This produces a sensitivity map of the entire model.
Example measurements from this model (baseline PPL 8.256):
| Tensor | PPL when crushed | Delta | Verdict |
|--------|-----------------|-------|---------|
| blk.63.attn_q | 8.418 | +0.162 | Sacred — needs max precision |
| blk.63.ffn_down | 8.393 | +0.138 | Sacred |
| blk.1.ffn_gate | 8.294 | +0.039 | Sensitive |
| blk.50.ffn_down | 8.246 | -0.010 | Safe to crush |
| blk.34.ffn_down | 8.161 | -0.095 | Demotable — improves when crushed |
| blk.2.ffn_gate | 8.109 | -0.147 | Demotable — actively helps |
Step 2: Budget Allocation
Given a target file size (12 GB), the allocator promotes sacred tensors to higher quant levels (Q3_K, Q4_K, Q5_K, Q6_K, Q8_0) in multiple passes, spending the size budget on tensors with the highest measured sensitivity. Demotable tensors are explicitly kept at Q2_K.
Step 3: Build
The final GGUF is built with llama-quantize --tensor-type @tensor_types.txt, which applies per-tensor quant overrides.
What It Found
181 tensor overrides across 64 layers:
| Quant Level | Tensors | Purpose |
|-------------|---------|---------|
| Q8_0 | 7 | Sacred attention/FFN in the most sensitive layers |
| Q6_K | 41 | High-sensitivity layers |
| Q5_K | 70 | Moderate sensitivity |
| Q4_K | 22 | Mild sensitivity |
| Q3_K | 19 | Low sensitivity |
| Q2_K (demoted) | 22 | Improve when crushed — kept at minimum |
Key findings:
- Layer 63 is the most sensitive — q_proj (+0.162 PPL) and ffn_down (+0.138 PPL) need maximum precision
- 7 tensors actively improve at Q2_K — crushing them reduces perplexity (negative delta)
- Same-layer interactions are destructive — crushing two FFN tensors in the same layer simultaneously causes worse regression than expected (interaction ratio 0.13)
- Cross-layer effects are ~86% additive — single-tensor ablation deltas predict multi-tensor outcomes with ~14% attenuation
VRAM Requirements
| Context | VRAM |
|---------|------|
| 2K | ~13 GB |
| 4K | ~13.5 GB |
| 16K | ~14.5 GB |
Measured launch (RTX 3090, llama.cpp)
Measured 2026-06-13 on a single RTX 3090 (24 GB), one llama-server, KV cache q8_0:
| metric | measured |
|---|---|
| decode speed | 36.5 tok/s |
| peak VRAM (4-slot serving) | 16.2 GB |
| max measured context (q8_0 KV) | 131,072 |
llama-server -m Qwen3.6-27B-Cerebellum-v4-Q2_K_Mixed.gguf \
-ngl 99 --parallel 4 -c 24576 --jinja
_This rig's measurements; no quality claims beyond them._
Usage
# llama.cpp — thinking disabled (recommended for chat)
llama-server \
--model Qwen3.6-27B-Cerebellum-v4-Q2_K_Mixed.gguf \
--n-gpu-layers 99 \
--ctx-size 4096 \
--reasoning-budget 0
# llama.cpp — thinking enabled (for complex reasoning)
llama-server \
--model Qwen3.6-27B-Cerebellum-v4-Q2_K_Mixed.gguf \
--n-gpu-layers 99 \
--ctx-size 4096
Ollama
echo 'FROM ./Qwen3.6-27B-Cerebellum-v4-Q2_K_Mixed.gguf' > Modelfile
ollama create qwen36-cerebellum -f Modelfile
ollama run qwen36-cerebellum
Reproducing This Quant
The full ablation data, tensor type allocations, and tools are in the Cerebellum repo.
pip install -e .
# 1. Run ablation sweep
python -m osmosis.cerebellum ablate \
--base-gguf qwen36-Q2_K.gguf \
--tensors ablation_plan.json \
--output ablation_results.json
# 2. Generate tensor allocation for 12GB budget
python -m osmosis.cerebellum allocate \
--ablation ablation_results.json \
--budget 12.0 \
--output tensor_types.txt
# 3. Build the GGUF
llama-quantize --imatrix imatrix.dat \
--tensor-type @tensor_types.txt \
qwen36-f16.gguf qwen36-cerebellum.gguf Q2_K
Model Details
- Base model: Qwen/Qwen3.6-27B
- Architecture: Dense transformer, 64 layers, 851 tensors
- Base quant: Q2_K with importance matrix
- Overrides: 181 tensors promoted or demoted based on ablation data
- File format: GGUF v3
Test Hardware
| Component | Spec |
|-----------|------|
| GPU | NVIDIA RTX 3090 (24 GB) |
| CPU | AMD Ryzen 7 5800XT |
| RAM | 64 GB DDR4 |
| OS | Fedora Linux 43 (Atomic) |
Attribution
- Qwen Team — open-weight base model
- llama.cpp — imatrix quantization and tensor type override support
- AWQ — channel-level weight sensitivity insights
License
Apache 2.0
Run deucebucket/Qwen3.6-27B-Cerebellum-GGUF with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models