What license applies to deucebucket/Qwen3.6-27B-Cerebellum-GGUF?

License: apache-2.0. Verify terms on Hugging Face before commercial use.

How do I run deucebucket/Qwen3.6-27B-Cerebellum-GGUF locally?

Download a GGUF file from this page and load it in guIDE or llama.cpp. Pipeline task: text-generation.

Model Intelligence Sheet

deucebucket/Qwen3.6-27B-Cerebellum-GGUF overview

Q: How much VRAM or disk space does deucebucket/Qwen3.6-27B-Cerebellum-GGUF need?

Runs locally from ~11.98 GB disk (12 GB VRAM class GPUs with llama.cpp / guIDE).

<p align="center" <img src="cerebellum banner.png" alt="Cerebellum" width="640" </p Qwen 3.6 27B — Cerebellum v4 GGUF 12 GB Ablation informed mixed precision q…

ggufquantizedcerebellumqwen3.6ablation-informedtext-generationarxiv:2306.00978base_model:Qwen/Qwen3.6-27Bbase_model:quantized:Qwen/Qwen3.6-27Blicense:apache-2.0model-indexendpoints_compatibleregion:usimatrixconversational

Runs locally from ~11.98 GB disk (12 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads

314

Likes

Pipeline

text-generation

Author

deucebucket

Repository Files & Downloads

1 GGUF files detected

Direct downloads for local inference

File	Type	Quantization	Size	Link
Qwen3.6-27B-Cerebellum-v4-Q2_K_Mixed.gguf	GGUF	Q2_K_MIXED	11.98 GB	Download

Model Details

Model ID	deucebucket/Qwen3.6-27B-Cerebellum-GGUF
Author	deucebucket
Pipeline	text-generation
License	apache-2.0
Base model	Qwen/Qwen3.6-27B
Last modified	2026-06-15T04:20:57.000Z

Model README

---

license: apache-2.0

tags:

gguf
quantized
cerebellum
qwen3.6
ablation-informed

base_model: Qwen/Qwen3.6-27B

model_type: qwen3

quantized_by: deucebucket

pipeline_tag: text-generation

model-index:

name: Qwen3.6-27B-Cerebellum-GGUF

results:

- task:

type: text-generation

dataset:

type: ai2_arc

config: ARC-Challenge

split: test

metrics:

- name: normalized accuracy

type: acc_norm

value: 0.968

source:

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmarks

- task:

type: text-generation

dataset:

type: hellaswag

split: validation

metrics:

- name: accuracy

type: acc

value: 0.922

source:

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmarks

- task:

type: text-generation

dataset:

type: cais/mmlu

config: all

split: test

metrics:

- name: accuracy

type: acc

value: 0.766

source:

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmarks

- task:

type: text-generation

dataset:

type: openai_humaneval

split: test

metrics:

- name: pass@1

type: pass@1

value: 0.927

source:

name: EvalPlus chat-mode (llama-server) — samples + eval JSON in benchmark_results/

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmark_results

- task:

type: text-generation

dataset:

type: evalplus/humanevalplus

split: test

metrics:

- name: pass@1

type: pass@1

value: 0.890

source:

name: EvalPlus chat-mode (llama-server) — samples + eval JSON in benchmark_results/

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmark_results

- task:

type: text-generation

dataset:

type: wikitext

config: wikitext-2-raw-v1

split: test

metrics:

- name: perplexity

type: perplexity

value: 7.034

source:

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-GGUF/tree/main/benchmarks

---

</p>

Qwen 3.6 27B — Cerebellum v4 GGUF (12 GB)

Ablation-informed mixed-precision quantization of Qwen 3.6 27B. 12 GB file size, 7.034 perplexity, 181 per-tensor quant overrides.

Benchmarks

Measured directly on this GGUF with the local llama.cpp benchmark harness on RTX 3090 at temperature 0. The model-index metadata in this card's frontmatter mirrors the v4 numbers below; MMLU-Redux is used for the MMLU entry there.

| Benchmark | Score | Questions |

|-----------|-------|-----------|

| Perplexity (WikiText-2, 2048 ctx) | 7.034 | — |

| HumanEval pass@1 | 92.7% | 164 |

| HumanEval+ pass@1 | 89.0% | 164 |

| ARC-Challenge | 96.8% | 1,172 |

| HellaSwag | 92.2% | 10,042 |

| MMLU | 82.5% | 11,643 |

| MMLU-Redux | 76.6% | 2,400 |

Recommended sampling: temperature=0. Tested across the full benchmark suite, temp=0 scored highest on all benchmarks.

2026-05-03 Score Corrections: Found and fixed bugs in the benchmark scripts. ARC had 19 questions misjudged due to numeric label handling. HellaSwag had 108 empty responses incorrectly counted as wrong. Full audit trail and per-question results in the Cerebellum repo.

2026-06-14 HumanEval correction (81.1% → 92.7% / HumanEval+ 89.0%): The earlier HumanEval figure (81.1%) came from a local raw-/v1/completions harness that mechanically understated the score — code was extracted from raw completions, where indentation and sanitization artifacts cost real passes that the model had actually solved. The corrected numbers come from the standard upstream EvalPlus pipeline (evalplus.codegen --backend openai against llama-server in chat mode, greedy/temp=0), which scored HumanEval 92.7% and HumanEval+ 89.0% on this same GGUF. The samples JSONL and EvalPlus eval JSON are in benchmark_results/ for verification. Only the HumanEval/HumanEval+ numbers changed; PPL, ARC, HellaSwag, and MMLU are unchanged.

vs Q2_K imatrix (10 GB)

| Benchmark | Cerebellum v4 (12 GB) | Q2_K imatrix (10 GB) |

|-----------|:---:|:---:|

| Perplexity | 7.034 | 7.500 |

| HumanEval | 92.7%¹ | 47.0%² |

| ARC-Challenge | 96.8% | 95.0% |

| HellaSwag | 92.2% | 90.8% |

| MMLU-Redux | 76.6% | 74.3% |

¹ Cerebellum v4 HumanEval re-measured under upstream EvalPlus chat-mode (see correction note above). ² The Q2_K imatrix HumanEval figure is still from the older local raw-completion harness and is pending a like-for-like EvalPlus rerun, so treat this row as indicative of direction, not an exact gap.

Short-answer benchmarks (ARC, HellaSwag) are nearly identical — both methods preserve surface reasoning at 2-bit. The gap opens on tasks requiring precise code generation and deep knowledge (MMLU-Redux: +2.8%), where ablation-informed precision allocation protects the tensors that matter.

Speed (RTX 3090, full GPU offload)

| Metric | Value |

|--------|-------|

| Prompt processing | 71 tok/s |

| Generation | 36.5 tok/s |

| Context tested | 4,096 tokens |

Perplexity vs Size

| Method | Size | PPL |

|--------|------|-----|

| Cerebellum v4 | 11.98 GB | 7.034 |

| Cerebellum v2 | 10.68 GB | 7.087 |

| Q2_K + imatrix | 9.98 GB | 7.500 |

| Q2_K (no imatrix) | 9.98 GB | 7.649 |

How Cerebellum Works

Standard quantization applies the same precision level uniformly across every tensor. Cerebellum measures the actual sensitivity of each tensor and allocates bits where they matter.

Step 1: Ablation Sweep

Each tensor is individually crushed to Q2_K while keeping all other tensors at their baseline quant. The perplexity impact of each crush is measured. This produces a sensitivity map of the entire model.

Example measurements from this model (baseline PPL 8.256):

|--------|-----------------|-------|---------|

| blk.63.attn_q | 8.418 | +0.162 | Sacred — needs max precision |

| blk.63.ffn_down | 8.393 | +0.138 | Sacred |

| blk.1.ffn_gate | 8.294 | +0.039 | Sensitive |

| blk.50.ffn_down | 8.246 | -0.010 | Safe to crush |

| blk.34.ffn_down | 8.161 | -0.095 | Demotable — improves when crushed |

| blk.2.ffn_gate | 8.109 | -0.147 | Demotable — actively helps |

Step 2: Budget Allocation

Given a target file size (12 GB), the allocator promotes sacred tensors to higher quant levels (Q3_K, Q4_K, Q5_K, Q6_K, Q8_0) in multiple passes, spending the size budget on tensors with the highest measured sensitivity. Demotable tensors are explicitly kept at Q2_K.

Step 3: Build

The final GGUF is built with llama-quantize --tensor-type @tensor_types.txt, which applies per-tensor quant overrides.

What It Found

181 tensor overrides across 64 layers:

| Quant Level | Tensors | Purpose |

|-------------|---------|---------|

| Q8_0 | 7 | Sacred attention/FFN in the most sensitive layers |

| Q6_K | 41 | High-sensitivity layers |

| Q5_K | 70 | Moderate sensitivity |

| Q4_K | 22 | Mild sensitivity |

| Q3_K | 19 | Low sensitivity |

| Q2_K (demoted) | 22 | Improve when crushed — kept at minimum |

Key findings:

Layer 63 is the most sensitive — q_proj (+0.162 PPL) and ffn_down (+0.138 PPL) need maximum precision
7 tensors actively improve at Q2_K — crushing them reduces perplexity (negative delta)
Same-layer interactions are destructive — crushing two FFN tensors in the same layer simultaneously causes worse regression than expected (interaction ratio 0.13)
Cross-layer effects are ~86% additive — single-tensor ablation deltas predict multi-tensor outcomes with ~14% attenuation

VRAM Requirements

| Context | VRAM |

|---------|------|

| 2K | ~13 GB |

| 4K | ~13.5 GB |

| 16K | ~14.5 GB |

Measured launch (RTX 3090, llama.cpp)

Measured 2026-06-13 on a single RTX 3090 (24 GB), one llama-server, KV cache q8_0:

| metric | measured |

|---|---|

| decode speed | 36.5 tok/s |

| peak VRAM (4-slot serving) | 16.2 GB |

| max measured context (q8_0 KV) | 131,072 |

llama-server -m Qwen3.6-27B-Cerebellum-v4-Q2_K_Mixed.gguf \
  -ngl 99 --parallel 4 -c 24576 --jinja

_This rig's measurements; no quality claims beyond them._

Usage

# llama.cpp — thinking disabled (recommended for chat)
llama-server \
  --model Qwen3.6-27B-Cerebellum-v4-Q2_K_Mixed.gguf \
  --n-gpu-layers 99 \
  --ctx-size 4096 \
  --reasoning-budget 0

# llama.cpp — thinking enabled (for complex reasoning)
llama-server \
  --model Qwen3.6-27B-Cerebellum-v4-Q2_K_Mixed.gguf \
  --n-gpu-layers 99 \
  --ctx-size 4096

Ollama

echo 'FROM ./Qwen3.6-27B-Cerebellum-v4-Q2_K_Mixed.gguf' > Modelfile
ollama create qwen36-cerebellum -f Modelfile
ollama run qwen36-cerebellum

Reproducing This Quant

The full ablation data, tensor type allocations, and tools are in the Cerebellum repo.

pip install -e .

# 1. Run ablation sweep
python -m osmosis.cerebellum ablate \
    --base-gguf qwen36-Q2_K.gguf \
    --tensors ablation_plan.json \
    --output ablation_results.json

# 2. Generate tensor allocation for 12GB budget
python -m osmosis.cerebellum allocate \
    --ablation ablation_results.json \
    --budget 12.0 \
    --output tensor_types.txt

# 3. Build the GGUF
llama-quantize --imatrix imatrix.dat \
    --tensor-type @tensor_types.txt \
    qwen36-f16.gguf qwen36-cerebellum.gguf Q2_K

Model Details

Base model: Qwen/Qwen3.6-27B
Architecture: Dense transformer, 64 layers, 851 tensors
Base quant: Q2_K with importance matrix
Overrides: 181 tensors promoted or demoted based on ablation data
File format: GGUF v3

Test Hardware

| Component | Spec |

|-----------|------|

| GPU | NVIDIA RTX 3090 (24 GB) |

| CPU | AMD Ryzen 7 5800XT |

| RAM | 64 GB DDR4 |

| OS | Fedora Linux 43 (Atomic) |

Attribution

Qwen Team — open-weight base model
llama.cpp — imatrix quantization and tensor type override support
AWQ — channel-level weight sensitivity insights

License

Apache 2.0

Run deucebucket/Qwen3.6-27B-Cerebellum-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models