GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

deucebucket/Qwen3.6-27B-Cerebellum-Q2K-GGUF overview

Qwen 3.6 27B — Q2 K GGUF with Weight Sensitivity Imatrix 10 GB A 2 bit quantization of Qwen/Qwen3.6 27B https://huggingface.co/Qwen/Qwen3.6 27B using a custom …

ggufqwenqwen3.6quantizedimatrix2-bittool-callingtext-generationarxiv:2306.00978base_model:Qwen/Qwen3.6-27Bbase_model:quantized:Qwen/Qwen3.6-27Blicense:apache-2.0model-indexendpoints_compatibleregion:usconversational

Runs locally from ~9.98 GB disk (12 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
221
Likes
2
Pipeline
text-generation

Repository Files & Downloads

1 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
qwen3.6-27b-cerebellum-imatrix-Q2_K.ggufGGUFQ2_K9.98 GBDownload

Model Details

Model IDdeucebucket/Qwen3.6-27B-Cerebellum-Q2K-GGUF
Authordeucebucket
Pipelinetext-generation
Licenseapache-2.0
Base modelQwen/Qwen3.6-27B
Last modified2026-06-12T08:33:15.000Z

Model README

---

license: apache-2.0

base_model: Qwen/Qwen3.6-27B

tags:

- gguf

- qwen

- qwen3.6

- quantized

- imatrix

- 2-bit

- tool-calling

model_type: qwen3

quantized_by: deucebucket

pipeline_tag: text-generation

model-index:

  • name: Qwen3.6-27B-Cerebellum-Q2K-GGUF

results:

- task:

name: Text Generation

type: text-generation

dataset:

name: AI2 Reasoning Challenge

type: ai2_arc

config: ARC-Challenge

split: test

metrics:

- name: normalized accuracy

type: acc_norm

value: 0.95

source:

name: Local benchmark run (RTX 3090, llama.cpp)

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-Q2K-GGUF/tree/main/benchmarks

- task:

name: Text Generation

type: text-generation

dataset:

name: HellaSwag

type: hellaswag

split: validation

metrics:

- name: accuracy

type: acc

value: 0.908

source:

name: Local benchmark run (RTX 3090, llama.cpp)

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-Q2K-GGUF/tree/main/benchmarks

- task:

name: Text Generation

type: text-generation

dataset:

name: MMLU-Redux

type: cais/mmlu

config: all

split: test

metrics:

- name: accuracy

type: acc

value: 0.743

source:

name: Local benchmark run (RTX 3090, llama.cpp)

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-Q2K-GGUF/tree/main/benchmarks

- task:

name: Text Generation

type: text-generation

dataset:

name: HumanEval (pass@1)

type: openai_humaneval

split: test

metrics:

- name: pass@1

type: pass@1

value: 0.47

source:

name: Local benchmark run (RTX 3090, llama.cpp)

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-Q2K-GGUF/tree/main/benchmarks

- task:

name: Text Generation

type: text-generation

dataset:

name: WikiText-2 Perplexity

type: wikitext

config: wikitext-2-raw-v1

split: test

metrics:

- name: perplexity

type: perplexity

value: 7.5

source:

name: Local benchmark run (RTX 3090, llama.cpp)

url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-Q2K-GGUF/tree/main/benchmarks

---

Qwen 3.6 27B — Q2_K GGUF with Weight-Sensitivity Imatrix (10 GB)

A 2-bit quantization of Qwen/Qwen3.6-27B using a custom importance matrix computed from weight statistics. No calibration data required — the imatrix is generated in ~60 seconds on CPU.

Benchmarks

Measured directly on this GGUF with the local llama.cpp benchmark harness on RTX 3090 at temperature 0. The model-index metadata in this card's frontmatter mirrors these numbers; MMLU-Redux is used for the MMLU entry there.

| Benchmark | Score | Questions |

|-----------|-------|-----------|

| Perplexity (WikiText-2, 2048 ctx) | 7.500 | -- |

| HumanEval pass@1 | 47.0% | 164 |

| ARC-Challenge | 95.0% | 1,172 |

| HellaSwag | 90.8% | 10,042 |

| MMLU | 74.8% | 11,643 |

| MMLU-Redux | 74.3% | 2,400 |

| File size | 9.98 GB | -- |

| Tool calls | 72 sequential, zero duplicates | -- |

| Inference speed | 31 tok/s (RTX 3090) | -- |

vs Cerebellum v4 (12 GB, ablation-informed mixed-precision)

| Benchmark | Q2_K imatrix (10 GB) | Cerebellum v4 (12 GB) |

|-----------|:---:|:---:|

| Perplexity | 7.500 | 7.034 |

| HumanEval | 47.0% | 75.0% |

| ARC-Challenge | 95.0% | 95.1% |

| HellaSwag | 90.8% | 91.2% |

| MMLU-Redux | 74.3% | 77.1% |

Short-answer benchmarks (ARC, HellaSwag) are nearly identical. The gap opens on code generation (HumanEval: -28%) and knowledge tasks (MMLU-Redux: -2.8%). See Cerebellum v4 for the ablation-informed version.

MMLU Breakdown

| Category | Accuracy |

|----------|----------|

| Social Sciences | 80.9% |

| Humanities | 77.1% |

| STEM | 74.7% |

| Other | 70.5% |

Knowledge-based subjects hold up well (College Biology 92%, HS Psychology 92%). Math-heavy subjects lose the most precision at Q2_K (HS Mathematics 52%, Abstract Algebra 51%).

Perplexity Across Quant Levels

All quants generated with the same weight-sensitivity imatrix:

| Quant | Size | PPL | Notes |

|-------|------|-----|-------|

| Q4_K_M | 16 GB | 7.44 | |

| Q3_K_M | 13 GB | 7.45 | |

| Q2_K | 10 GB | 7.50 | This file |

| IQ2_M | 8.5 GB | 12.80 | Reasoning degrades |

| IQ2_XS | 8.5 GB | 18.66 | |

| IQ1_M | 7.2 GB | 44.45 | Not usable |

The quality cliff is between Q2_K and IQ2_M. At Q2_K, reasoning and tool calling stay fully intact.

How the Imatrix Works

Standard imatrix generation (llama-imatrix) runs calibration text through the full model — hours of GPU time for large models. This approach computes importance directly from the weights:

  1. For each weight tensor, compute channel sensitivity: L2_norm x max_abs x variance
  2. Write importance scores in llama.cpp's imatrix binary format
  3. Feed to llama-quantize --imatrix — the quantizer allocates more bits to important blocks

No calibration data. No GPU required. ~60 seconds on CPU for any model size.

The osmosis_imatrix.dat file is included in this repo for anyone who wants to reproduce or create other quant sizes.

Tool Calling

72 unique sequential tool calls across a multi-city travel planning scenario:

  • 12 tools: get_weather, search_flights, book_hotel, get_attractions, convert_currency, get_visa_requirements, book_restaurant, get_transport, check_health_advisory, get_events, get_travel_insurance, check_embassy
  • 6 cities: Tokyo, Seoul, Bangkok, Singapore, Hanoi, Taipei
  • Result: 72/72 unique calls, coherent final summary
  • Config: 32K context, temperature 0, --reasoning-budget 0

VRAM Requirements

| Context | VRAM |

|---------|------|

| 2K | ~11 GB |

| 8K | ~12 GB |

| 16K | ~13.4 GB |

| 32K | ~14.3 GB |

Usage

# Thinking disabled (recommended for tool calling and chat)
llama-server \
  --model qwen3.6-27b-cerebellum-imatrix-Q2_K.gguf \
  --n-gpu-layers 99 \
  --ctx-size 32768 \
  --reasoning-budget 0

# Thinking enabled (for complex reasoning)
llama-server \
  --model qwen3.6-27b-cerebellum-imatrix-Q2_K.gguf \
  --n-gpu-layers 99 \
  --ctx-size 32768

Ollama

echo 'FROM ./qwen3.6-27b-cerebellum-imatrix-Q2_K.gguf' > Modelfile
ollama create qwen36-q2k -f Modelfile
ollama run qwen36-q2k

Reproducing This Quant

The imatrix generation tool is in the Cerebellum repo.

pip install git+https://github.com/deucebucket/cerebellum.git

# 1. Generate imatrix (~60 seconds on CPU)
python -m cerebellum.imatrix_stream \
  --model Qwen/Qwen3.6-27B \
  --output osmosis_imatrix.dat -v

# 2. Convert to f16 GGUF (using llama.cpp)
python convert_hf_to_gguf.py Qwen3.6-27B --outfile qwen3.6-27b-f16.gguf --outtype f16

# 3. Quantize with imatrix
llama-quantize --imatrix osmosis_imatrix.dat qwen3.6-27b-f16.gguf qwen3.6-27b-Q2_K.gguf Q2_K

Model Details

  • Base model: Qwen/Qwen3.6-27B
  • Architecture: Dense transformer, 64 layers, 851 tensors
  • Quantization: Q2_K via llama-quantize with weight-sensitivity imatrix
  • Imatrix method: L2_norm x max_abs x variance per channel, no calibration data
  • File format: GGUF v3

Test Hardware

| Component | Spec |

|-----------|------|

| GPU | NVIDIA RTX 3090 (24 GB) |

| CPU | AMD Ryzen 7 5800XT |

| RAM | 64 GB DDR4 |

| OS | Fedora Linux 43 (Atomic) |

Attribution

  • Qwen Team — open-weight base model
  • llama.cpp — imatrix quantization system
  • AWQ — channel-level weight sensitivity insights

License

Apache 2.0

Run deucebucket/Qwen3.6-27B-Cerebellum-Q2K-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models