deucebucket/Qwen3.6-27B-Cerebellum-Q2K-GGUF overview
Qwen 3.6 27B — Q2 K GGUF with Weight Sensitivity Imatrix 10 GB A 2 bit quantization of Qwen/Qwen3.6 27B https://huggingface.co/Qwen/Qwen3.6 27B using a custom …
Runs locally from ~9.98 GB disk (12 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| qwen3.6-27b-cerebellum-imatrix-Q2_K.gguf | GGUF | Q2_K | 9.98 GB | Download |
Model Details
| Model ID | deucebucket/Qwen3.6-27B-Cerebellum-Q2K-GGUF |
|---|---|
| Author | deucebucket |
| Pipeline | text-generation |
| License | apache-2.0 |
| Base model | Qwen/Qwen3.6-27B |
| Last modified | 2026-06-12T08:33:15.000Z |
Model README
---
license: apache-2.0
base_model: Qwen/Qwen3.6-27B
tags:
- gguf
- qwen
- qwen3.6
- quantized
- imatrix
- 2-bit
- tool-calling
model_type: qwen3
quantized_by: deucebucket
pipeline_tag: text-generation
model-index:
- name: Qwen3.6-27B-Cerebellum-Q2K-GGUF
results:
- task:
name: Text Generation
type: text-generation
dataset:
name: AI2 Reasoning Challenge
type: ai2_arc
config: ARC-Challenge
split: test
metrics:
- name: normalized accuracy
type: acc_norm
value: 0.95
source:
name: Local benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-Q2K-GGUF/tree/main/benchmarks
- task:
name: Text Generation
type: text-generation
dataset:
name: HellaSwag
type: hellaswag
split: validation
metrics:
- name: accuracy
type: acc
value: 0.908
source:
name: Local benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-Q2K-GGUF/tree/main/benchmarks
- task:
name: Text Generation
type: text-generation
dataset:
name: MMLU-Redux
type: cais/mmlu
config: all
split: test
metrics:
- name: accuracy
type: acc
value: 0.743
source:
name: Local benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-Q2K-GGUF/tree/main/benchmarks
- task:
name: Text Generation
type: text-generation
dataset:
name: HumanEval (pass@1)
type: openai_humaneval
split: test
metrics:
- name: pass@1
type: pass@1
value: 0.47
source:
name: Local benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-Q2K-GGUF/tree/main/benchmarks
- task:
name: Text Generation
type: text-generation
dataset:
name: WikiText-2 Perplexity
type: wikitext
config: wikitext-2-raw-v1
split: test
metrics:
- name: perplexity
type: perplexity
value: 7.5
source:
name: Local benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-27B-Cerebellum-Q2K-GGUF/tree/main/benchmarks
---
Qwen 3.6 27B — Q2_K GGUF with Weight-Sensitivity Imatrix (10 GB)
A 2-bit quantization of Qwen/Qwen3.6-27B using a custom importance matrix computed from weight statistics. No calibration data required — the imatrix is generated in ~60 seconds on CPU.
Benchmarks
Measured directly on this GGUF with the local llama.cpp benchmark harness on RTX 3090 at temperature 0. The model-index metadata in this card's frontmatter mirrors these numbers; MMLU-Redux is used for the MMLU entry there.
| Benchmark | Score | Questions |
|-----------|-------|-----------|
| Perplexity (WikiText-2, 2048 ctx) | 7.500 | -- |
| HumanEval pass@1 | 47.0% | 164 |
| ARC-Challenge | 95.0% | 1,172 |
| HellaSwag | 90.8% | 10,042 |
| MMLU | 74.8% | 11,643 |
| MMLU-Redux | 74.3% | 2,400 |
| File size | 9.98 GB | -- |
| Tool calls | 72 sequential, zero duplicates | -- |
| Inference speed | 31 tok/s (RTX 3090) | -- |
vs Cerebellum v4 (12 GB, ablation-informed mixed-precision)
| Benchmark | Q2_K imatrix (10 GB) | Cerebellum v4 (12 GB) |
|-----------|:---:|:---:|
| Perplexity | 7.500 | 7.034 |
| HumanEval | 47.0% | 75.0% |
| ARC-Challenge | 95.0% | 95.1% |
| HellaSwag | 90.8% | 91.2% |
| MMLU-Redux | 74.3% | 77.1% |
Short-answer benchmarks (ARC, HellaSwag) are nearly identical. The gap opens on code generation (HumanEval: -28%) and knowledge tasks (MMLU-Redux: -2.8%). See Cerebellum v4 for the ablation-informed version.
MMLU Breakdown
| Category | Accuracy |
|----------|----------|
| Social Sciences | 80.9% |
| Humanities | 77.1% |
| STEM | 74.7% |
| Other | 70.5% |
Knowledge-based subjects hold up well (College Biology 92%, HS Psychology 92%). Math-heavy subjects lose the most precision at Q2_K (HS Mathematics 52%, Abstract Algebra 51%).
Perplexity Across Quant Levels
All quants generated with the same weight-sensitivity imatrix:
| Quant | Size | PPL | Notes |
|-------|------|-----|-------|
| Q4_K_M | 16 GB | 7.44 | |
| Q3_K_M | 13 GB | 7.45 | |
| Q2_K | 10 GB | 7.50 | This file |
| IQ2_M | 8.5 GB | 12.80 | Reasoning degrades |
| IQ2_XS | 8.5 GB | 18.66 | |
| IQ1_M | 7.2 GB | 44.45 | Not usable |
The quality cliff is between Q2_K and IQ2_M. At Q2_K, reasoning and tool calling stay fully intact.
How the Imatrix Works
Standard imatrix generation (llama-imatrix) runs calibration text through the full model — hours of GPU time for large models. This approach computes importance directly from the weights:
- For each weight tensor, compute channel sensitivity:
L2_norm x max_abs x variance - Write importance scores in llama.cpp's imatrix binary format
- Feed to
llama-quantize --imatrix— the quantizer allocates more bits to important blocks
No calibration data. No GPU required. ~60 seconds on CPU for any model size.
The osmosis_imatrix.dat file is included in this repo for anyone who wants to reproduce or create other quant sizes.
Tool Calling
72 unique sequential tool calls across a multi-city travel planning scenario:
- 12 tools: get_weather, search_flights, book_hotel, get_attractions, convert_currency, get_visa_requirements, book_restaurant, get_transport, check_health_advisory, get_events, get_travel_insurance, check_embassy
- 6 cities: Tokyo, Seoul, Bangkok, Singapore, Hanoi, Taipei
- Result: 72/72 unique calls, coherent final summary
- Config: 32K context, temperature 0,
--reasoning-budget 0
VRAM Requirements
| Context | VRAM |
|---------|------|
| 2K | ~11 GB |
| 8K | ~12 GB |
| 16K | ~13.4 GB |
| 32K | ~14.3 GB |
Usage
# Thinking disabled (recommended for tool calling and chat)
llama-server \
--model qwen3.6-27b-cerebellum-imatrix-Q2_K.gguf \
--n-gpu-layers 99 \
--ctx-size 32768 \
--reasoning-budget 0
# Thinking enabled (for complex reasoning)
llama-server \
--model qwen3.6-27b-cerebellum-imatrix-Q2_K.gguf \
--n-gpu-layers 99 \
--ctx-size 32768
Ollama
echo 'FROM ./qwen3.6-27b-cerebellum-imatrix-Q2_K.gguf' > Modelfile
ollama create qwen36-q2k -f Modelfile
ollama run qwen36-q2k
Reproducing This Quant
The imatrix generation tool is in the Cerebellum repo.
pip install git+https://github.com/deucebucket/cerebellum.git
# 1. Generate imatrix (~60 seconds on CPU)
python -m cerebellum.imatrix_stream \
--model Qwen/Qwen3.6-27B \
--output osmosis_imatrix.dat -v
# 2. Convert to f16 GGUF (using llama.cpp)
python convert_hf_to_gguf.py Qwen3.6-27B --outfile qwen3.6-27b-f16.gguf --outtype f16
# 3. Quantize with imatrix
llama-quantize --imatrix osmosis_imatrix.dat qwen3.6-27b-f16.gguf qwen3.6-27b-Q2_K.gguf Q2_K
Model Details
- Base model: Qwen/Qwen3.6-27B
- Architecture: Dense transformer, 64 layers, 851 tensors
- Quantization: Q2_K via llama-quantize with weight-sensitivity imatrix
- Imatrix method:
L2_norm x max_abs x varianceper channel, no calibration data - File format: GGUF v3
Test Hardware
| Component | Spec |
|-----------|------|
| GPU | NVIDIA RTX 3090 (24 GB) |
| CPU | AMD Ryzen 7 5800XT |
| RAM | 64 GB DDR4 |
| OS | Fedora Linux 43 (Atomic) |
Attribution
- Qwen Team — open-weight base model
- llama.cpp — imatrix quantization system
- AWQ — channel-level weight sensitivity insights
License
Apache 2.0
Run deucebucket/Qwen3.6-27B-Cerebellum-Q2K-GGUF with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models