deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF overview
Qwen 3.6 35B A3B — Cerebellum GGUF Sensitivity guided mixed precision quantization of Qwen/Qwen3.6 35B A3B https://huggingface.co/Qwen/Qwen3.6 35B A3B . Two va…
Runs locally from ~857.6 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
Model Details
| Model ID | deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF |
|---|---|
| Author | deucebucket |
| Pipeline | text-generation |
| License | apache-2.0 |
| Base model | Qwen/Qwen3.6-35B-A3B |
| Last modified | 2026-06-13T00:57:58.000Z |
Model README
---
license: apache-2.0
library_name: gguf
base_model: Qwen/Qwen3.6-35B-A3B
base_model_relation: quantized
model_name: Qwen3.6-35B-A3B-Cerebellum-GGUF
model_creator: Qwen
model_type: qwen3
quantized_by: deucebucket
pipeline_tag: text-generation
tags:
- GGUF
- qwen3
- qwen
- quantized
- cerebellum
- imatrix
- moe
- mixed-precision
- 3-bit
- conversational
model-index:
- name: Qwen3.6-35B-A3B-Cerebellum-GGUF
results:
- task:
name: Text Generation
type: text-generation
dataset:
name: AI2 Reasoning Challenge
type: ai2_arc
config: ARC-Challenge
split: test
metrics:
- name: normalized accuracy
type: acc_norm
value: 0.958
source:
name: Local benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF/tree/main/benchmark_results
- task:
name: Text Generation
type: text-generation
dataset:
name: HellaSwag
type: hellaswag
split: validation
metrics:
- name: accuracy
type: acc
value: 0.923
source:
name: Local benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF/tree/main/benchmark_results
- task:
name: Text Generation
type: text-generation
dataset:
name: MMLU-Redux
type: cais/mmlu
config: all
split: test
metrics:
- name: accuracy
type: acc
value: 0.75
source:
name: Local benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF/tree/main/benchmark_results
- task:
name: Text Generation
type: text-generation
dataset:
name: HumanEval+ (pass@1)
type: openai_humaneval
split: test
metrics:
- name: pass@1
type: pass@1
value: 0.652
source:
name: Local benchmark run (RTX 3090, llama.cpp)
url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF/tree/main/benchmark_results
---
Qwen 3.6 35B-A3B — Cerebellum GGUF
Sensitivity-guided mixed-precision quantization of Qwen/Qwen3.6-35B-A3B. Two variants available:
| Variant | File | Size | BPW |
|---------|------|------|-----|
| Cerebellum v3 (recommended) | Qwen3.6-35B-A3B-Cerebellum-v3-Q3_K_M.gguf | 11 GB | 2.76 |
| Cerebellum v1 (legacy) | Qwen3.6-35B-A3B-Cerebellum-Q3_K_M.gguf | 12 GB | 2.73 |
Cerebellum measures which weight groups survive extreme compression and which don't, then writes a single GGUF with per-tensor precision assignments. v3 uses 360 tensor-level overrides guided by group ablation and reverse ablation analysis.
Benchmarks
All benchmarks measured directly on these GGUF files using llama.cpp inference with cleaned evaluation harness.
The model-index metadata in this card's frontmatter mirrors the recommended v3 build. Protocol: local llama.cpp chat harness on RTX 3090, temperature 0, no thinking mode. Full per-question artifacts are in benchmark_results/v3/.
| Benchmark | v3 (11 GB) | v1 (12 GB) | Q3_K_M (15.6 GB) |
|-----------|:---:|:---:|:---:|
| ARC-Challenge | 95.8% | 94.8% | 96.1% |
| HellaSwag | 92.3% | 91.5% | 91.5% |
| MMLU-Redux | 75.0% | 73.9% | 74.1% |
| HumanEval base | 70.7% | — | 64.0% |
| HumanEval+ | 65.2% | — | 56.7% |
| Vision smoke (36 images) | 100% | 100% | — |
v3 at 11 GB is 29% smaller than stock Q3_K_M (15.6 GB) while outperforming it on 4 of the 5 measured benchmarks (ARC is the one it loses; the vision check has no Q3_K_M baseline to compare). The Q2_K regularization effect on gate/mixing weights actively improves downstream task performance despite reducing perplexity.
v3 Allocation
| Group | Precision | Rationale |
|-------|-----------|-----------|
| attn_qkv | Q3_K_M | Critical for vision and attention routing |
| ssm_out | Q3_K_M | Most sensitive tensor per ablation (+0.24 PPL) |
| ffn_gate_exps | Q2_K | Q2_K regularization outperforms Q3_K_M |
| ffn_up_exps | Q2_K | Q2_K regularization outperforms Q3_K_M |
| ffn_down_exps | Q2_K | Acceptable loss for size savings |
| ffn_gate_shexp | Q2_K | Q2_K regularization outperforms Q3_K_M |
| ffn_up_shexp | Q2_K | Q2_K regularization outperforms Q3_K_M |
| ffn_down_shexp | Q2_K | Q2_K regularization outperforms Q3_K_M |
| attn_gate | Q2_K | Q2_K regularization outperforms Q3_K_M |
| ssm_alpha, ssm_beta | Q2_K | Q2_K regularization outperforms Q3_K_M |
Protected: all norms (F32), SSM state params (F32), router tensors (default).
Ablation Data
Full ablation methodology and results are in the ablation/ directory:
group_ablation_results.log— Forward ablation: demote each group to Q2_K, measure PPLreverse_ablation_results.log— Reverse ablation: from fully-demoted v1, restore each groupcerebellum_v3_overrides.txt— The 360-line tensor type override file used for v3
Key finding from reverse ablation: 7 of 10 groups perform better at Q2_K than Q3_K_M — imatrix-guided Q2_K acts as beneficial regularization on gate, mixing, and shared expert weights.
Usage
# v3 (recommended, 11 GB)
llama-server --model Qwen3.6-35B-A3B-Cerebellum-v3-Q3_K_M.gguf \
--mmproj mmproj-F16.gguf --n-gpu-layers 99 --ctx-size 8192
# v1 (legacy, 12 GB)
llama-server --model Qwen3.6-35B-A3B-Cerebellum-Q3_K_M.gguf \
--mmproj mmproj-F16.gguf --n-gpu-layers 99 --ctx-size 8192
Files
| File | Size | Description |
|------|------|-------------|
| Qwen3.6-35B-A3B-Cerebellum-v3-Q3_K_M.gguf | 11 GB | v3 — recommended, 29% smaller than Q3_K_M |
| Qwen3.6-35B-A3B-Cerebellum-Q3_K_M.gguf | 12 GB | v1 — legacy |
| mmproj-F16.gguf | 858 MB | Vision projection (F16) |
| benchmark_results/v3/ | — | Full benchmark JSON artifacts for v3 |
| ablation/ | — | Ablation logs and override files |
Methodology
Built with Cerebellum — sensitivity-guided mixed-precision quantization. v3 uses unsloth coder imatrix for importance-weighted quantization within each precision level.
Quantized by @deucebucket.
Independent records
This build has a recorded data point in club-3090's BENCHMARKS (author-rig numbers from a full report.sh --full chain: bench n=5, verify-full pass, soak-continuous pass). The same report led to a correction of their engine support table for this model (issue #390, PR #393). The numbers there are author-reported, not club-validated.
Run deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models