GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF overview

Qwen 3.6 35B A3B — Cerebellum GGUF Sensitivity guided mixed precision quantization of Qwen/Qwen3.6 35B A3B https://huggingface.co/Qwen/Qwen3.6 35B A3B . Two va…

ggufGGUFqwen3qwenquantizedcerebellumimatrixmoemixed-precision3-bitconversationaltext-generationbase_model:Qwen/Qwen3.6-35B-A3Bbase_model:quantized:Qwen/Qwen3.6-35B-A3Blicense:apache-2.0model-indexendpoints_compatibleregion:us

Runs locally from ~857.6 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
809
Likes
10
Pipeline
text-generation

Repository Files & Downloads

3 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
Qwen3.6-35B-A3B-Cerebellum-Q3_K_M.ggufGGUFQ3_K_M11.02 GBDownload
Qwen3.6-35B-A3B-Cerebellum-v3-Q3_K_M.ggufGGUFQ3_K_M11.13 GBDownload
mmproj-F16.ggufGGUFF16857.6 MBDownload

Model Details

Model IDdeucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF
Authordeucebucket
Pipelinetext-generation
Licenseapache-2.0
Base modelQwen/Qwen3.6-35B-A3B
Last modified2026-06-13T00:57:58.000Z

Model README

---

license: apache-2.0

library_name: gguf

base_model: Qwen/Qwen3.6-35B-A3B

base_model_relation: quantized

model_name: Qwen3.6-35B-A3B-Cerebellum-GGUF

model_creator: Qwen

model_type: qwen3

quantized_by: deucebucket

pipeline_tag: text-generation

tags:

- GGUF

- qwen3

- qwen

- quantized

- cerebellum

- imatrix

- moe

- mixed-precision

- 3-bit

- conversational

model-index:

  • name: Qwen3.6-35B-A3B-Cerebellum-GGUF

results:

- task:

name: Text Generation

type: text-generation

dataset:

name: AI2 Reasoning Challenge

type: ai2_arc

config: ARC-Challenge

split: test

metrics:

- name: normalized accuracy

type: acc_norm

value: 0.958

source:

name: Local benchmark run (RTX 3090, llama.cpp)

url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF/tree/main/benchmark_results

- task:

name: Text Generation

type: text-generation

dataset:

name: HellaSwag

type: hellaswag

split: validation

metrics:

- name: accuracy

type: acc

value: 0.923

source:

name: Local benchmark run (RTX 3090, llama.cpp)

url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF/tree/main/benchmark_results

- task:

name: Text Generation

type: text-generation

dataset:

name: MMLU-Redux

type: cais/mmlu

config: all

split: test

metrics:

- name: accuracy

type: acc

value: 0.75

source:

name: Local benchmark run (RTX 3090, llama.cpp)

url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF/tree/main/benchmark_results

- task:

name: Text Generation

type: text-generation

dataset:

name: HumanEval+ (pass@1)

type: openai_humaneval

split: test

metrics:

- name: pass@1

type: pass@1

value: 0.652

source:

name: Local benchmark run (RTX 3090, llama.cpp)

url: https://huggingface.co/deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF/tree/main/benchmark_results

---

Qwen 3.6 35B-A3B — Cerebellum GGUF

Sensitivity-guided mixed-precision quantization of Qwen/Qwen3.6-35B-A3B. Two variants available:

| Variant | File | Size | BPW |

|---------|------|------|-----|

| Cerebellum v3 (recommended) | Qwen3.6-35B-A3B-Cerebellum-v3-Q3_K_M.gguf | 11 GB | 2.76 |

| Cerebellum v1 (legacy) | Qwen3.6-35B-A3B-Cerebellum-Q3_K_M.gguf | 12 GB | 2.73 |

Cerebellum measures which weight groups survive extreme compression and which don't, then writes a single GGUF with per-tensor precision assignments. v3 uses 360 tensor-level overrides guided by group ablation and reverse ablation analysis.

Benchmarks

All benchmarks measured directly on these GGUF files using llama.cpp inference with cleaned evaluation harness.

The model-index metadata in this card's frontmatter mirrors the recommended v3 build. Protocol: local llama.cpp chat harness on RTX 3090, temperature 0, no thinking mode. Full per-question artifacts are in benchmark_results/v3/.

| Benchmark | v3 (11 GB) | v1 (12 GB) | Q3_K_M (15.6 GB) |

|-----------|:---:|:---:|:---:|

| ARC-Challenge | 95.8% | 94.8% | 96.1% |

| HellaSwag | 92.3% | 91.5% | 91.5% |

| MMLU-Redux | 75.0% | 73.9% | 74.1% |

| HumanEval base | 70.7% | — | 64.0% |

| HumanEval+ | 65.2% | — | 56.7% |

| Vision smoke (36 images) | 100% | 100% | — |

v3 at 11 GB is 29% smaller than stock Q3_K_M (15.6 GB) while outperforming it on 4 of the 5 measured benchmarks (ARC is the one it loses; the vision check has no Q3_K_M baseline to compare). The Q2_K regularization effect on gate/mixing weights actively improves downstream task performance despite reducing perplexity.

v3 Allocation

| Group | Precision | Rationale |

|-------|-----------|-----------|

| attn_qkv | Q3_K_M | Critical for vision and attention routing |

| ssm_out | Q3_K_M | Most sensitive tensor per ablation (+0.24 PPL) |

| ffn_gate_exps | Q2_K | Q2_K regularization outperforms Q3_K_M |

| ffn_up_exps | Q2_K | Q2_K regularization outperforms Q3_K_M |

| ffn_down_exps | Q2_K | Acceptable loss for size savings |

| ffn_gate_shexp | Q2_K | Q2_K regularization outperforms Q3_K_M |

| ffn_up_shexp | Q2_K | Q2_K regularization outperforms Q3_K_M |

| ffn_down_shexp | Q2_K | Q2_K regularization outperforms Q3_K_M |

| attn_gate | Q2_K | Q2_K regularization outperforms Q3_K_M |

| ssm_alpha, ssm_beta | Q2_K | Q2_K regularization outperforms Q3_K_M |

Protected: all norms (F32), SSM state params (F32), router tensors (default).

Ablation Data

Full ablation methodology and results are in the ablation/ directory:

  • group_ablation_results.log — Forward ablation: demote each group to Q2_K, measure PPL
  • reverse_ablation_results.log — Reverse ablation: from fully-demoted v1, restore each group
  • cerebellum_v3_overrides.txt — The 360-line tensor type override file used for v3

Key finding from reverse ablation: 7 of 10 groups perform better at Q2_K than Q3_K_M — imatrix-guided Q2_K acts as beneficial regularization on gate, mixing, and shared expert weights.

Usage

# v3 (recommended, 11 GB)
llama-server --model Qwen3.6-35B-A3B-Cerebellum-v3-Q3_K_M.gguf \
  --mmproj mmproj-F16.gguf --n-gpu-layers 99 --ctx-size 8192

# v1 (legacy, 12 GB)
llama-server --model Qwen3.6-35B-A3B-Cerebellum-Q3_K_M.gguf \
  --mmproj mmproj-F16.gguf --n-gpu-layers 99 --ctx-size 8192

Files

| File | Size | Description |

|------|------|-------------|

| Qwen3.6-35B-A3B-Cerebellum-v3-Q3_K_M.gguf | 11 GB | v3 — recommended, 29% smaller than Q3_K_M |

| Qwen3.6-35B-A3B-Cerebellum-Q3_K_M.gguf | 12 GB | v1 — legacy |

| mmproj-F16.gguf | 858 MB | Vision projection (F16) |

| benchmark_results/v3/ | — | Full benchmark JSON artifacts for v3 |

| ablation/ | — | Ablation logs and override files |

Methodology

Built with Cerebellum — sensitivity-guided mixed-precision quantization. v3 uses unsloth coder imatrix for importance-weighted quantization within each precision level.

Quantized by @deucebucket.

Independent records

This build has a recorded data point in club-3090's BENCHMARKS (author-rig numbers from a full report.sh --full chain: bench n=5, verify-full pass, soak-continuous pass). The same report led to a correction of their engine support table for this model (issue #390, PR #393). The numbers there are author-reported, not club-validated.

Run deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models