Model Intelligence Sheet

sandeshrajx/qwen3.5-24b-a3b-reap-0.32-gguf overview

Comprehensive model page for sandeshrajx/qwen3.5-24b-a3b-reap-0.32-gguf

transformersggufqwenMOEpruningcompressionGGUFtext-generationenarxiv:2510.13999base_model:Qwen/Qwen3.5-35B-A3Bbase_model:quantized:Qwen/Qwen3.5-35B-A3Blicense:apache-2.0endpoints_compatibleregion:usimatrixconversational

sandeshrajx/qwen3.5-24b-a3b-reap-0.32-gguf visual

Downloads

2,081

Likes

Pipeline

text-generation

Library

transformers

Visibility

Public

Access

Open

Repository Files & Downloads

5 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
Qwen3.5-18B-REAP-A3B-Coding-IQ4_K_M.gguf	GGUF	IQ4_K_M	11.28 GB	Download
Qwen3.5-24B-A3B-REAP-0.32-IQ4_K_M.gguf	GGUF	IQ4_K_M	14.62 GB	Download
Qwen3.5-24B-A3B-REAP-0.32-IQ4_K_S.gguf	GGUF	IQ4_K_S	13.87 GB	Download
Qwen3.5-24B-A3B-REAP-0.32-Q4_K_M.gguf	GGUF	Q4_K_M	13.94 GB	Download
Qwen3.5-25B-REAP-A3B-Coding-IQ4_K_S.gguf	GGUF	IQ4_K_S	14.20 GB	Download

Model Details Live

Model Slug

sandeshrajx/qwen3.5-24b-a3b-reap-0.32-gguf

Author

sandeshrajx

Pipeline Task

text-generation

Library

transformers

Created

2026-03-04

Last Modified

2026-03-07

Gated

Private

HF SHA

c2f0bed72cb8be3f99f193fda9b444c3a21d5816

License

apache-2.0

Language

Base Model

Qwen/Qwen3.5-35B-A3B

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "language": [
      "en"
    ],
    "library_name": "transformers",
    "tags": [
      "qwen",
      "MOE",
      "pruning",
      "compression",
      "GGUF"
    ],
    "license": "apache-2.0",
    "name": "sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32",
    "description": "This model was obtained by uniformly pruning 32% of experts in Qwen3.5-35B-A3B using the REAP method.\n",
    "pipeline_tag": "text-generation",
    "base_model": [
      "Qwen/Qwen3.5-35B-A3B"
    ],
    "frontmatter": {
      "language": [
        "en"
      ],
      "library_name": "transformers",
      "tags": [
        "qwen",
        "MOE",
        "pruning",
        "compression",
        "GGUF"
      ],
      "license": "apache-2.0",
      "name": "sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32",
      "description": ">",
      "pipeline_tag": "text-generation",
      "base_model": [
        "Qwen/Qwen3.5-35B-A3B"
      ]
    },
    "hero_image_url": "",
    "summary": "",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nlanguage:\n- en\nlibrary_name: transformers\ntags:\n- qwen\n- MOE\n- pruning\n- compression\n- GGUF\nlicense: apache-2.0\nname: sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32\ndescription: >\n  This model was obtained by uniformly pruning 32% of experts in Qwen3.5-35B-A3B using the REAP method.\npipeline_tag: text-generation\nbase_model:\n- Qwen/Qwen3.5-35B-A3B\n---\n\n<p align=\"center\">\n  <em>𓌳 <strong>REAP</strong>𓌳  the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br>\n</p>\n\n# Qwen3.5-24B-A3B-REAP-0.32\n\n## ✨ Highlights\n\nIntroducing **Qwen3.5-24B-A3B-REAP-0.32**, a **memory-efficient compressed variant** of Qwen3.5-35B-A3B that maintains the core reasoning and coding capabilities of the architecture while being **32% lighter**.\n\nThis model was created using **REAP (Router-weighted Expert Activation Pruning)**, a novel expert pruning method that selectively removes redundant experts while preserving the router's independent control over remaining experts. Key features include:\n\n- **Aggressive Compression**: 32% reduction in expert count, bringing the total parameter count down to approximately 24B.\n- **3B Active Parameters**: Maintains the same computational efficiency during inference as the original model (3B parameters activated per token).\n- **High-Precision GGUF**: Includes optimized quants using an importance matrix (imatrix) and custom tensor precision recipes.\n- **Drop-in Compatibility**: Fully compatible with the latest `transformers` (from source) and `vLLM`.\n- **Orchestration Scripts**: Full pipeline available at [sandeshrajbhandari/reap-qwen3.5-modal](https://github.com/sandeshrajbhandari/reap-qwen3.5-modal).\n\n---\n\n## 📋 Model Overview\n\n**Qwen3.5-24B-A3B-REAP-0.32** has the following specifications:\n\n- **Base Model**: [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B)\n- **Compression Method**: REAP (Router-weighted Expert Activation Pruning)\n- **Compression Ratio**: 32% expert pruning\n- **Type**: Sparse Mixture-of-Experts (SMoE) Causal Language Model\n- **Number of Parameters**: ~24B total, ~3B activated per token\n- **Number of Experts**: 175 (uniformly pruned from 256)\n- **Number of Activated Experts**: 8 per token\n- **License**: Apache 2.0\n\n---\n\n## 📂 Repository Contents\n\nThis repository contains the following artifacts:\n\n- **`Qwen3.5-24B-A3B-REAP-0.32-IQ4_K_M.gguf`**: High-precision 4-bit quant using the Unsloth-style recipe (imatrix + Q8_0 overrides for critical tensors).\n- **`Qwen3.5-24B-A3B-REAP-0.32-IQ4_K_S.gguf`**: Smaller 4-bit quant variant.\n- **`Qwen3.5-24B-A3B-REAP-0.32-Q4_K_M.gguf`**: Naive Q4_K_M quant. \n- **`imatrix.dat`**: The importance matrix used for quantization.\n- **`calibration_data_v5_rc.txt`**: The calibration corpus used to generate the imatrix.\n\n---\n\n## 🚀 Deployment\n\n### Transformers\nSince Qwen 3.5 MoE is a new architecture, ensure you are using the latest `transformers` from source:\n\n```bash\npip install git+https://github.com/huggingface/transformers.git\n```\n\n### vLLM\nYou can deploy the model directly using **vLLM**:\n\n```bash\nvllm serve sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32 \\\n    --enable-expert-parallel\n```\n\n### GGUF (llama.cpp)\nOptimized GGUF versions are available in this repository. We recommend using the `IQ4_K_M` variant for the best balance of size and performance.\n\n---\n\n## 🧩 Model Creation\n\n### How REAP Works\nREAP selects experts to prune based on a **saliency criterion** that considers router gate values and expert activation norms. This ensures that only experts contributing minimally to the model's internal representations are removed.\n\n### Infrastructure\nThe project utilized **Modal** for high-memory compute (A100-80GB) and a custom fork of the REAP library.\n- **Orchestration Code**: [reap-qwen3.5-modal](https://github.com/sandeshrajbhandari/reap-qwen3.5-modal)\n- **Library Fork**: [sandeshrajbhandari/reap](https://github.com/sandeshrajbhandari/reap/tree/feat/qwen3.5-moe-support)\n\n### ⚠️ Caveats & Future Work\n- **Compute Constraints**: Due to current memory limitations, the model was calibrated with a context length of **1024 tokens** and a limited sample size. \n- **Room for Optimization**: There is significant room for improvement by using larger sample sizes and the full 2048/4096 context length. The current REAP fork for Qwen 3.5 still hits OOM on 80GB VRAM at 2048 context length during activation profiling, which is a target for future optimization.\n\n---\n\n## 📚 References & Resources\n\n### 🔧 GGUF & Quantization Guides\n- [Overview of GGUF quantization methods](https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overview_of_gguf_quantization_methods/)\n- [Quant Cookers Basic Guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/434)\n\n### 📊 Benchmarks & Comparisons\n- [Qwen3.5-35B-A3B Q4 Quantization Comparison](https://www.reddit.com/r/LocalLLaMA/comments/1rfds1h/qwen3535ba3b_q4_quantization_comparison/)\n- [Qwen3.5 GGUF Benchmarks (Unsloth)](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks)\n\n### 🎯 Research\n- [REAP arXiv Preprint](https://arxiv.org/abs/2510.13999)\n- [REAP Blog (Cerebras)](https://www.cerebras.ai/blog/reap-one-shot-pruning-for-trillion-parameter-mixture-of-experts-models)\n\n---\n\n## ⚖️ License\n\nThis model is derived from **[`Qwen3.5-35B-A3B`](https://huggingface.co/Qwen/Qwen3.5-35B-A3B)** and distributed under the **Apache 2.0 License**.\n\n---\n\n## 🧾 Citation\n\n```bibtex\n@article{lasby-reap,\n  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},\n  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},\n  journal={arXiv preprint arXiv:2510.13999},\n  year={2025}\n}\n```\n",
    "related_quantizations": []
  },
  "tags": [
    "transformers",
    "gguf",
    "qwen",
    "MOE",
    "pruning",
    "compression",
    "GGUF",
    "text-generation",
    "en",
    "arxiv:2510.13999",
    "base_model:Qwen/Qwen3.5-35B-A3B",
    "base_model:quantized:Qwen/Qwen3.5-35B-A3B",
    "license:apache-2.0",
    "endpoints_compatible",
    "region:us",
    "imatrix",
    "conversational"
  ],
  "likes": 9,
  "downloads": 2081,
  "gated": false,
  "private": false,
  "last_modified": "2026-03-07T01:31:41.000Z",
  "created_at": "2026-03-04T08:51:10.000Z",
  "pipeline_tag": "text-generation",
  "library_name": "transformers"
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "69a7f27ec4dac4ed616af317",
  "id": "sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32-GGUF",
  "modelId": "sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32-GGUF",
  "sha": "c2f0bed72cb8be3f99f193fda9b444c3a21d5816",
  "createdAt": "2026-03-04T08:51:10.000Z",
  "lastModified": "2026-03-07T01:31:41.000Z",
  "author": "sandeshrajx",
  "downloads": 2081,
  "likes": 9,
  "gated": false,
  "private": false,
  "pipeline_tag": "text-generation",
  "library_name": "transformers",
  "siblings_count": 9
}