Model Intelligence Sheet
sandeshrajx/qwen3.5-24b-a3b-reap-0.32-gguf overview
Comprehensive model page for sandeshrajx/qwen3.5-24b-a3b-reap-0.32-gguf
Downloads
2,081
Likes
9
Pipeline
text-generation
Library
transformers
Visibility
Public
Access
Open
Repository Files & Downloads
5 files detected
Direct downloads for all repository files
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| Qwen3.5-18B-REAP-A3B-Coding-IQ4_K_M.gguf | GGUF | IQ4_K_M | 11.28 GB | Download |
| Qwen3.5-24B-A3B-REAP-0.32-IQ4_K_M.gguf | GGUF | IQ4_K_M | 14.62 GB | Download |
| Qwen3.5-24B-A3B-REAP-0.32-IQ4_K_S.gguf | GGUF | IQ4_K_S | 13.87 GB | Download |
| Qwen3.5-24B-A3B-REAP-0.32-Q4_K_M.gguf | GGUF | Q4_K_M | 13.94 GB | Download |
| Qwen3.5-25B-REAP-A3B-Coding-IQ4_K_S.gguf | GGUF | IQ4_K_S | 14.20 GB | Download |
Model Details Live
Metadata Inspector
Normalized metadata (stored in metadata_json)
{
"metadata": {},
"card_data": {
"language": [
"en"
],
"library_name": "transformers",
"tags": [
"qwen",
"MOE",
"pruning",
"compression",
"GGUF"
],
"license": "apache-2.0",
"name": "sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32",
"description": "This model was obtained by uniformly pruning 32% of experts in Qwen3.5-35B-A3B using the REAP method.\n",
"pipeline_tag": "text-generation",
"base_model": [
"Qwen/Qwen3.5-35B-A3B"
],
"frontmatter": {
"language": [
"en"
],
"library_name": "transformers",
"tags": [
"qwen",
"MOE",
"pruning",
"compression",
"GGUF"
],
"license": "apache-2.0",
"name": "sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32",
"description": ">",
"pipeline_tag": "text-generation",
"base_model": [
"Qwen/Qwen3.5-35B-A3B"
]
},
"hero_image_url": "",
"summary": "",
"quick_links": [],
"benchmark_table_html": "",
"readme_markdown": "---\nlanguage:\n- en\nlibrary_name: transformers\ntags:\n- qwen\n- MOE\n- pruning\n- compression\n- GGUF\nlicense: apache-2.0\nname: sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32\ndescription: >\n This model was obtained by uniformly pruning 32% of experts in Qwen3.5-35B-A3B using the REAP method.\npipeline_tag: text-generation\nbase_model:\n- Qwen/Qwen3.5-35B-A3B\n---\n\n<p align=\"center\">\n <em>๐ณ <strong>REAP</strong>๐ณ the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br>\n</p>\n\n# Qwen3.5-24B-A3B-REAP-0.32\n\n## โจ Highlights\n\nIntroducing **Qwen3.5-24B-A3B-REAP-0.32**, a **memory-efficient compressed variant** of Qwen3.5-35B-A3B that maintains the core reasoning and coding capabilities of the architecture while being **32% lighter**.\n\nThis model was created using **REAP (Router-weighted Expert Activation Pruning)**, a novel expert pruning method that selectively removes redundant experts while preserving the router's independent control over remaining experts. Key features include:\n\n- **Aggressive Compression**: 32% reduction in expert count, bringing the total parameter count down to approximately 24B.\n- **3B Active Parameters**: Maintains the same computational efficiency during inference as the original model (3B parameters activated per token).\n- **High-Precision GGUF**: Includes optimized quants using an importance matrix (imatrix) and custom tensor precision recipes.\n- **Drop-in Compatibility**: Fully compatible with the latest `transformers` (from source) and `vLLM`.\n- **Orchestration Scripts**: Full pipeline available at [sandeshrajbhandari/reap-qwen3.5-modal](https://github.com/sandeshrajbhandari/reap-qwen3.5-modal).\n\n---\n\n## ๐ Model Overview\n\n**Qwen3.5-24B-A3B-REAP-0.32** has the following specifications:\n\n- **Base Model**: [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B)\n- **Compression Method**: REAP (Router-weighted Expert Activation Pruning)\n- **Compression Ratio**: 32% expert pruning\n- **Type**: Sparse Mixture-of-Experts (SMoE) Causal Language Model\n- **Number of Parameters**: ~24B total, ~3B activated per token\n- **Number of Experts**: 175 (uniformly pruned from 256)\n- **Number of Activated Experts**: 8 per token\n- **License**: Apache 2.0\n\n---\n\n## ๐ Repository Contents\n\nThis repository contains the following artifacts:\n\n- **`Qwen3.5-24B-A3B-REAP-0.32-IQ4_K_M.gguf`**: High-precision 4-bit quant using the Unsloth-style recipe (imatrix + Q8_0 overrides for critical tensors).\n- **`Qwen3.5-24B-A3B-REAP-0.32-IQ4_K_S.gguf`**: Smaller 4-bit quant variant.\n- **`Qwen3.5-24B-A3B-REAP-0.32-Q4_K_M.gguf`**: Naive Q4_K_M quant. \n- **`imatrix.dat`**: The importance matrix used for quantization.\n- **`calibration_data_v5_rc.txt`**: The calibration corpus used to generate the imatrix.\n\n---\n\n## ๐ Deployment\n\n### Transformers\nSince Qwen 3.5 MoE is a new architecture, ensure you are using the latest `transformers` from source:\n\n```bash\npip install git+https://github.com/huggingface/transformers.git\n```\n\n### vLLM\nYou can deploy the model directly using **vLLM**:\n\n```bash\nvllm serve sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32 \\\n --enable-expert-parallel\n```\n\n### GGUF (llama.cpp)\nOptimized GGUF versions are available in this repository. We recommend using the `IQ4_K_M` variant for the best balance of size and performance.\n\n---\n\n## ๐งฉ Model Creation\n\n### How REAP Works\nREAP selects experts to prune based on a **saliency criterion** that considers router gate values and expert activation norms. This ensures that only experts contributing minimally to the model's internal representations are removed.\n\n### Infrastructure\nThe project utilized **Modal** for high-memory compute (A100-80GB) and a custom fork of the REAP library.\n- **Orchestration Code**: [reap-qwen3.5-modal](https://github.com/sandeshrajbhandari/reap-qwen3.5-modal)\n- **Library Fork**: [sandeshrajbhandari/reap](https://github.com/sandeshrajbhandari/reap/tree/feat/qwen3.5-moe-support)\n\n### โ ๏ธ Caveats & Future Work\n- **Compute Constraints**: Due to current memory limitations, the model was calibrated with a context length of **1024 tokens** and a limited sample size. \n- **Room for Optimization**: There is significant room for improvement by using larger sample sizes and the full 2048/4096 context length. The current REAP fork for Qwen 3.5 still hits OOM on 80GB VRAM at 2048 context length during activation profiling, which is a target for future optimization.\n\n---\n\n## ๐ References & Resources\n\n### ๐ง GGUF & Quantization Guides\n- [Overview of GGUF quantization methods](https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overview_of_gguf_quantization_methods/)\n- [Quant Cookers Basic Guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/434)\n\n### ๐ Benchmarks & Comparisons\n- [Qwen3.5-35B-A3B Q4 Quantization Comparison](https://www.reddit.com/r/LocalLLaMA/comments/1rfds1h/qwen3535ba3b_q4_quantization_comparison/)\n- [Qwen3.5 GGUF Benchmarks (Unsloth)](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks)\n\n### ๐ฏ Research\n- [REAP arXiv Preprint](https://arxiv.org/abs/2510.13999)\n- [REAP Blog (Cerebras)](https://www.cerebras.ai/blog/reap-one-shot-pruning-for-trillion-parameter-mixture-of-experts-models)\n\n---\n\n## โ๏ธ License\n\nThis model is derived from **[`Qwen3.5-35B-A3B`](https://huggingface.co/Qwen/Qwen3.5-35B-A3B)** and distributed under the **Apache 2.0 License**.\n\n---\n\n## ๐งพ Citation\n\n```bibtex\n@article{lasby-reap,\n title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},\n author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},\n journal={arXiv preprint arXiv:2510.13999},\n year={2025}\n}\n```\n",
"related_quantizations": []
},
"tags": [
"transformers",
"gguf",
"qwen",
"MOE",
"pruning",
"compression",
"GGUF",
"text-generation",
"en",
"arxiv:2510.13999",
"base_model:Qwen/Qwen3.5-35B-A3B",
"base_model:quantized:Qwen/Qwen3.5-35B-A3B",
"license:apache-2.0",
"endpoints_compatible",
"region:us",
"imatrix",
"conversational"
],
"likes": 9,
"downloads": 2081,
"gated": false,
"private": false,
"last_modified": "2026-03-07T01:31:41.000Z",
"created_at": "2026-03-04T08:51:10.000Z",
"pipeline_tag": "text-generation",
"library_name": "transformers"
}
Source payload excerpt (from Hugging Face API)
{
"_id": "69a7f27ec4dac4ed616af317",
"id": "sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32-GGUF",
"modelId": "sandeshrajx/Qwen3.5-24B-A3B-REAP-0.32-GGUF",
"sha": "c2f0bed72cb8be3f99f193fda9b444c3a21d5816",
"createdAt": "2026-03-04T08:51:10.000Z",
"lastModified": "2026-03-07T01:31:41.000Z",
"author": "sandeshrajx",
"downloads": 2081,
"likes": 9,
"gated": false,
"private": false,
"pipeline_tag": "text-generation",
"library_name": "transformers",
"siblings_count": 9
}