0xsero/qwen3.5-122b-a10b-reap-20-gguf Q8_0 GGUF - Free GGUF Download is indexed on GraySoft with repository links, GGUF quant files, and Hugging Face metadata. This page helps you pick a local model for guIDE or other runtimes. See related models in the same shard below.
Model Intelligence Sheet
0xsero/qwen3.5-122b-a10b-reap-20-gguf overview
GGUF quantizations of 0xSero/Qwen3.5-122B-A10B-REAP-20, a 20% expert-pruned Qwen3.5-122B MoE model using REAP.
Downloads
5,302
Likes
5
Pipeline
—
Library
—
Visibility
Public
Access
Open
Repository Files & Downloads
Model Details Live
Metadata Inspector
Normalized metadata (stored in metadata_json)
{
"metadata": {},
"card_data": {
"language": [
"en"
],
"license": "other",
"tags": [
"gguf",
"moe",
"pruned",
"reap",
"qwen3.5",
"expert-pruning",
"llama-cpp",
"strix-halo"
],
"base_model": "0xSero/Qwen3.5-122B-A10B-REAP-20",
"quantized_by": "0xSero",
"frontmatter": {
"language": [
"en"
],
"license": "other",
"tags": [
"gguf",
"moe",
"pruned",
"reap",
"qwen3.5",
"expert-pruning",
"llama-cpp",
"strix-halo"
],
"base_model": "0xSero/Qwen3.5-122B-A10B-REAP-20",
"quantized_by": "0xSero"
},
"hero_image_url": "",
"summary": "GGUF quantizations of 0xSero/Qwen3.5-122B-A10B-REAP-20, a 20% expert-pruned Qwen3.5-122B MoE model using REAP.",
"quick_links": [],
"benchmark_table_html": "",
"readme_markdown": "---\nlanguage:\n- en\nlicense: other\ntags:\n- gguf\n- moe\n- pruned\n- reap\n- qwen3.5\n- expert-pruning\n- llama-cpp\n- strix-halo\nbase_model: 0xSero/Qwen3.5-122B-A10B-REAP-20\nquantized_by: 0xSero\n---\n> [!TIP]\n> Support this work: **[donate.sybilsolutions.ai](https://donate.sybilsolutions.ai)**\n> \n> REAP surfaces: [GLM](https://huggingface.co/spaces/0xSero/reap-glm-family) | [MiniMax](https://huggingface.co/spaces/0xSero/reap-minimax-family) | [Qwen](https://huggingface.co/spaces/0xSero/reap-qwen-family) | [Gemma](https://huggingface.co/spaces/0xSero/reap-gemma-family) | [Paper](https://arxiv.org/abs/2510.13999) | [Code](https://github.com/CerebrasResearch/reap) | [PR17](https://github.com/CerebrasResearch/reap/pull/17) | [Cerebras Collection](https://huggingface.co/collections/cerebras/cerebras-reap)\n\n# Qwen3.5-122B-A10B-REAP-20 — GGUF\n\nGGUF quantizations of [0xSero/Qwen3.5-122B-A10B-REAP-20](https://huggingface.co/0xSero/Qwen3.5-122B-A10B-REAP-20), a 20% expert-pruned Qwen3.5-122B MoE model using [REAP](https://arxiv.org/abs/2510.13999).\n\n## Available Quantizations\n\n| File | Quant | BPW | Size | Description |\n|------|-------|-----|------|-------------|\n| `Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf` | Q4_K_M | 4.86 | 57 GB | Best speed-to-quality ratio. Fits in 64 GB GTT. |\n| `Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf` | Q6_K | 6.57 | 76 GB | Higher quality. Needs 80+ GB VRAM/GTT. |\n| `Qwen3.5-122B-A10B-REAP-20-Q8_0.gguf` | Q8_0 | 8.51 | 99 GB | Near-lossless. Needs 100+ GB VRAM/GTT. |\n\n## Model Details\n\n| Property | Value |\n|----------|-------|\n| **Base Model** | [Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B) |\n| **Pruned Model** | [0xSero/Qwen3.5-122B-A10B-REAP-20](https://huggingface.co/0xSero/Qwen3.5-122B-A10B-REAP-20) |\n| **Architecture** | Qwen3.5 MoE (GDN + Full Attention hybrid) |\n| **Total Parameters** | 99B (205 experts/layer, down from 256) |\n| **Active Parameters** | ~10B per token (8 experts selected) |\n| **Context Length** | 262,144 tokens |\n| **Thinking Mode** | Yes (reasoning_content in chat completions) |\n| **Pruning Method** | REAP — 20% expert removal with super-expert protection |\n| **Quantization Tool** | llama.cpp (llama-quantize) |\n| **Converted From** | Safetensors (BF16) via llama.cpp convert_hf_to_gguf.py |\n\n## Speed Benchmarks\n\nTested on AMD Ryzen AI MAX+ 395 (Strix Halo), Radeon 8060S (gfx1151), 128 GB LPDDR5X. llama.cpp b8746, Vulkan RADV, Flash Attention ON.\n\n### llama-bench (pp512 / tg128)\n\n| Quant | GPU Layers | Prefill (t/s) | Token Gen (t/s) |\n|-------|------------|---------------|-----------------|\n| **Q4_K_M** | 49/49 (full) | 295.74 | **27.56** |\n| Q6_K | 35/49 (partial) | 121.35 | 15.74 |\n| Q8_0 | 25/49 (partial) | 44.55 | 9.89 |\n\n### API Speed (llama-server, real chat completions)\n\n| Quant | Prefill (short) | Prefill (long) | Token Gen |\n|-------|-----------------|----------------|-----------|\n| **Q4_K_M** | 141.8 t/s | 62.3 t/s | **28.4 t/s** |\n| Q6_K | 48.8 t/s | 21.7 t/s | 15.4 t/s |\n| Q8_0 | 25.8 t/s | 14.2 t/s | 9.0 t/s |\n\n> Q6_K and Q8_0 are partially offloaded to CPU because they exceed the default 64 GB GTT limit. With GTT increased to 120 GB (BIOS GART + modprobe config), they would run at full GPU speed.\n\n## Quality Benchmarks\n\nTested via llama-server API with thinking mode enabled.\n\n### Reasoning (5 questions — math, calculus, logic, code comprehension, knowledge)\n\n| Quant | Score |\n|-------|-------|\n| **Q4_K_M** | **5/5** |\n| **Q6_K** | **5/5** |\n| **Q8_0** | **5/5** |\n\nAll quants produce correct answers for arithmetic (127*43=5461), calculus (derivative of x^3+2x^2-5x+7), formal logic, Python reference semantics, and factual recall.\n\n### Code Generation (HumanEval subset — 5 problems, executed and tested)\n\n| Quant | Passed |\n|-------|--------|\n| **Q4_K_M** | 4/5 |\n| **Q6_K** | 4/5 |\n| **Q8_0** | 3/5 |\n\nThe model generates correct code for all problems. Score differences are due to code extraction from the thinking format, not model quality.\n\n### Full Benchmarks (safetensors, from base model card)\n\n| Benchmark | Score |\n|-----------|-------|\n| HumanEval | 81.1% |\n| HumanEval+ | 76.8% |\n| MBPP | 86.2% |\n| MBPP+ | 73.0% |\n| ARC Challenge | 63.7% |\n| HellaSwag | 84.1% |\n| TruthfulQA MC2 | 52.4% |\n| Winogrande | 75.5% |\n\nSee the [full model card](https://huggingface.co/0xSero/Qwen3.5-122B-A10B-REAP-20) for complete benchmark results and methodology.\n\n## How to Run\n\n### llama-server (recommended)\n\n```bash\n# Q4_K_M — fits in 64 GB, fastest\nllama-server \\\n -m Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf \\\n -ngl 999 --flash-attn on -c 4096 \\\n --port 8080 --host 0.0.0.0\n\n# With speculative decoding for faster generation\nllama-server \\\n -m Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf \\\n -ngl 999 --flash-attn on -c 4096 \\\n --spec-type ngram-mod --spec-ngram-size-n 24 \\\n --draft-min 48 --draft-max 64 \\\n --port 8080 --host 0.0.0.0\n```\n\n### Ollama\n\n```bash\n# Create a Modelfile\necho 'FROM ./Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf' > Modelfile\nollama create reap20 -f Modelfile\nollama run reap20\n```\n\n### Python (llama-cpp-python)\n\n```python\nfrom llama_cpp import Llama\n\nllm = Llama(\n model_path=\"Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf\",\n n_gpu_layers=-1,\n n_ctx=4096,\n flash_attn=True,\n)\n\noutput = llm.create_chat_completion(\n messages=[{\"role\": \"user\", \"content\": \"Hello!\"}],\n max_tokens=512,\n)\nprint(output[\"choices\"][0][\"message\"][\"content\"])\n```\n\n## Which Quant Should I Use?\n\n| Your Setup | Recommended |\n|------------|-------------|\n| 64 GB VRAM/GTT (e.g., Strix Halo default) | **Q4_K_M** — full GPU offload, 28 t/s |\n| 80-96 GB VRAM/GTT | **Q6_K** — higher quality, full GPU offload |\n| 128+ GB VRAM (e.g., 2x Strix Halo cluster, A100) | **Q8_0** — near-lossless quality |\n| RTX 4090 (24 GB) | Model too large. Use a smaller model. |\n\n## Hardware Notes\n\nThis model was designed for and tested on AMD Strix Halo (Ryzen AI MAX+ 395) with 128 GB unified memory. It also works on any system with sufficient VRAM/RAM:\n\n- **Strix Halo (64 GB GTT default)**: Q4_K_M fits fully, Q6_K/Q8_0 partial offload\n- **Strix Halo (120 GB GTT increased)**: All quants fit fully\n- **2x Strix Halo cluster (RPC)**: All quants at full speed\n- **NVIDIA A100 80GB**: Q4_K_M and Q6_K fit fully\n- **Apple M-series (128 GB)**: All quants should work via Metal\n\n## What is REAP?\n\n[REAP](https://arxiv.org/abs/2510.13999) (Routing-Enhanced Activation Pruning) removes the least-activated experts from Mixture-of-Experts models while preserving critical capabilities. This model has 20% of experts removed (256 -> 205 per layer), retaining **97.9% average capability** across standard benchmarks.\n\n## Credits\n\n- **Pruning**: [0xSero](https://huggingface.co/0xSero) / Sybil Solutions\n- **Base Model**: [Qwen Team](https://huggingface.co/Qwen)\n- **REAP Method**: [arxiv:2510.13999](https://arxiv.org/abs/2510.13999)\n- **Quantization**: llama.cpp\n\n## License\n\nSame license as the base model. See [Qwen3.5-122B-A10B license](https://huggingface.co/Qwen/Qwen3.5-122B-A10B).\n",
"related_quantizations": []
},
"tags": [
"gguf",
"moe",
"pruned",
"reap",
"qwen3.5",
"expert-pruning",
"llama-cpp",
"strix-halo",
"en",
"arxiv:2510.13999",
"base_model:0xSero/Qwen3.5-122B-A10B-REAP-20",
"base_model:quantized:0xSero/Qwen3.5-122B-A10B-REAP-20",
"license:other",
"endpoints_compatible",
"region:us",
"conversational"
],
"likes": 5,
"downloads": 5302,
"gated": false,
"private": false,
"last_modified": "2026-04-14T22:46:38.000Z",
"created_at": "2026-04-10T16:27:07.000Z",
"pipeline_tag": "",
"library_name": ""
}
Source payload excerpt (from Hugging Face API)
{
"_id": "69d924dbc70f0feb0f15d7cc",
"id": "0xSero/Qwen3.5-122B-A10B-REAP-20-GGUF",
"modelId": "0xSero/Qwen3.5-122B-A10B-REAP-20-GGUF",
"sha": "81cadfaffa3d6b1bd0786cbd4a3e316035a91406",
"createdAt": "2026-04-10T16:27:07.000Z",
"lastModified": "2026-04-14T22:46:38.000Z",
"author": "0xSero",
"downloads": 5302,
"likes": 5,
"gated": false,
"private": false,
"pipeline_tag": "",
"library_name": "",
"siblings_count": 5
}