GraySoft
Projects Models About FAQ Contact Download guIDE →

0xsero/qwen3.5-122b-a10b-reap-20-gguf Q8_0 GGUF - Free GGUF Download is indexed on GraySoft with repository links, GGUF quant files, and Hugging Face metadata. This page helps you pick a local model for guIDE or other runtimes. See related models in the same shard below.

Model Intelligence Sheet

0xsero/qwen3.5-122b-a10b-reap-20-gguf overview

GGUF quantizations of 0xSero/Qwen3.5-122B-A10B-REAP-20, a 20% expert-pruned Qwen3.5-122B MoE model using REAP.

ggufmoeprunedreapqwen3.5expert-pruningllama-cppstrix-haloenarxiv:2510.13999base_model:0xSero/Qwen3.5-122B-A10B-REAP-20base_model:quantized:0xSero/Qwen3.5-122B-A10B-REAP-20license:otherendpoints_compatibleregion:usconversational
0xsero/qwen3.5-122b-a10b-reap-20-gguf visual
Downloads
5,302
Likes
5
Pipeline
Library
Visibility
Public
Access
Open

Repository Files & Downloads

3 files detected
Direct downloads for all repository files
FileTypeQuantizationSizeLink
Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf GGUF Q4_K_M 56.06 GB Download
Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf GGUF Q6_K 75.74 GB Download
Qwen3.5-122B-A10B-REAP-20-Q8_0.gguf GGUF 98.06 GB Download

Model Details Live

Model Slug
0xsero/qwen3.5-122b-a10b-reap-20-gguf
Author
0xSero
Pipeline Task
Library
Created
2026-04-10
Last Modified
2026-04-14
Gated
No
Private
No
HF SHA
81cadfaffa3d6b1bd0786cbd4a3e316035a91406
License
other
Language
en
Base Model
0xSero/Qwen3.5-122B-A10B-REAP-20

Metadata Inspector

Normalized metadata (stored in metadata_json)
{
  "metadata": {},
  "card_data": {
    "language": [
      "en"
    ],
    "license": "other",
    "tags": [
      "gguf",
      "moe",
      "pruned",
      "reap",
      "qwen3.5",
      "expert-pruning",
      "llama-cpp",
      "strix-halo"
    ],
    "base_model": "0xSero/Qwen3.5-122B-A10B-REAP-20",
    "quantized_by": "0xSero",
    "frontmatter": {
      "language": [
        "en"
      ],
      "license": "other",
      "tags": [
        "gguf",
        "moe",
        "pruned",
        "reap",
        "qwen3.5",
        "expert-pruning",
        "llama-cpp",
        "strix-halo"
      ],
      "base_model": "0xSero/Qwen3.5-122B-A10B-REAP-20",
      "quantized_by": "0xSero"
    },
    "hero_image_url": "",
    "summary": "GGUF quantizations of 0xSero/Qwen3.5-122B-A10B-REAP-20, a 20% expert-pruned Qwen3.5-122B MoE model using REAP.",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nlanguage:\n- en\nlicense: other\ntags:\n- gguf\n- moe\n- pruned\n- reap\n- qwen3.5\n- expert-pruning\n- llama-cpp\n- strix-halo\nbase_model: 0xSero/Qwen3.5-122B-A10B-REAP-20\nquantized_by: 0xSero\n---\n> [!TIP]\n> Support this work: **[donate.sybilsolutions.ai](https://donate.sybilsolutions.ai)**\n> \n> REAP surfaces: [GLM](https://huggingface.co/spaces/0xSero/reap-glm-family) | [MiniMax](https://huggingface.co/spaces/0xSero/reap-minimax-family) | [Qwen](https://huggingface.co/spaces/0xSero/reap-qwen-family) | [Gemma](https://huggingface.co/spaces/0xSero/reap-gemma-family) | [Paper](https://arxiv.org/abs/2510.13999) | [Code](https://github.com/CerebrasResearch/reap) | [PR17](https://github.com/CerebrasResearch/reap/pull/17) | [Cerebras Collection](https://huggingface.co/collections/cerebras/cerebras-reap)\n\n# Qwen3.5-122B-A10B-REAP-20 — GGUF\n\nGGUF quantizations of [0xSero/Qwen3.5-122B-A10B-REAP-20](https://huggingface.co/0xSero/Qwen3.5-122B-A10B-REAP-20), a 20% expert-pruned Qwen3.5-122B MoE model using [REAP](https://arxiv.org/abs/2510.13999).\n\n## Available Quantizations\n\n| File | Quant | BPW | Size | Description |\n|------|-------|-----|------|-------------|\n| `Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf` | Q4_K_M | 4.86 | 57 GB | Best speed-to-quality ratio. Fits in 64 GB GTT. |\n| `Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf` | Q6_K | 6.57 | 76 GB | Higher quality. Needs 80+ GB VRAM/GTT. |\n| `Qwen3.5-122B-A10B-REAP-20-Q8_0.gguf` | Q8_0 | 8.51 | 99 GB | Near-lossless. Needs 100+ GB VRAM/GTT. |\n\n## Model Details\n\n| Property | Value |\n|----------|-------|\n| **Base Model** | [Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B) |\n| **Pruned Model** | [0xSero/Qwen3.5-122B-A10B-REAP-20](https://huggingface.co/0xSero/Qwen3.5-122B-A10B-REAP-20) |\n| **Architecture** | Qwen3.5 MoE (GDN + Full Attention hybrid) |\n| **Total Parameters** | 99B (205 experts/layer, down from 256) |\n| **Active Parameters** | ~10B per token (8 experts selected) |\n| **Context Length** | 262,144 tokens |\n| **Thinking Mode** | Yes (reasoning_content in chat completions) |\n| **Pruning Method** | REAP — 20% expert removal with super-expert protection |\n| **Quantization Tool** | llama.cpp (llama-quantize) |\n| **Converted From** | Safetensors (BF16) via llama.cpp convert_hf_to_gguf.py |\n\n## Speed Benchmarks\n\nTested on AMD Ryzen AI MAX+ 395 (Strix Halo), Radeon 8060S (gfx1151), 128 GB LPDDR5X. llama.cpp b8746, Vulkan RADV, Flash Attention ON.\n\n### llama-bench (pp512 / tg128)\n\n| Quant | GPU Layers | Prefill (t/s) | Token Gen (t/s) |\n|-------|------------|---------------|-----------------|\n| **Q4_K_M** | 49/49 (full) | 295.74 | **27.56** |\n| Q6_K | 35/49 (partial) | 121.35 | 15.74 |\n| Q8_0 | 25/49 (partial) | 44.55 | 9.89 |\n\n### API Speed (llama-server, real chat completions)\n\n| Quant | Prefill (short) | Prefill (long) | Token Gen |\n|-------|-----------------|----------------|-----------|\n| **Q4_K_M** | 141.8 t/s | 62.3 t/s | **28.4 t/s** |\n| Q6_K | 48.8 t/s | 21.7 t/s | 15.4 t/s |\n| Q8_0 | 25.8 t/s | 14.2 t/s | 9.0 t/s |\n\n> Q6_K and Q8_0 are partially offloaded to CPU because they exceed the default 64 GB GTT limit. With GTT increased to 120 GB (BIOS GART + modprobe config), they would run at full GPU speed.\n\n## Quality Benchmarks\n\nTested via llama-server API with thinking mode enabled.\n\n### Reasoning (5 questions — math, calculus, logic, code comprehension, knowledge)\n\n| Quant | Score |\n|-------|-------|\n| **Q4_K_M** | **5/5** |\n| **Q6_K** | **5/5** |\n| **Q8_0** | **5/5** |\n\nAll quants produce correct answers for arithmetic (127*43=5461), calculus (derivative of x^3+2x^2-5x+7), formal logic, Python reference semantics, and factual recall.\n\n### Code Generation (HumanEval subset — 5 problems, executed and tested)\n\n| Quant | Passed |\n|-------|--------|\n| **Q4_K_M** | 4/5 |\n| **Q6_K** | 4/5 |\n| **Q8_0** | 3/5 |\n\nThe model generates correct code for all problems. Score differences are due to code extraction from the thinking format, not model quality.\n\n### Full Benchmarks (safetensors, from base model card)\n\n| Benchmark | Score |\n|-----------|-------|\n| HumanEval | 81.1% |\n| HumanEval+ | 76.8% |\n| MBPP | 86.2% |\n| MBPP+ | 73.0% |\n| ARC Challenge | 63.7% |\n| HellaSwag | 84.1% |\n| TruthfulQA MC2 | 52.4% |\n| Winogrande | 75.5% |\n\nSee the [full model card](https://huggingface.co/0xSero/Qwen3.5-122B-A10B-REAP-20) for complete benchmark results and methodology.\n\n## How to Run\n\n### llama-server (recommended)\n\n```bash\n# Q4_K_M — fits in 64 GB, fastest\nllama-server \\\n  -m Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf \\\n  -ngl 999 --flash-attn on -c 4096 \\\n  --port 8080 --host 0.0.0.0\n\n# With speculative decoding for faster generation\nllama-server \\\n  -m Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf \\\n  -ngl 999 --flash-attn on -c 4096 \\\n  --spec-type ngram-mod --spec-ngram-size-n 24 \\\n  --draft-min 48 --draft-max 64 \\\n  --port 8080 --host 0.0.0.0\n```\n\n### Ollama\n\n```bash\n# Create a Modelfile\necho 'FROM ./Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf' > Modelfile\nollama create reap20 -f Modelfile\nollama run reap20\n```\n\n### Python (llama-cpp-python)\n\n```python\nfrom llama_cpp import Llama\n\nllm = Llama(\n    model_path=\"Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf\",\n    n_gpu_layers=-1,\n    n_ctx=4096,\n    flash_attn=True,\n)\n\noutput = llm.create_chat_completion(\n    messages=[{\"role\": \"user\", \"content\": \"Hello!\"}],\n    max_tokens=512,\n)\nprint(output[\"choices\"][0][\"message\"][\"content\"])\n```\n\n## Which Quant Should I Use?\n\n| Your Setup | Recommended |\n|------------|-------------|\n| 64 GB VRAM/GTT (e.g., Strix Halo default) | **Q4_K_M** — full GPU offload, 28 t/s |\n| 80-96 GB VRAM/GTT | **Q6_K** — higher quality, full GPU offload |\n| 128+ GB VRAM (e.g., 2x Strix Halo cluster, A100) | **Q8_0** — near-lossless quality |\n| RTX 4090 (24 GB) | Model too large. Use a smaller model. |\n\n## Hardware Notes\n\nThis model was designed for and tested on AMD Strix Halo (Ryzen AI MAX+ 395) with 128 GB unified memory. It also works on any system with sufficient VRAM/RAM:\n\n- **Strix Halo (64 GB GTT default)**: Q4_K_M fits fully, Q6_K/Q8_0 partial offload\n- **Strix Halo (120 GB GTT increased)**: All quants fit fully\n- **2x Strix Halo cluster (RPC)**: All quants at full speed\n- **NVIDIA A100 80GB**: Q4_K_M and Q6_K fit fully\n- **Apple M-series (128 GB)**: All quants should work via Metal\n\n## What is REAP?\n\n[REAP](https://arxiv.org/abs/2510.13999) (Routing-Enhanced Activation Pruning) removes the least-activated experts from Mixture-of-Experts models while preserving critical capabilities. This model has 20% of experts removed (256 -> 205 per layer), retaining **97.9% average capability** across standard benchmarks.\n\n## Credits\n\n- **Pruning**: [0xSero](https://huggingface.co/0xSero) / Sybil Solutions\n- **Base Model**: [Qwen Team](https://huggingface.co/Qwen)\n- **REAP Method**: [arxiv:2510.13999](https://arxiv.org/abs/2510.13999)\n- **Quantization**: llama.cpp\n\n## License\n\nSame license as the base model. See [Qwen3.5-122B-A10B license](https://huggingface.co/Qwen/Qwen3.5-122B-A10B).\n",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "moe",
    "pruned",
    "reap",
    "qwen3.5",
    "expert-pruning",
    "llama-cpp",
    "strix-halo",
    "en",
    "arxiv:2510.13999",
    "base_model:0xSero/Qwen3.5-122B-A10B-REAP-20",
    "base_model:quantized:0xSero/Qwen3.5-122B-A10B-REAP-20",
    "license:other",
    "endpoints_compatible",
    "region:us",
    "conversational"
  ],
  "likes": 5,
  "downloads": 5302,
  "gated": false,
  "private": false,
  "last_modified": "2026-04-14T22:46:38.000Z",
  "created_at": "2026-04-10T16:27:07.000Z",
  "pipeline_tag": "",
  "library_name": ""
}
Source payload excerpt (from Hugging Face API)
{
  "_id": "69d924dbc70f0feb0f15d7cc",
  "id": "0xSero/Qwen3.5-122B-A10B-REAP-20-GGUF",
  "modelId": "0xSero/Qwen3.5-122B-A10B-REAP-20-GGUF",
  "sha": "81cadfaffa3d6b1bd0786cbd4a3e316035a91406",
  "createdAt": "2026-04-10T16:27:07.000Z",
  "lastModified": "2026-04-14T22:46:38.000Z",
  "author": "0xSero",
  "downloads": 5302,
  "likes": 5,
  "gated": false,
  "private": false,
  "pipeline_tag": "",
  "library_name": "",
  "siblings_count": 5
}