djlougen/nemotron-h-120b-reap-50pct-gguf Q4_K_M GGUF - Free GGUF Download is indexed on GraySoft with repository links, GGUF quant files, and Hugging Face metadata. This page helps you pick a local model for guIDE or other runtimes. See related models in the same shard below.
Model Intelligence Sheet
djlougen/nemotron-h-120b-reap-50pct-gguf overview
GGUF quantizations of the REAP 50%-pruned Nemotron-H 120B model for use with llama.cpp and compatible tools.
Downloads
991
Likes
0
Pipeline
text-generation
Library
llama.cpp
Visibility
Public
Access
Open
Repository Files & Downloads
Model Details Live
Metadata Inspector
Normalized metadata (stored in metadata_json)
{
"metadata": {},
"card_data": {
"license": "other",
"license_name": "nvidia-open-model-license",
"base_model": [
"0xSero/NVIDIA-Nemotron-3-Super-120B-A12B-BF16-REAP-50pct-draft",
"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16"
],
"library_name": "llama.cpp",
"pipeline_tag": "text-generation",
"tags": [
"nemotron_h",
"reap",
"pruned",
"gguf",
"sparse-moe",
"mamba",
"quantized"
],
"frontmatter": {
"license": "other",
"license_name": "nvidia-open-model-license",
"base_model": [
"0xSero/NVIDIA-Nemotron-3-Super-120B-A12B-BF16-REAP-50pct-draft",
"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16"
],
"library_name": "llama.cpp",
"pipeline_tag": "text-generation",
"tags": [
"nemotron_h",
"reap",
"pruned",
"gguf",
"sparse-moe",
"mamba",
"quantized"
]
},
"hero_image_url": "",
"summary": "GGUF quantizations of the REAP 50%-pruned Nemotron-H 120B model for use with llama.cpp and compatible tools.",
"quick_links": [],
"benchmark_table_html": "",
"readme_markdown": "---\nlicense: other\nlicense_name: nvidia-open-model-license\nbase_model:\n- 0xSero/NVIDIA-Nemotron-3-Super-120B-A12B-BF16-REAP-50pct-draft\n- nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16\nlibrary_name: llama.cpp\npipeline_tag: text-generation\ntags:\n- nemotron_h\n- reap\n- pruned\n- gguf\n- sparse-moe\n- mamba\n- quantized\n---\n\n> **Credit:** This is a GGUF quantization of [0xSero/NVIDIA-Nemotron-3-Super-120B-A12B-BF16-REAP-50pct-draft](https://huggingface.co/0xSero/NVIDIA-Nemotron-3-Super-120B-A12B-BF16-REAP-50pct-draft), a REAP expert-pruned checkpoint derived from [nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16). All credit for the base model goes to NVIDIA, and the REAP pruning work to [0xSero](https://huggingface.co/0xSero).\n\n# NVIDIA Nemotron-H 120B REAP 50% — GGUF\n\nGGUF quantizations of the REAP 50%-pruned Nemotron-H 120B model for use with [llama.cpp](https://github.com/ggml-org/llama.cpp) and compatible tools.\n\n## Available Quantizations\n\n| File | Quant | Size | BPW |\n|------|-------|------|-----|\n| `Nemotron-H-120B-REAP-50pct-BF16.gguf` | BF16 | 128.6 GB | 16.01 |\n| `Nemotron-H-120B-REAP-50pct-Q8_0.gguf` | Q8_0 | 68.4 GB | 8.52 |\n| `Nemotron-H-120B-REAP-50pct-Q6_K.gguf` | Q6_K | 59.7 GB | 7.43 |\n| `Nemotron-H-120B-REAP-50pct-Q4_K_M.gguf` | Q4_K_M | 45.4 GB | 5.65 |\n\n## Model Details\n\n| Property | Value |\n|----------|-------|\n| Architecture | NemotronH (hybrid Mamba + MoE + Attention) |\n| Total Blocks | 88 (40 Mamba, 40 MoE, 8 Attention) |\n| Original Parameters | ~120B (64B after 50% expert pruning) |\n| Experts per MoE Layer | 256 (pruned from 512) |\n| Routed Experts per Token | 22 |\n| Context Length | 262,144 tokens |\n| Vocab Size | 131,072 |\n\n## Usage\n\n```bash\n# With llama.cpp\nllama-cli -m Nemotron-H-120B-REAP-50pct-Q4_K_M.gguf -p \"Hello\" -n 128\n\n# With ollama (create a Modelfile first)\nollama create nemotron-h-reap -f Modelfile\n```\n\n## About REAP Pruning\n\nThis model was pruned using the REAP method ([arXiv:2510.13999](https://arxiv.org/abs/2510.13999)), which selectively removes 50% of MoE experts based on layerwise activation observations. This reduces memory footprint while preserving quality for the most commonly activated expert pathways.\n\n- Source: [0xSero/reap-expert-swap](https://github.com/0xSero/reap-expert-swap)\n- Research funding: [donate.sybilsolutions.ai](https://donate.sybilsolutions.ai)\n\n## Draft Caveats\n\nThis is a **draft** derived checkpoint from the original author. Full serving benchmarks and quality evaluations have not been completed. Evaluate accordingly.\n\n## License\n\nDistributed under the NVIDIA Open Model License. See the [original model](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16) for full terms.\n\n## Quantized by\n\n[DJLougen](https://huggingface.co/DJLougen) using llama.cpp on DGX Spark\n",
"related_quantizations": []
},
"tags": [
"llama.cpp",
"gguf",
"nemotron_h",
"reap",
"pruned",
"sparse-moe",
"mamba",
"quantized",
"text-generation",
"arxiv:2510.13999",
"base_model:0xSero/NVIDIA-Nemotron-3-Super-120B-A12B-BF16-REAP-50pct-draft",
"base_model:quantized:0xSero/NVIDIA-Nemotron-3-Super-120B-A12B-BF16-REAP-50pct-draft",
"license:other",
"endpoints_compatible",
"region:us",
"conversational"
],
"likes": 0,
"downloads": 991,
"gated": false,
"private": false,
"last_modified": "2026-03-28T16:07:41.000Z",
"created_at": "2026-03-28T14:08:28.000Z",
"pipeline_tag": "text-generation",
"library_name": "llama.cpp"
}
Source payload excerpt (from Hugging Face API)
{
"_id": "69c7e0dc17242a3b23740340",
"id": "DJLougen/Nemotron-H-120B-REAP-50pct-GGUF",
"modelId": "DJLougen/Nemotron-H-120B-REAP-50pct-GGUF",
"sha": "a7c5011b4c39c714c6de3904649ee2605e955535",
"createdAt": "2026-03-28T14:08:28.000Z",
"lastModified": "2026-03-28T16:07:41.000Z",
"author": "DJLougen",
"downloads": 991,
"likes": 0,
"gated": false,
"private": false,
"pipeline_tag": "text-generation",
"library_name": "llama.cpp",
"siblings_count": 5
}