Model Intelligence Sheet
weathermanj/nvidia-nemotron-nano-9b-v2-gguf overview
GGUF quantizations of NVIDIA’s NVIDIA-Nemotron-Nano-9B-v2. These files target llama.cpp-compatible runtimes.
Downloads
1,217
Likes
1
Pipeline
text-generation
Library
llama.cpp
Visibility
Public
Access
Open
Repository Files & Downloads
11 files detected
Direct downloads for all repository files
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ3_M.gguf | GGUF | IQ3_M | 4.85 GB | Download |
| NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ4_XS.gguf | GGUF | IQ4_XS | 4.99 GB | Download |
| NVIDIA-Nemotron-Nano-9B-v2-gguf-Q2_K.gguf | GGUF | Q2_K | 4.66 GB | Download |
| NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_0.gguf | GGUF | — | 4.94 GB | Download |
| NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_1.gguf | GGUF | — | 5.43 GB | Download |
| NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf | GGUF | Q4_K_M | 6.08 GB | Download |
| NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_S.gguf | GGUF | Q4_K_S | 5.79 GB | Download |
| NVIDIA-Nemotron-Nano-9B-v2-gguf-Q5_K_M.gguf | GGUF | Q5_K_M | 6.58 GB | Download |
| NVIDIA-Nemotron-Nano-9B-v2-gguf-Q6_K.gguf | GGUF | Q6_K | 8.51 GB | Download |
| NVIDIA-Nemotron-Nano-9B-v2-gguf-Q8_0.gguf | GGUF | — | 8.81 GB | Download |
| NVIDIA-Nemotron-Nano-9B-v2-gguf-f16.gguf | GGUF | F16 | 16.57 GB | Download |
Model Details Live
Metadata Inspector
Normalized metadata (stored in metadata_json)
{
"metadata": {},
"card_data": {
"tags": [
"gguf",
"llama.cpp",
"text-generation",
"quantized",
"nvidia",
"nemotron",
"mamba2",
"transformer"
],
"language": [
"en"
],
"license": "other",
"license_name": "nvidia-open-model-license",
"license_link": "https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/",
"base_model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
"library_name": "llama.cpp",
"pipeline_tag": "text-generation",
"model_type": "nemotron_h",
"quantized": true,
"quantization_type": "gguf",
"quantization_config": {
"quantized": true,
"format": "gguf",
"variants": [
{
"filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-Q2_K.gguf",
"size": "4.7GB",
"bits_per_weight": "~2.0",
"description": "2-bit K-quantization, maximum compression"
},
{
"filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-Q8_0.gguf",
"size": "8.9GB",
"bits_per_weight": "~8.0",
"description": "Near-lossless, reference quality"
},
{
"filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-Q6_K.gguf",
"size": "8.6GB",
"bits_per_weight": "~6.0",
"description": "High quality, recommended"
},
{
"filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-Q5_K_M.gguf",
"size": "6.6GB",
"bits_per_weight": "~5.0",
"description": "Good quality, balanced"
},
{
"filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf",
"size": "6.1GB",
"bits_per_weight": "~4.0",
"description": "Standard choice, good compression"
},
{
"filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_1.gguf",
"size": "5.5GB",
"bits_per_weight": "~4.0",
"description": "Legacy 4-bit (Q4_1), slightly better quality than Q4_0"
},
{
"filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_0.gguf",
"size": "5.0GB",
"bits_per_weight": "~4.0",
"description": "Legacy 4-bit (Q4_0), smaller, lower quality"
},
{
"filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_S.gguf",
"size": "~5.8GB",
"bits_per_weight": "~4.0",
"description": "4-bit K (small), smaller than Q4_K_M"
},
{
"filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ4_XS.gguf",
"size": "5.0GB",
"bits_per_weight": "4.25",
"description": "Integer quantization, excellent compression"
},
{
"filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ3_M.gguf",
"size": "4.9GB",
"bits_per_weight": "3.66",
"description": "Ultra-small, mobile/edge"
}
]
},
"frontmatter": {
"tags": [
"gguf",
"llama.cpp",
"text-generation",
"quantized",
"nvidia",
"nemotron",
"mamba2",
"transformer"
],
"language": [
"en"
],
"license": "other",
"license_name": "nvidia-open-model-license",
"license_link": "https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/",
"base_model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
"library_name": "llama.cpp",
"pipeline_tag": "text-generation",
"model_type": "nemotron_h",
"quantized": "true",
"quantization_type": "gguf",
"quantization_config": [
"filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q2_K.gguf",
"filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q8_0.gguf",
"filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q6_K.gguf",
"filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q5_K_M.gguf",
"filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf",
"filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_1.gguf",
"filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_0.gguf",
"filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_S.gguf",
"filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ4_XS.gguf",
"filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ3_M.gguf"
]
},
"hero_image_url": "",
"summary": "GGUF quantizations of NVIDIA’s NVIDIA-Nemotron-Nano-9B-v2. These files target llama.cpp-compatible runtimes.",
"quick_links": [],
"benchmark_table_html": "",
"readme_markdown": "---\ntags:\n - gguf\n - llama.cpp\n - text-generation\n - quantized\n - nvidia\n - nemotron\n - mamba2\n - transformer\nlanguage:\n - en\nlicense: other\nlicense_name: nvidia-open-model-license\nlicense_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/\nbase_model: nvidia/NVIDIA-Nemotron-Nano-9B-v2\nlibrary_name: llama.cpp\npipeline_tag: text-generation\nmodel_type: nemotron_h\nquantized: true\nquantization_type: gguf\nquantization_config:\n quantized: true\n format: gguf\n variants:\n - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q2_K.gguf\n size: 4.7GB\n bits_per_weight: \"~2.0\"\n description: \"2-bit K-quantization, maximum compression\"\n - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q8_0.gguf\n size: 8.9GB\n bits_per_weight: \"~8.0\"\n description: \"Near-lossless, reference quality\"\n - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q6_K.gguf\n size: 8.6GB\n bits_per_weight: \"~6.0\"\n description: \"High quality, recommended\"\n - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q5_K_M.gguf\n size: 6.6GB\n bits_per_weight: \"~5.0\"\n description: \"Good quality, balanced\"\n - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf\n size: 6.1GB\n bits_per_weight: \"~4.0\"\n description: \"Standard choice, good compression\"\n - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_1.gguf\n size: 5.5GB\n bits_per_weight: \"~4.0\"\n description: \"Legacy 4-bit (Q4_1), slightly better quality than Q4_0\"\n - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_0.gguf\n size: 5.0GB\n bits_per_weight: \"~4.0\"\n description: \"Legacy 4-bit (Q4_0), smaller, lower quality\"\n - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_S.gguf\n size: \"~5.8GB\"\n bits_per_weight: \"~4.0\"\n description: \"4-bit K (small), smaller than Q4_K_M\"\n - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ4_XS.gguf\n size: 5.0GB\n bits_per_weight: \"4.25\"\n description: \"Integer quantization, excellent compression\"\n - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ3_M.gguf\n size: 4.9GB\n bits_per_weight: \"3.66\"\n description: \"Ultra-small, mobile/edge\"\n---\n\n# NVIDIA-Nemotron-Nano-9B-v2-gguf\n\nGGUF quantizations of NVIDIA’s [NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2). These files target llama.cpp-compatible runtimes.\n\n## Available Models\n\n| Model | Size | Bits/Weight | Description |\n|-------|------|-------------|-------------|\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q8_0.gguf` | 8.9GB | ~8.0 | Near-lossless, reference quality |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q6_K.gguf` | 8.6GB | ~6.0 | High quality, recommended for most users |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q5_K_M.gguf` | 6.6GB | ~5.0 | Good quality, balanced |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf` | 6.1GB | ~4.0 | Standard choice, good compression |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_1.gguf` | 5.5GB | ~4.0 | Legacy 4-bit (Q4_1), better than Q4_0 |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_0.gguf` | 5.0GB | ~4.0 | Legacy 4-bit (Q4_0), smaller |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ4_XS.gguf` | 5.0GB | 4.25 | Integer quantization, excellent compression |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ3_M.gguf` | 4.9GB | 3.66 | Ultra-small, mobile/edge deployment |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_S.gguf` | 5.8GB | ~4.0 | 4-bit K (small), smaller than Q4_K_M |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q2_K.gguf` | 4.7GB | ~2.0 | 2-bit K, maximum compression |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-f16.gguf` | 17GB | 16.0 | Full precision reference (optional) |\n\n## Usage\n\n- Download a quantization\n - `huggingface-cli download weathermanj/NVIDIA-Nemotron-Nano-9B-v2-gguf NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf --local-dir ./`\n- Run with llama.cpp\n - `./llama-server -m NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf -c 4096`\n\n## Performance (tokens/s)\n\nCPU vs CUDA vs CUDA+FlashAttn on a 24GB RTX 3090, n_predict=64, temp=0.7, top_p=0.95.\n\n| Model | CPU Factoid | CPU Code | CPU Reasoning | CUDA Factoid | CUDA Code | CUDA Reasoning | CUDA+FA Factoid | CUDA+FA Code | CUDA+FA Reasoning |\n|--------|------------:|---------:|--------------:|-------------:|----------:|---------------:|----------------:|-------------:|------------------:|\n| IQ3_M | 10.96 | 9.83 | 9.84 | 59.51 | 48.83 | 51.22 | 49.46 | 51.48 | 51.54 |\n| Q4_K_M | 8.59 | 8.03 | 8.02 | 48.28 | 48.72 | 48.70 | 53.48 | 48.73 | 47.97 |\n| Q5_K_M | 7.54 | 7.54 | 7.52 | 49.09 | 46.00 | 46.87 | 51.25 | 50.58 | 47.00 |\n| Q6_K | 6.65 | 6.19 | 5.89 | 52.77 | 41.84 | 42.06 | 47.59 | 41.48 | 42.85 |\n| Q8_0 | 6.95 | 5.79 | 5.93 | 45.99 | 40.81 | 41.51 | 48.32 | 41.21 | 41.54 |\n\nNotes:\n- IQ3_M is fastest on this setup; Q4_K_M offers stronger quality with close speed.\n- Flash Attention helps variably; larger micro-batches (e.g., `--ubatch-size 1024`) can improve throughput.\n\n\n## Notes\n\n- Base model: [nvidia/NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2)\n- These are GGUF files suitable for llama.cpp and compatible backends.\n- Choose a quantization based on your resource/quality needs (see table).\n\n## License\n\n- NVIDIA Open Model License: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/\n",
"related_quantizations": []
},
"tags": [
"llama.cpp",
"gguf",
"nemotron_h",
"text-generation",
"quantized",
"nvidia",
"nemotron",
"mamba2",
"transformer",
"en",
"base_model:nvidia/NVIDIA-Nemotron-Nano-9B-v2",
"base_model:quantized:nvidia/NVIDIA-Nemotron-Nano-9B-v2",
"license:other",
"endpoints_compatible",
"region:us"
],
"likes": 1,
"downloads": 1217,
"gated": false,
"private": false,
"last_modified": "2025-08-29T00:20:12.000Z",
"created_at": "2025-08-28T19:28:08.000Z",
"pipeline_tag": "text-generation",
"library_name": "llama.cpp"
}
Source payload excerpt (from Hugging Face API)
{
"_id": "68b0adc81f7f07b7b1018abc",
"id": "weathermanj/NVIDIA-Nemotron-Nano-9B-v2-gguf",
"modelId": "weathermanj/NVIDIA-Nemotron-Nano-9B-v2-gguf",
"sha": "b41807a00ee3c57eb43cb8eee9a71935595c2627",
"createdAt": "2025-08-28T19:28:08.000Z",
"lastModified": "2025-08-29T00:20:12.000Z",
"author": "weathermanj",
"downloads": 1217,
"likes": 1,
"gated": false,
"private": false,
"pipeline_tag": "text-generation",
"library_name": "llama.cpp",
"siblings_count": 15
}