Model Intelligence Sheet

weathermanj/nvidia-nemotron-nano-9b-v2-gguf overview

GGUF quantizations of NVIDIA’s NVIDIA-Nemotron-Nano-9B-v2. These files target llama.cpp-compatible runtimes.

llama.cppggufnemotron_htext-generationquantizednvidianemotronmamba2transformerenbase_model:nvidia/NVIDIA-Nemotron-Nano-9B-v2base_model:quantized:nvidia/NVIDIA-Nemotron-Nano-9B-v2license:otherendpoints_compatibleregion:us

weathermanj/nvidia-nemotron-nano-9b-v2-gguf visual

Downloads

1,217

Likes

Pipeline

text-generation

Library

llama.cpp

Visibility

Public

Access

Open

Repository Files & Downloads

11 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ3_M.gguf	GGUF	IQ3_M	4.85 GB	Download
NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ4_XS.gguf	GGUF	IQ4_XS	4.99 GB	Download
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q2_K.gguf	GGUF	Q2_K	4.66 GB	Download
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_0.gguf	GGUF	—	4.94 GB	Download
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_1.gguf	GGUF	—	5.43 GB	Download
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf	GGUF	Q4_K_M	6.08 GB	Download
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_S.gguf	GGUF	Q4_K_S	5.79 GB	Download
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q5_K_M.gguf	GGUF	Q5_K_M	6.58 GB	Download
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q6_K.gguf	GGUF	Q6_K	8.51 GB	Download
NVIDIA-Nemotron-Nano-9B-v2-gguf-Q8_0.gguf	GGUF	—	8.81 GB	Download
NVIDIA-Nemotron-Nano-9B-v2-gguf-f16.gguf	GGUF	F16	16.57 GB	Download

Model Details Live

Model Slug

weathermanj/nvidia-nemotron-nano-9b-v2-gguf

Author

weathermanj

Pipeline Task

text-generation

Library

llama.cpp

Created

2025-08-28

Last Modified

2025-08-29

Gated

Private

HF SHA

b41807a00ee3c57eb43cb8eee9a71935595c2627

License

other

Language

Base Model

nvidia/NVIDIA-Nemotron-Nano-9B-v2

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "tags": [
      "gguf",
      "llama.cpp",
      "text-generation",
      "quantized",
      "nvidia",
      "nemotron",
      "mamba2",
      "transformer"
    ],
    "language": [
      "en"
    ],
    "license": "other",
    "license_name": "nvidia-open-model-license",
    "license_link": "https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/",
    "base_model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    "library_name": "llama.cpp",
    "pipeline_tag": "text-generation",
    "model_type": "nemotron_h",
    "quantized": true,
    "quantization_type": "gguf",
    "quantization_config": {
      "quantized": true,
      "format": "gguf",
      "variants": [
        {
          "filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-Q2_K.gguf",
          "size": "4.7GB",
          "bits_per_weight": "~2.0",
          "description": "2-bit K-quantization, maximum compression"
        },
        {
          "filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-Q8_0.gguf",
          "size": "8.9GB",
          "bits_per_weight": "~8.0",
          "description": "Near-lossless, reference quality"
        },
        {
          "filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-Q6_K.gguf",
          "size": "8.6GB",
          "bits_per_weight": "~6.0",
          "description": "High quality, recommended"
        },
        {
          "filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-Q5_K_M.gguf",
          "size": "6.6GB",
          "bits_per_weight": "~5.0",
          "description": "Good quality, balanced"
        },
        {
          "filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf",
          "size": "6.1GB",
          "bits_per_weight": "~4.0",
          "description": "Standard choice, good compression"
        },
        {
          "filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_1.gguf",
          "size": "5.5GB",
          "bits_per_weight": "~4.0",
          "description": "Legacy 4-bit (Q4_1), slightly better quality than Q4_0"
        },
        {
          "filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_0.gguf",
          "size": "5.0GB",
          "bits_per_weight": "~4.0",
          "description": "Legacy 4-bit (Q4_0), smaller, lower quality"
        },
        {
          "filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_S.gguf",
          "size": "~5.8GB",
          "bits_per_weight": "~4.0",
          "description": "4-bit K (small), smaller than Q4_K_M"
        },
        {
          "filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ4_XS.gguf",
          "size": "5.0GB",
          "bits_per_weight": "4.25",
          "description": "Integer quantization, excellent compression"
        },
        {
          "filename": "NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ3_M.gguf",
          "size": "4.9GB",
          "bits_per_weight": "3.66",
          "description": "Ultra-small, mobile/edge"
        }
      ]
    },
    "frontmatter": {
      "tags": [
        "gguf",
        "llama.cpp",
        "text-generation",
        "quantized",
        "nvidia",
        "nemotron",
        "mamba2",
        "transformer"
      ],
      "language": [
        "en"
      ],
      "license": "other",
      "license_name": "nvidia-open-model-license",
      "license_link": "https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/",
      "base_model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2",
      "library_name": "llama.cpp",
      "pipeline_tag": "text-generation",
      "model_type": "nemotron_h",
      "quantized": "true",
      "quantization_type": "gguf",
      "quantization_config": [
        "filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q2_K.gguf",
        "filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q8_0.gguf",
        "filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q6_K.gguf",
        "filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q5_K_M.gguf",
        "filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf",
        "filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_1.gguf",
        "filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_0.gguf",
        "filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_S.gguf",
        "filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ4_XS.gguf",
        "filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ3_M.gguf"
      ]
    },
    "hero_image_url": "",
    "summary": "GGUF quantizations of NVIDIA’s NVIDIA-Nemotron-Nano-9B-v2. These files target llama.cpp-compatible runtimes.",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\ntags:\n  - gguf\n  - llama.cpp\n  - text-generation\n  - quantized\n  - nvidia\n  - nemotron\n  - mamba2\n  - transformer\nlanguage:\n  - en\nlicense: other\nlicense_name: nvidia-open-model-license\nlicense_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/\nbase_model: nvidia/NVIDIA-Nemotron-Nano-9B-v2\nlibrary_name: llama.cpp\npipeline_tag: text-generation\nmodel_type: nemotron_h\nquantized: true\nquantization_type: gguf\nquantization_config:\n  quantized: true\n  format: gguf\n  variants:\n    - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q2_K.gguf\n      size: 4.7GB\n      bits_per_weight: \"~2.0\"\n      description: \"2-bit K-quantization, maximum compression\"\n    - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q8_0.gguf\n      size: 8.9GB\n      bits_per_weight: \"~8.0\"\n      description: \"Near-lossless, reference quality\"\n    - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q6_K.gguf\n      size: 8.6GB\n      bits_per_weight: \"~6.0\"\n      description: \"High quality, recommended\"\n    - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q5_K_M.gguf\n      size: 6.6GB\n      bits_per_weight: \"~5.0\"\n      description: \"Good quality, balanced\"\n    - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf\n      size: 6.1GB\n      bits_per_weight: \"~4.0\"\n      description: \"Standard choice, good compression\"\n    - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_1.gguf\n      size: 5.5GB\n      bits_per_weight: \"~4.0\"\n      description: \"Legacy 4-bit (Q4_1), slightly better quality than Q4_0\"\n    - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_0.gguf\n      size: 5.0GB\n      bits_per_weight: \"~4.0\"\n      description: \"Legacy 4-bit (Q4_0), smaller, lower quality\"\n    - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_S.gguf\n      size: \"~5.8GB\"\n      bits_per_weight: \"~4.0\"\n      description: \"4-bit K (small), smaller than Q4_K_M\"\n    - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ4_XS.gguf\n      size: 5.0GB\n      bits_per_weight: \"4.25\"\n      description: \"Integer quantization, excellent compression\"\n    - filename: NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ3_M.gguf\n      size: 4.9GB\n      bits_per_weight: \"3.66\"\n      description: \"Ultra-small, mobile/edge\"\n---\n\n# NVIDIA-Nemotron-Nano-9B-v2-gguf\n\nGGUF quantizations of NVIDIA’s [NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2). These files target llama.cpp-compatible runtimes.\n\n## Available Models\n\n| Model | Size | Bits/Weight | Description |\n|-------|------|-------------|-------------|\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q8_0.gguf` | 8.9GB | ~8.0 | Near-lossless, reference quality |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q6_K.gguf` | 8.6GB | ~6.0 | High quality, recommended for most users |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q5_K_M.gguf` | 6.6GB | ~5.0 | Good quality, balanced |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf` | 6.1GB | ~4.0 | Standard choice, good compression |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_1.gguf` | 5.5GB | ~4.0 | Legacy 4-bit (Q4_1), better than Q4_0 |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_0.gguf` | 5.0GB | ~4.0 | Legacy 4-bit (Q4_0), smaller |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ4_XS.gguf` | 5.0GB | 4.25 | Integer quantization, excellent compression |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ3_M.gguf` | 4.9GB | 3.66 | Ultra-small, mobile/edge deployment |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_S.gguf` | 5.8GB | ~4.0 | 4-bit K (small), smaller than Q4_K_M |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q2_K.gguf` | 4.7GB | ~2.0 | 2-bit K, maximum compression |\n| `NVIDIA-Nemotron-Nano-9B-v2-gguf-f16.gguf` | 17GB | 16.0 | Full precision reference (optional) |\n\n## Usage\n\n- Download a quantization\n  - `huggingface-cli download weathermanj/NVIDIA-Nemotron-Nano-9B-v2-gguf NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf --local-dir ./`\n- Run with llama.cpp\n  - `./llama-server -m NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4_K_M.gguf -c 4096`\n\n## Performance (tokens/s)\n\nCPU vs CUDA vs CUDA+FlashAttn on a 24GB RTX 3090, n_predict=64, temp=0.7, top_p=0.95.\n\n| Model  | CPU Factoid | CPU Code | CPU Reasoning | CUDA Factoid | CUDA Code | CUDA Reasoning | CUDA+FA Factoid | CUDA+FA Code | CUDA+FA Reasoning |\n|--------|------------:|---------:|--------------:|-------------:|----------:|---------------:|----------------:|-------------:|------------------:|\n| IQ3_M  |       10.96 |     9.83 |          9.84 |        59.51 |     48.83 |          51.22 |           49.46 |        51.48 |             51.54 |\n| Q4_K_M |        8.59 |     8.03 |          8.02 |        48.28 |     48.72 |          48.70 |           53.48 |        48.73 |             47.97 |\n| Q5_K_M |        7.54 |     7.54 |          7.52 |        49.09 |     46.00 |          46.87 |           51.25 |        50.58 |             47.00 |\n| Q6_K   |        6.65 |     6.19 |          5.89 |        52.77 |     41.84 |          42.06 |           47.59 |        41.48 |             42.85 |\n| Q8_0   |        6.95 |     5.79 |          5.93 |        45.99 |     40.81 |          41.51 |           48.32 |        41.21 |             41.54 |\n\nNotes:\n- IQ3_M is fastest on this setup; Q4_K_M offers stronger quality with close speed.\n- Flash Attention helps variably; larger micro-batches (e.g., `--ubatch-size 1024`) can improve throughput.\n\n\n## Notes\n\n- Base model: [nvidia/NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2)\n- These are GGUF files suitable for llama.cpp and compatible backends.\n- Choose a quantization based on your resource/quality needs (see table).\n\n## License\n\n- NVIDIA Open Model License: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/\n",
    "related_quantizations": []
  },
  "tags": [
    "llama.cpp",
    "gguf",
    "nemotron_h",
    "text-generation",
    "quantized",
    "nvidia",
    "nemotron",
    "mamba2",
    "transformer",
    "en",
    "base_model:nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    "base_model:quantized:nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    "license:other",
    "endpoints_compatible",
    "region:us"
  ],
  "likes": 1,
  "downloads": 1217,
  "gated": false,
  "private": false,
  "last_modified": "2025-08-29T00:20:12.000Z",
  "created_at": "2025-08-28T19:28:08.000Z",
  "pipeline_tag": "text-generation",
  "library_name": "llama.cpp"
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "68b0adc81f7f07b7b1018abc",
  "id": "weathermanj/NVIDIA-Nemotron-Nano-9B-v2-gguf",
  "modelId": "weathermanj/NVIDIA-Nemotron-Nano-9B-v2-gguf",
  "sha": "b41807a00ee3c57eb43cb8eee9a71935595c2627",
  "createdAt": "2025-08-28T19:28:08.000Z",
  "lastModified": "2025-08-29T00:20:12.000Z",
  "author": "weathermanj",
  "downloads": 1217,
  "likes": 1,
  "gated": false,
  "private": false,
  "pipeline_tag": "text-generation",
  "library_name": "llama.cpp",
  "siblings_count": 15
}