majentik/nemotron-3-nano-4b-turboquant-gguf-q4_k_m Q4_K_M GGUF - Free GGUF Download is indexed on GraySoft with repository links, GGUF quant files, and Hugging Face metadata. This page helps you pick a local model for guIDE or other runtimes. See related models in the same shard below.

Model Intelligence Sheet

majentik/nemotron-3-nano-4b-turboquant-gguf-q4_k_m overview

GGUF Q4KM weight-quantized variant of nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 optimised for use with TurboQuant KV cache compression via a dedicated llama.cpp fork. Important: TurboQuant KV cache types (planar3, iso3) are not available in upstream llama.cpp, standard Ollama, or LM Studio. They require a specific llama.cpp fork. The GGUF file itself is a standard GGUF and works with any llama.cpp-compatible runtime using normal KV cache types (f16, q80, q40, etc.).

ggufturboquantkv-cache-quantizationnemotronnvidiamamba2hybridllama-cppquantizedtext-generationenarxiv:2504.19874base_model:nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16base_model:quantized:nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16license:otherendpoints_compatibleregion:usconversational

majentik/nemotron-3-nano-4b-turboquant-gguf-q4_k_m visual

Downloads

197

Likes

Pipeline

text-generation

Library

gguf

Visibility

Public

Access

Open

Repository Files & Downloads

1 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
Nemotron-3-Nano-4B-TurboQuant-Q4_K_M.gguf	GGUF	Q4_K_M	2.64 GB	Download

Model Details Live

Model Slug

majentik/nemotron-3-nano-4b-turboquant-gguf-q4_k_m

Author

majentik

Pipeline Task

text-generation

Library

gguf

Created

2026-04-14

Last Modified

2026-04-17

Gated

Private

HF SHA

86e2a178317d53d6173ca390adbd1bbdad98435d

License

other

Language

Base Model

nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "license": "other",
    "license_name": "nvidia-open-model-license",
    "license_link": "https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf",
    "base_model": "nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16",
    "tags": [
      "gguf",
      "turboquant",
      "kv-cache-quantization",
      "nemotron",
      "nvidia",
      "mamba2",
      "hybrid",
      "llama-cpp",
      "quantized"
    ],
    "library_name": "gguf",
    "pipeline_tag": "text-generation",
    "language": [
      "en"
    ],
    "frontmatter": {
      "license": "other",
      "license_name": "nvidia-open-model-license",
      "license_link": "https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf",
      "base_model": "nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16",
      "tags": [
        "gguf",
        "turboquant",
        "kv-cache-quantization",
        "nemotron",
        "nvidia",
        "mamba2",
        "hybrid",
        "llama-cpp",
        "quantized"
      ],
      "library_name": "gguf",
      "pipeline_tag": "text-generation",
      "language": [
        "en"
      ]
    },
    "hero_image_url": "",
    "summary": "GGUF Q4_K_M weight-quantized variant of nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 optimised for use with **TurboQuant** KV cache compression via a dedicated llama.cpp fork. > **Important:** TurboQuant KV cache types (planar3, iso3) are **not** available in upstream llama.cpp, standard Ollama, or LM Studio. > They require a specific llama.cpp fork. > The GGUF file itself is a standard GGUF and works with any llama.cpp-compatible runtime using normal KV cache types (f16, q8_0, q4_0, etc.).",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nlicense: other\nlicense_name: nvidia-open-model-license\nlicense_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf\nbase_model: nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16\ntags:\n- gguf\n- turboquant\n- kv-cache-quantization\n- nemotron\n- nvidia\n- mamba2\n- hybrid\n- llama-cpp\n- quantized\nlibrary_name: gguf\npipeline_tag: text-generation\nlanguage:\n- en\n---\n\n# Nemotron-3-Nano-4B-TurboQuant-GGUF-Q4_K_M\n\nGGUF Q4_K_M weight-quantized variant of [nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16) optimised for use with **TurboQuant** KV cache compression via a dedicated llama.cpp fork.\n\n> **Important:** TurboQuant KV cache types (`planar3`, `iso3`) are **not** available in upstream llama.cpp, standard Ollama, or LM Studio.\n> They require a [specific llama.cpp fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache).\n> The GGUF file itself is a standard GGUF and works with any llama.cpp-compatible runtime using normal KV cache types (f16, q8_0, q4_0, etc.).\n\n## Overview\n\nThis model combines two independent compression techniques:\n\n| Technique | What it does | Requirement |\n|-----------|-------------|-------------|\n| **GGUF Q4_K_M weight quantization** | Reduces model size from ~8 GB (BF16) to ~2.2 GB | Any llama.cpp-compatible runtime |\n| **TurboQuant KV cache compression** — random rotation + Lloyd-Max scalar quantization (`--cache-type-k planar3 --cache-type-v planar3`) | Block-diagonal rotations / random rotation for compressed KV cache | [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) only |\n\n## Quickstart\n\n### Option A — With TurboQuant KV cache (fork required)\n\nYou must build from the TurboQuant-enabled llama.cpp fork:\n\n```bash\n# Clone and build the fork\ngit clone https://github.com/johndpope/llama-cpp-turboquant.git\ncd llama-cpp-turboquant && git checkout feature/planarquant-kv-cache\n\n# CUDA (Windows/Linux)\ncmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j\n\n# Metal (Apple Silicon)\ncmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j\n\n# Run with TurboQuant KV cache\n./build/bin/llama-cli -m Nemotron-3-Nano-4B-TurboQuant-GGUF-Q4_K_M.gguf \\\n  --cache-type-k planar3 --cache-type-v planar3 \\\n  -ngl 99 -fa \\\n  -p \"Explain quantum computing\"\n\n# Or run as a server\n./build/bin/llama-server -m Nemotron-3-Nano-4B-TurboQuant-GGUF-Q4_K_M.gguf \\\n  --cache-type-k planar3 --cache-type-v planar3 \\\n  -ngl 99 -fa --jinja\n```\n\n### Option B — With standard llama.cpp / LM Studio / Ollama\n\nThe GGUF works as a normal quantised model. You won't get TurboQuant-specific KV cache benefits, but standard KV cache quantization (q8_0, q4_0) still reduces VRAM significantly.\n\n**llama.cpp (upstream)**\n```bash\nllama-cli -m Nemotron-3-Nano-4B-TurboQuant-GGUF-Q4_K_M.gguf \\\n  --cache-type-k q8_0 --cache-type-v q8_0 \\\n  -ngl 99 -fa \\\n  -p \"Explain quantum computing\"\n```\n\n**LM Studio**\n1. Download the GGUF file and load in LM Studio.\n2. Enable **Developer Mode** (Settings → Developer).\n3. In the model loader's advanced settings, set **Flash Attention** to ON.\n4. Set **K Cache Quantization** and **V Cache Quantization** to `q8_0` (or `q4_0` for more aggressive VRAM savings).\n5. Note: LM Studio does not currently support TurboQuant's `planar3` cache types. Track [this feature request](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) for updates.\n\n**Ollama**\n```bash\n# Standard Ollama does not support TurboQuant cache types.\n# Use with default or q8_0 KV cache via OLLAMA_KV_CACHE_TYPE=q8_0\nOLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_FLASH_ATTENTION=1 ollama run majentik/Nemotron-3-Nano-4B-TurboQuant-GGUF-Q4_K_M\n```\n\n## Specifications\n\n| Property | Value |\n|----------|-------|\n| Base Model | [nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16) |\n| Architecture | Mamba-2 + Transformer hybrid (dense) |\n| Parameters | 4B (dense hybrid) |\n| Context Length | 262K |\n| Weight Quantization | GGUF Q4_K_M (popular 4-bit, best quality/size tradeoff) |\n| Original Size (BF16) | ~8 GB |\n| Quantized File Size | ~2.2 GB |\n| KV Cache (TurboQuant) | 3-bit via `--cache-type-k planar3 --cache-type-v planar3` (fork only) |\n| KV Cache (standard) | q8_0, q4_0, f16, etc. (any llama.cpp runtime) |\n| License | other |\n| Modalities | Text only |\n| Compatible Runtimes | llama.cpp, LM Studio, Ollama, koboldcpp |\n\n## What is TurboQuant?\n\n[TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) is a KV cache compression method that applies a random orthogonal rotation followed by optimal scalar quantization. Bit-identical prefill logits at 4-bit on tested models, with up to 4-8× memory savings for long sequences.\n\n**Benchmarks from the TurboQuant repository** (Llama 3.1 8B, RTX 5090 — results will vary by model and hardware):\n\n| Metric | TurboQuant (4-bit) | Standard q4_0 |\n|--------|--------------------|---------------|\n| Quality | Bit-identical prefill | Lossy |\n| KV Compression | ~4× vs FP16 | ~4× vs FP16 |\n| Speedup (Apple Silicon) | 1.4–1.7× | — |\n\n> **Note:** These benchmarks are from the TurboQuant repository using Llama 3.1 8B on an RTX 5090. Performance on Nemotron-3-Nano-4B will differ. Independent benchmarks for this specific model are welcome — please open a discussion if you have results to share.\n\n## Current Status of TurboQuant in the Ecosystem\n\n| Runtime | TurboQuant Support | Standard KV Quant |\n|---------|---------------------|-------------------|\n| llama.cpp (upstream) | ❌ Not merged | ✅ q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1 |\n| llama-cpp-turboquant fork | ✅ planar3 | ✅ All standard types |\n| LM Studio | ❌ [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | ✅ Via advanced settings |\n| Ollama | ❌ Not supported | ✅ Via OLLAMA_KV_CACHE_TYPE |\n| koboldcpp | ❌ Not supported | ✅ Standard types |\n\n## Recommended Settings\n\nFor VRAM-constrained setups, standard q8_0 KV cache quantization already halves KV cache memory with negligible quality impact. Flash Attention should always be enabled — it is required for V cache quantization and improves memory efficiency regardless.\n\n| VRAM | Suggested Configuration |\n|------|------------------------|\n| 24 GB (RTX 4090) | Q4_K_M + q8_0 KV cache + Flash Attention, 8K–16K context |\n| 16 GB | Q4_K_M + q4_0 KV cache + Flash Attention, 4K–8K context |\n| 48+ GB | Q4_K_M + f16 KV cache, full 32K+ context |\n\n## See Also\n\n- [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)\n- [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)\n- [TurboQuant llama.cpp discussion](https://github.com/ggml-org/llama.cpp/discussions/20969)\n- [TurboQuant paper (arXiv: 2504.19874)](https://arxiv.org/abs/2504.19874)\n- [Base model: nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16)\n- [Nemotron-3-Nano-4B announcement](https://huggingface.co/blog/nvidia/nemotron-3-nano-4b)\n",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "turboquant",
    "kv-cache-quantization",
    "nemotron",
    "nvidia",
    "mamba2",
    "hybrid",
    "llama-cpp",
    "quantized",
    "text-generation",
    "en",
    "arxiv:2504.19874",
    "base_model:nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16",
    "base_model:quantized:nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16",
    "license:other",
    "endpoints_compatible",
    "region:us",
    "conversational"
  ],
  "likes": 0,
  "downloads": 197,
  "gated": false,
  "private": false,
  "last_modified": "2026-04-17T01:17:54.000Z",
  "created_at": "2026-04-14T02:30:22.000Z",
  "pipeline_tag": "text-generation",
  "library_name": "gguf"
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "69dda6be169d1fe41ead9ceb",
  "id": "majentik/Nemotron-3-Nano-4B-TurboQuant-GGUF-Q4_K_M",
  "modelId": "majentik/Nemotron-3-Nano-4B-TurboQuant-GGUF-Q4_K_M",
  "sha": "86e2a178317d53d6173ca390adbd1bbdad98435d",
  "createdAt": "2026-04-14T02:30:22.000Z",
  "lastModified": "2026-04-17T01:17:54.000Z",
  "author": "majentik",
  "downloads": 197,
  "likes": 0,
  "gated": false,
  "private": false,
  "pipeline_tag": "text-generation",
  "library_name": "gguf",
  "siblings_count": 3
}

majentik/nemotron-3-nano-4b-turboquant-gguf-q4_k_m overview

Repository Files & Downloads

Model Details Live

Metadata Inspector

More models in this shard