GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

FreedomAISVR/Nemotron-3-30B-Nano-Omni-NVFP4-GGUF overview

license: other language: en library name: gguf tags: gguf nemotron nemotron 3 nvidia nvfp4 mamba2 hybrid moe base model: nvidia/Nemotron 3 Nano Omni 30B A3B Re…

ggufnemotronnemotron-3nvidianvfp4mamba2hybridmoetext-generationenbase_model:nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16base_model:quantized:nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16license:otherregion:usconversational

Runs locally from ~1.48 GB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
578
Likes
0
Pipeline
text-generation

Repository Files & Downloads

2 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
mmproj-nemotron-3-30b-f16.ggufGGUFF161.48 GBDownload
nemotron-3-30b-NVFP4.ggufGGUFGGUF17.93 GBDownload

Model Details

Model IDFreedomAISVR/Nemotron-3-30B-Nano-Omni-NVFP4-GGUF
AuthorFreedomAISVR
Pipelinetext-generation
Licenseother
Base modelnvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
Last modified2026-06-18T12:42:45.000Z

Model README

---

license: other

language:

  • en

library_name: gguf

tags:

  • gguf
  • nemotron
  • nemotron-3
  • nvidia
  • nvfp4
  • mamba2
  • hybrid
  • moe

base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

pipeline_tag: text-generation

inference: false

quantized_by: FreedomAISVR

---

Nemotron-3-30B-Nano-Omni NVFP4 GGUF

NVFP4 (E4M3) quantization of nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16, NVIDIA's 30B MoE (~3B active) reasoning model with Mamba2-SSM hybrid architecture.

Why Is This File Bigger Than Expected?

This model uses a Mamba2-Transformer hybrid MoE architecture — 23 out of 52 layers are Mamba2 state-space model (SSM) layers. The CUDA kernels that run Mamba2 SSM operations (SSM_SCAN, SSM_CONV) require F32 inputs and will reject quantized weight tensors entirely.

When SSM weight tensors are quantized, the CUDA backend falls back to CPU for those layers. Since 44% of the model is Mamba2, this significantly impacts performance.

The Workaround (Hybrid Quantization)

The 46 SSM weight tensors (ssm_in.weight, ssm_out.weight across 23 Mamba2 blocks) are stored in F16 instead of the quantized format. Everything else stays at the original quantization.

This Is a Bandaid, Not a Fix

This hybrid approach is a temporary workaround until llama.cpp adds quantized support to CUDA SSM kernels (SSM_SCAN, SSM_CONV). The kernels are currently hardcoded to F32 (ssm_conv_f32, ssm_scan_f32 in the CUDA backend).

Files

| Filename | Type | Size | Description |

|---|---|---|---|

| nemotron-3-30b-NVFP4.gguf | Hybrid NVFP4+F16 | ~18.4 GB | Main model weights (~4.88 BPW effective) |

| mmproj-nemotron-3-30b-f16.gguf | F16 | ~1.5 GB | SigLIP vision encoder |

Architecture

  • 52 layers: 23 Mamba2 (SSM) + 23 MoE FFN (128 experts, top-6) + 6 attention
  • MoE: 128 routed experts + 1 shared expert, top-6 routing
  • Vision: SigLIP ViT (30 blocks, 1280 hidden)
  • Context: 262K tokens

Usage

llama-server -m nemotron-3-30b-NVFP4.gguf --mmproj mmproj-nemotron-3-30b-f16.gguf --host 0.0.0.0 --port 8080 -ngl 99

Credits

Run FreedomAISVR/Nemotron-3-30B-Nano-Omni-NVFP4-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models