GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

FreedomAISVR/Gemma-4-12B-it-NVFP4-GGUF overview

Gemma 4 12B it NVFP4 GGUF NVFP4 Blackwell FP4 quantization of Google's Gemma 4 12B It https://huggingface.co/google/gemma 4 12B it , a multimodal language mode…

ggufgemmagemma-412bnvfp4fp4blackwellvisionmultimodaltext-generationenbase_model:google/gemma-4-12B-itbase_model:quantized:google/gemma-4-12B-itlicense:apache-2.0region:usconversational

Runs locally from ~116.4 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
0
Likes
1
Pipeline
text-generation

Repository Files & Downloads

2 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
gemma-4-12b-it-nvfp4.ggufGGUFGGUF6.50 GBDownload
mmproj-gemma-4-12b-it-f16.ggufGGUFF16116.4 MBDownload

Model Details

Model IDFreedomAISVR/Gemma-4-12B-it-NVFP4-GGUF
AuthorFreedomAISVR
Pipelinetext-generation
Licenseapache-2.0
Base modelgoogle/gemma-4-12B-it
Last modified2026-06-08T04:34:52.000Z

Model README

---

license: apache-2.0

language:

  • en

library_name: gguf

tags:

  • gguf
  • gemma
  • gemma-4
  • 12b
  • nvfp4
  • fp4
  • blackwell
  • vision
  • multimodal

base_model: google/gemma-4-12B-it

pipeline_tag: text-generation

inference: false

quantized_by: freedomaisvr

---

Gemma-4-12B-it-NVFP4-GGUF

NVFP4 (Blackwell FP4) quantization of Google's Gemma 4 12B It, a multimodal language model with native vision understanding.

This repository contains two files:

  • gemma-4-12b-it-nvfp4.gguf — Text backbone (48 transformer layers, 3840 hidden dim, 262k context) quantized to NVFP4
  • mmproj-gemma-4-12b-it-f16.gguf — SigLIP vision encoder + projector at F16 precision (required for image input)

About NVFP4

NVFP4 is NVIDIA's native 4-bit floating-point format (E4M3 — 1 sign, 4 exponent, 3 mantissa bits) purpose-built for Blackwell GPU architecture (RTX 50-series). Unlike block-quantized INT4 (Q4_K_M) or microscaling MXFP4, NVFP4 operates directly on Blackwell's native tensor core data type, eliminating the dequantization step entirely.

| Feature | NVFP4 | Q4_K_M | MXFP4 |

|---|---|---|---|

| Numeric format | E4M3 native FP4 | INT4 block quantization | E2M1 microscaling |

| Block size | 32 elements | 32 elements | 32 elements |

| Effective BPW | 4.68 | ~4.50 | 4.72 |

| Dequantization overhead | None (native tensor cores) | Required on every load | Required on every load |

| Hardware acceleration | Blackwell (RTX 5060 Ti, 5070, 5090) | CUDA cores / CPU | CUDA cores / CPU / AMD |

| Dynamic range (max normal) | 448 (E4M3) | 7 (INT4, symmetric) | 30 (E2M1) |

| PPL vs F16 (estimated) | +0.2–0.5% | +0.3–0.6% | +0.3–0.7% |

When to use NVFP4: You have a Blackwell GPU (RTX 5060 Ti / 5070 / 5090) and want maximum throughput with near-lossless quality. The native FP4 tensor cores give approximately 2× throughput vs Q4_K_M on the same hardware.

When to use alternatives: You're running on pre-Blackwell NVIDIA GPUs, AMD GPUs, or CPU inference — use Q4_K_M or MXFP4 instead.

Files

| Filename | Type | Size | BPW | Description |

|---|---|---|---|---|

| gemma-4-12b-it-nvfp4.gguf | NVFP4 quantized (text) | 6.98 GB | 4.68 | 48-layer text backbone with hybrid attention |

| mmproj-gemma-4-12b-it-f16.gguf | Vision encoder (F16) | 122 MB | 16.0 | SigLIP vision embedder + GEMMA4UV projector |

Quantization Characteristics

| Metric | Value |

|---|---|

| Input format | F16 GGUF (23.83 GB, 667 tensors) |

| Output format | NVFP4 GGUF (6.98 GB, 667 tensors) |

| Quantization type | LLAMA_FTYPE_MOSTLY_NVFP4 (type 39) |

| Compression ratio | 3.42× (23.83 GB → 6.98 GB) |

| Quantization time | ~293 seconds on RTX 5060 Ti |

| 1D tensors (norms, scales) | Kept at F32 |

| Attention + FFN weights | Converted to NVFP4 |

Model Description

Gemma 4 12B is part of Google's fourth-generation Gemma family, featuring:

  • 48 transformer layers with 3840 hidden dimensions and 15360 FFN intermediate size
  • Hybrid attention: 40 sliding-window layers (window 1024, kv_heads=8, head_dim=256) interleaved with 8 full-attention layers (kv_heads=2, head_dim=512) in a 5:1 pattern
  • Context window: up to 262,144 tokens
  • RoPE scaling: separate frequency bases for sliding window and full attention
  • Final logit softcapping: stabilizes large-vocabulary predictions
  • Vision: SigLIP-based embedder (not a full ViT — a lightweight patch embedder with learned positional encoding) enables native image understanding without a separate vision transformer
  • Instruction-tuned: optimized for chat and instruction-following with Gemma 4's structured turn format

The base model is google/gemma-4-12B-it under Apache 2.0 license.

Usage

llama.cpp CLI

# Text-only inference
./llama-cli \
  -m gemma-4-12b-it-nvfp4.gguf \
  -p "Explain quantum computing in simple terms" \
  -n 512

# Vision inference (requires mmproj)
./llama-cli \
  -m gemma-4-12b-it-nvfp4.gguf \
  --mmproj mmproj-gemma-4-12b-it-f16.gguf \
  --image diagram.png \
  -p "Explain what this diagram shows" \
  -n 512

llama.cpp Server (OpenAI-compatible API)

./llama-server \
  -m gemma-4-12b-it-nvfp4.gguf \
  --mmproj mmproj-gemma-4-12b-it-f16.gguf \
  --port 8080 \
  -ngl 99 \
  -c 8192

# Chat completion with vision (via API)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-12b-it-nvfp4",
    "messages": [
      {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
        {"type": "text", "text": "Describe this image"}
      ]}
    ]
  }'

LM Studio

  1. Download both .gguf files
  2. Place them in the same model folder
  3. In LM Studio, select the model and ensure the mmproj is auto-detected (same basename)
  4. The chat template is embedded in the GGUF — no manual configuration needed

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="gemma-4-12b-it-nvfp4.gguf",
    mmproj="mmproj-gemma-4-12b-it-f16.gguf",
    n_ctx=8192,
    n_gpu_layers=-1,  # offload all layers to GPU
)

# Text
output = llm("What is the capital of France?", max_tokens=128)
print(output["choices"][0]["text"])

# Vision
output = llm.create_chat_completion(
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "photo.jpg"}},
            {"type": "text", "text": "What's in this image?"}
        ]
    }],
    max_tokens=256
)
print(output["choices"][0]["message"]["content"])

Download

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="FreedomAISVR/Gemma-4-12B-it-NVFP4-GGUF",
    filename="gemma-4-12b-it-nvfp4.gguf"
)
mmproj_path = hf_hub_download(
    repo_id="FreedomAISVR/Gemma-4-12B-it-NVFP4-GGUF",
    filename="mmproj-gemma-4-12b-it-f16.gguf"
)

Thinking / Reasoning Behavior

Gemma 4 supports structured reasoning using <|channel>thought tags. The chat template handles reasoning as follows:

  • enable_thinking=false (default): The template inserts empty think tags (<|channel>thought\n<channel|>) at the start of the model's generation turn. This signals to the model that reasoning has already been completed, suppressing thinking output entirely.
  • enable_thinking=true: The model may generate internal reasoning tokens enclosed in <|channel>thought...<channel|> before its final response.

In LM Studio, reasoning sections are rendered as collapsible blocks when the chat template and reasoning.parsing configuration are properly set (start string: <|channel>thought, end string: <channel|>).

This behavior matches the official Google/unsloth GGUFs.

Conversion Pipeline

google/gemma-4-12B-it (23.92 GB, safetensors)
  │
  ├─ convert_hf_to_gguf.py --outtype f16 (llama.cpp d403f00, Gemma4Unified handler)
  │     → gemma-4-12b-f16.gguf (23.83 GB, 667 tensors, text backbone only)
  │
  ├─ convert_hf_to_gguf.py --mmproj --outtype f16
  │     → mmproj-gemma-4-12b-f16.gguf (122 MB, 11 tensors, SigLIP embedder)
  │
  └─ llama-quantize.exe NVFP4
        → gemma-4-12b-it-nvfp4.gguf (6.98 GB, 667 tensors, 4.68 BPW)

Key facts:

  • Built with llama.cpp commit d403f00 (CUDA backend, Blackwell-enabled)
  • The converter uses Gemma4UnifiedForConditionalGeneration handler which automatically separates text backbone from vision embedders
  • No post-processing or binary patching required
  • Google's original chat template (tokenizer.chat_template, 17,466 bytes) is preserved as-is from the source model

Hardware Requirements

| GPU | VRAM | Model + KV Cache (4k ctx) | Estimated Speed |

|---|---|---|---|

| RTX 5060 Ti 16GB | 16 GB | ~10 GB | ~50 tok/s |

| RTX 5070 12GB | 12 GB | ~10 GB | ~55 tok/s |

| RTX 5090 32GB | 32 GB | ~10 GB | ~80 tok/s |

| CPU (6+ cores) | System RAM | ~7 GB | ~5-8 tok/s |

Minimum VRAM: ~10 GB for the model (~7 GB at 4.68 BPW + ~2.5 GB KV cache at 8192 ctx).

Hardware Compatibility

| Backend | NVFP4 Support | Notes |

|---|---|---|

| CUDA (Blackwell, SM 120a) | ✅ Native tensor cores | RTX 5060 Ti, 5070, 5090 |

| CUDA (pre-Blackwell) | ❌ Falls back to CPU | SM < 120, no FP4 support |

| CPU (llamafile) | ❌ Not supported | Use Q4_K_M instead |

| Vulkan | ❌ Not supported | Use Q4_K_M instead |

| Metal (Apple) | ❌ Not supported | Use Q4_K_M instead |

License

Apache 2.0, as per Google's Gemma 4 license.

Run FreedomAISVR/Gemma-4-12B-it-NVFP4-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models