FreedomAISVR/Gemma-4-12B-it-NVFP4-GGUF overview
Gemma 4 12B it NVFP4 GGUF NVFP4 Blackwell FP4 quantization of Google's Gemma 4 12B It https://huggingface.co/google/gemma 4 12B it , a multimodal language mode…
Runs locally from ~116.4 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
Model Details
| Model ID | FreedomAISVR/Gemma-4-12B-it-NVFP4-GGUF |
|---|---|
| Author | FreedomAISVR |
| Pipeline | text-generation |
| License | apache-2.0 |
| Base model | google/gemma-4-12B-it |
| Last modified | 2026-06-08T04:34:52.000Z |
Model README
---
license: apache-2.0
language:
- en
library_name: gguf
tags:
- gguf
- gemma
- gemma-4
- 12b
- nvfp4
- fp4
- blackwell
- vision
- multimodal
base_model: google/gemma-4-12B-it
pipeline_tag: text-generation
inference: false
quantized_by: freedomaisvr
---
Gemma-4-12B-it-NVFP4-GGUF
NVFP4 (Blackwell FP4) quantization of Google's Gemma 4 12B It, a multimodal language model with native vision understanding.
This repository contains two files:
gemma-4-12b-it-nvfp4.gguf— Text backbone (48 transformer layers, 3840 hidden dim, 262k context) quantized to NVFP4mmproj-gemma-4-12b-it-f16.gguf— SigLIP vision encoder + projector at F16 precision (required for image input)
About NVFP4
NVFP4 is NVIDIA's native 4-bit floating-point format (E4M3 — 1 sign, 4 exponent, 3 mantissa bits) purpose-built for Blackwell GPU architecture (RTX 50-series). Unlike block-quantized INT4 (Q4_K_M) or microscaling MXFP4, NVFP4 operates directly on Blackwell's native tensor core data type, eliminating the dequantization step entirely.
| Feature | NVFP4 | Q4_K_M | MXFP4 |
|---|---|---|---|
| Numeric format | E4M3 native FP4 | INT4 block quantization | E2M1 microscaling |
| Block size | 32 elements | 32 elements | 32 elements |
| Effective BPW | 4.68 | ~4.50 | 4.72 |
| Dequantization overhead | None (native tensor cores) | Required on every load | Required on every load |
| Hardware acceleration | Blackwell (RTX 5060 Ti, 5070, 5090) | CUDA cores / CPU | CUDA cores / CPU / AMD |
| Dynamic range (max normal) | 448 (E4M3) | 7 (INT4, symmetric) | 30 (E2M1) |
| PPL vs F16 (estimated) | +0.2–0.5% | +0.3–0.6% | +0.3–0.7% |
When to use NVFP4: You have a Blackwell GPU (RTX 5060 Ti / 5070 / 5090) and want maximum throughput with near-lossless quality. The native FP4 tensor cores give approximately 2× throughput vs Q4_K_M on the same hardware.
When to use alternatives: You're running on pre-Blackwell NVIDIA GPUs, AMD GPUs, or CPU inference — use Q4_K_M or MXFP4 instead.
Files
| Filename | Type | Size | BPW | Description |
|---|---|---|---|---|
| gemma-4-12b-it-nvfp4.gguf | NVFP4 quantized (text) | 6.98 GB | 4.68 | 48-layer text backbone with hybrid attention |
| mmproj-gemma-4-12b-it-f16.gguf | Vision encoder (F16) | 122 MB | 16.0 | SigLIP vision embedder + GEMMA4UV projector |
Quantization Characteristics
| Metric | Value |
|---|---|
| Input format | F16 GGUF (23.83 GB, 667 tensors) |
| Output format | NVFP4 GGUF (6.98 GB, 667 tensors) |
| Quantization type | LLAMA_FTYPE_MOSTLY_NVFP4 (type 39) |
| Compression ratio | 3.42× (23.83 GB → 6.98 GB) |
| Quantization time | ~293 seconds on RTX 5060 Ti |
| 1D tensors (norms, scales) | Kept at F32 |
| Attention + FFN weights | Converted to NVFP4 |
Model Description
Gemma 4 12B is part of Google's fourth-generation Gemma family, featuring:
- 48 transformer layers with 3840 hidden dimensions and 15360 FFN intermediate size
- Hybrid attention: 40 sliding-window layers (window 1024, kv_heads=8, head_dim=256) interleaved with 8 full-attention layers (kv_heads=2, head_dim=512) in a 5:1 pattern
- Context window: up to 262,144 tokens
- RoPE scaling: separate frequency bases for sliding window and full attention
- Final logit softcapping: stabilizes large-vocabulary predictions
- Vision: SigLIP-based embedder (not a full ViT — a lightweight patch embedder with learned positional encoding) enables native image understanding without a separate vision transformer
- Instruction-tuned: optimized for chat and instruction-following with Gemma 4's structured turn format
The base model is google/gemma-4-12B-it under Apache 2.0 license.
Usage
llama.cpp CLI
# Text-only inference
./llama-cli \
-m gemma-4-12b-it-nvfp4.gguf \
-p "Explain quantum computing in simple terms" \
-n 512
# Vision inference (requires mmproj)
./llama-cli \
-m gemma-4-12b-it-nvfp4.gguf \
--mmproj mmproj-gemma-4-12b-it-f16.gguf \
--image diagram.png \
-p "Explain what this diagram shows" \
-n 512
llama.cpp Server (OpenAI-compatible API)
./llama-server \
-m gemma-4-12b-it-nvfp4.gguf \
--mmproj mmproj-gemma-4-12b-it-f16.gguf \
--port 8080 \
-ngl 99 \
-c 8192
# Chat completion with vision (via API)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4-12b-it-nvfp4",
"messages": [
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
{"type": "text", "text": "Describe this image"}
]}
]
}'
LM Studio
- Download both
.gguffiles - Place them in the same model folder
- In LM Studio, select the model and ensure the mmproj is auto-detected (same basename)
- The chat template is embedded in the GGUF — no manual configuration needed
Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="gemma-4-12b-it-nvfp4.gguf",
mmproj="mmproj-gemma-4-12b-it-f16.gguf",
n_ctx=8192,
n_gpu_layers=-1, # offload all layers to GPU
)
# Text
output = llm("What is the capital of France?", max_tokens=128)
print(output["choices"][0]["text"])
# Vision
output = llm.create_chat_completion(
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "photo.jpg"}},
{"type": "text", "text": "What's in this image?"}
]
}],
max_tokens=256
)
print(output["choices"][0]["message"]["content"])
Download
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="FreedomAISVR/Gemma-4-12B-it-NVFP4-GGUF",
filename="gemma-4-12b-it-nvfp4.gguf"
)
mmproj_path = hf_hub_download(
repo_id="FreedomAISVR/Gemma-4-12B-it-NVFP4-GGUF",
filename="mmproj-gemma-4-12b-it-f16.gguf"
)
Thinking / Reasoning Behavior
Gemma 4 supports structured reasoning using <|channel>thought tags. The chat template handles reasoning as follows:
enable_thinking=false(default): The template inserts empty think tags (<|channel>thought\n<channel|>) at the start of the model's generation turn. This signals to the model that reasoning has already been completed, suppressing thinking output entirely.enable_thinking=true: The model may generate internal reasoning tokens enclosed in<|channel>thought...<channel|>before its final response.
In LM Studio, reasoning sections are rendered as collapsible blocks when the chat template and reasoning.parsing configuration are properly set (start string: <|channel>thought, end string: <channel|>).
This behavior matches the official Google/unsloth GGUFs.
Conversion Pipeline
google/gemma-4-12B-it (23.92 GB, safetensors)
│
├─ convert_hf_to_gguf.py --outtype f16 (llama.cpp d403f00, Gemma4Unified handler)
│ → gemma-4-12b-f16.gguf (23.83 GB, 667 tensors, text backbone only)
│
├─ convert_hf_to_gguf.py --mmproj --outtype f16
│ → mmproj-gemma-4-12b-f16.gguf (122 MB, 11 tensors, SigLIP embedder)
│
└─ llama-quantize.exe NVFP4
→ gemma-4-12b-it-nvfp4.gguf (6.98 GB, 667 tensors, 4.68 BPW)
Key facts:
- Built with llama.cpp commit
d403f00(CUDA backend, Blackwell-enabled) - The converter uses
Gemma4UnifiedForConditionalGenerationhandler which automatically separates text backbone from vision embedders - No post-processing or binary patching required
- Google's original chat template (
tokenizer.chat_template, 17,466 bytes) is preserved as-is from the source model
Hardware Requirements
| GPU | VRAM | Model + KV Cache (4k ctx) | Estimated Speed |
|---|---|---|---|
| RTX 5060 Ti 16GB | 16 GB | ~10 GB | ~50 tok/s |
| RTX 5070 12GB | 12 GB | ~10 GB | ~55 tok/s |
| RTX 5090 32GB | 32 GB | ~10 GB | ~80 tok/s |
| CPU (6+ cores) | System RAM | ~7 GB | ~5-8 tok/s |
Minimum VRAM: ~10 GB for the model (~7 GB at 4.68 BPW + ~2.5 GB KV cache at 8192 ctx).
Hardware Compatibility
| Backend | NVFP4 Support | Notes |
|---|---|---|
| CUDA (Blackwell, SM 120a) | ✅ Native tensor cores | RTX 5060 Ti, 5070, 5090 |
| CUDA (pre-Blackwell) | ❌ Falls back to CPU | SM < 120, no FP4 support |
| CPU (llamafile) | ❌ Not supported | Use Q4_K_M instead |
| Vulkan | ❌ Not supported | Use Q4_K_M instead |
| Metal (Apple) | ❌ Not supported | Use Q4_K_M instead |
License
Apache 2.0, as per Google's Gemma 4 license.
Run FreedomAISVR/Gemma-4-12B-it-NVFP4-GGUF with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models