GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE โ†’
Model Intelligence Sheet

Krasnopjorovs/gemma-4-12B-it-Imatrix-GGUF overview

๐Ÿฆ™ gemma 4 12B it โ€” Imatrix GGUF Verified quants only. Sub 4 bit variants are excluded from this release because they produce degenerate output on this model sโ€ฆ

ggufimatrixllama.cppquantizedverifiedtext-generationbase_model:google/gemma-4-12B-itbase_model:finetune:google/gemma-4-12B-itlicense:apache-2.0region:us
Downloads
0
Likes
0
Pipeline
text-generation

Repository Files & Downloads

0 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
Browse files on Hugging Face

Model Details

Model IDKrasnopjorovs/gemma-4-12B-it-Imatrix-GGUF
AuthorKrasnopjorovs
Pipelinetext-generation
Licenseapache-2.0
Base modelgoogle/gemma-4-12B-it
Last modified2026-06-06T19:44:49.000Z

Model README

---

license: apache-2.0

base_model: google/gemma-4-12B-it

pipeline_tag: text-generation

library_name: gguf

tags:

- gguf

- imatrix

- llama.cpp

- quantized

- verified

quantized_by: Krasnopjorovs

---

๐Ÿฆ™ gemma-4-12B-it โ€” Imatrix GGUF

> Verified quants only. Sub-4-bit variants are excluded from this release because they produce degenerate output on this model size โ€” no point shipping broken files.

GGUF imatrix builds of google/gemma-4-12B-it.

Built with llama.cpp and importance-matrix calibration on a public multilingual + code + math corpus. Every quant is loaded, prompted, and visually checked before publication.

---

๐Ÿ‘‹ From the author

Hi local LLM enthusiasts! You may have noticed a few quants from me over the past months โ€” those were one-off experiments. This release is the first one from a fully automated pipeline.

Please try them out, share feedback, and let me know which quants you find useful or which ones you wish I had made. For now, anything below 4-bit produces degraded output on this architecture, so I am holding off on those until they actually work well. More quants and more models coming.

---

๐ŸŽฏ Pick a quant

| Quant | Size | What it's for |

|---|---|---|

| Q8_0 | 12G | Almost the original. Pick this if RAM isn't a concern. |

| Q6_K_L | 9.4G | Near-lossless with Q8_0 embeddings. Best of the K-quants. |

| Q6_K | 9.2G | Near-lossless. Excellent fidelity at smaller size than Q8_0. |

| Q5_K_L | 8.2G | Q5_K_M with Q8_0 embeddings. High quality, small overhead. |

| Q5_K_M | 8.0G | Sweet spot between Q4 and Q6. Solid all-rounder. |

| Q5_K_S | 7.8G | Slightly smaller than Q5_K_M, virtually identical output. |

| Q4_K_L | 7.2G | Q4_K_M with Q8_0 embeddings. The smart 4-bit choice. |

| Q4_K_M | 6.9G | The default. Most-downloaded quant for a reason. |

| Q4_K_S | 6.6G | A bit smaller than Q4_K_M, a tiny step down. |

| IQ4_NL | 6.5G | ARM-optimized 4-bit. For Raspberry Pi & friends. |

| IQ4_XS | 6.2G | Tightest 4-bit format. Quality close to Q4_K_S, smaller. |

*_L variants override the output tensor and embeddings to Q8_0 โ€” small disk cost, better output stability.

Tl;dr:

  • ๐ŸŸข Q4_K_M โ€” start here unless you have a reason not to
  • ๐ŸŸก Q5_K_M โ€” small quality bump if you have RAM to spare
  • ๐Ÿ”ต Q8_0 โ€” pick this if you're not RAM-constrained and want max quality
  • ๐ŸŸ  IQ4_XS โ€” when 4-bit is too big
  • ๐ŸŸฃ IQ4_NL โ€” running on ARM (Pi, phones)

---

๐Ÿ’ฌ Prompt format

<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model

---

๐Ÿ“ฅ Download

Single file (recommended):

hf download Krasnopjorovs/gemma-4-12B-it-Imatrix-GGUF \
  --include "gemma-4-12B-it-Q4_K_M.gguf" --local-dir .

Whole repo:

hf download Krasnopjorovs/gemma-4-12B-it-Imatrix-GGUF --local-dir ./gemma-4-12B-it-gguf

---

โ–ถ Run

./llama-server \
  -m gemma-4-12B-it-Q4_K_M.gguf \
  -c 32768 -ngl 99 \
  --host 0.0.0.0 --port 8080

Then point your favorite chat client at http://localhost:8080/v1.

---

๐Ÿ”ฌ Calibration

Imatrix generated from reapmix โ€” community calibration mix (~400K tokens, multilingual + code + math). Same class of public calibration data used by other community publishers; this release makes no unique calibration claim.

The output tensor and embeddings carry disproportionate weight in quantized output. The *_L variants keep these at Q8_0 โ€” small disk cost for noticeably better stability at low bit-rates.

---

โœ… Build info

  • Source: google/gemma-4-12B-it
  • llama.cpp: latest mainline
  • Quantization: CPU
  • Imatrix calibration: GPU
  • Generated: 2026-06-06

---

๐Ÿ™ Credits

Run Krasnopjorovs/gemma-4-12B-it-Imatrix-GGUF with guIDE

Download guIDE โ€” the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE โ†’ ยท Browse 524k+ models ยท Compare models

Source: Hugging Face ยท Compare models