GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

TheStageAI/gemma-4-E2B-it-qat-GGUF overview

TheStageAI/gemma 4 E2B it qat GGUF A portable GGUF release of Google's Gemma 4 E2B instruction model , compressed from Google's QAT trained BF16 weights and em…

llama.cppggufgemmagemma-4qatquantizationtext-generationenmultilingualbase_model:google/gemma-4-E2B-it-qat-q4_0-unquantizedbase_model:quantized:google/gemma-4-E2B-it-qat-q4_0-unquantizedlicense:mitendpoints_compatibleregion:us

Runs locally from ~2.32 GB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
0
Likes
0
Pipeline
text-generation

Repository Files & Downloads

3 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
gemma-4-E2B-it-qat-GGUF-L.ggufGGUFGGUF2.53 GBDownload
gemma-4-E2B-it-qat-GGUF-M.ggufGGUFGGUF2.32 GBDownload
gemma-4-E2B-it-qat-GGUF-W4-uniform.ggufGGUFGGUF2.50 GBDownload

Model Details

Model IDTheStageAI/gemma-4-E2B-it-qat-GGUF
AuthorTheStageAI
Pipelinetext-generation
Licensemit
Base modelgoogle/gemma-4-E2B-it-qat-q4_0-unquantized
Last modified2026-06-11T11:22:08.000Z

Model README

---

license: mit

base_model:

- google/gemma-4-E2B-it-qat-q4_0-unquantized

base_model_relation: quantized

library_name: llama.cpp

pipeline_tag: text-generation

tags:

- gguf

- llama.cpp

- gemma

- gemma-4

- qat

- quantization

language:

- en

- multilingual

---

TheStageAI/gemma-4-E2B-it-qat-GGUF

A portable GGUF release of Google's Gemma 4 E2B instruction model, compressed from Google's

QAT-trained BF16 weights and emitted as standard llama.cpp-compatible .gguf files.

Use this repo when deployment portability matters most. If you can run our native MLX runtime and

want the smallest artifacts, use the edge-lm sibling release.

Why this exists

The native edge-lm checkpoints use custom codecs for both decoder weights and PLE tables, which is

why they are smaller at comparable quality. Many deployments, however, need standard GGUF files that

work with llama.cpp-compatible tooling.

This repo keeps the production bit-width schedules from our native compression pipeline, but maps the

weights into GGUF-compatible quantization formats. The result is larger than the native release, but

portable.

How it was compressed

We start from Google's QAT-trained BF16 checkpoint and reuse the production m and l schedules from

the native release.

  • Transformer blocks - the M and L files follow our RCO-selected production bit-width

schedules, then emit the weights in GGUF-compatible K-quant layouts with the required group sizes

and symmetric/asymmetric modes for each tensor family.

  • PLE tables - stored with GGUF-compatible Q4 scalar quantization instead of the native AQLM PLE

codec, so the files stay portable across GGUF runtimes.

  • Token embeddings / LM head - quantized through the same GGUF-compatible path as the rest of the

model.

  • W4-uniform - a conservative uniform 4-bit GGUF variant with the same Q4 PLE path.

Operating points

| File | Trade-off | Size | Compression vs BF16 | Transformer | PLE |

|---|---|---:|---:|---|---|

| gemma-4-E2B-it-qat-GGUF-M.gguf | Compact GGUF target | 2.47 GB | 4.1x | production m mapped to GGUF | GGUF Q4 |

| gemma-4-E2B-it-qat-GGUF-L.gguf | Higher-quality GGUF target | 2.68 GB | 3.8x | production l mapped to GGUF | GGUF Q4 |

| gemma-4-E2B-it-qat-GGUF-W4-uniform.gguf | Uniform W4 baseline | 2.69 GB | 3.8x | uniform W4 GGUF | GGUF Q4 |

Usage

Use a recent upstream llama.cpp build. Example:

llama-completion \
  -m gemma-4-E2B-it-qat-GGUF-L.gguf \
  -p "Explain gravity in one sentence." \
  -n 64

Benchmarks

For quality evaluation, GGUF checkpoints are converted through the same dequantized BF16 evaluation

path used for the native release, so the backend is equalized. IFEval p/i means prompt strict /

instruction strict, using the corrected public recipe with max_gen_toks=1280.

| Model | Size | Compression | MMLU-Pro | IFEval p/i |

|---|---:|---:|---:|---:|

| BF16 reference | 10.21 GB | 1.0x | 61.85 | 75.23 / 82.37 |

| GGUF M | 2.47 GB | 4.1x | 53.79 | 72.64 / 81.29 |

| GGUF L | 2.68 GB | 3.8x | 57.12 | 73.38 / 81.65 |

| GGUF W4-uniform | 2.69 GB | 3.8x | 56.91 | 74.68 / 82.61 |

MMLU-Pro is the official checkpoint-wise vLLM route with Gemma chat formatting and thinking enabled.

The .gguf files in this repo also passed generation smoke tests with upstream llama.cpp.

Files

| File | Contents |

|---|---|

| gemma-4-E2B-it-qat-GGUF-M.gguf | Compact GGUF target |

| gemma-4-E2B-it-qat-GGUF-L.gguf | Higher-quality GGUF target |

| gemma-4-E2B-it-qat-GGUF-W4-uniform.gguf | Uniform W4 GGUF baseline |

License

Released under the MIT License.

As a derivative of Gemma, the weights are also subject to the

Gemma Terms of Use.

Citation

If you use these checkpoints, please cite the Gemma 4 release and the methods we build on

(GPTQ, QEP, AQLM, RCO) - see the references in the

edge-lm write-up.

Run TheStageAI/gemma-4-E2B-it-qat-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models