TheStageAI/gemma-4-E2B-it-qat-GGUF overview
TheStageAI/gemma 4 E2B it qat GGUF A portable GGUF release of Google's Gemma 4 E2B instruction model , compressed from Google's QAT trained BF16 weights and em…
Runs locally from ~2.32 GB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
Model Details
| Model ID | TheStageAI/gemma-4-E2B-it-qat-GGUF |
|---|---|
| Author | TheStageAI |
| Pipeline | text-generation |
| License | mit |
| Base model | google/gemma-4-E2B-it-qat-q4_0-unquantized |
| Last modified | 2026-06-11T11:22:08.000Z |
Model README
---
license: mit
base_model:
- google/gemma-4-E2B-it-qat-q4_0-unquantized
base_model_relation: quantized
library_name: llama.cpp
pipeline_tag: text-generation
tags:
- gguf
- llama.cpp
- gemma
- gemma-4
- qat
- quantization
language:
- en
- multilingual
---
TheStageAI/gemma-4-E2B-it-qat-GGUF
A portable GGUF release of Google's Gemma 4 E2B instruction model, compressed from Google's
QAT-trained BF16 weights and emitted as standard llama.cpp-compatible .gguf files.
- Run it with: llama.cpp or other GGUF-compatible runtimes
- Compression source:
google/gemma-4-E2B-it-qat-q4_0-unquantized - BF16 reference:
google/gemma-4-E2B-it - Smaller native release:
TheStageAI/gemma-4-E2B-it-qat
Use this repo when deployment portability matters most. If you can run our native MLX runtime and
want the smallest artifacts, use the edge-lm sibling release.
Why this exists
The native edge-lm checkpoints use custom codecs for both decoder weights and PLE tables, which is
why they are smaller at comparable quality. Many deployments, however, need standard GGUF files that
work with llama.cpp-compatible tooling.
This repo keeps the production bit-width schedules from our native compression pipeline, but maps the
weights into GGUF-compatible quantization formats. The result is larger than the native release, but
portable.
How it was compressed
We start from Google's QAT-trained BF16 checkpoint and reuse the production m and l schedules from
the native release.
- Transformer blocks - the
MandLfiles follow our RCO-selected production bit-width
schedules, then emit the weights in GGUF-compatible K-quant layouts with the required group sizes
and symmetric/asymmetric modes for each tensor family.
- PLE tables - stored with GGUF-compatible Q4 scalar quantization instead of the native AQLM PLE
codec, so the files stay portable across GGUF runtimes.
- Token embeddings / LM head - quantized through the same GGUF-compatible path as the rest of the
model.
- W4-uniform - a conservative uniform 4-bit GGUF variant with the same Q4 PLE path.
Operating points
| File | Trade-off | Size | Compression vs BF16 | Transformer | PLE |
|---|---|---:|---:|---|---|
| gemma-4-E2B-it-qat-GGUF-M.gguf | Compact GGUF target | 2.47 GB | 4.1x | production m mapped to GGUF | GGUF Q4 |
| gemma-4-E2B-it-qat-GGUF-L.gguf | Higher-quality GGUF target | 2.68 GB | 3.8x | production l mapped to GGUF | GGUF Q4 |
| gemma-4-E2B-it-qat-GGUF-W4-uniform.gguf | Uniform W4 baseline | 2.69 GB | 3.8x | uniform W4 GGUF | GGUF Q4 |
Usage
Use a recent upstream llama.cpp build. Example:
llama-completion \
-m gemma-4-E2B-it-qat-GGUF-L.gguf \
-p "Explain gravity in one sentence." \
-n 64
Benchmarks
For quality evaluation, GGUF checkpoints are converted through the same dequantized BF16 evaluation
path used for the native release, so the backend is equalized. IFEval p/i means prompt strict /
instruction strict, using the corrected public recipe with max_gen_toks=1280.
| Model | Size | Compression | MMLU-Pro | IFEval p/i |
|---|---:|---:|---:|---:|
| BF16 reference | 10.21 GB | 1.0x | 61.85 | 75.23 / 82.37 |
| GGUF M | 2.47 GB | 4.1x | 53.79 | 72.64 / 81.29 |
| GGUF L | 2.68 GB | 3.8x | 57.12 | 73.38 / 81.65 |
| GGUF W4-uniform | 2.69 GB | 3.8x | 56.91 | 74.68 / 82.61 |
MMLU-Pro is the official checkpoint-wise vLLM route with Gemma chat formatting and thinking enabled.
The .gguf files in this repo also passed generation smoke tests with upstream llama.cpp.
Files
| File | Contents |
|---|---|
| gemma-4-E2B-it-qat-GGUF-M.gguf | Compact GGUF target |
| gemma-4-E2B-it-qat-GGUF-L.gguf | Higher-quality GGUF target |
| gemma-4-E2B-it-qat-GGUF-W4-uniform.gguf | Uniform W4 GGUF baseline |
License
Released under the MIT License.
As a derivative of Gemma, the weights are also subject to the
Citation
If you use these checkpoints, please cite the Gemma 4 release and the methods we build on
(GPTQ, QEP, AQLM, RCO) - see the references in the
edge-lm write-up.
Run TheStageAI/gemma-4-E2B-it-qat-GGUF with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models