LordAce9/Gemma-4-31B-it-QAT-MX46A_S-GGUF overview
Gemma 4 31B it QAT MX46A S GGUF MX46A S sensitivity tiered adaptive NVFP4/MXFP6 mix, ~4.4 bpw average quantization of Google's Gemma 4 31B It QAT https://huggi…
Runs locally from ~910.6 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
Model Details
| Model ID | LordAce9/Gemma-4-31B-it-QAT-MX46A_S-GGUF |
|---|---|
| Author | LordAce9 |
| Pipeline | text-generation |
| License | gemma |
| Base model | google/gemma-4-31B-it-qat-q4_0-unquantized |
| Last modified | 2026-06-11T05:41:14.000Z |
Model README
---
license: gemma
language:
- en
library_name: gguf
base_model: google/gemma-4-31B-it-qat-q4_0-unquantized
pipeline_tag: text-generation
inference: false
quantized_by: LordAce9
tags:
- gguf
- gemma
- gemma-4
- 31b
- qat
- mx46as
- mx46a
- nvfp4
- mxfp6
- blackwell
- fp4
- multimodal
- vision
- speculative-decoding
- mtp
---
Gemma-4-31B-it-QAT-MX46A_S-GGUF
MX46A_S (sensitivity-tiered adaptive NVFP4/MXFP6 mix, ~4.4 bpw average) quantization of Google's
Gemma 4 31B It QAT — the
half-precision checkpoint extracted from Google's quantization-aware-training pipeline, which makes
it markedly more robust at 4-bit precision than post-training quantization of the standard release.
> ## ⚠️ Requires a custom llama.cpp fork
> The MX46A / MX46AS tensor types (GGML types 43/44) do not exist in mainline llama.cpp.
> This GGUF will not load in stock llama.cpp, LM Studio, Ollama, or Jan.
> Build the fork here: https://github.com/AcerThyRacer/llama.cpp/tree/mxfp6-adaptive (commit b0e5c2a24), with
> -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 for RTX 50-series.
Files
| File | Size | Purpose |
|---|---|---|
| gemma-4-31B-it-qat-MX46A_S.gguf | 17.8 GB | Main model, MX46A_S tiered quantization |
| gemma-4-31B-it-mmproj.gguf | 1.2 GB | Vision projector (F16, from Google's official GGUF release) — required for image input |
| gemma-4-31B-it-F16-MTP.gguf | 955 MB | MTP head for --spec-type draft-mtp speculative decoding (optional, big decode speedup) |
What even is MX46A_S?
MX46A_S is a per-tensor-category mix built from two custom superblock formats plus established ones,
assigning bits where the signal actually lives:
| Tensor category | Type | Why |
|---|---|---|
| attn_v, attn_output | MX46A (5.0 bpw) | Highest quantization sensitivity; 256-weight superblocks where the chunks with the largest imatrix-weighted error get promoted from FP4 (E2M1) to MXFP6 (E2M3) |
| ffn_down (outer layers) | MX46A | First/last layers carry disproportionate signal |
| ffn_gate, ffn_up | IQ4_XS (4.25 bpw, imatrix) | Bulk of parameters, lowest sensitivity |
| everything else (attn_q/k, inner ffn_down) | MX46AS (4.625 bpw) | Lighter sibling: MXFP4-style E8M0 base, exactly 1-of-8 chunks promoted to FP6 |
| output, token_embd | Q6_K | Standard practice for fragile tensors |
All FP4/FP6 element formats and E8M0/UE4M3 scale formats follow the OCP Microscaling (MX) and
NVIDIA NVFP4 conventions and decode through native CUDA kernels (MMVQ vec-dot for generation,
MMQ with Blackwell-specific SRAM layouts for prefill on supported types).
Quantization provenance
- Source:
google/gemma-4-31B-it-qat-q4_0-unquantized(BF16 QAT checkpoint), streamed remotely
and converted to Q8_0 GGUF (convert_hf_to_gguf.py --remote --outtype q8_0).
The Q8_0 intermediate hop is effectively transparent (per-element RMSE ≈ 7e-5) and was required
because a 62 GB BF16 intermediate did not fit the build machine's disk.
- Importance matrix: 120 chunks × 512 tokens of WikiText-2 train, computed on the Q8_0
model (llama-imatrix). The imatrix guides both MX46A chunk promotion and the IQ4_XS ffn tiers.
- Quantized with
llama-quantize ... MX46A_Sat fork commitb0e5c2a24. - Build/eval hardware: RTX 5080 16 GB (Blackwell, CC 12.0), CUDA 13.3, driver 595.80.
Measured quality & speed
> Why KL-divergence and not wikitext perplexity? This model is a thinking-instruct QAT
> artifact: raw untemplated text is far outside its post-training distribution, and its raw
> wikitext perplexity lands in the thousands — including Google's own official Q4_0 GGUF
> (verified on 2026-06-10; the model is healthy: templated long-context comprehension is
> flawless). Do not panic if you measure it yourself. Quantization fidelity is instead
> measured as KL-divergence of this quant's token distributions against the Q8_0 reference
> on identical inputs.
All numbers measured on RTX 5080 16 GB (partial offload -ngl 30, FA on, fork commit 47fa11b36+fixes),
same Q8_0 reference, same text, same machine:
| Metric | This release (MX46A_S) | UD-Q4_K_XL (pure Q4_0) |
|---|---|---|
| File size / bpw | 17.8 GB / 4.64 | 17.3 GB / 4.50 |
| Mean KLD vs Q8_0 | 0.142 (median 0.069) | 0.011 |
| Mean Δp on reference tokens | −0.54 % | — |
| pp512 t/s | 486 | 515 |
| tg64 t/s | 2.15 | 3.30 |
Honest positioning — read this before using
This checkpoint was QAT-trained against Q4_0's exact quantization grid: its weights sit on
Q4_0 lattice points by construction, so plain Q4_0 quantizes it near-losslessly (KLD 0.011) and
any other 4-bit grid — including this one — moves weights off their QAT-optimal positions.
For everyday use of this particular model on this class of hardware, Google's official Q4_0
(or a UD-Q4_K_XL) is the better choice: smaller, more faithful, and faster under partial offload,
where decode is bound by CPU-resident layers and Q4_0's mature AVX2 kernels win.
What this release is for instead:
- Format research: a complete, working, end-to-end MX46A/MX46AS/NVFP4/MXFP6 pipeline on a
real 31B model — reference quality numbers for adaptive FP4/FP6 superblock formats included.
- Full-GPU Blackwell scenarios: the native FP4/FP6 MMQ/MMVQ paths pay off when all layers are
GPU-resident (24 GB+ cards, or the smaller Gemma-4 variants); on 16 GB with a 31B they are
masked by the CPU-side bottleneck.
- Non-QAT models: against ordinary BF16 checkpoints the imatrix-guided FP6 promotion competes
on fidelity-per-bit; against a Q4_0-QAT checkpoint nothing beats Q4_0 — by design.
VRAM note: a 31B dense model at ~4.4 bpw is ~17 GB of weights — on a 16 GB card expect
partial offload (≈30/61 layers on an RTX 5080 alongside a desktop session). Full-GPU
residency on 16 GB requires the smaller Gemma-4 variants or a lower-bpw mix.
Run it
# text + vision
./build/bin/llama-server \
-m gemma-4-31B-it-qat-MX46A_S.gguf \
--mmproj gemma-4-31B-it-mmproj.gguf \
-ngl 30 -fa on -c 8192 -ctk q8_0 -ctv q8_0 --port 8080
# with MTP speculative decoding (recommended; the head is trained with the model)
./build/bin/llama-server \
-m gemma-4-31B-it-qat-MX46A_S.gguf \
--spec-type draft-mtp \
--spec-draft-model gemma-4-31B-it-F16-MTP.gguf \
-ngl 30 -fa on -c 8192 -ctk q8_0 -ctv q8_0 --port 8080
Known limitations
- Fork-only format — see the warning above. Track upstreaming status at https://github.com/AcerThyRacer/llama.cpp/tree/mxfp6-adaptive.
- The MXFP6 direct-FP8 activation decode path (
GGML_CUDA_MXFP6_FP8_ACT=1) trades ~2× matmul NMSE
for speed and is off by default; this release's tier mix does not use standalone MXFP6 tensors.
- imatrix calibration is English-centric (WikiText-2); multilingual quality may shift slightly.
- Vision projector is kept at F16 by design; quantizing it is not supported in this release.
License
Subject to the Gemma Terms of Use. This repository
redistributes a quantized derivative of a Google Gemma model; you must comply with the Gemma
license, including its use restrictions, when using these files.
Run LordAce9/Gemma-4-31B-it-QAT-MX46A_S-GGUF with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models