What license applies to LordAce9/Gemma-4-31B-it-QAT-MX46A_S-GGUF?

License: gemma. Verify terms on Hugging Face before commercial use.

How do I run LordAce9/Gemma-4-31B-it-QAT-MX46A_S-GGUF locally?

Download a GGUF file from this page and load it in guIDE or llama.cpp. Pipeline task: text-generation.

Model Intelligence Sheet

LordAce9/Gemma-4-31B-it-QAT-MX46A_S-GGUF overview

Gemma 4 31B it QAT MX46A S GGUF MX46A S sensitivity tiered adaptive NVFP4/MXFP6 mix, ~4.4 bpw average quantization of Google's Gemma 4 31B It QAT https://huggi…

ggufgemmagemma-431bqatmx46asmx46anvfp4mxfp6blackwellfp4multimodalvisionspeculative-decodingmtptext-generationenbase_model:google/gemma-4-31B-it-qat-q4_0-unquantizedbase_model:quantized:google/gemma-4-31B-it-qat-q4_0-unquantizedlicense:gemmaregion:usimatrix

Runs locally from ~910.6 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads

Likes

Pipeline

text-generation

Author

LordAce9

Repository Files & Downloads

3 GGUF files detected

Direct downloads for local inference

File	Type	Quantization	Size	Link
gemma-4-31B-it-F16-MTP.gguf	GGUF	F16	910.6 MB	Download
gemma-4-31B-it-mmproj.gguf	GGUF	GGUF	1.12 GB	Download
gemma-4-31B-it-qat-MX46A_S.gguf	GGUF	GGUF	16.59 GB	Download

Model Details

Model ID	LordAce9/Gemma-4-31B-it-QAT-MX46A_S-GGUF
Author	LordAce9
Pipeline	text-generation
License	gemma
Base model	google/gemma-4-31B-it-qat-q4_0-unquantized
Last modified	2026-06-11T05:41:14.000Z

Model README

---

license: gemma

language:

library_name: gguf

base_model: google/gemma-4-31B-it-qat-q4_0-unquantized

pipeline_tag: text-generation

inference: false

quantized_by: LordAce9

tags:

gguf
gemma
gemma-4
31b
qat
mx46as
mx46a
nvfp4
mxfp6
blackwell
fp4
multimodal
vision
speculative-decoding
mtp

---

Gemma-4-31B-it-QAT-MX46A_S-GGUF

MX46A_S (sensitivity-tiered adaptive NVFP4/MXFP6 mix, ~4.4 bpw average) quantization of Google's

Gemma 4 31B It QAT — the

half-precision checkpoint extracted from Google's quantization-aware-training pipeline, which makes

it markedly more robust at 4-bit precision than post-training quantization of the standard release.

> ## ⚠️ Requires a custom llama.cpp fork

> The MX46A / MX46AS tensor types (GGML types 43/44) do not exist in mainline llama.cpp.

> This GGUF will not load in stock llama.cpp, LM Studio, Ollama, or Jan.

> Build the fork here: https://github.com/AcerThyRacer/llama.cpp/tree/mxfp6-adaptive (commit b0e5c2a24), with

> -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 for RTX 50-series.

Files

| File | Size | Purpose |

|---|---|---|

| gemma-4-31B-it-qat-MX46A_S.gguf | 17.8 GB | Main model, MX46A_S tiered quantization |

| gemma-4-31B-it-mmproj.gguf | 1.2 GB | Vision projector (F16, from Google's official GGUF release) — required for image input |

| gemma-4-31B-it-F16-MTP.gguf | 955 MB | MTP head for --spec-type draft-mtp speculative decoding (optional, big decode speedup) |

What even is MX46A_S?

MX46A_S is a per-tensor-category mix built from two custom superblock formats plus established ones,

assigning bits where the signal actually lives:

| Tensor category | Type | Why |

|---|---|---|

| attn_v, attn_output | MX46A (5.0 bpw) | Highest quantization sensitivity; 256-weight superblocks where the chunks with the largest imatrix-weighted error get promoted from FP4 (E2M1) to MXFP6 (E2M3) |

| ffn_down (outer layers) | MX46A | First/last layers carry disproportionate signal |

| ffn_gate, ffn_up | IQ4_XS (4.25 bpw, imatrix) | Bulk of parameters, lowest sensitivity |

| everything else (attn_q/k, inner ffn_down) | MX46AS (4.625 bpw) | Lighter sibling: MXFP4-style E8M0 base, exactly 1-of-8 chunks promoted to FP6 |

| output, token_embd | Q6_K | Standard practice for fragile tensors |

All FP4/FP6 element formats and E8M0/UE4M3 scale formats follow the OCP Microscaling (MX) and

NVIDIA NVFP4 conventions and decode through native CUDA kernels (MMVQ vec-dot for generation,

MMQ with Blackwell-specific SRAM layouts for prefill on supported types).

Quantization provenance

Source: google/gemma-4-31B-it-qat-q4_0-unquantized (BF16 QAT checkpoint), streamed remotely

and converted to Q8_0 GGUF (convert_hf_to_gguf.py --remote --outtype q8_0).

The Q8_0 intermediate hop is effectively transparent (per-element RMSE ≈ 7e-5) and was required

because a 62 GB BF16 intermediate did not fit the build machine's disk.

Importance matrix: 120 chunks × 512 tokens of WikiText-2 train, computed on the Q8_0

model (llama-imatrix). The imatrix guides both MX46A chunk promotion and the IQ4_XS ffn tiers.

Quantized with llama-quantize ... MX46A_S at fork commit b0e5c2a24.
Build/eval hardware: RTX 5080 16 GB (Blackwell, CC 12.0), CUDA 13.3, driver 595.80.

Measured quality & speed

> Why KL-divergence and not wikitext perplexity? This model is a thinking-instruct QAT

> artifact: raw untemplated text is far outside its post-training distribution, and its raw

> wikitext perplexity lands in the thousands — including Google's own official Q4_0 GGUF

> (verified on 2026-06-10; the model is healthy: templated long-context comprehension is

> flawless). Do not panic if you measure it yourself. Quantization fidelity is instead

> measured as KL-divergence of this quant's token distributions against the Q8_0 reference

> on identical inputs.

All numbers measured on RTX 5080 16 GB (partial offload -ngl 30, FA on, fork commit 47fa11b36+fixes),

same Q8_0 reference, same text, same machine:

| Metric | This release (MX46A_S) | UD-Q4_K_XL (pure Q4_0) |

|---|---|---|

| File size / bpw | 17.8 GB / 4.64 | 17.3 GB / 4.50 |

| Mean KLD vs Q8_0 | 0.142 (median 0.069) | 0.011 |

| Mean Δp on reference tokens | −0.54 % | — |

| pp512 t/s | 486 | 515 |

| tg64 t/s | 2.15 | 3.30 |

Honest positioning — read this before using

This checkpoint was QAT-trained against Q4_0's exact quantization grid: its weights sit on

Q4_0 lattice points by construction, so plain Q4_0 quantizes it near-losslessly (KLD 0.011) and

any other 4-bit grid — including this one — moves weights off their QAT-optimal positions.

For everyday use of this particular model on this class of hardware, Google's official Q4_0

(or a UD-Q4_K_XL) is the better choice: smaller, more faithful, and faster under partial offload,

where decode is bound by CPU-resident layers and Q4_0's mature AVX2 kernels win.

What this release is for instead:

Format research: a complete, working, end-to-end MX46A/MX46AS/NVFP4/MXFP6 pipeline on a

real 31B model — reference quality numbers for adaptive FP4/FP6 superblock formats included.

Full-GPU Blackwell scenarios: the native FP4/FP6 MMQ/MMVQ paths pay off when all layers are

GPU-resident (24 GB+ cards, or the smaller Gemma-4 variants); on 16 GB with a 31B they are

masked by the CPU-side bottleneck.

Non-QAT models: against ordinary BF16 checkpoints the imatrix-guided FP6 promotion competes

on fidelity-per-bit; against a Q4_0-QAT checkpoint nothing beats Q4_0 — by design.

VRAM note: a 31B dense model at ~4.4 bpw is ~17 GB of weights — on a 16 GB card expect

partial offload (≈30/61 layers on an RTX 5080 alongside a desktop session). Full-GPU

residency on 16 GB requires the smaller Gemma-4 variants or a lower-bpw mix.

Run it

# text + vision
./build/bin/llama-server \
  -m  gemma-4-31B-it-qat-MX46A_S.gguf \
  --mmproj gemma-4-31B-it-mmproj.gguf \
  -ngl 30 -fa on -c 8192 -ctk q8_0 -ctv q8_0 --port 8080

# with MTP speculative decoding (recommended; the head is trained with the model)
./build/bin/llama-server \
  -m  gemma-4-31B-it-qat-MX46A_S.gguf \
  --spec-type draft-mtp \
  --spec-draft-model gemma-4-31B-it-F16-MTP.gguf \
  -ngl 30 -fa on -c 8192 -ctk q8_0 -ctv q8_0 --port 8080

Known limitations

Fork-only format — see the warning above. Track upstreaming status at https://github.com/AcerThyRacer/llama.cpp/tree/mxfp6-adaptive.
The MXFP6 direct-FP8 activation decode path (GGML_CUDA_MXFP6_FP8_ACT=1) trades ~2× matmul NMSE

for speed and is off by default; this release's tier mix does not use standalone MXFP6 tensors.

imatrix calibration is English-centric (WikiText-2); multilingual quality may shift slightly.
Vision projector is kept at F16 by design; quantizing it is not supported in this release.

License

Subject to the Gemma Terms of Use. This repository

redistributes a quantized derivative of a Google Gemma model; you must comply with the Gemma

license, including its use restrictions, when using these files.

Run LordAce9/Gemma-4-31B-it-QAT-MX46A_S-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models