superbonyx/gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf overview
gemma 4 26B A4B it qat — MTP draft head Q8 0 GGUF A Q8 0 GGUF of the Gemma 4 26B A4B QAT Multi Token Prediction "assistant" head , for use as a self speculativ…
Runs locally from ~440.4 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf | GGUF | Q8_0 | 440.4 MB | Download |
Model Details
| Model ID | superbonyx/gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf |
|---|---|
| Author | superbonyx |
| Pipeline | text-generation |
| License | apache-2.0 |
| Base model | google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant |
| Last modified | 2026-06-08T02:05:17.000Z |
Model README
---
license: apache-2.0
base_model:
- google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant
base_model_relation: quantized
pipeline_tag: text-generation
tags:
- gguf
- llama.cpp
- speculative-decoding
- mtp
- gemma4-assistant
- draft-model
- gemma
---
gemma-4-26B-A4B-it-qat — MTP draft head (Q8_0 GGUF)
A Q8_0 GGUF of the Gemma 4 26B A4B (QAT) Multi-Token-Prediction "assistant" head, for use as a
self-speculative draft model in llama.cpp. Pairing this head with the full model lets the target
model draft several tokens per step and verify them in a single forward pass — accelerating
token generation with no change to output (speculative decoding is lossless).
This is not a standalone chat model. It only contains the MTP/nextn head and is meaningless on
its own — load it as the --spec-draft-model alongside the full Gemma 4 26B A4B QAT model.
Benchmark
Measured against the full target model unsloth/gemma-4-26B-A4B-it-qat-GGUF → UD-Q4_K_XL, with the
exact llama-server configuration shown under Usage below, same build and same prompt, only the MTP
head toggled.
| Config | Decode (mean) | Prefill (mean) | VRAM | Draft acceptance |
|---|---|---|---|---|
| Target only (no MTP) | 199.8 tok/s | 10,568 tok/s | 23.5 GB | — |
| Target + this MTP head | 282.3 tok/s | 10,236 tok/s | 25.3 GB | 78–92% |
| Δ | +41% | ~flat | +1.8 GB | |
<sub>Setup: single RTX 5090; llama-server v129; the full config under Usage; native /completion,
fixed 3,000-token prompt, n_predict 512, ignore_eos, cache_prompt:false, 3 reps + warmup.</sub>
Speculative-decoding speedup is content-dependent (it scales with draft acceptance), so real-world
gains vary — roughly +40–55% decode on varied prose here, higher on predictable/boilerplate output.
Prefill is unaffected; MTP only accelerates generation. Output is identical to running without the head.
Requirements
- A
llama.cppbuild with Gemma 4 MTP support — PR #23398
/ commit 04eb4c4 or later (tested on build 9e3b928, llama-server version 129).
- A full Gemma 4 26B A4B QAT GGUF to serve as the target model — e.g.
unsloth/gemma-4-26B-A4B-it-qat-GGUF
(the UD-Q4_K_XL quant was used here). This head is only a draft for that model and does nothing on its own.
Usage
Use it as the speculative draft for the full model — the Unsloth QAT GGUF this head was tested
against, unsloth/gemma-4-26B-A4B-it-qat-GGUF
(file gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf):
hf download unsloth/gemma-4-26B-A4B-it-qat-GGUF gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --local-dir .
hf download superbonyx/gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf --local-dir .
Minimal (the bits that enable MTP)
llama-server \
-m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf \
--spec-type draft-mtp \
--spec-draft-model gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf \
--spec-draft-n-max 3 \
--draft-p-min 0.75 \
-ngl 99 -fa on
--spec-type draft-mtptellsllama.cppto treat the head as an MTP draft (required).--spec-draft-n-max 3— draft tokens per step.--draft-p-min 0.75— only accept a drafted token when the head's probability ≥ 0.75 (greedy gate).
> Note: the older --mtp-model / --mtp-draft-n flag names do not exist; MTP is wired through the
> speculative-decoding subsystem (--spec-type / --spec-draft-*).
Full configuration (the exact setup benchmarked above)
This is the exact llama-server invocation (and environment) that produced the Benchmark numbers
above — included so the results are reproducible. Several flags are specific to the test box and
can be changed or dropped (see the notes below the block).
# pin to a single GPU (box-specific: index of the target GPU in PCI order)
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=1
llama-server \
-m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf \
--mmproj mmproj-F16.gguf \
--chat-template-file gemma-4-chat_template.jinja \
-a gemma4 \
--jinja \
--reasoning on \
--reasoning-budget 32000 \
-c 262144 \
-ngl 99 \
-fa on \
-ctk q8_0 -ctv q8_0 \
--ctx-checkpoints 0 \
--spec-type draft-mtp \
--spec-draft-model gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf \
--spec-draft-n-max 3 \
--draft-p-min 0.75 \
-sm none -mg 0 \
--temp 0.9 --top-k 20 --min-p 0.1 --top-p 1.0 --repeat-penalty 1.0 \
--parallel 1 \
--host 127.0.0.1 --port 8082
What each part does:
- MTP head (the only flags that enable speculative decoding):
--spec-type draft-mtp,
--spec-draft-model …, --spec-draft-n-max 3, --draft-p-min 0.75.
- Quality/behaviour:
--jinja+--chat-template-file(official Gemma 4 template; optional if the
GGUF's embedded one works), --reasoning on --reasoning-budget 32000 (thinking), and the
--temp/--top-k/--min-p/--top-p/--repeat-penalty defaults (clients can override per request).
- Capacity:
-c 262144(256K context) is affordable only because of-ctk q8_0 -ctv q8_0
(q8 KV cache) and Gemma's sliding-window attention — lower -c if you're VRAM-limited.
- Vision:
--mmproj mmproj-F16.ggufenables image input (Gemma 4 is a VLM); drop it for text-only. - Box-specific (change to match your machine):
CUDA_DEVICE_ORDER/CUDA_VISIBLE_DEVICESand
-sm none -mg 0 select/pin one GPU; --host/--port are local.
- Build workaround:
--ctx-checkpoints 0dodges an SWA-checkpoint crash seen on this build; omit it
on patched builds.
Conversion
# from the source bf16 safetensors, with current-master llama.cpp
python convert_hf_to_gguf.py google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant \
--outfile gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf --outtype q8_0
Why this re-conversion exists
The Gemma 4 MTP head landed in llama.cpp via PR #23398,
which registers the architecture as gemma4-assistant (hyphen) and expects nextn_predict_layers
plus nextn.pre_projection / nextn.post_projection tensors.
Pre-merge community GGUFs were tagged gemma4_assistant (underscore) with a different tensor/hparam
schema, and they fail to load on current builds with:
error loading model: unknown model architecture: 'gemma4_assistant'
This file was re-converted from source with current-master convert_hf_to_gguf.py (see
Conversion above), so it has the correct gemma4-assistant arch and nextn_* tensors
and loads on up-to-date llama.cpp.
| | value |
|---|---|
| Architecture | gemma4-assistant |
| Quant | Q8_0 (general.file_type = 7) |
| Tensors | 49 (23× Q8_0 weights, 26× F32 norms/scales) |
| Size | ~462 MB |
| Source | google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant (bf16 safetensors) |
License & attribution
Derived from Google's gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant
(Apache-2.0). As a derivative of Gemma 4, use is also subject to Google's
Gemma Terms of Use. Only the quantization (bf16 → Q8_0) and GGUF
packaging were performed here; no weights were retrained or modified.
Run superbonyx/gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models