GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

superbonyx/gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf overview

gemma 4 26B A4B it qat — MTP draft head Q8 0 GGUF A Q8 0 GGUF of the Gemma 4 26B A4B QAT Multi Token Prediction "assistant" head , for use as a self speculativ…

ggufllama.cppspeculative-decodingmtpgemma4-assistantdraft-modelgemmatext-generationbase_model:google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistantbase_model:quantized:google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistantlicense:apache-2.0endpoints_compatibleregion:usconversational

Runs locally from ~440.4 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
0
Likes
0
Pipeline
text-generation

Repository Files & Downloads

1 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.ggufGGUFQ8_0440.4 MBDownload

Model Details

Model IDsuperbonyx/gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf
Authorsuperbonyx
Pipelinetext-generation
Licenseapache-2.0
Base modelgoogle/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant
Last modified2026-06-08T02:05:17.000Z

Model README

---

license: apache-2.0

base_model:

  • google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant

base_model_relation: quantized

pipeline_tag: text-generation

tags:

  • gguf
  • llama.cpp
  • speculative-decoding
  • mtp
  • gemma4-assistant
  • draft-model
  • gemma

---

gemma-4-26B-A4B-it-qat — MTP draft head (Q8_0 GGUF)

A Q8_0 GGUF of the Gemma 4 26B A4B (QAT) Multi-Token-Prediction "assistant" head, for use as a

self-speculative draft model in llama.cpp. Pairing this head with the full model lets the target

model draft several tokens per step and verify them in a single forward pass — accelerating

token generation with no change to output (speculative decoding is lossless).

This is not a standalone chat model. It only contains the MTP/nextn head and is meaningless on

its own — load it as the --spec-draft-model alongside the full Gemma 4 26B A4B QAT model.

Benchmark

Measured against the full target model unsloth/gemma-4-26B-A4B-it-qat-GGUFUD-Q4_K_XL, with the

exact llama-server configuration shown under Usage below, same build and same prompt, only the MTP

head toggled.

| Config | Decode (mean) | Prefill (mean) | VRAM | Draft acceptance |

|---|---|---|---|---|

| Target only (no MTP) | 199.8 tok/s | 10,568 tok/s | 23.5 GB | — |

| Target + this MTP head | 282.3 tok/s | 10,236 tok/s | 25.3 GB | 78–92% |

| Δ | +41% | ~flat | +1.8 GB | |

<sub>Setup: single RTX 5090; llama-server v129; the full config under Usage; native /completion,

fixed 3,000-token prompt, n_predict 512, ignore_eos, cache_prompt:false, 3 reps + warmup.</sub>

Speculative-decoding speedup is content-dependent (it scales with draft acceptance), so real-world

gains vary — roughly +40–55% decode on varied prose here, higher on predictable/boilerplate output.

Prefill is unaffected; MTP only accelerates generation. Output is identical to running without the head.

Requirements

  • A llama.cpp build with Gemma 4 MTP support — PR #23398

/ commit 04eb4c4 or later (tested on build 9e3b928, llama-server version 129).

  • A full Gemma 4 26B A4B QAT GGUF to serve as the target model — e.g.

unsloth/gemma-4-26B-A4B-it-qat-GGUF

(the UD-Q4_K_XL quant was used here). This head is only a draft for that model and does nothing on its own.

Usage

Use it as the speculative draft for the full model — the Unsloth QAT GGUF this head was tested

against, unsloth/gemma-4-26B-A4B-it-qat-GGUF

(file gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf):

hf download unsloth/gemma-4-26B-A4B-it-qat-GGUF gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --local-dir .
hf download superbonyx/gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf --local-dir .

Minimal (the bits that enable MTP)

llama-server \
  -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf \
  --spec-type draft-mtp \
  --spec-draft-model gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf \
  --spec-draft-n-max 3 \
  --draft-p-min 0.75 \
  -ngl 99 -fa on
  • --spec-type draft-mtp tells llama.cpp to treat the head as an MTP draft (required).
  • --spec-draft-n-max 3 — draft tokens per step.
  • --draft-p-min 0.75 — only accept a drafted token when the head's probability ≥ 0.75 (greedy gate).

> Note: the older --mtp-model / --mtp-draft-n flag names do not exist; MTP is wired through the

> speculative-decoding subsystem (--spec-type / --spec-draft-*).

Full configuration (the exact setup benchmarked above)

This is the exact llama-server invocation (and environment) that produced the Benchmark numbers

above — included so the results are reproducible. Several flags are specific to the test box and

can be changed or dropped (see the notes below the block).

# pin to a single GPU (box-specific: index of the target GPU in PCI order)
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=1

llama-server \
  -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf \
  --mmproj mmproj-F16.gguf \
  --chat-template-file gemma-4-chat_template.jinja \
  -a gemma4 \
  --jinja \
  --reasoning on \
  --reasoning-budget 32000 \
  -c 262144 \
  -ngl 99 \
  -fa on \
  -ctk q8_0 -ctv q8_0 \
  --ctx-checkpoints 0 \
  --spec-type draft-mtp \
  --spec-draft-model gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf \
  --spec-draft-n-max 3 \
  --draft-p-min 0.75 \
  -sm none -mg 0 \
  --temp 0.9 --top-k 20 --min-p 0.1 --top-p 1.0 --repeat-penalty 1.0 \
  --parallel 1 \
  --host 127.0.0.1 --port 8082

What each part does:

  • MTP head (the only flags that enable speculative decoding): --spec-type draft-mtp,

--spec-draft-model …, --spec-draft-n-max 3, --draft-p-min 0.75.

  • Quality/behaviour: --jinja + --chat-template-file (official Gemma 4 template; optional if the

GGUF's embedded one works), --reasoning on --reasoning-budget 32000 (thinking), and the

--temp/--top-k/--min-p/--top-p/--repeat-penalty defaults (clients can override per request).

  • Capacity: -c 262144 (256K context) is affordable only because of -ctk q8_0 -ctv q8_0

(q8 KV cache) and Gemma's sliding-window attention — lower -c if you're VRAM-limited.

  • Vision: --mmproj mmproj-F16.gguf enables image input (Gemma 4 is a VLM); drop it for text-only.
  • Box-specific (change to match your machine): CUDA_DEVICE_ORDER / CUDA_VISIBLE_DEVICES and

-sm none -mg 0 select/pin one GPU; --host/--port are local.

  • Build workaround: --ctx-checkpoints 0 dodges an SWA-checkpoint crash seen on this build; omit it

on patched builds.

Conversion

# from the source bf16 safetensors, with current-master llama.cpp
python convert_hf_to_gguf.py google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant \
  --outfile gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf --outtype q8_0

Why this re-conversion exists

The Gemma 4 MTP head landed in llama.cpp via PR #23398,

which registers the architecture as gemma4-assistant (hyphen) and expects nextn_predict_layers

plus nextn.pre_projection / nextn.post_projection tensors.

Pre-merge community GGUFs were tagged gemma4_assistant (underscore) with a different tensor/hparam

schema, and they fail to load on current builds with:

error loading model: unknown model architecture: 'gemma4_assistant'

This file was re-converted from source with current-master convert_hf_to_gguf.py (see

Conversion above), so it has the correct gemma4-assistant arch and nextn_* tensors

and loads on up-to-date llama.cpp.

| | value |

|---|---|

| Architecture | gemma4-assistant |

| Quant | Q8_0 (general.file_type = 7) |

| Tensors | 49 (23× Q8_0 weights, 26× F32 norms/scales) |

| Size | ~462 MB |

| Source | google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant (bf16 safetensors) |

License & attribution

Derived from Google's gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant

(Apache-2.0). As a derivative of Gemma 4, use is also subject to Google's

Gemma Terms of Use. Only the quantization (bf16 → Q8_0) and GGUF

packaging were performed here; no weights were retrained or modified.

Run superbonyx/gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models