ysong21/gemma-4-12B-it-qat-assistant-MTP-Q4_0-GGUF overview
Gemma 4 12B IT QAT Assistant MTP Q4 0 GGUF This repository contains a Q4 0 GGUF conversion of Google's official Gemma 4 12B IT QAT assistant / drafter checkpoi…
Runs locally from ~308.0 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| gemma-4-12B-it-qat-assistant-MTP-Q4_0.gguf | GGUF | Q4_0 | 308.0 MB | Download |
Model Details
| Model ID | ysong21/gemma-4-12B-it-qat-assistant-MTP-Q4_0-GGUF |
|---|---|
| Author | ysong21 |
| Pipeline | text-generation |
| License | apache-2.0 |
| Base model | google/gemma-4-12B-it-qat-q4_0-unquantized-assistant |
| Last modified | 2026-06-08T05:58:30.000Z |
Model README
---
license: apache-2.0
library_name: llama.cpp
base_model: google/gemma-4-12B-it-qat-q4_0-unquantized-assistant
tags:
- gguf
- llama.cpp
- gemma
- gemma-4
- qat
- mtp
- draft-model
- assistant
- speculative-decoding
pipeline_tag: text-generation
---
Gemma 4 12B IT QAT Assistant MTP Q4_0 GGUF
This repository contains a Q4_0 GGUF conversion of Google's official Gemma 4 12B IT QAT assistant / drafter checkpoint:
- Source checkpoint:
google/gemma-4-12B-it-qat-q4_0-unquantized-assistant - Intended target model:
google/gemma-4-12B-it-qat-q4_0-gguf - Output file:
gemma-4-12B-it-qat-assistant-MTP-Q4_0.gguf - Runtime: llama.cpp with Gemma 4 MTP /
draft-mtpsupport
This is not a standalone chat model. It is a draft model for speculative decoding and must be loaded together with a matching Gemma 4 12B IT QAT target model.
Usage
llama-server \
-hf google/gemma-4-12B-it-qat-q4_0-gguf:Q4_0 \
--model-draft gemma-4-12B-it-qat-assistant-MTP-Q4_0.gguf \
--spec-type draft-mtp \
--spec-draft-n-max 4
For multimodal use, keep the target model's matching mmproj enabled or pass it explicitly.
Conversion
Converted with ggml-org/llama.cpp tag b9553 / commit 9e3b928fd:
python3 convert_hf_to_gguf.py \
gemma-4-12B-it-qat-q4_0-unquantized-assistant \
--outfile gemma-4-12B-it-qat-assistant-MTP-BF16.gguf \
--outtype bf16
llama-quantize \
gemma-4-12B-it-qat-assistant-MTP-BF16.gguf \
gemma-4-12B-it-qat-assistant-MTP-Q4_0.gguf \
Q4_0
The quantization log reports:
- BF16 model size: 806.57 MiB
- Q4_0 quantized size: 292.90 MiB
- Effective BPW: 5.81
token_embd.weightwas quantized toQ6_Kby llama.cpp's standard mostly-Q4_0 recipe.
Local Validation
Smoke-tested with llama.cpp b9553 on macOS / Apple Metal. The server loaded the Q4_0 draft model and initialized draft-mtp successfully:
loading draft model 'gemma-4-12B-it-qat-assistant-MTP-Q4_0.gguf'
common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
srv load_model: speculative decoding context initialized
File SHA-256:
11d666fcab5284a4a333a4785f26dd72588855c07d7522331439f9aca7fee84e
Notes
This file is intended for benchmarking against the Q8_0 assistant drafter. Final output correctness is still verified by the target model during speculative decoding, but lower draft quantization can reduce acceptance rate or speed if the draft distribution diverges too much.
Run ysong21/gemma-4-12B-it-qat-assistant-MTP-Q4_0-GGUF with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models