GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

ysong21/gemma-4-12B-it-qat-assistant-MTP-Q4_0-GGUF overview

Gemma 4 12B IT QAT Assistant MTP Q4 0 GGUF This repository contains a Q4 0 GGUF conversion of Google's official Gemma 4 12B IT QAT assistant / drafter checkpoi…

llama.cppggufgemmagemma-4qatmtpdraft-modelassistantspeculative-decodingtext-generationbase_model:google/gemma-4-12B-it-qat-q4_0-unquantized-assistantbase_model:quantized:google/gemma-4-12B-it-qat-q4_0-unquantized-assistantlicense:apache-2.0endpoints_compatibleregion:usconversational

Runs locally from ~308.0 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
0
Likes
0
Pipeline
text-generation
Author

Repository Files & Downloads

1 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
gemma-4-12B-it-qat-assistant-MTP-Q4_0.ggufGGUFQ4_0308.0 MBDownload

Model Details

Model IDysong21/gemma-4-12B-it-qat-assistant-MTP-Q4_0-GGUF
Authorysong21
Pipelinetext-generation
Licenseapache-2.0
Base modelgoogle/gemma-4-12B-it-qat-q4_0-unquantized-assistant
Last modified2026-06-08T05:58:30.000Z

Model README

---

license: apache-2.0

library_name: llama.cpp

base_model: google/gemma-4-12B-it-qat-q4_0-unquantized-assistant

tags:

- gguf

- llama.cpp

- gemma

- gemma-4

- qat

- mtp

- draft-model

- assistant

- speculative-decoding

pipeline_tag: text-generation

---

Gemma 4 12B IT QAT Assistant MTP Q4_0 GGUF

This repository contains a Q4_0 GGUF conversion of Google's official Gemma 4 12B IT QAT assistant / drafter checkpoint:

  • Source checkpoint: google/gemma-4-12B-it-qat-q4_0-unquantized-assistant
  • Intended target model: google/gemma-4-12B-it-qat-q4_0-gguf
  • Output file: gemma-4-12B-it-qat-assistant-MTP-Q4_0.gguf
  • Runtime: llama.cpp with Gemma 4 MTP / draft-mtp support

This is not a standalone chat model. It is a draft model for speculative decoding and must be loaded together with a matching Gemma 4 12B IT QAT target model.

Usage

llama-server \
  -hf google/gemma-4-12B-it-qat-q4_0-gguf:Q4_0 \
  --model-draft gemma-4-12B-it-qat-assistant-MTP-Q4_0.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 4

For multimodal use, keep the target model's matching mmproj enabled or pass it explicitly.

Conversion

Converted with ggml-org/llama.cpp tag b9553 / commit 9e3b928fd:

python3 convert_hf_to_gguf.py \
  gemma-4-12B-it-qat-q4_0-unquantized-assistant \
  --outfile gemma-4-12B-it-qat-assistant-MTP-BF16.gguf \
  --outtype bf16

llama-quantize \
  gemma-4-12B-it-qat-assistant-MTP-BF16.gguf \
  gemma-4-12B-it-qat-assistant-MTP-Q4_0.gguf \
  Q4_0

The quantization log reports:

  • BF16 model size: 806.57 MiB
  • Q4_0 quantized size: 292.90 MiB
  • Effective BPW: 5.81
  • token_embd.weight was quantized to Q6_K by llama.cpp's standard mostly-Q4_0 recipe.

Local Validation

Smoke-tested with llama.cpp b9553 on macOS / Apple Metal. The server loaded the Q4_0 draft model and initialized draft-mtp successfully:

loading draft model 'gemma-4-12B-it-qat-assistant-MTP-Q4_0.gguf'
common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
srv load_model: speculative decoding context initialized

File SHA-256:

11d666fcab5284a4a333a4785f26dd72588855c07d7522331439f9aca7fee84e

Notes

This file is intended for benchmarking against the Q8_0 assistant drafter. Final output correctness is still verified by the target model during speculative decoding, but lower draft quantization can reduce acceptance rate or speed if the draft distribution diverges too much.

Run ysong21/gemma-4-12B-it-qat-assistant-MTP-Q4_0-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models