GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

lym00/Qwen3.6-27B-MTP-ONLY-GGUF overview

Separate MTP GGUF NOTE This repo contains Multi Token Prediction MTP GGUF for LLaMA.cpp extracted from the base model Qwen/Qwen3.6 27B https://huggingface.co/Q…

gguflicense:apache-2.0endpoints_compatibleregion:usconversational

Runs locally from ~1.87 GB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
0
Likes
0
Pipeline
Author

Repository Files & Downloads

4 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
Qwen3.6-27B-MTP-bf16.ggufGGUFBF165.54 GBDownload
Qwen3.6-27B-MTP-f16.ggufGGUFF165.54 GBDownload
Qwen3.6-27B-MTP-q4_0.ggufGGUFQ4_01.87 GBDownload
Qwen3.6-27B-MTP-q8_0.ggufGGUFQ8_02.95 GBDownload

Model Details

Model IDlym00/Qwen3.6-27B-MTP-ONLY-GGUF
Authorlym00
Pipeline
Licenseapache-2.0
Base model
Last modified2026-06-18T02:20:30.000Z

Model README

---

license: apache-2.0

---

Separate MTP GGUF

> [!NOTE]

> This repo contains Multi-Token Prediction (MTP) GGUF for LLaMA.cpp extracted from the base model (Qwen/Qwen3.6-27B).

> It can be paired with a target model using the --spec-draft-model flag.

> See PR: https://github.com/ggml-org/llama.cpp/pull/22673

>

> If you’re looking for an MTP GGUF for transplanting/"grafting" onto your model, check out:

> - https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-GGUF

> - https://huggingface.co/IHaveNoClueAndIMustPost/Qwen3.6-27B-MTP-TENSORS-ONLY

am17an:

>yes it can be loaded separately using --spec-draft-model.

>The convert_hf_to_gguf.py changes have an option of --mtp which just outputs the MTP gguf.

>

>Using the "grafted" on MTP is more VRAM efficient though.

>

>Another thing is that -hf option will try to look for the MTP gguf like it does for mmproj in case spec-draft-type draft-mtp is mentioned.

Discussion: https://github.com/ggml-org/llama.cpp/pull/22673#issuecomment-4456979078

Findings

Original MTP tensors:

  • https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/model-00013-of-00015.safetensors
  • https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/model-00015-of-00015.safetensors

Shared embeddings/output weights:

  • https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/model-00001-of-00015.safetensors
  • https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/model-00008-of-00015.safetensors

Ref: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/model.safetensors.index.json

MTP tensors to GGUF conversion:

python convert_hf_to_gguf.py ../Qwen3.6-27B --outtype bf16 --outfile ../Qwen3.6-27B-MTP/Qwen3.6-27B-MTP-bf16.gguf --mtp

Conversion log: conversion.log

INFO:hf-to-gguf:Loading model: Qwen3.6-27B
INFO:hf-to-gguf:Model architecture: Qwen3_5ForConditionalGeneration
INFO:hf-to-gguf:gguf: indexing model part 'model-00001-of-00015.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00008-of-00015.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00013-of-00015.safetensors'
INFO:hf-to-gguf:gguf: indexing model part 'model-00015-of-00015.safetensors'
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only

INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:token_embd.weight,                    torch.bfloat16 --> BF16, shape = {5120, 248320}
INFO:hf-to-gguf:output.weight,                        torch.bfloat16 --> BF16, shape = {5120, 248320}
INFO:hf-to-gguf:blk.64.nextn.eh_proj.weight,          torch.bfloat16 --> BF16, shape = {10240, 5120}
INFO:hf-to-gguf:blk.64.ffn_down.weight,               torch.bfloat16 --> BF16, shape = {17408, 5120}
INFO:hf-to-gguf:blk.64.ffn_gate.weight,               torch.bfloat16 --> BF16, shape = {5120, 17408}
INFO:hf-to-gguf:blk.64.ffn_up.weight,                 torch.bfloat16 --> BF16, shape = {5120, 17408}
INFO:hf-to-gguf:blk.64.attn_k.weight,                 torch.bfloat16 --> BF16, shape = {5120, 1024}
INFO:hf-to-gguf:blk.64.attn_q.weight,                 torch.bfloat16 --> BF16, shape = {5120, 12288}
INFO:hf-to-gguf:blk.64.attn_v.weight,                 torch.bfloat16 --> BF16, shape = {5120, 1024}
INFO:hf-to-gguf:output_norm.weight,                   torch.bfloat16 --> F32, shape = {5120}
INFO:hf-to-gguf:blk.64.attn_norm.weight,              torch.bfloat16 --> F32, shape = {5120}
INFO:hf-to-gguf:blk.64.post_attention_norm.weight,    torch.bfloat16 --> F32, shape = {5120}
INFO:hf-to-gguf:blk.64.attn_k_norm.weight,            torch.bfloat16 --> F32, shape = {256}
INFO:hf-to-gguf:blk.64.attn_output.weight,            torch.bfloat16 --> BF16, shape = {6144, 5120}
INFO:hf-to-gguf:blk.64.attn_q_norm.weight,            torch.bfloat16 --> F32, shape = {256}
INFO:hf-to-gguf:blk.64.nextn.shared_head_norm.weight, torch.bfloat16 --> F32, shape = {5120}
INFO:hf-to-gguf:blk.64.nextn.enorm.weight,            torch.bfloat16 --> F32, shape = {5120}
INFO:hf-to-gguf:blk.64.nextn.hnorm.weight,            torch.bfloat16 --> F32, shape = {5120}

INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:..\Qwen3.6-27B-MTP\Qwen3.6-27B-MTP-bf16.gguf: n_tensors = 18, total_size = 5.9G
Writing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 5.94G/5.94G [00:46<00:00, 127Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to ..\Qwen3.6-27B-MTP\Qwen3.6-27B-MTP-bf16.gguf

---

Test Results

WebUI (llama.cpp)

!image

Harness (OpenCode)

!image

Config (models.ini)

version = 1

[*]
flash-attn = on
mlock = off
mmap = on
fit = on
warmup = on
batch-size = 256
ubatch-size = 256
cache-type-k = q4_0
cache-type-v = q4_0
kv-unified = true
swa-full = true
jinja = true
direct-io = off
cache-prompt = true
cache-ram = 28672
n-gpu-layers = 99
reasoning = off
reasoning-budget = 0
min-p = 0
presence-penalty = 1.5
top-k = 40
chat-template-kwargs = {"preserve_thinking": true}
spec-default = true
ctx-checkpoints = 64
parallel = 1
threads-http = 1
ctx-size = 131072

# --- MODELS ---
[TeichAI/Qwen3.6-27B-Fable-5-Experimental-GGUF]
alias = TeichAI/Qwen3.6-27B-Fable-5-Experimental-GGUF
model = /root/.cache/llama.cpp/TeichAI/Qwen3.6-27B-Fable-5-Experimental-GGUF/Qwen3.6-27B-Fable-5-Distill.iq4_nl.gguf
mmproj = /root/.cache/llama.cpp/mmproj/mmproj-Qwen3.6-27B-BF16.gguf
spec-draft-model = /root/.cache/llama.cpp/mtp/Qwen3.6-27B-MTP-q4_0.gguf
temperature = 0.7
top-k = 20
top-p = 0.8
presence-penalty = 1.5
repeat-penalty = 1.0
seed = 42
spec-type = draft-mtp,ngram-mod,ngram-map-k4v
spec-draft-n-max = 3
spec-draft-p-min = 0.50
spec-draft-prio = 2
spec-draft-prio-batch = 2
spec-ngram-mod-n-match = 24
spec-ngram-mod-n-min = 48
spec-ngram-mod-n-max = 64
spec-ngram-map-k4v-size-n = 8
spec-ngram-map-k4v-size-m = 24
spec-ngram-map-k4v-min-hits = 2

Hardware tested (low-budget mini PC)

  • Model: Machenike GTR Mini PC (~$600)
  • CPU: AMD R7-H255 (780M iGPU)
  • RAM: 32G DDR5 (Shared/Unified memory)
  • Backend: llama.cpp (Vulkan)

Run lym00/Qwen3.6-27B-MTP-ONLY-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models