GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

autotrust/gpt-oss-120b-Fable-5-Distilled-GGUF overview

GPT OSS 120B Fable 5 Distilled — GGUF Trained & Converted by : cloudyu https://huggingface.co/cloudyu Compute Sponsor : AutoTrust AI Lab https://huggingface.co…

ggufloramuonapple-siliconmoecodingagentfine-tunedendataset:armand0e/claude-fable-5-claude-codeendpoints_compatibleregion:usconversational

Runs locally from ~75.15 GB disk (32 GB+ VRAM class GPUs with llama.cpp / guIDE).

Downloads
0
Likes
2
Pipeline
Author

Repository Files & Downloads

1 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
gpt-oss-120b-Fable-5-Distilled-Q5_0.ggufGGUFQ5_075.15 GBDownload

Model Details

Model IDautotrust/gpt-oss-120b-Fable-5-Distilled-GGUF
Authorautotrust
Pipeline
License
Base modelgpt-oss-120b-heretic-v2-mxfp4-q8-hi-mlx
Last modified2026-06-19T23:34:59.000Z

Model README

---

base_model: gpt-oss-120b-heretic-v2-mxfp4-q8-hi-mlx

tags:

- gguf

- lora

- muon

- apple-silicon

- moe

- coding

- agent

- fine-tuned

language:

- en

datasets:

- armand0e/claude-fable-5-claude-code

---

GPT-OSS 120B Fable-5 Distilled — GGUF

> Trained & Converted by: cloudyu

> Compute Sponsor: AutoTrust AI Lab

> Source Model: cloudyu/gpt-oss-120b-Fable-5-Distilled

> Architecture: GPT-OSS (OpenAI MoE, 36 layers, 128 experts, 4 active)

> Quantization: Q8_0 (115.7 GB) | Q5_0 (67 GB) — GGUF V3 for llama.cpp

>

> GGUF is the universal format for LLM inference. These files run on llama.cpp, LM Studio, Ollama, GPT4All, text-generation-webui, llamafile, MLX-LM, and any GGUF-compatible runtime — across macOS, Windows, Linux, iOS, and Android, on CPU, CUDA, Metal, Vulkan, and ROCm backends.

Model Overview

This is a distilled variant of the OpenAI gpt-oss-120b model, fine-tuned using MLX with LoRA adapters (rank=16, targeting all attention projections, MoE router, and expert FFN layers). The training was performed in the MLX ecosystem using the MXFP4 quantized base model format. These GGUF files are the first community-produced GGUF conversions of a LoRA-fine-tuned GPT-OSS model from the MXFP4 format.

Available Quantizations

| Format | File | Size | Quality |

|--------|------|------|---------|

| Q8_0 | gpt-oss-120b-Fable-5-Distilled-Q8_0.gguf | 115.7 GB | Near-lossless (8-bit uniform, biases/norms kept at F32) |

| Q5_0 | gpt-oss-120b-Fable-5-Distilled-Q5_0.gguf | ~67 GB | Strong compression (5-bit symmetric, via llama-quantize from Q8_0) |

Weight matrices use Q8_0/Q5_0; biases, layer norms, and attention sinks are stored at full F32 precision for numerical stability.

Conversion Pipeline

The conversion from MLX MXFP4 format to llama.cpp GGUF involved a two-phase pipeline addressing several novel technical challenges:

Phase 1: MXFP4 Dequantization + LoRA Fusion → BF16

The source model is stored in Apple's MLX format using 4-bit MXFP4 quantization (Microscaling FP4, OCP spec) for the expert layers, with 8-bit affine quantization for attention projections. The LoRA adapter weights (576 tensors, float32) modify 288 modules spanning all Q/K/V/O projections, the MoE router, and all expert gate/up/down projections.

Key technical breakthroughs in Phase 1:

  • Per-weight quantization dispatch: The model uses heterogeneous quantization parameters — MXFP4 4-bit (experts), affine 8-bit (attention + embeddings), and a special group_size=64 variant for the router. Each weight's dequantization must use the correct (bits, group_size) pair from the config.
  • MXFP4 dequantization without explicit mode: MLX's dequantize function, when called with bits=4 and group_size=32, auto-detects the microscaling format. Explicit mode='mxfp4' is incompatible with the BF16 scale storage convention actually used.
  • LoRA fusion with shape-dependent matmul: The MLX LoRA adapter stores lora_a as [input_dims, rank] and lora_b as [rank, output_dims] for 2D weights, but [num_experts, rank, input_dims] and [num_experts, output_dims, rank] for 3D expert weights. The fusion requires lora_b.T @ lora_a.T (2D) vs lora_b @ lora_a (3D) — using the wrong formula produces catastrophic dimension mismatches.
  • BF16 overflow during MXFP4 dequant: The MXFP4 e8m0 scale representation can produce float32 values exceeding the BF16 representable range (~3.4e38), resulting in NaN values in ~5% of dequantized expert weights. These must be sanitized before downstream quantization.

Phase 2: BF16 → Quantized GGUF

Phase 2 reads the intermediate BF16 safetensors and writes the final GGUF in a single streaming pass, keeping RAM below 2 GB for a 218 GB intermediate.

Key technical breakthroughs in Phase 2:

  • Missing attention.key_length metadata: The llama.cpp GPT-OSS handler computes n_embd_head_k = n_embd / n_head = 2880 / 64 = 45 by default, which is incorrect (the true head dimension is 64). This causes a fatal GGML_ASSERT(a->ne[0] == b->ne[0]) crash during KV cache initialization. The fix is explicitly setting gpt-oss.attention.key_length, value_length, key_length_swa, and value_length_swa to head_dim=64 in the GGUF metadata.
  • Tokenizer vocabulary padding: The model's embedding table has 201,088 rows (vocab_size in config) but the o200k harmony tokenizer has only 200,019 entries. llama.cpp validates token_embd.weight against the tokenizer count. The token list must be padded with dummy tokens to reach 201,088.
  • Attention sinks must be F32: The ggml_soft_max_add_sinks and ggml_flash_attn_ext_add_sinks operations require the sinks tensor to be GGML_TYPE_F32. Storing it as BF16 causes a Metal backend crash during inference.
  • All biases must be F32: ggml_add operations in the attention and FFN graph require type-matched operands. Since attention outputs are F32, bias tensors stored as BF16 trigger binary_op: unsupported types errors on CPU and Metal.
  • GGUF tensor naming alignment: The GPT-OSS architecture in llama.cpp (PR #15091, LLM_ARCH_OPENAI_MOE) maps MLX tensor names via specific conventions: self_attn.o_projattn_output (not attn_out), mlp.routerffn_gate_inp, mlp.experts.gate_projffn_gate_exps, self_attn.sinksattn_sinks.weight, etc. The architecture identifier is gpt-oss (hyphenated), not gptoss.
  • Safetensors byte-level data extraction: Reading individual tensors from safetensor files requires accounting for the JSON header size (f.seek(8 + header_length + offset), not f.seek(8 + offset)). Missing this causes the first tensor in each shard to read partially corrupt data — a silent data integrity bug.

Usage with llama.cpp

# Download (example path)
wget https://huggingface.co/cloudyu/gpt-oss-120b-Fable-5-Distilled/resolve/main/gpt-oss-120b-Fable-5-Distilled-Q8_0.gguf

# Inference
llama-cli -m gpt-oss-120b-Fable-5-Distilled-Q8_0.gguf \
  -p "Explain quantum computing in simple terms" -n 512

# Server (recommended — uses sliding window attention efficiently)
llama-server -m gpt-oss-120b-Fable-5-Distilled-Q8_0.gguf \
  --jinja --reasoning-format auto --port 8080

The model uses GPT-OSS's native reasoning format. Set reasoning_effort via chat template kwargs for API usage. The chat template supports system, developer, user, and assistant roles with channel markers for analysis, commentary, and final output.

Local Agent Best Practice

This model demonstrates extremely strong reasoning capabilities — when given --jinja and --reasoning-format auto with reasoning_effort: "high", it can perform multi-step planning, complex code analysis, and structured problem decomposition at a level comparable to frontier models. However, it has a critical weakness: severe hallucination. The model will confidently fabricate facts, API signatures, file paths, URLs, and library versions.

Golden Rule: Always pair this model with the anysearch skill in llama-agent or opencode, which grounds responses against real web search results. Do not trust any factual claim from this model without verification.

./build/bin/llama-agent \
  -m /path/to/gpt-oss-120b-Fable-5-Distilled-Q5_0.gguf \
  --jinja \
  --reasoning-format auto \
  --temp 1.0 \
  --chat-template-kwargs '{"reasoning_effort": "high"}'

Key flags explained:

  • --jinja — Enables Jinja template rendering for GPT-OSS's multi-channel chat format
  • --reasoning-format auto — Uses the model's native reasoning block structure (analysis / commentary / final)
  • --temp 1.0 — Maximum temperature; strong reasoning models benefit from full creativity in the analysis phase
  • --chat-template-kwargs '{"reasoning_effort": "high"}' — Instructs the model to allocate more compute to reasoning chains

Known failure modes without search grounding:

  • Confidently invents non-existent CLI flags and API parameters
  • Hallucinates plausible but wrong library versions and package names
  • Fabricates convincing-looking URLs and file paths
  • Constructs internally consistent but factually incorrect arguments

---

本地 Agent 最佳实践

本模型拥有极强的推理能力——配合 --jinja--reasoning-format autoreasoning_effort: "high" 参数,可以进行多步规划、复杂代码分析和结构化问题拆解,水平可比肩前沿模型。但它有一个致命弱点:严重的幻觉问题。模型会自信地捏造事实、API 签名、文件路径、URL 和库版本号。

黄金法则:始终将此模型与 anysearch skill 配合使用,以真实网页搜索结果作为事实锚定。未经查证,不要信任该模型的任何事实性结论。

./build/bin/llama-agent \
  -m /path/to/gpt-oss-120b-Fable-5-Distilled-Q5_0.gguf \
  --jinja \
  --reasoning-format auto \
  --temp 1.0 \
  --chat-template-kwargs '{"reasoning_effort": "high"}'

关键参数说明

  • --jinja — 启用 Jinja 模板渲染,支持 GPT-OSS 多通道 chat 格式
  • --reasoning-format auto — 使用模型原生的推理块结构(analysis / commentary / final 三通道)
  • --temp 1.0 — 最高温度;强推理模型在分析阶段受益于充分的创造性
  • --chat-template-kwargs '{"reasoning_effort": "high"}' — 指示模型为推理链分配更多计算资源

未配合搜索验证时的已知失败模式

  • 自信地编造不存在的 CLI 参数和 API 接口
  • 幻觉出看似合理但实际错误的库版本号和包名
  • 构造看似可信的虚假 URL 和文件路径
  • 生成内部自洽但与事实不符的论证

---

GPT-OSS 120B Fable-5 Distilled — GGUF(中文说明)

模型概要

训练与转换: cloudyu

算力支持: AutoTrust AI Lab

基于 OpenAI gpt-oss-120b 的蒸馏微调版本。使用 MLX 框架对 MXFP4 量化基座模型训练 LoRA 适配器(rank=16,覆盖所有注意力投影、MoE 路由器及专家 FFN 层)。这是社区首次将 LoRA 微调后的 GPT-OSS 模型从 MLX MXFP4 格式转换为 llama.cpp GGUF 格式。

量化版本

GGUF 是 LLM 推理的通用格式——支持 llama.cpp、LM Studio、Ollama、GPT4All、text-generation-webui、llamafile 等主流运行时,兼容 macOS/Windows/Linux/iOS/Android 全平台,CPU/CUDA/Metal/Vulkan/ROCm 全后端。

| 文件 | 大小 | 说明 |

|------|------|------|

| gpt-oss-120b-Fable-5-Distilled-Q8_0.gguf | 115.7 GB | Q8_0 权重 + F32 biases/norms,接近无损 |

| gpt-oss-120b-Fable-5-Distilled-Q5_0.gguf | ~67 GB | Q5_0 对称量化,高压缩比 |

转换技术突破

阶段一:MXFP4 反量化 + LoRA 融合 → BF16

源模型以 MLX 格式存储,专家层使用 MXFP4(4-bit 显微缩放格式),注意力层使用 8-bit affine 量化。LoRA 适配器包含 576 个张量,修改 288 个模块。

核心技术挑战:

  1. 异构量化参数调度:不同层使用不同的量化参数——专家层 bits=4, group_size=32 (mxfp4),注意力层 bits=8, group_size=32 (affine),路由器 bits=8, group_size=64。反量化必须按张量逐一使用正确的 (bits, group_size) 组合。
  1. MXFP4 反量化的 mode 陷阱mlx.dequantizebits=4 时自动检测 MXFP4 格式,显式传 mode='mxfp4' 反而报错——因为它期望 uint8 格式的 scale,而实际存储的是 BF16。
  1. LoRA 融合的形状相关矩阵乘法:2D 权重(注意力、路由器)与 3D 权重(专家)的 LoRA 融合公式不同——2D 用 lora_b.T @ lora_a.T,3D 用 lora_b @ lora_a。混用会导致维度崩溃。
  1. BF16 溢出:MXFP4 的 e8m0 scale 编码可能产生超出 BF16 范围(±3.4e38)的 float32 值,导致约 5% 的专家权重变成 NaN。必须在量化前用 nan_to_num 清理。

阶段二:BF16 → 量化 GGUF

以流式方式读取 BF16 中间文件,逐张量量化写入 GGUF,内存占用 < 2 GB。

核心技术挑战:

  1. attention.key_length 元数据缺失:llama.cpp 的 GPT-OSS 处理程序默认计算 n_embd_head_k = hidden_size / num_heads = 2880 / 64 = 45,但实际 head_dim 是 64。导致 KV cache 初始化时 ggml_set_rows assertion 崩溃(缓存 ne[0]=360 ≠ K张量 ne[0]=512)。修复方法是在 GGUF 元数据中显式设置 key_length/value_length/key_length_swa/value_length_swa = 64
  1. 词表补齐:模型 embedding 有 201,088 行,但 o200k harmony tokenizer 只有 200,019 个 token。llama.cpp 会校验 token_embd.weight 的形状与 tokenizer 数量必须一致。需用虚拟 token 补齐。
  1. Attention Sinks 必须 F32ggml_soft_max_add_sinks 操作要求 sinks 张量为 F32 类型,存为 BF16 会导致 Metal 后端推理崩溃。
  1. 所有 biases 必须 F32:注意力输出是 F32,ggml_add 要求操作数类型匹配。BF16 存储的 bias 会在 CPU/Metal 后端触发 binary_op: unsupported types 错误。
  1. GGUF 张量命名对齐:llama.cpp GPT-OSS 架构(PR #15091)的张量命名约定:attn_output(非 attn_out)、ffn_gate_inpffn_gate_expsattn_sinks.weight 等。架构标识符为 gpt-oss(带连字符)。
  1. Safetensors 字节级偏移:读取单个张量时必须计算 JSON header 的实际长度:f.seek(8 + header_length + offset),而非简单的 f.seek(8 + offset)。每个 shard 的第一个张量会因忽略 header 长度而读到前一个 shard 的尾部垃圾数据。

使用方式

llama-cli -m gpt-oss-120b-Fable-5-Distilled-Q8_0.gguf -p "你好" -n 512

# API 服务(推荐)
llama-server -m gpt-oss-120b-Fable-5-Distilled-Q8_0.gguf \
  --jinja --reasoning-format auto --port 8080

模型使用 GPT-OSS 原生的 reasoning format,支持 reasoning_effort 参数控制推理深度,chat template 包含 analysiscommentaryfinal 三个输出通道。

Run autotrust/gpt-oss-120b-Fable-5-Distilled-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models