autotrust/gpt-oss-120b-Fable-5-Distilled-GGUF overview
GPT OSS 120B Fable 5 Distilled — GGUF Trained & Converted by : cloudyu https://huggingface.co/cloudyu Compute Sponsor : AutoTrust AI Lab https://huggingface.co…
Runs locally from ~75.15 GB disk (32 GB+ VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| gpt-oss-120b-Fable-5-Distilled-Q5_0.gguf | GGUF | Q5_0 | 75.15 GB | Download |
Model Details
Model README
---
base_model: gpt-oss-120b-heretic-v2-mxfp4-q8-hi-mlx
tags:
- gguf
- lora
- muon
- apple-silicon
- moe
- coding
- agent
- fine-tuned
language:
- en
datasets:
- armand0e/claude-fable-5-claude-code
---
GPT-OSS 120B Fable-5 Distilled — GGUF
> Trained & Converted by: cloudyu
> Compute Sponsor: AutoTrust AI Lab
> Source Model: cloudyu/gpt-oss-120b-Fable-5-Distilled
> Architecture: GPT-OSS (OpenAI MoE, 36 layers, 128 experts, 4 active)
> Quantization: Q8_0 (115.7 GB) | Q5_0 (67 GB) — GGUF V3 for llama.cpp
>
> GGUF is the universal format for LLM inference. These files run on llama.cpp, LM Studio, Ollama, GPT4All, text-generation-webui, llamafile, MLX-LM, and any GGUF-compatible runtime — across macOS, Windows, Linux, iOS, and Android, on CPU, CUDA, Metal, Vulkan, and ROCm backends.
Model Overview
This is a distilled variant of the OpenAI gpt-oss-120b model, fine-tuned using MLX with LoRA adapters (rank=16, targeting all attention projections, MoE router, and expert FFN layers). The training was performed in the MLX ecosystem using the MXFP4 quantized base model format. These GGUF files are the first community-produced GGUF conversions of a LoRA-fine-tuned GPT-OSS model from the MXFP4 format.
Available Quantizations
| Format | File | Size | Quality |
|--------|------|------|---------|
| Q8_0 | gpt-oss-120b-Fable-5-Distilled-Q8_0.gguf | 115.7 GB | Near-lossless (8-bit uniform, biases/norms kept at F32) |
| Q5_0 | gpt-oss-120b-Fable-5-Distilled-Q5_0.gguf | ~67 GB | Strong compression (5-bit symmetric, via llama-quantize from Q8_0) |
Weight matrices use Q8_0/Q5_0; biases, layer norms, and attention sinks are stored at full F32 precision for numerical stability.
Conversion Pipeline
The conversion from MLX MXFP4 format to llama.cpp GGUF involved a two-phase pipeline addressing several novel technical challenges:
Phase 1: MXFP4 Dequantization + LoRA Fusion → BF16
The source model is stored in Apple's MLX format using 4-bit MXFP4 quantization (Microscaling FP4, OCP spec) for the expert layers, with 8-bit affine quantization for attention projections. The LoRA adapter weights (576 tensors, float32) modify 288 modules spanning all Q/K/V/O projections, the MoE router, and all expert gate/up/down projections.
Key technical breakthroughs in Phase 1:
- Per-weight quantization dispatch: The model uses heterogeneous quantization parameters — MXFP4 4-bit (experts), affine 8-bit (attention + embeddings), and a special group_size=64 variant for the router. Each weight's dequantization must use the correct
(bits, group_size)pair from the config. - MXFP4 dequantization without explicit mode: MLX's
dequantizefunction, when called withbits=4andgroup_size=32, auto-detects the microscaling format. Explicitmode='mxfp4'is incompatible with the BF16 scale storage convention actually used. - LoRA fusion with shape-dependent matmul: The MLX LoRA adapter stores
lora_aas[input_dims, rank]andlora_bas[rank, output_dims]for 2D weights, but[num_experts, rank, input_dims]and[num_experts, output_dims, rank]for 3D expert weights. The fusion requireslora_b.T @ lora_a.T(2D) vslora_b @ lora_a(3D) — using the wrong formula produces catastrophic dimension mismatches. - BF16 overflow during MXFP4 dequant: The MXFP4 e8m0 scale representation can produce float32 values exceeding the BF16 representable range (~3.4e38), resulting in NaN values in ~5% of dequantized expert weights. These must be sanitized before downstream quantization.
Phase 2: BF16 → Quantized GGUF
Phase 2 reads the intermediate BF16 safetensors and writes the final GGUF in a single streaming pass, keeping RAM below 2 GB for a 218 GB intermediate.
Key technical breakthroughs in Phase 2:
- Missing
attention.key_lengthmetadata: The llama.cpp GPT-OSS handler computesn_embd_head_k = n_embd / n_head = 2880 / 64 = 45by default, which is incorrect (the true head dimension is 64). This causes a fatalGGML_ASSERT(a->ne[0] == b->ne[0])crash during KV cache initialization. The fix is explicitly settinggpt-oss.attention.key_length,value_length,key_length_swa, andvalue_length_swatohead_dim=64in the GGUF metadata. - Tokenizer vocabulary padding: The model's embedding table has 201,088 rows (
vocab_sizein config) but the o200k harmony tokenizer has only 200,019 entries. llama.cpp validatestoken_embd.weightagainst the tokenizer count. The token list must be padded with dummy tokens to reach 201,088. - Attention sinks must be F32: The
ggml_soft_max_add_sinksandggml_flash_attn_ext_add_sinksoperations require the sinks tensor to beGGML_TYPE_F32. Storing it as BF16 causes a Metal backend crash during inference. - All biases must be F32:
ggml_addoperations in the attention and FFN graph require type-matched operands. Since attention outputs are F32, bias tensors stored as BF16 triggerbinary_op: unsupported typeserrors on CPU and Metal. - GGUF tensor naming alignment: The GPT-OSS architecture in llama.cpp (PR #15091,
LLM_ARCH_OPENAI_MOE) maps MLX tensor names via specific conventions:self_attn.o_proj→attn_output(notattn_out),mlp.router→ffn_gate_inp,mlp.experts.gate_proj→ffn_gate_exps,self_attn.sinks→attn_sinks.weight, etc. The architecture identifier isgpt-oss(hyphenated), notgptoss. - Safetensors byte-level data extraction: Reading individual tensors from safetensor files requires accounting for the JSON header size (
f.seek(8 + header_length + offset), notf.seek(8 + offset)). Missing this causes the first tensor in each shard to read partially corrupt data — a silent data integrity bug.
Usage with llama.cpp
# Download (example path)
wget https://huggingface.co/cloudyu/gpt-oss-120b-Fable-5-Distilled/resolve/main/gpt-oss-120b-Fable-5-Distilled-Q8_0.gguf
# Inference
llama-cli -m gpt-oss-120b-Fable-5-Distilled-Q8_0.gguf \
-p "Explain quantum computing in simple terms" -n 512
# Server (recommended — uses sliding window attention efficiently)
llama-server -m gpt-oss-120b-Fable-5-Distilled-Q8_0.gguf \
--jinja --reasoning-format auto --port 8080
The model uses GPT-OSS's native reasoning format. Set reasoning_effort via chat template kwargs for API usage. The chat template supports system, developer, user, and assistant roles with channel markers for analysis, commentary, and final output.
Local Agent Best Practice
This model demonstrates extremely strong reasoning capabilities — when given --jinja and --reasoning-format auto with reasoning_effort: "high", it can perform multi-step planning, complex code analysis, and structured problem decomposition at a level comparable to frontier models. However, it has a critical weakness: severe hallucination. The model will confidently fabricate facts, API signatures, file paths, URLs, and library versions.
Golden Rule: Always pair this model with the anysearch skill in llama-agent or opencode, which grounds responses against real web search results. Do not trust any factual claim from this model without verification.
./build/bin/llama-agent \
-m /path/to/gpt-oss-120b-Fable-5-Distilled-Q5_0.gguf \
--jinja \
--reasoning-format auto \
--temp 1.0 \
--chat-template-kwargs '{"reasoning_effort": "high"}'
Key flags explained:
--jinja— Enables Jinja template rendering for GPT-OSS's multi-channel chat format--reasoning-format auto— Uses the model's native reasoning block structure (analysis/commentary/final)--temp 1.0— Maximum temperature; strong reasoning models benefit from full creativity in the analysis phase--chat-template-kwargs '{"reasoning_effort": "high"}'— Instructs the model to allocate more compute to reasoning chains
Known failure modes without search grounding:
- Confidently invents non-existent CLI flags and API parameters
- Hallucinates plausible but wrong library versions and package names
- Fabricates convincing-looking URLs and file paths
- Constructs internally consistent but factually incorrect arguments
---
本地 Agent 最佳实践
本模型拥有极强的推理能力——配合 --jinja、--reasoning-format auto 和 reasoning_effort: "high" 参数,可以进行多步规划、复杂代码分析和结构化问题拆解,水平可比肩前沿模型。但它有一个致命弱点:严重的幻觉问题。模型会自信地捏造事实、API 签名、文件路径、URL 和库版本号。
黄金法则:始终将此模型与 anysearch skill 配合使用,以真实网页搜索结果作为事实锚定。未经查证,不要信任该模型的任何事实性结论。
./build/bin/llama-agent \
-m /path/to/gpt-oss-120b-Fable-5-Distilled-Q5_0.gguf \
--jinja \
--reasoning-format auto \
--temp 1.0 \
--chat-template-kwargs '{"reasoning_effort": "high"}'
关键参数说明:
--jinja— 启用 Jinja 模板渲染,支持 GPT-OSS 多通道 chat 格式--reasoning-format auto— 使用模型原生的推理块结构(analysis/commentary/final三通道)--temp 1.0— 最高温度;强推理模型在分析阶段受益于充分的创造性--chat-template-kwargs '{"reasoning_effort": "high"}'— 指示模型为推理链分配更多计算资源
未配合搜索验证时的已知失败模式:
- 自信地编造不存在的 CLI 参数和 API 接口
- 幻觉出看似合理但实际错误的库版本号和包名
- 构造看似可信的虚假 URL 和文件路径
- 生成内部自洽但与事实不符的论证
---
GPT-OSS 120B Fable-5 Distilled — GGUF(中文说明)
模型概要
训练与转换: cloudyu
算力支持: AutoTrust AI Lab
基于 OpenAI gpt-oss-120b 的蒸馏微调版本。使用 MLX 框架对 MXFP4 量化基座模型训练 LoRA 适配器(rank=16,覆盖所有注意力投影、MoE 路由器及专家 FFN 层)。这是社区首次将 LoRA 微调后的 GPT-OSS 模型从 MLX MXFP4 格式转换为 llama.cpp GGUF 格式。
量化版本
GGUF 是 LLM 推理的通用格式——支持 llama.cpp、LM Studio、Ollama、GPT4All、text-generation-webui、llamafile 等主流运行时,兼容 macOS/Windows/Linux/iOS/Android 全平台,CPU/CUDA/Metal/Vulkan/ROCm 全后端。
| 文件 | 大小 | 说明 |
|------|------|------|
| gpt-oss-120b-Fable-5-Distilled-Q8_0.gguf | 115.7 GB | Q8_0 权重 + F32 biases/norms,接近无损 |
| gpt-oss-120b-Fable-5-Distilled-Q5_0.gguf | ~67 GB | Q5_0 对称量化,高压缩比 |
转换技术突破
阶段一:MXFP4 反量化 + LoRA 融合 → BF16
源模型以 MLX 格式存储,专家层使用 MXFP4(4-bit 显微缩放格式),注意力层使用 8-bit affine 量化。LoRA 适配器包含 576 个张量,修改 288 个模块。
核心技术挑战:
- 异构量化参数调度:不同层使用不同的量化参数——专家层
bits=4, group_size=32(mxfp4),注意力层bits=8, group_size=32(affine),路由器bits=8, group_size=64。反量化必须按张量逐一使用正确的(bits, group_size)组合。
- MXFP4 反量化的 mode 陷阱:
mlx.dequantize在bits=4时自动检测 MXFP4 格式,显式传mode='mxfp4'反而报错——因为它期望 uint8 格式的 scale,而实际存储的是 BF16。
- LoRA 融合的形状相关矩阵乘法:2D 权重(注意力、路由器)与 3D 权重(专家)的 LoRA 融合公式不同——2D 用
lora_b.T @ lora_a.T,3D 用lora_b @ lora_a。混用会导致维度崩溃。
- BF16 溢出:MXFP4 的 e8m0 scale 编码可能产生超出 BF16 范围(±3.4e38)的 float32 值,导致约 5% 的专家权重变成 NaN。必须在量化前用
nan_to_num清理。
阶段二:BF16 → 量化 GGUF
以流式方式读取 BF16 中间文件,逐张量量化写入 GGUF,内存占用 < 2 GB。
核心技术挑战:
attention.key_length元数据缺失:llama.cpp 的 GPT-OSS 处理程序默认计算n_embd_head_k = hidden_size / num_heads = 2880 / 64 = 45,但实际 head_dim 是 64。导致 KV cache 初始化时ggml_set_rowsassertion 崩溃(缓存 ne[0]=360 ≠ K张量 ne[0]=512)。修复方法是在 GGUF 元数据中显式设置key_length/value_length/key_length_swa/value_length_swa = 64。
- 词表补齐:模型 embedding 有 201,088 行,但 o200k harmony tokenizer 只有 200,019 个 token。llama.cpp 会校验
token_embd.weight的形状与 tokenizer 数量必须一致。需用虚拟 token 补齐。
- Attention Sinks 必须 F32:
ggml_soft_max_add_sinks操作要求 sinks 张量为 F32 类型,存为 BF16 会导致 Metal 后端推理崩溃。
- 所有 biases 必须 F32:注意力输出是 F32,
ggml_add要求操作数类型匹配。BF16 存储的 bias 会在 CPU/Metal 后端触发binary_op: unsupported types错误。
- GGUF 张量命名对齐:llama.cpp GPT-OSS 架构(PR #15091)的张量命名约定:
attn_output(非attn_out)、ffn_gate_inp、ffn_gate_exps、attn_sinks.weight等。架构标识符为gpt-oss(带连字符)。
- Safetensors 字节级偏移:读取单个张量时必须计算 JSON header 的实际长度:
f.seek(8 + header_length + offset),而非简单的f.seek(8 + offset)。每个 shard 的第一个张量会因忽略 header 长度而读到前一个 shard 的尾部垃圾数据。
使用方式
llama-cli -m gpt-oss-120b-Fable-5-Distilled-Q8_0.gguf -p "你好" -n 512
# API 服务(推荐)
llama-server -m gpt-oss-120b-Fable-5-Distilled-Q8_0.gguf \
--jinja --reasoning-format auto --port 8080
模型使用 GPT-OSS 原生的 reasoning format,支持 reasoning_effort 参数控制推理深度,chat template 包含 analysis、commentary、final 三个输出通道。
Run autotrust/gpt-oss-120b-Fable-5-Distilled-GGUF with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models