majentik/gpt-oss-120b-rotorquant-gguf-q5_k_m Q5_K_M GGUF - Free GGUF Download is indexed on GraySoft with repository links, GGUF quant files, and Hugging Face metadata. This page helps you pick a local model for guIDE or other runtimes. See related models in the same shard below.
Model Intelligence Sheet
majentik/gpt-oss-120b-rotorquant-gguf-q5_k_m overview
GGUF Q5KM weight-quantized variant of openai/gpt-oss-120b optimised for use with RotorQuant KV cache compression via a dedicated llama.cpp fork. Important: RotorQuant KV cache types (planar3, iso3) are not available in upstream llama.cpp, standard Ollama, or LM Studio. They require a specific llama.cpp fork. The GGUF file itself is a standard GGUF and works with any llama.cpp-compatible runtime using normal KV cache types (f16, q80, q40, etc.).
Downloads
86
Likes
0
Pipeline
text-generation
Library
gguf
Visibility
Public
Access
Open
Repository Files & Downloads
1 files detected
Direct downloads for all repository files
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| gpt-oss-120b-RotorQuant-Q5_K_M.gguf | GGUF | Q5_K_M | 1.01 GB | Download |
Model Details Live
Metadata Inspector
Normalized metadata (stored in metadata_json)
{
"metadata": {},
"card_data": {
"license": "apache-2.0",
"base_model": "openai/gpt-oss-120b",
"tags": [
"gguf",
"rotorquant",
"kv-cache-quantization",
"gpt-oss",
"openai",
"moe",
"llama-cpp",
"quantized"
],
"library_name": "gguf",
"pipeline_tag": "text-generation",
"language": [
"en"
],
"frontmatter": {
"license": "apache-2.0",
"base_model": "openai/gpt-oss-120b",
"tags": [
"gguf",
"rotorquant",
"kv-cache-quantization",
"gpt-oss",
"openai",
"moe",
"llama-cpp",
"quantized"
],
"library_name": "gguf",
"pipeline_tag": "text-generation",
"language": [
"en"
]
},
"hero_image_url": "",
"summary": "GGUF Q5_K_M weight-quantized variant of openai/gpt-oss-120b optimised for use with **RotorQuant** KV cache compression via a dedicated llama.cpp fork. > **Important:** RotorQuant KV cache types (planar3, iso3) are **not** available in upstream llama.cpp, standard Ollama, or LM Studio. > They require a specific llama.cpp fork. > The GGUF file itself is a standard GGUF and works with any llama.cpp-compatible runtime using normal KV cache types (f16, q8_0, q4_0, etc.).",
"quick_links": [],
"benchmark_table_html": "",
"readme_markdown": "---\nlicense: apache-2.0\nbase_model: openai/gpt-oss-120b\ntags:\n- gguf\n- rotorquant\n- kv-cache-quantization\n- gpt-oss\n- openai\n- moe\n- llama-cpp\n- quantized\nlibrary_name: gguf\npipeline_tag: text-generation\nlanguage:\n- en\n---\n\n# gpt-oss-120b-RotorQuant-GGUF-Q5_K_M\n\nGGUF Q5_K_M weight-quantized variant of [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) optimised for use with **RotorQuant** KV cache compression via a dedicated llama.cpp fork.\n\n> **Important:** RotorQuant KV cache types (`planar3`, `iso3`) are **not** available in upstream llama.cpp, standard Ollama, or LM Studio.\n> They require a [specific llama.cpp fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache).\n> The GGUF file itself is a standard GGUF and works with any llama.cpp-compatible runtime using normal KV cache types (f16, q8_0, q4_0, etc.).\n\n## Overview\n\nThis model combines two independent compression techniques:\n\n| Technique | What it does | Requirement |\n|-----------|-------------|-------------|\n| **GGUF Q5_K_M weight quantization** | Reduces model size from ~240 GB (BF16) to ~81.6 GB | Any llama.cpp-compatible runtime |\n| **RotorQuant KV cache compression** — block-diagonal Clifford-algebra rotors for 3-bit KV cache (`--cache-type-k iso3 --cache-type-v iso3`) | Block-diagonal rotations / random rotation for compressed KV cache | [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) only |\n\n## Quickstart\n\n### Option A — With RotorQuant KV cache (fork required)\n\nYou must build from the RotorQuant-enabled llama.cpp fork:\n\n```bash\n# Clone and build the fork\ngit clone https://github.com/johndpope/llama-cpp-turboquant.git\ncd llama-cpp-turboquant && git checkout feature/planarquant-kv-cache\n\n# CUDA (Windows/Linux)\ncmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j\n\n# Metal (Apple Silicon)\ncmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j\n\n# Run with RotorQuant KV cache\n./build/bin/llama-cli -m gpt-oss-120b-RotorQuant-GGUF-Q5_K_M.gguf \\\n --cache-type-k iso3 --cache-type-v iso3 \\\n -ngl 99 -fa \\\n -p \"Explain quantum computing\"\n\n# Or run as a server\n./build/bin/llama-server -m gpt-oss-120b-RotorQuant-GGUF-Q5_K_M.gguf \\\n --cache-type-k iso3 --cache-type-v iso3 \\\n -ngl 99 -fa --jinja\n```\n\n### Option B — With standard llama.cpp / LM Studio / Ollama\n\nThe GGUF works as a normal quantised model. You won't get RotorQuant-specific KV cache benefits, but standard KV cache quantization (q8_0, q4_0) still reduces VRAM significantly.\n\n**llama.cpp (upstream)**\n```bash\nllama-cli -m gpt-oss-120b-RotorQuant-GGUF-Q5_K_M.gguf \\\n --cache-type-k q8_0 --cache-type-v q8_0 \\\n -ngl 99 -fa \\\n -p \"Explain quantum computing\"\n```\n\n**LM Studio**\n1. Download the GGUF file and load in LM Studio.\n2. Enable **Developer Mode** (Settings → Developer).\n3. In the model loader's advanced settings, set **Flash Attention** to ON.\n4. Set **K Cache Quantization** and **V Cache Quantization** to `q8_0` (or `q4_0` for more aggressive VRAM savings).\n5. Note: LM Studio does not currently support RotorQuant's `iso3` cache types. Track [this feature request](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) for updates.\n\n**Ollama**\n```bash\n# Standard Ollama does not support RotorQuant cache types.\n# Use with default or q8_0 KV cache via OLLAMA_KV_CACHE_TYPE=q8_0\nOLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_FLASH_ATTENTION=1 ollama run majentik/gpt-oss-120b-RotorQuant-GGUF-Q5_K_M\n```\n\n## Specifications\n\n| Property | Value |\n|----------|-------|\n| Base Model | [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) |\n| Architecture | Sparse MoE |\n| Parameters | 120B total (MoE) |\n| Context Length | 128K |\n| Weight Quantization | GGUF Q5_K_M (high quality, balanced 5-bit) |\n| Original Size (BF16) | ~240 GB |\n| Quantized File Size | ~81.6 GB |\n| KV Cache (RotorQuant) | 3-bit via `--cache-type-k iso3 --cache-type-v iso3` (fork only) |\n| KV Cache (standard) | q8_0, q4_0, f16, etc. (any llama.cpp runtime) |\n| License | apache-2.0 |\n| Modalities | Text only |\n| Compatible Runtimes | llama.cpp, LM Studio, Ollama, koboldcpp |\n\n## What is RotorQuant?\n\n[RotorQuant](https://github.com/scrya-com/rotorquant) is a KV cache compression method based on Clifford algebra (Cl(3,0)) rotors. It was developed as a faster, more parameter-efficient alternative to Google's [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026).\n\nInstead of applying a dense d×d random orthogonal rotation matrix (as TurboQuant does), RotorQuant uses lightweight block-diagonal rotations — independent 2D/4D rotations per pair/quartet — achieving O(d) complexity instead of O(d log d), fully parallelisable with no inter-element dependencies.\n\n**Benchmarks from the RotorQuant repository** (Llama 3.1 8B, RTX 5090 — results will vary by model and hardware):\n\n| Metric | RotorQuant (iso3) | TurboQuant | Standard q4_0 |\n|--------|-------------------|------------|---------------|\n| Prefill Speed | 3,822 tok/s | 722 tok/s | — |\n| Decode Speed | 119 tok/s | 93 tok/s | — |\n| Perplexity (PPL) | 6.91 | 7.07 | — |\n| KV Compression | ~5× vs FP16 | ~5× vs FP16 | ~4× vs FP16 |\n| Rotation Parameters | 4 per rotor | 16,384 per matrix | N/A |\n\n> **Note:** These benchmarks are from the RotorQuant repository using Llama 3.1 8B on an RTX 5090. Performance on gpt-oss-120b will differ. Independent benchmarks for this specific model are welcome — please open a discussion if you have results to share.\n\n## Current Status of RotorQuant in the Ecosystem\n\n| Runtime | RotorQuant Support | Standard KV Quant |\n|---------|---------------------|-------------------|\n| llama.cpp (upstream) | ❌ Not merged | ✅ q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1 |\n| llama-cpp-turboquant fork | ✅ planar3, iso3 | ✅ All standard types |\n| LM Studio | ❌ [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | ✅ Via advanced settings |\n| Ollama | ❌ Not supported | ✅ Via OLLAMA_KV_CACHE_TYPE |\n| koboldcpp | ❌ Not supported | ✅ Standard types |\n\n## Recommended Settings\n\nFor VRAM-constrained setups, standard q8_0 KV cache quantization already halves KV cache memory with negligible quality impact. Flash Attention should always be enabled — it is required for V cache quantization and improves memory efficiency regardless.\n\n| VRAM | Suggested Configuration |\n|------|------------------------|\n| 24 GB (RTX 4090) | Q5_K_M + q8_0 KV cache + Flash Attention, 8K–16K context |\n| 16 GB | Q5_K_M + q4_0 KV cache + Flash Attention, 4K–8K context |\n| 48+ GB | Q5_K_M + f16 KV cache, full 32K+ context |\n\n## See Also\n\n- [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)\n- [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)\n- [TurboQuant llama.cpp discussion](https://github.com/ggml-org/llama.cpp/discussions/20969)\n- [TurboQuant paper (arXiv: 2504.19874)](https://arxiv.org/abs/2504.19874)\n- [Base model: openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)\n- [gpt-oss-120b announcement](https://openai.com/blog/gpt-oss)\n",
"related_quantizations": []
},
"tags": [
"gguf",
"rotorquant",
"kv-cache-quantization",
"gpt-oss",
"openai",
"moe",
"llama-cpp",
"quantized",
"text-generation",
"en",
"arxiv:2504.19874",
"base_model:openai/gpt-oss-120b",
"base_model:quantized:openai/gpt-oss-120b",
"license:apache-2.0",
"region:us"
],
"likes": 0,
"downloads": 86,
"gated": false,
"private": false,
"last_modified": "2026-04-17T01:24:47.000Z",
"created_at": "2026-04-14T15:07:30.000Z",
"pipeline_tag": "text-generation",
"library_name": "gguf"
}
Source payload excerpt (from Hugging Face API)
{
"_id": "69de5832b2548ab9ddc65449",
"id": "majentik/gpt-oss-120b-RotorQuant-GGUF-Q5_K_M",
"modelId": "majentik/gpt-oss-120b-RotorQuant-GGUF-Q5_K_M",
"sha": "08300df50aee8715b7451b92bce84832abe0a66a",
"createdAt": "2026-04-14T15:07:30.000Z",
"lastModified": "2026-04-17T01:24:47.000Z",
"author": "majentik",
"downloads": 86,
"likes": 0,
"gated": false,
"private": false,
"pipeline_tag": "text-generation",
"library_name": "gguf",
"siblings_count": 3
}