Model Intelligence Sheet

ubergarm/qwen3.5-122b-a10b-gguf overview

Comprehensive model page for ubergarm/qwen3.5-122b-a10b-gguf

ggufimatrixconversationalqwen3_5_moeik_llama.cpptext-generationbase_model:Qwen/Qwen3.5-122B-A10Bbase_model:quantized:Qwen/Qwen3.5-122B-A10Blicense:apache-2.0endpoints_compatibleregion:us

Downloads

5,617

Likes

Pipeline

text-generation

Library

—

Visibility

Public

Access

Open

Repository Files & Downloads

5 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
Qwen3.5-122B-A10B-IQ1_KT.gguf	GGUF	IQ1_KT	30.23 GB	Download
Qwen3.5-122B-A10B-IQ2_KL.gguf	GGUF	IQ2_KL	43.33 GB	Download
Qwen3.5-122B-A10B-IQ4_KSS.gguf	GGUF	IQ4_KSS	61.23 GB	Download
Qwen3.5-122B-A10B-smol-IQ2_KS.gguf	GGUF	IQ2_KS	35.33 GB	Download
Qwen3.5-122B-A10B-smol-IQ5_KS.gguf	GGUF	IQ5_KS	77.35 GB	Download

Model Details Live

Model Slug

ubergarm/qwen3.5-122b-a10b-gguf

Author

ubergarm

Pipeline Task

text-generation

Library

—

Created

2026-02-24

Last Modified

2026-03-20

Gated

Private

HF SHA

4cb49cf72d5647605b5510b0745bab8a6e10124e

License

apache-2.0

Language

Unknown

Base Model

Qwen/Qwen3.5-122B-A10B

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "quantized_by": "ubergarm",
    "pipeline_tag": "text-generation",
    "base_model": "Qwen/Qwen3.5-122B-A10B",
    "base_model_relation": "quantized",
    "license": "apache-2.0",
    "license_link": "https://huggingface.co/Qwen/Qwen3.5-122B-A10B/blob/main/LICENSE",
    "tags": [
      "imatrix",
      "conversational",
      "qwen3_5_moe",
      "ik_llama.cpp"
    ],
    "frontmatter": {
      "quantized_by": "ubergarm",
      "pipeline_tag": "text-generation",
      "base_model": "Qwen/Qwen3.5-122B-A10B",
      "base_model_relation": "quantized",
      "license": "apache-2.0",
      "license_link": "https://huggingface.co/Qwen/Qwen3.5-122B-A10B/blob/main/LICENSE",
      "tags": [
        "imatrix",
        "conversational",
        "qwen3_5_moe",
        "ik_llama.cpp"
      ]
    },
    "hero_image_url": "images/perplexity.png \"Chart showing Perplexity vs Model Size.\"",
    "summary": "",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nquantized_by: ubergarm\npipeline_tag: text-generation\nbase_model: Qwen/Qwen3.5-122B-A10B\nbase_model_relation: quantized\nlicense: apache-2.0\nlicense_link: https://huggingface.co/Qwen/Qwen3.5-122B-A10B/blob/main/LICENSE\ntags:\n- imatrix\n- conversational\n- qwen3_5_moe\n- ik_llama.cpp\n---\n\n## `ik_llama.cpp` imatrix Quantizations of Qwen/Qwen3.5-122B-A10B\nThe quants in this collection **REQUIRE** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!\n\n*NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Only a couple quants in this collection are compatible with mainline llamma.cpp/LMStudio/KoboldCPP/etc as mentioned in the specific description, all others require ik_llama.cpp.\n\nSome of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP with Windows builds. Also check for [ik_llama.cpp windows builds by Thireus here.](https://github.com/Thireus/ik_llama.cpp/releases).\n\nThese quants provide best in class perplexity for the given memory footprint.\n\n## Big Thanks\nShout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)!  **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!\n\nAlso thanks to all the folks in the quanting and inferencing community on [BeaverAI Club Discord](https://huggingface.co/BeaverAI) and on [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/) for tips and tricks helping each other run, test, and benchmark all the fun new models! Thanks to huggingface for hosting all these big quants!\n\nFinally, I *really* appreciate the support from [aifoundry.org](https://aifoundry.org) so check out their open source RISC-V based solutions!\n\n## Quant Collection\nPerplexity computed against *wiki.test.raw*. (lower is \"better\")\n\n![Perplexity Chart](images/perplexity.png \"Chart showing Perplexity vs Model Size.\")\n\nThese two are just test quants for baseline perplexity comparison and not available for download here:\n* `BF16` 227.525 GiB (16.005 BPW)\n  - PPL over 580 chunks for n_ctx=512 = 4.8159 +/- 0.02839\n* `Q8_0` 120.942 GiB (8.508 BPW)\n  - PPL over 580 chunks for n_ctx=512 = 4.8196 +/- 0.02841\n\n*NOTE*: The first split file is much smaller on purpose to only contain metadata, its fine!\n\n## IQ5_KS 77.341 GiB (5.441 BPW)\nFinal estimate: PPL over 580 chunks for n_ctx=512 = 4.8264 +/- 0.02846\n\nThis is the best quality version for full offload on 96GB VRAM. This is it.\n\n<details>\n\n<summary>👈 Secret Recipe</summary>\n\n```bash\n#!/usr/bin/env bash\n\ncustom=\"\n# 60 Repeating Layers [0-59]\n\n## Gated Attention/Delta Net [Blended 0-59]\nblk\\..*\\.attn_gate\\.weight=q8_0\nblk\\..*\\.attn_qkv\\.weight=q8_0\nblk\\..*\\.attn_output\\.weight=q8_0\nblk\\..*\\.attn_q\\.weight=q8_0\nblk\\..*\\.attn_k\\.weight=q8_0\nblk\\..*\\.attn_v\\.weight=q8_0\nblk\\..*\\.ssm_alpha\\.weight=f32\nblk\\..*\\.ssm_beta\\.weight=f32\nblk\\..*\\.ssm_out\\.weight=q8_0\n\n# Shared Expert Layers [0-59]\nblk\\..*\\.ffn_down_shexp\\.weight=q8_0\nblk\\..*\\.ffn_(gate|up)_shexp\\.weight=q8_0\n\n# Routed Experts Layers [0-59]\nblk\\..*\\.ffn_down_exps\\.weight=iq5_ks\nblk\\..*\\.ffn_(gate|up)_exps\\.weight=iq5_ks\n\n# Non-Repeating Layers\ntoken_embd\\.weight=q8_0\noutput\\.weight=q8_0\n\"\n\ncustom=$(\n  echo \"$custom\" | grep -v '^#' | \\\n  sed -Ez 's:\\n+:,:g;s:,$::;s:^,::'\n)\n\n    #--dry-run \\\nnumactl -N ${SOCKET} -m ${SOCKET} \\\n./build/bin/llama-quantize \\\n    --custom-q \"$custom\" \\\n    --imatrix /mnt/data/models/ubergarm/Qwen3.5-122B-A10B-GGUF/imatrix-Qwen3.5-122B-A10B-BF16.dat \\\n    /mnt/data/models/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-BF16-00001-of-00005.gguf \\\n    /mnt/data/models/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-smol-IQ5_KS.gguf \\\n    IQ5_KS \\\n    128\n```\n\n</details>\n\n## IQ4_KSS 61.219 GiB (4.306 BPW)\nPPL over 580 chunks for n_ctx=512 = 4.8741 +/- 0.02879\n\n<details>\n\n<summary>👈 Secret Recipe</summary>\n\n```bash\n#!/usr/bin/env bash\n\ncustom=\"\n# 60 Repeating Layers [0-59]\n\n## Gated Attention/Delta Net [Blended 0-59]\nblk\\..*\\.attn_gate\\.weight=q8_0\nblk\\..*\\.attn_qkv\\.weight=q8_0\nblk\\..*\\.attn_output\\.weight=q8_0\nblk\\..*\\.attn_q\\.weight=q8_0\nblk\\..*\\.attn_k\\.weight=q8_0\nblk\\..*\\.attn_v\\.weight=q8_0\nblk\\..*\\.ssm_alpha\\.weight=q8_0\nblk\\..*\\.ssm_beta\\.weight=q8_0\nblk\\..*\\.ssm_out\\.weight=q8_0\n\n# Shared Expert Layers [0-59]\nblk\\..*\\.ffn_down_shexp\\.weight=q8_0\nblk\\..*\\.ffn_(gate|up)_shexp\\.weight=q8_0\n\n# Routed Experts Layers [0-59]\nblk\\..*\\.ffn_down_exps\\.weight=iq4_ks\nblk\\..*\\.ffn_(gate|up)_exps\\.weight=iq4_kss\n\n# Non-Repeating Layers\ntoken_embd\\.weight=iq6_k\noutput\\.weight=iq6_k\n\"\n\ncustom=$(\n  echo \"$custom\" | grep -v '^#' | \\\n  sed -Ez 's:\\n+:,:g;s:,$::;s:^,::'\n)\n\nnumactl -N ${SOCKET} -m ${SOCKET} \\\n./build/bin/llama-quantize \\\n    --custom-q \"$custom\" \\\n    --imatrix /mnt/data/models/ubergarm/Qwen3.5-122B-A10B-GGUF/imatrix-Qwen3.5-122B-A10B-BF16.dat \\\n    /mnt/data/models/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-BF16-00001-of-00005.gguf \\\n    /mnt/data/models/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-IQ4_KSS.gguf \\\n    IQ4_KSS \\\n    128\n```\n\n</details>\n\n## IQ2_KL 43.319 GiB (3.047 BPW)\nPPL over 580 chunks for n_ctx=512 = 5.1012 +/- 0.03038\n\n<details>\n\n<summary>👈 Secret Recipe</summary>\n\n```bash\n#!/usr/bin/env bash\n\ncustom=\"\n# 60 Repeating Layers [0-59]\n\n## Gated Attention/Delta Net [Blended 0-59]\nblk\\..*\\.attn_gate\\.weight=iq6_k\nblk\\..*\\.attn_qkv\\.weight=iq6_k\nblk\\..*\\.attn_output\\.weight=iq6_k\nblk\\..*\\.attn_q\\.weight=iq6_k\nblk\\..*\\.attn_k\\.weight=iq6_k\nblk\\..*\\.attn_v\\.weight=iq6_k\nblk\\..*\\.ssm_alpha\\.weight=iq6_k\nblk\\..*\\.ssm_beta\\.weight=iq6_k\nblk\\..*\\.ssm_out\\.weight=iq6_k\n\n# Shared Expert Layers [0-59]\nblk\\..*\\.ffn_down_shexp\\.weight=iq6_k\nblk\\..*\\.ffn_(gate|up)_shexp\\.weight=iq6_k\n\n# Routed Experts Layers [0-59]\nblk\\..*\\.ffn_down_exps\\.weight=iq3_ks\nblk\\..*\\.ffn_(gate|up)_exps\\.weight=iq2_kl\n\n# Non-Repeating Layers\ntoken_embd\\.weight=iq4_k\noutput\\.weight=iq6_k\n\"\n\ncustom=$(\n  echo \"$custom\" | grep -v '^#' | \\\n  sed -Ez 's:\\n+:,:g;s:,$::;s:^,::'\n)\n\n    #--dry-run \\\nnumactl -N ${SOCKET} -m ${SOCKET} \\\n./build/bin/llama-quantize \\\n    --custom-q \"$custom\" \\\n    --imatrix /mnt/data/models/ubergarm/Qwen3.5-122B-A10B-GGUF/imatrix-Qwen3.5-122B-A10B-BF16.dat \\\n    /mnt/data/models/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-BF16-00001-of-00005.gguf \\\n    /mnt/data/models/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-IQ2_KL.gguf \\\n    IQ2_KL \\\n    128\n```\n\n</details>\n\n## smol-IQ2_KS 35.319 GiB (2.485 BPW)\nPPL over 580 chunks for n_ctx=512 = 5.4614 +/- 0.03292\n\n<details>\n\n<summary>👈 Secret Recipe</summary>\n\n```bash\n#!/usr/bin/env bash\n\ncustom=\"\n# 60 Repeating Layers [0-59]\n\n## Gated Attention/Delta Net [Blended 0-59]\nblk\\..*\\.attn_gate\\.weight=q8_0\nblk\\..*\\.attn_qkv\\.weight=q8_0\nblk\\..*\\.attn_output\\.weight=q8_0\nblk\\..*\\.attn_q\\.weight=q8_0\nblk\\..*\\.attn_k\\.weight=q8_0\nblk\\..*\\.attn_v\\.weight=q8_0\nblk\\..*\\.ssm_alpha\\.weight=q8_0\nblk\\..*\\.ssm_beta\\.weight=q8_0\nblk\\..*\\.ssm_out\\.weight=q8_0\n\n# Shared Expert Layers [0-59]\nblk\\..*\\.ffn_down_shexp\\.weight=q8_0\nblk\\..*\\.ffn_(gate|up)_shexp\\.weight=q8_0\n\n# Routed Experts Layers [0-59]\nblk\\..*\\.ffn_down_exps\\.weight=iq2_ks\nblk\\..*\\.ffn_(gate|up)_exps\\.weight=iq2_ks\n\n# Non-Repeating Layers\ntoken_embd\\.weight=iq4_k\noutput\\.weight=iq6_k\n\"\n\ncustom=$(\n  echo \"$custom\" | grep -v '^#' | \\\n  sed -Ez 's:\\n+:,:g;s:,$::;s:^,::'\n)\n\nnumactl -N ${SOCKET} -m ${SOCKET} \\\n./build/bin/llama-quantize \\\n    --custom-q \"$custom\" \\\n    --imatrix /mnt/data/models/ubergarm/Qwen3.5-122B-A10B-GGUF/imatrix-Qwen3.5-122B-A10B-BF16.dat \\\n    /mnt/data/models/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-BF16-00001-of-00005.gguf \\\n    /mnt/data/models/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-smol-IQ2_KS.gguf \\\n    IQ2_KS \\\n    128\n```\n\n</details>\n\n## IQ1_KT 30.217 GiB (2.126 BPW)\nPPL over 580 chunks for n_ctx=512 = 5.7763 +/- 0.03535\n\n<details>\n\n<summary>👈 Secret Recipe</summary>\n\n```bash\n#!/usr/bin/env bash\n\ncustom=\"\n# 60 Repeating Layers [0-59]\n\n## Gated Attention/Delta Net [Blended 0-59]\nblk\\..*\\.attn_gate\\.weight=iq6_k\nblk\\..*\\.attn_qkv\\.weight=iq6_k\nblk\\..*\\.attn_output\\.weight=iq6_k\nblk\\..*\\.attn_q\\.weight=iq6_k\nblk\\..*\\.attn_k\\.weight=iq6_k\nblk\\..*\\.attn_v\\.weight=iq6_k\nblk\\..*\\.ssm_alpha\\.weight=iq6_k\nblk\\..*\\.ssm_beta\\.weight=iq6_k\nblk\\..*\\.ssm_out\\.weight=iq6_k\n\n# Shared Expert Layers [0-59]\nblk\\..*\\.ffn_down_shexp\\.weight=iq6_k\nblk\\..*\\.ffn_(gate|up)_shexp\\.weight=iq6_k\n\n# Routed Experts Layers [0-59]\nblk\\..*\\.ffn_down_exps\\.weight=iq2_kt\nblk\\..*\\.ffn_(gate|up)_exps\\.weight=iq1_kt\n\n# Non-Repeating Layers\ntoken_embd\\.weight=iq4_k\noutput\\.weight=iq6_k\n\"\n\ncustom=$(\n  echo \"$custom\" | grep -v '^#' | \\\n  sed -Ez 's:\\n+:,:g;s:,$::;s:^,::'\n)\n\n    #--dry-run \\\nnumactl -N ${SOCKET} -m ${SOCKET} \\\n./build/bin/llama-quantize \\\n    --custom-q \"$custom\" \\\n    --imatrix /mnt/data/models/ubergarm/Qwen3.5-122B-A10B-GGUF/imatrix-Qwen3.5-122B-A10B-BF16.dat \\\n    /mnt/data/models/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-BF16-00001-of-00005.gguf \\\n    /mnt/data/models/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-IQ1_KT.gguf \\\n    IQ1_KT \\\n    128\n```\n\n</details>\n\n## Quick Start\n\n```bash\n# Clone and checkout\n$ git clone https://github.com/ikawrakow/ik_llama.cpp\n$ cd ik_llama.cpp\n\n# Build for hybrid CPU+CUDA\n$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON\n$ cmake --build build --config Release -j $(nproc)\n\n# Download Desired Quants\n$ pip install huggingface_hub\n$ hf download --local-dir ./ --include=smol-IQ2_KS/*.gguf ubergarm/Qwen3.5-122B-A10B-GGUF\n\n# Full GPU Offload\n./build/bin/llama-server \\\n  --model \"$model\" \\\n  --alias Qwen3.5-122B-A10B \\\n  -c 262144 \\\n  -fa on \\\n  -ger \\\n  --merge-qkv \\\n  -sm graph \\\n  -ngl 99 \\\n  -ub 4096 -b 4096 \\\n  --parallel 1 \\\n  --threads 1 \\\n  --host 127.0.0.1 \\\n  --port 8080 \\\n  --jinja \\\n  --no-mmap\n\n# Hybrid CPU+GPU Offload\necho TODO or see other recent modelcards for examples running Qwen3.5\n\n# CPU-Only Inference\nnumactl -N \"$SOCKET\" -m \"$SOCKET\" \\\n./build/bin/llama-server \\\n    --model \"$model\"\\\n    --alias ubergarm/Qwen3.5-122B-A10B \\\n    --ctx-size 65536 \\\n    -ctk q8_0 -ctv q8_0 \\\n    --parallel 1 \\\n    --threads 96 \\\n    --threads-batch 128 \\\n    --numa numactl \\\n    --host 127.0.0.1 \\\n    --port 8080 \\\n    --no-mmap \\\n    --jinja\n```\n\nIf you're using chat completions endpoint, you can disable thinking with `--chat-template-kwargs '{\"enable_thinking\": false }'`.\n\n## References\n* [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)\n* [ubergarm on quantizing LLMs and tuning GPUs with aifoundry.org](https://blog.aifoundry.org/p/adventures-in-model-quantization)\n* [ubergarm-imatrix-calibration-corpus-v02.txt](https://gist.github.com/ubergarm/edfeb3ff9c6ec8b49e88cdf627b0711a?permalink_comment_id=5682584#gistcomment-5682584)\n* [Getting Started Guide (out of date)](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)\n* [Quant Cookers Guide (out of date)](https://github.com/ikawrakow/ik_llama.cpp/discussions/434)\n* [high quality imatrix MoE optimized mainline llama.cpp quants AesSedai/Qwen3.5-122B-A10B-GGUF](https://huggingface.co/AesSedai/Qwen3.5-122B-A10B-GGUF)\n",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "imatrix",
    "conversational",
    "qwen3_5_moe",
    "ik_llama.cpp",
    "text-generation",
    "base_model:Qwen/Qwen3.5-122B-A10B",
    "base_model:quantized:Qwen/Qwen3.5-122B-A10B",
    "license:apache-2.0",
    "endpoints_compatible",
    "region:us"
  ],
  "likes": 18,
  "downloads": 5617,
  "gated": false,
  "private": false,
  "last_modified": "2026-03-20T03:43:17.000Z",
  "created_at": "2026-02-24T18:04:50.000Z",
  "pipeline_tag": "text-generation",
  "library_name": ""
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "699de84273489fecb704eddf",
  "id": "ubergarm/Qwen3.5-122B-A10B-GGUF",
  "modelId": "ubergarm/Qwen3.5-122B-A10B-GGUF",
  "sha": "4cb49cf72d5647605b5510b0745bab8a6e10124e",
  "createdAt": "2026-02-24T18:04:50.000Z",
  "lastModified": "2026-03-20T03:43:17.000Z",
  "author": "ubergarm",
  "downloads": 5617,
  "likes": 18,
  "gated": false,
  "private": false,
  "pipeline_tag": "text-generation",
  "library_name": "",
  "siblings_count": 12
}