GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

General-Instinct/InstinctRazor-Qwen3.5-122B-A10B-GGUF overview

InstinctRazor — Qwen3.5 122B A10B · IQ3 XXS GGUF A 122B hybrid Gated DeltaNet MoE 256 experts, 8 active — packed to 48 GiB so it runs on one 80 GB GPU or a sma…

ggufllama.cppmixture-of-expertsquantizediq3_xxsinstinctrazortext-generationbase_model:Qwen/Qwen3.5-122B-A10Bbase_model:quantized:Qwen/Qwen3.5-122B-A10Blicense:apache-2.0endpoints_compatibleregion:usimatrixconversational

Runs locally from ~870.0 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
1,444
Likes
17
Pipeline
text-generation

Repository Files & Downloads

2 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.ggufGGUFIQ3_XXS48.05 GBDownload
InstinctRazor-Qwen3.5-122B-A10B-mmproj-f16.ggufGGUFF16870.0 MBDownload

Model Details

Model IDGeneral-Instinct/InstinctRazor-Qwen3.5-122B-A10B-GGUF
AuthorGeneral-Instinct
Pipelinetext-generation
Licenseapache-2.0
Base modelQwen/Qwen3.5-122B-A10B
Last modified2026-06-06T21:41:22.000Z

Model README

---

license: apache-2.0

base_model:

  • Qwen/Qwen3.5-122B-A10B

tags:

  • gguf
  • llama.cpp
  • mixture-of-experts
  • quantized
  • iq3_xxs
  • instinctrazor

pipeline_tag: text-generation

---

InstinctRazor — Qwen3.5-122B-A10B · IQ3_XXS GGUF

A 122B hybrid Gated-DeltaNet MoE (256 experts, 8 active) — packed to 48 GiB so it runs on **one 80 GB

GPU (or a small card + CPU offload). Quantized from the original BF16** with an importance matrix

(math + code + general calibration), via llama.cpp.

Framework, recipe, and full reproduction: https://github.com/General-Instinct/InstinctRazor

Speed (llama.cpp, this artifact)

  • 1× H100-80GB, all layers on GPU: 115.9 tok/s decode (prefill ≈2541 tok/s).
  • Small card + CPU expert-offload (--n-cpu-moe 48, peak ≈7.6 GiB VRAM): 45.7 tok/s decode — runs on an 8 GB GPU + ≈48 GiB system RAM.

Run

# full GPU
llama-cli -m InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf -ngl 999 -fa on -p "Your prompt"
# small card + CPU offload (routed experts on CPU)
llama-cli -m InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf -ngl 999 --n-cpu-moe 48 -t 52 -p "Your prompt"
# multimodal (image input)
llama-cli -m InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf --mmproj InstinctRazor-Qwen3.5-122B-A10B-mmproj-f16.gguf --image pic.png -p "Describe the image"

Requires a llama.cpp build with qwen3_5_moe support (upstream, 2026-02+).

Scope & roadmap

This GGUF matches or beats the footprint-matched A4B on knowledge, reasoning, and multimodal-MMMU. Where it

still trails — code (LiveCodeBench v6) and math / multimodal-math — the loss is largely

token-inefficiency introduced by quantization, and is the target of OPD (on-policy distillation), a

separate framework we'll open-source later. Eval absolutes are subject to a same-harness validation gate;

see the GitHub results/RESULTS.md

for full per-number provenance.

Attribution

  • Base model: Qwen3.5-122B-A10B © Qwen — subject to its own model license.
  • Quantization recipe + framework: General Instinct, released under Apache-2.0.

Run General-Instinct/InstinctRazor-Qwen3.5-122B-A10B-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models