Model Intelligence Sheet

unidaikon/qwen3.5-35b-a3b-q5_k_xxl-gguf overview

Overview This repository provides a custom quantization of Qwen3.5-35B-A3B to Q5K format, with a hybrid precision approach that keeps ssm and attention layers in high precision to preserve long-context performance. The resulting model size is approximately 23.8 GiB, optimized for systems with 32 GiB RAM + 8 GiB VRAM. The Vision tower (mmproj) is the same file as the one in any other quantization repos. ### Background Currently, all gguf quantizations of Qwen3.5 compress ssm layers to low precision. For example, in Qwen3.5-35B-A3B-UD-Q5KXS: This may cause issues in long-context scenarios: ssm layers perform linear accumulation during generation, causing quantization errors to compound over time ssm layers are small (2048×32 and 4096×2048), so quantization provides minimal performance gain For certain tokens requiring minor knowledge updates, ssm_beta quantization may introduce noticeable degradation This Quant: Keep ssm` layers in BF16 precision. ### Other Modificatoin Higher precision to token embeddings (token_embd.weight) Qwen3.5 has much larger token list and so high precision can prevents token representation collapse Higher attention matrix precision Full attention is critical in the SSM–FULL attention fusion architecture As only 25% layers have full attention, it's safe to have more bits without slow down inference

ggufQwen3.5-35B-A3BGGUFbase_model:Qwen/Qwen3.5-35B-A3Bbase_model:quantized:Qwen/Qwen3.5-35B-A3Bendpoints_compatibleregion:usimatrixconversational

unidaikon/qwen3.5-35b-a3b-q5_k_xxl-gguf visual

Downloads

141

Likes

Pipeline

—

Library

—

Visibility

Public

Access

Open

Repository Files & Downloads

2 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
Qwen3.5-35B-A3B-Q5_K_HIGH.gguf	GGUF	Q5_K_HIGH	24.19 GB	Download
Qwen3.5-35B-A3B-mmproj-BF16.gguf	GGUF	BF16	861.00 MB	Download

Model Details Live

Model Slug

unidaikon/qwen3.5-35b-a3b-q5_k_xxl-gguf

Author

unidaikon

Pipeline Task

—

Library

—

Created

2026-02-26

Last Modified

2026-02-26

Gated

Private

HF SHA

7e646f72ad7014c0a8f82b95fb20cd718ad94893

License

Unknown

Language

Unknown

Base Model

Qwen/Qwen3.5-35B-A3B, unsloth/Qwen3.5-35B-A3B-GGUF

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "base_model": [
      "Qwen/Qwen3.5-35B-A3B",
      "unsloth/Qwen3.5-35B-A3B-GGUF"
    ],
    "tags": [
      "Qwen3.5-35B-A3B",
      "GGUF"
    ],
    "frontmatter": {
      "base_model": [
        "Qwen/Qwen3.5-35B-A3B",
        "unsloth/Qwen3.5-35B-A3B-GGUF"
      ],
      "tags": [
        "Qwen3.5-35B-A3B",
        "GGUF"
      ]
    },
    "hero_image_url": "",
    "summary": "### Overview This repository provides a custom quantization of **Qwen3.5-35B-A3B** to **Q5_K** format, with a hybrid precision approach that keeps ssm and attention layers in high precision to preserve long-context performance. The resulting model size is approximately 23.8 GiB, optimized for systems with 32 GiB RAM + 8 GiB VRAM. The Vision tower (mmproj) is the same file as the one in any other quantization repos. ### Background Currently, all gguf quantizations of Qwen3.5 compress ssm layers to low precision. For example, in Qwen3.5-35B-A3B-UD-Q5_K_XS: `` blk.0.attn_qkv.weight \t[2,048, 8,192] \tQ5_K ... blk.0.ffn_gate_inp_shexp.weight \t[2,048] \tF32 blk.0.ffn_gate_shexp.weight \t[2,048, 512] \tQ8_0 blk.0.ffn_up_exps.weight \t[2,048, 512, 256] \tQ5_K ... blk.0.ssm_alpha.weight \t[2,048, 32] \tQ5_K blk.0.ssm_beta.weight \t[2,048, 32] \tQ5_K ... blk.0.ssm_out.weight \t[4,096, 2,048] \tQ5_K ` This may cause issues in long-context scenarios: * ssm layers perform linear accumulation during generation, causing quantization errors to compound over time * ssm layers are small (2048×32 and 4096×2048), so quantization provides minimal performance gain * For certain tokens requiring minor knowledge updates, ssm_beta quantization may introduce noticeable degradation This Quant: Keep ssm` layers in **BF16** precision. ### Other Modificatoin * Higher precision to token embeddings (token_embd.weight) * Qwen3.5 has much larger token list and so high precision can prevents token representation collapse * Higher attention matrix precision * Full attention is critical in the SSM–FULL attention fusion architecture * As only 25% layers have full attention, it's safe to have more bits without slow down inference",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nbase_model:\n- Qwen/Qwen3.5-35B-A3B\n- unsloth/Qwen3.5-35B-A3B-GGUF\ntags:\n- Qwen3.5-35B-A3B\n- GGUF\n---\n\n\n### Overview\n\nThis repository provides a custom quantization of **Qwen3.5-35B-A3B** to **Q5_K** format, \nwith a hybrid precision approach that keeps `ssm` and `attention` layers in high precision to preserve long-context performance. \nThe resulting model size is approximately 23.8 GiB, optimized for systems with 32 GiB RAM + 8 GiB VRAM.\n\nThe Vision tower (mmproj) is the same file as the one in any other quantization repos.\n\n### Background\n\nCurrently, all gguf quantizations of `Qwen3.5` compress `ssm` layers to low precision. \nFor example, in `Qwen3.5-35B-A3B-UD-Q5_K_XS`:\n\n```\nblk.0.attn_qkv.weight \t[2,048, 8,192] \tQ5_K\n...\nblk.0.ffn_gate_inp_shexp.weight \t[2,048] \tF32\nblk.0.ffn_gate_shexp.weight \t[2,048, 512] \tQ8_0\nblk.0.ffn_up_exps.weight \t[2,048, 512, 256] \tQ5_K\n...\nblk.0.ssm_alpha.weight \t[2,048, 32] \tQ5_K\nblk.0.ssm_beta.weight \t[2,048, 32] \tQ5_K\n...\nblk.0.ssm_out.weight \t[4,096, 2,048] \tQ5_K\n```\n\nThis may cause issues in long-context scenarios:\n\n* `ssm` layers perform linear accumulation during generation, causing quantization errors to compound over time\n* `ssm` layers are small (`2048×32` and `4096×2048`), so quantization provides minimal performance gain\n* For certain tokens requiring minor knowledge updates, `ssm_beta` quantization may introduce noticeable degradation\n\nThis Quant: Keep `ssm` layers in **BF16** precision.\n\n### Other Modificatoin\n\n* Higher precision to token embeddings (token_embd.weight)\n  * Qwen3.5 has much larger token list and so high precision can prevents token representation collapse \n* Higher attention matrix precision\n  * Full attention is critical in the SSM–FULL attention fusion architecture\n  * As only 25% layers have full attention, it's safe to have more bits without slow down inference",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "Qwen3.5-35B-A3B",
    "GGUF",
    "base_model:Qwen/Qwen3.5-35B-A3B",
    "base_model:quantized:Qwen/Qwen3.5-35B-A3B",
    "endpoints_compatible",
    "region:us",
    "imatrix",
    "conversational"
  ],
  "likes": 0,
  "downloads": 141,
  "gated": false,
  "private": false,
  "last_modified": "2026-02-26T08:03:19.000Z",
  "created_at": "2026-02-26T07:19:25.000Z",
  "pipeline_tag": "",
  "library_name": ""
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "699ff3fd8c00ecb963d03c2d",
  "id": "unidaikon/Qwen3.5-35B-A3B-Q5_K_XXL-GGUF",
  "modelId": "unidaikon/Qwen3.5-35B-A3B-Q5_K_XXL-GGUF",
  "sha": "7e646f72ad7014c0a8f82b95fb20cd718ad94893",
  "createdAt": "2026-02-26T07:19:25.000Z",
  "lastModified": "2026-02-26T08:03:19.000Z",
  "author": "unidaikon",
  "downloads": 141,
  "likes": 0,
  "gated": false,
  "private": false,
  "pipeline_tag": "",
  "library_name": "",
  "siblings_count": 4
}