GraySoft
Projects Models About FAQ Contact Download guIDE →
Model Intelligence Sheet

unidaikon/qwen3.5-35b-a3b-q5_k_xxl-gguf overview

Overview This repository provides a custom quantization of Qwen3.5-35B-A3B to Q5K format, with a hybrid precision approach that keeps ssm and attention layers in high precision to preserve long-context performance. The resulting model size is approximately 23.8 GiB, optimized for systems with 32 GiB RAM + 8 GiB VRAM. The Vision tower (mmproj) is the same file as the one in any other quantization repos. ### Background Currently, all gguf quantizations of Qwen3.5 compress ssm layers to low precision. For example, in Qwen3.5-35B-A3B-UD-Q5KXS: This may cause issues in long-context scenarios: ssm layers perform linear accumulation during generation, causing quantization errors to compound over time ssm layers are small (2048×32 and 4096×2048), so quantization provides minimal performance gain For certain tokens requiring minor knowledge updates, ssm_beta quantization may introduce noticeable degradation This Quant: Keep ssm` layers in BF16 precision. ### Other Modificatoin Higher precision to token embeddings (token_embd.weight) Qwen3.5 has much larger token list and so high precision can prevents token representation collapse Higher attention matrix precision Full attention is critical in the SSM–FULL attention fusion architecture As only 25% layers have full attention, it's safe to have more bits without slow down inference

ggufQwen3.5-35B-A3BGGUFbase_model:Qwen/Qwen3.5-35B-A3Bbase_model:quantized:Qwen/Qwen3.5-35B-A3Bendpoints_compatibleregion:usimatrixconversational
unidaikon/qwen3.5-35b-a3b-q5_k_xxl-gguf visual
Downloads
141
Likes
0
Pipeline
Library
Visibility
Public
Access
Open

Repository Files & Downloads

2 files detected
Direct downloads for all repository files
FileTypeQuantizationSizeLink
Qwen3.5-35B-A3B-Q5_K_HIGH.gguf GGUF Q5_K_HIGH 24.19 GB Download
Qwen3.5-35B-A3B-mmproj-BF16.gguf GGUF BF16 861.00 MB Download

Model Details Live

Model Slug
unidaikon/qwen3.5-35b-a3b-q5_k_xxl-gguf
Author
unidaikon
Pipeline Task
Library
Created
2026-02-26
Last Modified
2026-02-26
Gated
No
Private
No
HF SHA
7e646f72ad7014c0a8f82b95fb20cd718ad94893
License
Unknown
Language
Unknown
Base Model
Qwen/Qwen3.5-35B-A3B, unsloth/Qwen3.5-35B-A3B-GGUF

Metadata Inspector

Normalized metadata (stored in metadata_json)
{
  "metadata": {},
  "card_data": {
    "base_model": [
      "Qwen/Qwen3.5-35B-A3B",
      "unsloth/Qwen3.5-35B-A3B-GGUF"
    ],
    "tags": [
      "Qwen3.5-35B-A3B",
      "GGUF"
    ],
    "frontmatter": {
      "base_model": [
        "Qwen/Qwen3.5-35B-A3B",
        "unsloth/Qwen3.5-35B-A3B-GGUF"
      ],
      "tags": [
        "Qwen3.5-35B-A3B",
        "GGUF"
      ]
    },
    "hero_image_url": "",
    "summary": "### Overview This repository provides a custom quantization of **Qwen3.5-35B-A3B** to **Q5_K** format, with a hybrid precision approach that keeps ssm and attention layers in high precision to preserve long-context performance. The resulting model size is approximately 23.8 GiB, optimized for systems with 32 GiB RAM + 8 GiB VRAM. The Vision tower (mmproj) is the same file as the one in any other quantization repos. ### Background Currently, all gguf quantizations of Qwen3.5 compress ssm layers to low precision. For example, in Qwen3.5-35B-A3B-UD-Q5_K_XS: `` blk.0.attn_qkv.weight \t[2,048, 8,192] \tQ5_K ... blk.0.ffn_gate_inp_shexp.weight \t[2,048] \tF32 blk.0.ffn_gate_shexp.weight \t[2,048, 512] \tQ8_0 blk.0.ffn_up_exps.weight \t[2,048, 512, 256] \tQ5_K ... blk.0.ssm_alpha.weight \t[2,048, 32] \tQ5_K blk.0.ssm_beta.weight \t[2,048, 32] \tQ5_K ... blk.0.ssm_out.weight \t[4,096, 2,048] \tQ5_K ` This may cause issues in long-context scenarios: * ssm layers perform linear accumulation during generation, causing quantization errors to compound over time * ssm layers are small (2048×32 and 4096×2048), so quantization provides minimal performance gain * For certain tokens requiring minor knowledge updates, ssm_beta quantization may introduce noticeable degradation This Quant: Keep ssm` layers in **BF16** precision. ### Other Modificatoin * Higher precision to token embeddings (token_embd.weight) * Qwen3.5 has much larger token list and so high precision can prevents token representation collapse * Higher attention matrix precision * Full attention is critical in the SSM–FULL attention fusion architecture * As only 25% layers have full attention, it's safe to have more bits without slow down inference",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nbase_model:\n- Qwen/Qwen3.5-35B-A3B\n- unsloth/Qwen3.5-35B-A3B-GGUF\ntags:\n- Qwen3.5-35B-A3B\n- GGUF\n---\n\n\n### Overview\n\nThis repository provides a custom quantization of **Qwen3.5-35B-A3B** to **Q5_K** format, \nwith a hybrid precision approach that keeps `ssm` and `attention` layers in high precision to preserve long-context performance. \nThe resulting model size is approximately 23.8 GiB, optimized for systems with 32 GiB RAM + 8 GiB VRAM.\n\nThe Vision tower (mmproj) is the same file as the one in any other quantization repos.\n\n### Background\n\nCurrently, all gguf quantizations of `Qwen3.5` compress `ssm` layers to low precision. \nFor example, in `Qwen3.5-35B-A3B-UD-Q5_K_XS`:\n\n```\nblk.0.attn_qkv.weight \t[2,048, 8,192] \tQ5_K\n...\nblk.0.ffn_gate_inp_shexp.weight \t[2,048] \tF32\nblk.0.ffn_gate_shexp.weight \t[2,048, 512] \tQ8_0\nblk.0.ffn_up_exps.weight \t[2,048, 512, 256] \tQ5_K\n...\nblk.0.ssm_alpha.weight \t[2,048, 32] \tQ5_K\nblk.0.ssm_beta.weight \t[2,048, 32] \tQ5_K\n...\nblk.0.ssm_out.weight \t[4,096, 2,048] \tQ5_K\n```\n\nThis may cause issues in long-context scenarios:\n\n* `ssm` layers perform linear accumulation during generation, causing quantization errors to compound over time\n* `ssm` layers are small (`2048×32` and `4096×2048`), so quantization provides minimal performance gain\n* For certain tokens requiring minor knowledge updates, `ssm_beta` quantization may introduce noticeable degradation\n\nThis Quant: Keep `ssm` layers in **BF16** precision.\n\n### Other Modificatoin\n\n* Higher precision to token embeddings (token_embd.weight)\n  * Qwen3.5 has much larger token list and so high precision can prevents token representation collapse \n* Higher attention matrix precision\n  * Full attention is critical in the SSM–FULL attention fusion architecture\n  * As only 25% layers have full attention, it's safe to have more bits without slow down inference",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "Qwen3.5-35B-A3B",
    "GGUF",
    "base_model:Qwen/Qwen3.5-35B-A3B",
    "base_model:quantized:Qwen/Qwen3.5-35B-A3B",
    "endpoints_compatible",
    "region:us",
    "imatrix",
    "conversational"
  ],
  "likes": 0,
  "downloads": 141,
  "gated": false,
  "private": false,
  "last_modified": "2026-02-26T08:03:19.000Z",
  "created_at": "2026-02-26T07:19:25.000Z",
  "pipeline_tag": "",
  "library_name": ""
}
Source payload excerpt (from Hugging Face API)
{
  "_id": "699ff3fd8c00ecb963d03c2d",
  "id": "unidaikon/Qwen3.5-35B-A3B-Q5_K_XXL-GGUF",
  "modelId": "unidaikon/Qwen3.5-35B-A3B-Q5_K_XXL-GGUF",
  "sha": "7e646f72ad7014c0a8f82b95fb20cd718ad94893",
  "createdAt": "2026-02-26T07:19:25.000Z",
  "lastModified": "2026-02-26T08:03:19.000Z",
  "author": "unidaikon",
  "downloads": 141,
  "likes": 0,
  "gated": false,
  "private": false,
  "pipeline_tag": "",
  "library_name": "",
  "siblings_count": 4
}