Model Intelligence Sheet

sokann/glm-5-gguf-1.594bpw overview

This is a 1.6 BPW quantized model for the GPU poors with 128 GiB of System RAM and 24 GiB of VRAM. The quant aims to achieve best-in-class performance, by relying on SOTA IQK-quants from ik_llama.cpp.

ggufglm_moe_dsaconversationalik_llama.cppbase_model:zai-org/GLM-5base_model:quantized:zai-org/GLM-5license:mitendpoints_compatibleregion:usimatrix

Downloads

256

Likes

Pipeline

—

Library

—

Visibility

Public

Access

Open

Repository Files & Downloads

1 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
GLM-5-GGUF-1.594bpw.gguf	GGUF	—	139.92 GB	Download

Model Details Live

Model Slug

sokann/glm-5-gguf-1.594bpw

Author

sokann

Pipeline Task

—

Library

—

Created

2026-02-26

Last Modified

2026-03-01

Gated

Private

HF SHA

31c21b16a06529ef1b521d9d255fb67634a40a36

License

mit

Language

Unknown

Base Model

zai-org/GLM-5

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "base_model": "zai-org/GLM-5",
    "base_model_relation": "quantized",
    "license": "mit",
    "tags": [
      "glm_moe_dsa",
      "conversational",
      "ik_llama.cpp"
    ],
    "frontmatter": {
      "base_model": "zai-org/GLM-5",
      "base_model_relation": "quantized",
      "license": "mit",
      "tags": [
        "glm_moe_dsa",
        "conversational",
        "ik_llama.cpp"
      ]
    },
    "hero_image_url": "",
    "summary": "This is a 1.6 BPW quantized model for the GPU poors with 128 GiB of System RAM and 24 GiB of VRAM. The quant aims to achieve best-in-class performance, by relying on SOTA IQK-quants from ik_llama.cpp.",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nbase_model: zai-org/GLM-5\nbase_model_relation: quantized\nlicense: mit\ntags:\n- glm_moe_dsa\n- conversational\n- ik_llama.cpp\n---\n\n# GLM-5-GGUF-1.594bpw\n\nThis is a 1.6 BPW quantized model for the GPU poors with 128 GiB of System RAM and 24 GiB of VRAM.\n\nThe quant aims to achieve best-in-class performance, by relying on SOTA IQK-quants from ik_llama.cpp.\n\n\n## Size\n\nThe FFN tensors will take about 127GiB, to be loaded into System RAM and partially into VRAM, leaving absolutely no space for anything else. No GUI, no syslog, no cronie, no chronyd. For the GPU poors, every single bit matters.\n\nThe token_embd tensor will take about 510MiB, and that goes into System RAM as well.\n\nThe other tensors will take about 10.6GiB, to be loaded into VRAM, leaving some space for context, compute buffer, and the few overflow FFN tensors.\n\nSize from `llama-server` output:\n```\nllm_load_print_meta: model size       = 139.907 GiB (1.594 BPW)\nllm_load_print_meta: repeating layers = 138.826 GiB (1.586 BPW, 751.961 B parameters)\n```\n\nBuffer size with `-cmoe --no-mmap` (need a small swap to load):\n```\nllm_load_tensors:        CPU buffer size = 129975.00 MiB\nllm_load_tensors:  CUDA_Host buffer size =   510.47 MiB\nllm_load_tensors:      CUDA0 buffer size = 10897.35 MiB\n```\n\nBuffer size with `ncmoe 74 --no-mmap` (doesn't need a swap):\n```\nllm_load_tensors:        CPU buffer size = 123043.00 MiB\nllm_load_tensors:  CUDA_Host buffer size =   510.47 MiB\nllm_load_tensors:      CUDA0 buffer size = 17829.35 MiB\n```\n\n\n## Quality\n\n<details>\n\n<summary>Recipe</summary>\n\n```\n# Attention\nblk\\..*\\.attn_k_b\\.weight=q6_0\nblk\\..*\\.attn_v_b\\.weight=q6_0\n\nblk\\..*\\.attn_kv_a_mqa\\.weight=iq4_k\nblk\\..*\\.attn_q_a\\.weight=iq4_k\nblk\\..*\\.attn_q_b\\.weight=iq4_k\nblk\\..*\\.attn_output\\.weight=iq5_ks\n\n# First 3 Dense Layers\nblk\\..*\\.ffn_down\\.weight=iq4_k\nblk\\..*\\.ffn_(gate|up)\\.weight=iq4_k\n\n# Shared Expert Layers\nblk\\..*\\.ffn_down_shexp\\.weight=iq4_k\nblk\\..*\\.ffn_(gate|up)_shexp\\.weight=iq4_k\n\n# Routed Experts Layers\nblk\\..*\\.ffn_(up|gate|down)_exps\\.weight=iq1_s_r4\n\n# Indexer\nblk\\..*\\.indexer\\.proj\\.weight=iq4_k\nblk\\..*\\.indexer\\.attn_k\\.weight=iq4_k\nblk\\..*\\.indexer\\.attn_q_b\\.weight=iq4_k\n\n# NextN MTP Layer\nblk\\..*\\.nextn\\.embed_tokens\\.weight=iq4_k\nblk\\..*\\.nextn\\.shared_head_head\\.weight=iq4_k\nblk\\..*\\.nextn\\.eh_proj\\.weight=iq4_k\n\n# Non-Repeating Layers\ntoken_embd\\.weight=iq4_k\noutput\\.weight=iq5_ks\n```\n</details>\n\nPPL result with wiki.test.raw:\n```\nFinal estimate: PPL over 565 chunks for n_ctx=512 = 6.2248 +/- 0.03964\n```\nCan check the graph from https://huggingface.co/ubergarm/GLM-5-GGUF for comparison.\n\nThis quant uses the imatrix from unsloth, which seems to allow the model to perform more reliably in actual tasks.\n\nWhen using the imatrix from ubergarm, PPL is a bit better at 6.1469 +/- 0.03890, but performance is noticeably worse.\n\n\n## Flags\n\nTo have usable context size, we have to sacrifice PP by going with the much slower `-mla 1`, which doesn't use as much VRAM compared to the usual `-mla 3`.\n\nThese flags allow a 75000 context size:\n```\n-ot \\.(73|74|75|76|77)\\.ffn_down_exps=CUDA0 \\\n-ot \\.(75|76|77)\\.ffn_(up|gate)_exps=CUDA0 \\\n-ot exps=CPU \\\n-mla 1 -c 75000 -ctk q5_0 -khad \\\n-b 2048 -ub 2048 \\\n--jinja -cram 0 -mqkv -ger -cuda graphs=1\n```\n* 11 FFN tensors on GPU, the rest on CPU\n* `-mla 1` to squeeze 75000 context in Q5, `-khad` to reduce quantization error\n* 2048 batch size to allow GPU offload when processing larger prompt\n\nTested to be working well in both Q&A tasks and agentic tasks, with high difficulty.\n",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "glm_moe_dsa",
    "conversational",
    "ik_llama.cpp",
    "base_model:zai-org/GLM-5",
    "base_model:quantized:zai-org/GLM-5",
    "license:mit",
    "endpoints_compatible",
    "region:us",
    "imatrix"
  ],
  "likes": 3,
  "downloads": 256,
  "gated": false,
  "private": false,
  "last_modified": "2026-03-01T17:49:44.000Z",
  "created_at": "2026-02-26T23:42:10.000Z",
  "pipeline_tag": "",
  "library_name": ""
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "69a0da52d66e26677da34a0b",
  "id": "sokann/GLM-5-GGUF-1.594bpw",
  "modelId": "sokann/GLM-5-GGUF-1.594bpw",
  "sha": "31c21b16a06529ef1b521d9d255fb67634a40a36",
  "createdAt": "2026-02-26T23:42:10.000Z",
  "lastModified": "2026-03-01T17:49:44.000Z",
  "author": "sokann",
  "downloads": 256,
  "likes": 3,
  "gated": false,
  "private": false,
  "pipeline_tag": "",
  "library_name": "",
  "siblings_count": 3
}