GraySoft
Projects Models About FAQ Contact Download guIDE →
Model Intelligence Sheet

sokann/glm-5-gguf-1.594bpw overview

This is a 1.6 BPW quantized model for the GPU poors with 128 GiB of System RAM and 24 GiB of VRAM. The quant aims to achieve best-in-class performance, by relying on SOTA IQK-quants from ik_llama.cpp.

ggufglm_moe_dsaconversationalik_llama.cppbase_model:zai-org/GLM-5base_model:quantized:zai-org/GLM-5license:mitendpoints_compatibleregion:usimatrix
sokann/glm-5-gguf-1.594bpw visual
Downloads
256
Likes
3
Pipeline
Library
Visibility
Public
Access
Open

Repository Files & Downloads

1 files detected
Direct downloads for all repository files
FileTypeQuantizationSizeLink
GLM-5-GGUF-1.594bpw.gguf GGUF 139.92 GB Download

Model Details Live

Model Slug
sokann/glm-5-gguf-1.594bpw
Author
sokann
Pipeline Task
Library
Created
2026-02-26
Last Modified
2026-03-01
Gated
No
Private
No
HF SHA
31c21b16a06529ef1b521d9d255fb67634a40a36
License
mit
Language
Unknown
Base Model
zai-org/GLM-5

Metadata Inspector

Normalized metadata (stored in metadata_json)
{
  "metadata": {},
  "card_data": {
    "base_model": "zai-org/GLM-5",
    "base_model_relation": "quantized",
    "license": "mit",
    "tags": [
      "glm_moe_dsa",
      "conversational",
      "ik_llama.cpp"
    ],
    "frontmatter": {
      "base_model": "zai-org/GLM-5",
      "base_model_relation": "quantized",
      "license": "mit",
      "tags": [
        "glm_moe_dsa",
        "conversational",
        "ik_llama.cpp"
      ]
    },
    "hero_image_url": "",
    "summary": "This is a 1.6 BPW quantized model for the GPU poors with 128 GiB of System RAM and 24 GiB of VRAM. The quant aims to achieve best-in-class performance, by relying on SOTA IQK-quants from ik_llama.cpp.",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nbase_model: zai-org/GLM-5\nbase_model_relation: quantized\nlicense: mit\ntags:\n- glm_moe_dsa\n- conversational\n- ik_llama.cpp\n---\n\n# GLM-5-GGUF-1.594bpw\n\nThis is a 1.6 BPW quantized model for the GPU poors with 128 GiB of System RAM and 24 GiB of VRAM.\n\nThe quant aims to achieve best-in-class performance, by relying on SOTA IQK-quants from ik_llama.cpp.\n\n\n## Size\n\nThe FFN tensors will take about 127GiB, to be loaded into System RAM and partially into VRAM, leaving absolutely no space for anything else. No GUI, no syslog, no cronie, no chronyd. For the GPU poors, every single bit matters.\n\nThe token_embd tensor will take about 510MiB, and that goes into System RAM as well.\n\nThe other tensors will take about 10.6GiB, to be loaded into VRAM, leaving some space for context, compute buffer, and the few overflow FFN tensors.\n\nSize from `llama-server` output:\n```\nllm_load_print_meta: model size       = 139.907 GiB (1.594 BPW)\nllm_load_print_meta: repeating layers = 138.826 GiB (1.586 BPW, 751.961 B parameters)\n```\n\nBuffer size with `-cmoe --no-mmap` (need a small swap to load):\n```\nllm_load_tensors:        CPU buffer size = 129975.00 MiB\nllm_load_tensors:  CUDA_Host buffer size =   510.47 MiB\nllm_load_tensors:      CUDA0 buffer size = 10897.35 MiB\n```\n\nBuffer size with `ncmoe 74 --no-mmap` (doesn't need a swap):\n```\nllm_load_tensors:        CPU buffer size = 123043.00 MiB\nllm_load_tensors:  CUDA_Host buffer size =   510.47 MiB\nllm_load_tensors:      CUDA0 buffer size = 17829.35 MiB\n```\n\n\n## Quality\n\n<details>\n\n<summary>Recipe</summary>\n\n```\n# Attention\nblk\\..*\\.attn_k_b\\.weight=q6_0\nblk\\..*\\.attn_v_b\\.weight=q6_0\n\nblk\\..*\\.attn_kv_a_mqa\\.weight=iq4_k\nblk\\..*\\.attn_q_a\\.weight=iq4_k\nblk\\..*\\.attn_q_b\\.weight=iq4_k\nblk\\..*\\.attn_output\\.weight=iq5_ks\n\n# First 3 Dense Layers\nblk\\..*\\.ffn_down\\.weight=iq4_k\nblk\\..*\\.ffn_(gate|up)\\.weight=iq4_k\n\n# Shared Expert Layers\nblk\\..*\\.ffn_down_shexp\\.weight=iq4_k\nblk\\..*\\.ffn_(gate|up)_shexp\\.weight=iq4_k\n\n# Routed Experts Layers\nblk\\..*\\.ffn_(up|gate|down)_exps\\.weight=iq1_s_r4\n\n# Indexer\nblk\\..*\\.indexer\\.proj\\.weight=iq4_k\nblk\\..*\\.indexer\\.attn_k\\.weight=iq4_k\nblk\\..*\\.indexer\\.attn_q_b\\.weight=iq4_k\n\n# NextN MTP Layer\nblk\\..*\\.nextn\\.embed_tokens\\.weight=iq4_k\nblk\\..*\\.nextn\\.shared_head_head\\.weight=iq4_k\nblk\\..*\\.nextn\\.eh_proj\\.weight=iq4_k\n\n# Non-Repeating Layers\ntoken_embd\\.weight=iq4_k\noutput\\.weight=iq5_ks\n```\n</details>\n\nPPL result with wiki.test.raw:\n```\nFinal estimate: PPL over 565 chunks for n_ctx=512 = 6.2248 +/- 0.03964\n```\nCan check the graph from https://huggingface.co/ubergarm/GLM-5-GGUF for comparison.\n\nThis quant uses the imatrix from unsloth, which seems to allow the model to perform more reliably in actual tasks.\n\nWhen using the imatrix from ubergarm, PPL is a bit better at 6.1469 +/- 0.03890, but performance is noticeably worse.\n\n\n## Flags\n\nTo have usable context size, we have to sacrifice PP by going with the much slower `-mla 1`, which doesn't use as much VRAM compared to the usual `-mla 3`.\n\nThese flags allow a 75000 context size:\n```\n-ot \\.(73|74|75|76|77)\\.ffn_down_exps=CUDA0 \\\n-ot \\.(75|76|77)\\.ffn_(up|gate)_exps=CUDA0 \\\n-ot exps=CPU \\\n-mla 1 -c 75000 -ctk q5_0 -khad \\\n-b 2048 -ub 2048 \\\n--jinja -cram 0 -mqkv -ger -cuda graphs=1\n```\n* 11 FFN tensors on GPU, the rest on CPU\n* `-mla 1` to squeeze 75000 context in Q5, `-khad` to reduce quantization error\n* 2048 batch size to allow GPU offload when processing larger prompt\n\nTested to be working well in both Q&A tasks and agentic tasks, with high difficulty.\n",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "glm_moe_dsa",
    "conversational",
    "ik_llama.cpp",
    "base_model:zai-org/GLM-5",
    "base_model:quantized:zai-org/GLM-5",
    "license:mit",
    "endpoints_compatible",
    "region:us",
    "imatrix"
  ],
  "likes": 3,
  "downloads": 256,
  "gated": false,
  "private": false,
  "last_modified": "2026-03-01T17:49:44.000Z",
  "created_at": "2026-02-26T23:42:10.000Z",
  "pipeline_tag": "",
  "library_name": ""
}
Source payload excerpt (from Hugging Face API)
{
  "_id": "69a0da52d66e26677da34a0b",
  "id": "sokann/GLM-5-GGUF-1.594bpw",
  "modelId": "sokann/GLM-5-GGUF-1.594bpw",
  "sha": "31c21b16a06529ef1b521d9d255fb67634a40a36",
  "createdAt": "2026-02-26T23:42:10.000Z",
  "lastModified": "2026-03-01T17:49:44.000Z",
  "author": "sokann",
  "downloads": 256,
  "likes": 3,
  "gated": false,
  "private": false,
  "pipeline_tag": "",
  "library_name": "",
  "siblings_count": 3
}