Model Intelligence Sheet
sokann/glm-5-gguf-1.594bpw overview
This is a 1.6 BPW quantized model for the GPU poors with 128 GiB of System RAM and 24 GiB of VRAM. The quant aims to achieve best-in-class performance, by relying on SOTA IQK-quants from ik_llama.cpp.
Downloads
256
Likes
3
Pipeline
—
Library
—
Visibility
Public
Access
Open
Repository Files & Downloads
1 files detected
Direct downloads for all repository files
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| GLM-5-GGUF-1.594bpw.gguf | GGUF | — | 139.92 GB | Download |
Model Details Live
Metadata Inspector
Normalized metadata (stored in metadata_json)
{
"metadata": {},
"card_data": {
"base_model": "zai-org/GLM-5",
"base_model_relation": "quantized",
"license": "mit",
"tags": [
"glm_moe_dsa",
"conversational",
"ik_llama.cpp"
],
"frontmatter": {
"base_model": "zai-org/GLM-5",
"base_model_relation": "quantized",
"license": "mit",
"tags": [
"glm_moe_dsa",
"conversational",
"ik_llama.cpp"
]
},
"hero_image_url": "",
"summary": "This is a 1.6 BPW quantized model for the GPU poors with 128 GiB of System RAM and 24 GiB of VRAM. The quant aims to achieve best-in-class performance, by relying on SOTA IQK-quants from ik_llama.cpp.",
"quick_links": [],
"benchmark_table_html": "",
"readme_markdown": "---\nbase_model: zai-org/GLM-5\nbase_model_relation: quantized\nlicense: mit\ntags:\n- glm_moe_dsa\n- conversational\n- ik_llama.cpp\n---\n\n# GLM-5-GGUF-1.594bpw\n\nThis is a 1.6 BPW quantized model for the GPU poors with 128 GiB of System RAM and 24 GiB of VRAM.\n\nThe quant aims to achieve best-in-class performance, by relying on SOTA IQK-quants from ik_llama.cpp.\n\n\n## Size\n\nThe FFN tensors will take about 127GiB, to be loaded into System RAM and partially into VRAM, leaving absolutely no space for anything else. No GUI, no syslog, no cronie, no chronyd. For the GPU poors, every single bit matters.\n\nThe token_embd tensor will take about 510MiB, and that goes into System RAM as well.\n\nThe other tensors will take about 10.6GiB, to be loaded into VRAM, leaving some space for context, compute buffer, and the few overflow FFN tensors.\n\nSize from `llama-server` output:\n```\nllm_load_print_meta: model size = 139.907 GiB (1.594 BPW)\nllm_load_print_meta: repeating layers = 138.826 GiB (1.586 BPW, 751.961 B parameters)\n```\n\nBuffer size with `-cmoe --no-mmap` (need a small swap to load):\n```\nllm_load_tensors: CPU buffer size = 129975.00 MiB\nllm_load_tensors: CUDA_Host buffer size = 510.47 MiB\nllm_load_tensors: CUDA0 buffer size = 10897.35 MiB\n```\n\nBuffer size with `ncmoe 74 --no-mmap` (doesn't need a swap):\n```\nllm_load_tensors: CPU buffer size = 123043.00 MiB\nllm_load_tensors: CUDA_Host buffer size = 510.47 MiB\nllm_load_tensors: CUDA0 buffer size = 17829.35 MiB\n```\n\n\n## Quality\n\n<details>\n\n<summary>Recipe</summary>\n\n```\n# Attention\nblk\\..*\\.attn_k_b\\.weight=q6_0\nblk\\..*\\.attn_v_b\\.weight=q6_0\n\nblk\\..*\\.attn_kv_a_mqa\\.weight=iq4_k\nblk\\..*\\.attn_q_a\\.weight=iq4_k\nblk\\..*\\.attn_q_b\\.weight=iq4_k\nblk\\..*\\.attn_output\\.weight=iq5_ks\n\n# First 3 Dense Layers\nblk\\..*\\.ffn_down\\.weight=iq4_k\nblk\\..*\\.ffn_(gate|up)\\.weight=iq4_k\n\n# Shared Expert Layers\nblk\\..*\\.ffn_down_shexp\\.weight=iq4_k\nblk\\..*\\.ffn_(gate|up)_shexp\\.weight=iq4_k\n\n# Routed Experts Layers\nblk\\..*\\.ffn_(up|gate|down)_exps\\.weight=iq1_s_r4\n\n# Indexer\nblk\\..*\\.indexer\\.proj\\.weight=iq4_k\nblk\\..*\\.indexer\\.attn_k\\.weight=iq4_k\nblk\\..*\\.indexer\\.attn_q_b\\.weight=iq4_k\n\n# NextN MTP Layer\nblk\\..*\\.nextn\\.embed_tokens\\.weight=iq4_k\nblk\\..*\\.nextn\\.shared_head_head\\.weight=iq4_k\nblk\\..*\\.nextn\\.eh_proj\\.weight=iq4_k\n\n# Non-Repeating Layers\ntoken_embd\\.weight=iq4_k\noutput\\.weight=iq5_ks\n```\n</details>\n\nPPL result with wiki.test.raw:\n```\nFinal estimate: PPL over 565 chunks for n_ctx=512 = 6.2248 +/- 0.03964\n```\nCan check the graph from https://huggingface.co/ubergarm/GLM-5-GGUF for comparison.\n\nThis quant uses the imatrix from unsloth, which seems to allow the model to perform more reliably in actual tasks.\n\nWhen using the imatrix from ubergarm, PPL is a bit better at 6.1469 +/- 0.03890, but performance is noticeably worse.\n\n\n## Flags\n\nTo have usable context size, we have to sacrifice PP by going with the much slower `-mla 1`, which doesn't use as much VRAM compared to the usual `-mla 3`.\n\nThese flags allow a 75000 context size:\n```\n-ot \\.(73|74|75|76|77)\\.ffn_down_exps=CUDA0 \\\n-ot \\.(75|76|77)\\.ffn_(up|gate)_exps=CUDA0 \\\n-ot exps=CPU \\\n-mla 1 -c 75000 -ctk q5_0 -khad \\\n-b 2048 -ub 2048 \\\n--jinja -cram 0 -mqkv -ger -cuda graphs=1\n```\n* 11 FFN tensors on GPU, the rest on CPU\n* `-mla 1` to squeeze 75000 context in Q5, `-khad` to reduce quantization error\n* 2048 batch size to allow GPU offload when processing larger prompt\n\nTested to be working well in both Q&A tasks and agentic tasks, with high difficulty.\n",
"related_quantizations": []
},
"tags": [
"gguf",
"glm_moe_dsa",
"conversational",
"ik_llama.cpp",
"base_model:zai-org/GLM-5",
"base_model:quantized:zai-org/GLM-5",
"license:mit",
"endpoints_compatible",
"region:us",
"imatrix"
],
"likes": 3,
"downloads": 256,
"gated": false,
"private": false,
"last_modified": "2026-03-01T17:49:44.000Z",
"created_at": "2026-02-26T23:42:10.000Z",
"pipeline_tag": "",
"library_name": ""
}
Source payload excerpt (from Hugging Face API)
{
"_id": "69a0da52d66e26677da34a0b",
"id": "sokann/GLM-5-GGUF-1.594bpw",
"modelId": "sokann/GLM-5-GGUF-1.594bpw",
"sha": "31c21b16a06529ef1b521d9d255fb67634a40a36",
"createdAt": "2026-02-26T23:42:10.000Z",
"lastModified": "2026-03-01T17:49:44.000Z",
"author": "sokann",
"downloads": 256,
"likes": 3,
"gated": false,
"private": false,
"pipeline_tag": "",
"library_name": "",
"siblings_count": 3
}