FreedomAISVR/GLM-4.6V-Flash-MXFP4-GGUF overview
GLM 4.6V Flash MXFP4 GGUF Quantized GGUF version of zai org/GLM 4.6V Flash https://huggingface.co/zai org/GLM 4.6V Flash by Z.ai https://z.ai Zhipu AI , conver…
Runs locally from ~1.66 GB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
Model Details
| Model ID | FreedomAISVR/GLM-4.6V-Flash-MXFP4-GGUF |
|---|---|
| Author | FreedomAISVR |
| Pipeline | image-text-to-text |
| License | mit |
| Base model | zai-org/GLM-4.6V-Flash |
| Last modified | 2026-06-10T22:55:47.000Z |
Model README
---
license: mit
language:
- en
- zh
base_model: zai-org/GLM-4.6V-Flash
library_name: gguf
pipeline_tag: image-text-to-text
tags:
- glm
- vision
- multi-modal
- mxfp4
- 4-bit
- gguf
- llava
- quantized
extra_field:
quantized_by: FreedomAI SVR
arxiv: 2507.01006
---
GLM-4.6V-Flash MXFP4 GGUF
Quantized GGUF version of zai-org/GLM-4.6V-Flash by Z.ai (Zhipu AI), converted to MXFP4 (4-bit Microscaling FP4) format.
Model Details
- Base model: zai-org/GLM-4.6V-Flash — 9B parameter vision-language model by Z.ai with 40 transformer layers, 4096 hidden dim, 32 attention heads (8 KV heads), SwiGLU activation. Paper: 2507.01006.
- Vision encoder: 24-layer ViT (1536 hidden dim, 1536/4096 attention dim, 13696 intermediate FFN)
- Context length: 128K tokens
- Quantization: MXFP4 — OCP Microscaling FP4 format (E2M1 data values with E8M0 per-block scales, 4.41 BPW, 4.82 GB)
- Thinking: Enabled by default (native
<think>/</think>tokens, opt-out viaenable_thinking=false)
Files
| File | Size | Description |
|------|------|-------------|
| glm-4.6v-flash-mxfp4.gguf | 4.82 GB | Quantized text model (523 tensors, 4.41 BPW) |
| mmproj-glm-4.6v-flash-f16.gguf | 1.66 GB | Vision encoder projector (182 tensors, F16) |
Usage
LM Studio
Load both files — the text GGUF as the main model and the mmproj as the vision encoder. Supports multimodal inputs (images + text).
llama.cpp
./llama-llava-cli \
-m glm-4.6v-flash-mxfp4.gguf \
--mmproj mmproj-glm-4.6v-flash-f16.gguf \
-p "Describe this image in detail." \
--image path/to/image.jpg
Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="glm-4.6v-flash-mxfp4.gguf",
mmproj="mmproj-glm-4.6v-flash-f16.gguf",
n_ctx=32768
)
output = llm.create_chat_completion(
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "image.jpg"}},
{"type": "text", "text": "What's in this image?"}
]
}]
)
print(output["choices"][0]["message"]["content"])
Quantization Details
- Source:
zai-org/GLM-4.6V-Flash→ F16 GGUF →llama-quantize.exe MXFP4 - Block size: 32 elements; E8M0 shared scale (1 scale per 32-element block)
- Output tensor: Q6_K (higher precision for the final projection)
- Format: OCP MXFP4 specification (E2M1 data values, E8M0 per-block scaling)
- Architecture:
glm4with 523 tensors (40 transformer layers, vision embedder)
Hardware Compatibility
- MXFP4 is supported on NVIDIA Blackwell (RTX 50 series) via native FP4 MMA instructions
- Falls back to software dequantization on other GPU architectures and CPU
- Cross-vendor compatible format per OCP Microscaling specification
Run FreedomAISVR/GLM-4.6V-Flash-MXFP4-GGUF with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models