GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

FreedomAISVR/GLM-4.6V-Flash-MXFP4-GGUF overview

GLM 4.6V Flash MXFP4 GGUF Quantized GGUF version of zai org/GLM 4.6V Flash https://huggingface.co/zai org/GLM 4.6V Flash by Z.ai https://z.ai Zhipu AI , conver…

ggufglmvisionmulti-modalmxfp44-bitllavaquantizedimage-text-to-textenzharxiv:2507.01006base_model:zai-org/GLM-4.6V-Flashbase_model:quantized:zai-org/GLM-4.6V-Flashlicense:mitendpoints_compatibleregion:usconversational

Runs locally from ~1.66 GB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
0
Likes
1
Pipeline
image-text-to-text

Repository Files & Downloads

2 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
glm-4.6v-flash-mxfp4.ggufGGUFGGUF4.83 GBDownload
mmproj-glm-4.6v-flash-f16.ggufGGUFF161.66 GBDownload

Model Details

Model IDFreedomAISVR/GLM-4.6V-Flash-MXFP4-GGUF
AuthorFreedomAISVR
Pipelineimage-text-to-text
Licensemit
Base modelzai-org/GLM-4.6V-Flash
Last modified2026-06-10T22:55:47.000Z

Model README

---

license: mit

language:

  • en
  • zh

base_model: zai-org/GLM-4.6V-Flash

library_name: gguf

pipeline_tag: image-text-to-text

tags:

  • glm
  • vision
  • multi-modal
  • mxfp4
  • 4-bit
  • gguf
  • llava
  • quantized

extra_field:

quantized_by: FreedomAI SVR

arxiv: 2507.01006

---

GLM-4.6V-Flash MXFP4 GGUF

Quantized GGUF version of zai-org/GLM-4.6V-Flash by Z.ai (Zhipu AI), converted to MXFP4 (4-bit Microscaling FP4) format.

Model Details

  • Base model: zai-org/GLM-4.6V-Flash — 9B parameter vision-language model by Z.ai with 40 transformer layers, 4096 hidden dim, 32 attention heads (8 KV heads), SwiGLU activation. Paper: 2507.01006.
  • Vision encoder: 24-layer ViT (1536 hidden dim, 1536/4096 attention dim, 13696 intermediate FFN)
  • Context length: 128K tokens
  • Quantization: MXFP4 — OCP Microscaling FP4 format (E2M1 data values with E8M0 per-block scales, 4.41 BPW, 4.82 GB)
  • Thinking: Enabled by default (native <think>/</think> tokens, opt-out via enable_thinking=false)

Files

| File | Size | Description |

|------|------|-------------|

| glm-4.6v-flash-mxfp4.gguf | 4.82 GB | Quantized text model (523 tensors, 4.41 BPW) |

| mmproj-glm-4.6v-flash-f16.gguf | 1.66 GB | Vision encoder projector (182 tensors, F16) |

Usage

LM Studio

Load both files — the text GGUF as the main model and the mmproj as the vision encoder. Supports multimodal inputs (images + text).

llama.cpp

./llama-llava-cli \
  -m glm-4.6v-flash-mxfp4.gguf \
  --mmproj mmproj-glm-4.6v-flash-f16.gguf \
  -p "Describe this image in detail." \
  --image path/to/image.jpg

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="glm-4.6v-flash-mxfp4.gguf",
    mmproj="mmproj-glm-4.6v-flash-f16.gguf",
    n_ctx=32768
)

output = llm.create_chat_completion(
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "image.jpg"}},
            {"type": "text", "text": "What's in this image?"}
        ]
    }]
)
print(output["choices"][0]["message"]["content"])

Quantization Details

  • Source: zai-org/GLM-4.6V-Flash → F16 GGUF → llama-quantize.exe MXFP4
  • Block size: 32 elements; E8M0 shared scale (1 scale per 32-element block)
  • Output tensor: Q6_K (higher precision for the final projection)
  • Format: OCP MXFP4 specification (E2M1 data values, E8M0 per-block scaling)
  • Architecture: glm4 with 523 tensors (40 transformer layers, vision embedder)

Hardware Compatibility

  • MXFP4 is supported on NVIDIA Blackwell (RTX 50 series) via native FP4 MMA instructions
  • Falls back to software dequantization on other GPU architectures and CPU
  • Cross-vendor compatible format per OCP Microscaling specification

Run FreedomAISVR/GLM-4.6V-Flash-MXFP4-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models