FreedomAISVR/GLM-4.7-Flash-MXFP4-MOE-GGUF overview
GLM 4.7 Flash MXFP4 MOE GGUF GGUF quantization of zai org/GLM 4.7 Flash https://huggingface.co/zai org/GLM 4.7 Flash — a 30B parameter Mixture of Experts langu…
Runs locally from ~15.80 GB disk (16 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| glm-4.7-flash-mxfp4_moe.gguf | GGUF | GGUF | 15.80 GB | Download |
Model Details
| Model ID | FreedomAISVR/GLM-4.7-Flash-MXFP4-MOE-GGUF |
|---|---|
| Author | FreedomAISVR |
| Pipeline | text-generation |
| License | mit |
| Base model | zai-org/GLM-4.7-Flash |
| Last modified | 2026-06-11T00:16:33.000Z |
Model README
---
license: mit
language:
- en
- zh
library_name: gguf
tags:
- gguf
- glm
- glm-4.7-flash
- moe
- mixture-of-experts
- deepseek2
- mxfp4_moe
- MXFP4 MOE
- fp4
base_model: zai-org/GLM-4.7-Flash
pipeline_tag: text-generation
inference: false
quantized_by: FreedomAI
---
GLM-4.7-Flash-MXFP4 MOE-GGUF
GGUF quantization of zai-org/GLM-4.7-Flash — a 30B-parameter Mixture-of-Experts language model with ~3.2B active parameters per token, built on the DeepSeek2 architecture with Multi-head Latent Attention (MLA) and 64 routed experts.
Quantized to MXFP4 MOE format for efficient inference with minimal quality loss.
About MXFP4 MOE
MXFP4 (Microscaling FP4, E2M1) is an open standard 4-bit format under the OCP Microscaling Formats (MX) specification. In MXFP4_MOE mode, expert weights are stored in MXFP4 while non-expert tensors (attention, embeddings, norms) remain at Q8_0, balancing quality and compression for Mixture-of-Experts models. Works on any GPU or CPU without hardware-specific acceleration.
Files
| Filename | Type | Size | Description |
|----------|------|------|-------------|
| glm-4.7-flash-mxfp4_moe.gguf | GGUF (MXFP4 MOE) | 15.8 GB | Quantized model weights |
| README.md | Markdown | - | Model card |
Quantization Details
| Property | Value |
|----------|-------|
| Format | MXFP4 MOE |
| Bits Per Weight | 4.53 BPW |
| File Size | 15.8 GB |
| Tensor Count | 844 |
| Architecture | DeepSeek2 (custom for GLM-4.7-Flash) |
Model Description
- Developer: Zhipu AI
- Architecture: Mixture-of-Experts (MoE) with DeepSeek2-style MLA
- Parameters: ~30B total, ~3.2B active per token
- Context Length: 200,000 tokens
- Layers: 47 transformer layers
- Attention: Multi-head Latent Attention (q_lora_rank=768, kv_lora_rank=512)
- Experts: 64 routed experts (4 per token) + 1 shared expert
- Vocab Size: 151,936
- Languages: English, Chinese
- Thinking: Enabled by default (native
<think>/</think>tokens, hidden in history for clean multi-turn reasoning) - Pipeline: text-generation only (no vision encoder)
Usage
llama.cpp
# Basic generation
./llama-cli -m glm-4.7-flash-mxfp4_moe.gguf \
-p "Hello, how are you?" \
-n 256
# With thinking/reasoning controlled
./llama-cli -m glm-4.7-flash-mxfp4_moe.gguf \
-p "Solve this step by step: 23 * 47" \
-n 512 \
-no-cnv
HuggingFace Hub
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="FreedomAISVR/GLM-4.7-Flash-MXFP4-MOE-GGUF",
filename="glm-4.7-flash-mxfp4_moe.gguf",
repo_type="model"
)
Pipeline Commands
Source: zai-org/GLM-4.7-Flash (58 GB, 48 safetensor shards)
- F16 GGUF Conversion:
```powershell
python convert_hf_to_gguf.py D:\AI_MODELS\glm-4.7-src --outfile glm-4.7-f16.gguf --outtype f16
```
Output: 55.79 GB, 844 tensors (DeepSeek2 arch, Glm4MoeLiteModel)
- MXFP4 MOE Quantization:
```powershell
llama-quantize.exe glm-4.7-f16.gguf glm-4.7-flash-mxfp4_moe.gguf MXFP4_MOE
```
Duration: ~310s on RTX 5060 Ti
Hardware
| Component | Specification |
|-----------|---------------|
| GPU | NVIDIA RTX 5060 Ti 16 GB (Blackwell) |
| System RAM | 64 GB |
| Storage | D: (NVMe) |
License
MIT — same as the original zai-org/GLM-4.7-Flash.
Run FreedomAISVR/GLM-4.7-Flash-MXFP4-MOE-GGUF with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models