GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

FreedomAISVR/GLM-4.7-Flash-MXFP4-MOE-GGUF overview

GLM 4.7 Flash MXFP4 MOE GGUF GGUF quantization of zai org/GLM 4.7 Flash https://huggingface.co/zai org/GLM 4.7 Flash — a 30B parameter Mixture of Experts langu…

ggufglmglm-4.7-flashmoemixture-of-expertsdeepseek2mxfp4_moeMXFP4 MOEfp4text-generationenzhbase_model:zai-org/GLM-4.7-Flashbase_model:quantized:zai-org/GLM-4.7-Flashlicense:mitregion:usconversational

Runs locally from ~15.80 GB disk (16 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
0
Likes
0
Pipeline
text-generation

Repository Files & Downloads

1 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
glm-4.7-flash-mxfp4_moe.ggufGGUFGGUF15.80 GBDownload

Model Details

Model IDFreedomAISVR/GLM-4.7-Flash-MXFP4-MOE-GGUF
AuthorFreedomAISVR
Pipelinetext-generation
Licensemit
Base modelzai-org/GLM-4.7-Flash
Last modified2026-06-11T00:16:33.000Z

Model README

---

license: mit

language:

  • en
  • zh

library_name: gguf

tags:

  • gguf
  • glm
  • glm-4.7-flash
  • moe
  • mixture-of-experts
  • deepseek2
  • mxfp4_moe
  • MXFP4 MOE
  • fp4

base_model: zai-org/GLM-4.7-Flash

pipeline_tag: text-generation

inference: false

quantized_by: FreedomAI

---

GLM-4.7-Flash-MXFP4 MOE-GGUF

GGUF quantization of zai-org/GLM-4.7-Flash — a 30B-parameter Mixture-of-Experts language model with ~3.2B active parameters per token, built on the DeepSeek2 architecture with Multi-head Latent Attention (MLA) and 64 routed experts.

Quantized to MXFP4 MOE format for efficient inference with minimal quality loss.

About MXFP4 MOE

MXFP4 (Microscaling FP4, E2M1) is an open standard 4-bit format under the OCP Microscaling Formats (MX) specification. In MXFP4_MOE mode, expert weights are stored in MXFP4 while non-expert tensors (attention, embeddings, norms) remain at Q8_0, balancing quality and compression for Mixture-of-Experts models. Works on any GPU or CPU without hardware-specific acceleration.

Files

| Filename | Type | Size | Description |

|----------|------|------|-------------|

| glm-4.7-flash-mxfp4_moe.gguf | GGUF (MXFP4 MOE) | 15.8 GB | Quantized model weights |

| README.md | Markdown | - | Model card |

Quantization Details

| Property | Value |

|----------|-------|

| Format | MXFP4 MOE |

| Bits Per Weight | 4.53 BPW |

| File Size | 15.8 GB |

| Tensor Count | 844 |

| Architecture | DeepSeek2 (custom for GLM-4.7-Flash) |

Model Description

  • Developer: Zhipu AI
  • Architecture: Mixture-of-Experts (MoE) with DeepSeek2-style MLA
  • Parameters: ~30B total, ~3.2B active per token
  • Context Length: 200,000 tokens
  • Layers: 47 transformer layers
  • Attention: Multi-head Latent Attention (q_lora_rank=768, kv_lora_rank=512)
  • Experts: 64 routed experts (4 per token) + 1 shared expert
  • Vocab Size: 151,936
  • Languages: English, Chinese
  • Thinking: Enabled by default (native <think>/</think> tokens, hidden in history for clean multi-turn reasoning)
  • Pipeline: text-generation only (no vision encoder)

Usage

llama.cpp

# Basic generation
./llama-cli -m glm-4.7-flash-mxfp4_moe.gguf \
  -p "Hello, how are you?" \
  -n 256

# With thinking/reasoning controlled
./llama-cli -m glm-4.7-flash-mxfp4_moe.gguf \
  -p "Solve this step by step: 23 * 47" \
  -n 512 \
  -no-cnv

HuggingFace Hub

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="FreedomAISVR/GLM-4.7-Flash-MXFP4-MOE-GGUF",
    filename="glm-4.7-flash-mxfp4_moe.gguf",
    repo_type="model"
)

Pipeline Commands

Source: zai-org/GLM-4.7-Flash (58 GB, 48 safetensor shards)

  1. F16 GGUF Conversion:

```powershell

python convert_hf_to_gguf.py D:\AI_MODELS\glm-4.7-src --outfile glm-4.7-f16.gguf --outtype f16

```

Output: 55.79 GB, 844 tensors (DeepSeek2 arch, Glm4MoeLiteModel)

  1. MXFP4 MOE Quantization:

```powershell

llama-quantize.exe glm-4.7-f16.gguf glm-4.7-flash-mxfp4_moe.gguf MXFP4_MOE

```

Duration: ~310s on RTX 5060 Ti

Hardware

| Component | Specification |

|-----------|---------------|

| GPU | NVIDIA RTX 5060 Ti 16 GB (Blackwell) |

| System RAM | 64 GB |

| Storage | D: (NVMe) |

License

MIT — same as the original zai-org/GLM-4.7-Flash.

Run FreedomAISVR/GLM-4.7-Flash-MXFP4-MOE-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models