forkjoin-ai/qwen2.5-72b-instruct-gguf overview
Qwen2.5 72b Instruct GGUF, Q4 K M Production ready GGUF quantization of Qwen/Qwen2.5 72B Instruct https://huggingface.co/Qwen/Qwen2.5 72B Instruct for distribu…
Runs locally from ~44.16 GB disk (32 GB+ VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| Qwen2.5-72B-Instruct-Q4_K_M.gguf | GGUF | Q4_K_M | 44.16 GB | Download |
Model Details
| Model ID | forkjoin-ai/qwen2.5-72b-instruct-gguf |
|---|---|
| Author | forkjoin-ai |
| Pipeline | text-generation |
| License | apache-2.0 |
| Base model | Qwen/Qwen2.5-72B-Instruct |
| Last modified | 2026-06-08T21:25:26.000Z |
Model README
---
language:
- en
license: apache-2.0
library_name: gguf
tags:
- gguf
- qwen2
- instruct
- affectively
- edgework
- aether
- distributed-inference
- edge-deployment
base_model: Qwen/Qwen2.5-72B-Instruct
base_model_relation: quantized
pipeline_tag: text-generation
---
Qwen2.5 72b Instruct (GGUF, Q4_K_M)
> Production-ready GGUF quantization of Qwen/Qwen2.5-72B-Instruct for distributed text generation and conversation — powered by the Aether edge inference runtime on Edgework.ai.
Model Details
| Property | Value |
|----------|-------|
| Base model | Qwen/Qwen2.5-72B-Instruct |
| Parameters | 72B |
| Architecture | Qwen2 |
| Quantization | Q4_K_M |
| Format | GGUF |
| Size | ~43 GB |
| License | apache-2.0 |
Usage
With llama.cpp
./llama-cli -m Qwen2.5-72B-Instruct-Q4_K_M.gguf -p "Your prompt here" -n 256
With Aether (Distributed Inference)
This model is deployed across the Aether distributed inference network. Weights are layer-sharded and distributed across multiple edge nodes for parallel inference.
Also available: .knot (sovereign format)
This repo ships qwen2.5-72b-instruct.knot — the model weights in the KNOT container that the Aether distributed-inference runtime loads natively (the GGUF, when present, sits right beside it). A KNOT is a single self-describing file with a JSON table-of-contents, so any single tensor is one HTTP Range request — ideal for streaming weights to edge nodes.
| | GGUF | KNOT |
|---|---|---|
| Container | format-specific header | single file, JSON table-of-contents |
| Per-tensor fetch | whole-file oriented | one tensor = one Range request |
| Ecosystem | broad (llama.cpp, …) | Aether / Gnosis runtime |
huggingface-cli download forkjoin-ai/qwen2.5-72b-instruct-gguf qwen2.5-72b-instruct.knot --local-dir ./knots
Full format spec: KNOT_FORMAT.md. Inspect the header with bun run open-source/bitwise/scripts/dump-knot.ts qwen2.5-72b-instruct.knot.
Deployment Architecture
This model runs on the Aether distributed inference runtime — a custom engine that shards model layers across multiple nodes for parallel execution:
- Coordinator receives requests and manages token generation
- Layer nodes each hold a subset of model layers (6 nodes for this model)
- Hidden states flow between nodes via gRPC
- Zero cold start via warm pool scheduling
Deployed via Edgework.ai — bringing fast, cheap, and private inference as close to the user as possible.
About
Published by AFFECTIVELY · Managed by @buley
We quantize and publish production-ready models for distributed edge inference via the Aether runtime. Every release is tested for correctness and stability before publication.
Run forkjoin-ai/qwen2.5-72b-instruct-gguf with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models