LibertAIDAI/Nex-N2-mini-GGUF overview
Nex N2 mini GGUF imatrix, fixed chat template Imatrix calibrated GGUF quantizations of nex agi/Nex N2 mini https://huggingface.co/nex agi/Nex N2 mini for llama…
Runs locally from ~861.0 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| Nex-N2-mini-IQ4_XS.gguf | GGUF | IQ4_XS | 17.44 GB | Download |
| Nex-N2-mini-Q4_K_M.gguf | GGUF | Q4_K_M | 19.71 GB | Download |
| Nex-N2-mini-Q5_K_M.gguf | GGUF | Q5_K_M | 23.03 GB | Download |
| Nex-N2-mini-Q6_K.gguf | GGUF | Q6_K | 26.56 GB | Download |
| Nex-N2-mini-Q8_0.gguf | GGUF | Q8_0 | 34.37 GB | Download |
| mmproj-Nex-N2-mini-F16.gguf | GGUF | F16 | 861.0 MB | Download |
Model Details
| Model ID | LibertAIDAI/Nex-N2-mini-GGUF |
|---|---|
| Author | LibertAIDAI |
| Pipeline | image-text-to-text |
| License | apache-2.0 |
| Base model | nex-agi/Nex-N2-mini |
| Last modified | 2026-06-12T08:50:19.000Z |
Model README
---
license: apache-2.0
base_model: nex-agi/Nex-N2-mini
quantized_by: LibertAIDAI
tags:
- gguf
- llama.cpp
- imatrix
- nex-n2
- moe
- multimodal
language:
- en
pipeline_tag: image-text-to-text
---
Nex-N2-mini GGUF (imatrix, fixed chat template)
Imatrix-calibrated GGUF quantizations of nex-agi/Nex-N2-mini for llama.cpp — with a fixed chat template so reasoning extraction and tool calling work out of the box (see below).
Nex-N2-mini is a 35B-total / ~3B-active MoE (256 experts, 8 active) with hybrid linear attention, vision input, and "Agentic Thinking" adaptive reasoning. Apache 2.0.
> Looking for Blackwell-optimized files? See LibertAIDAI/Nex-N2-mini-NVFP4-GGUF — NVFP4 expert tensors with native tensor-core kernels on RTX 50-series / B100/B200, faster batched serving than Q4_K_M on those GPUs.
Why these quants? Fixed chat template
The upstream chat template prefills the assistant turn with '<think>' (no trailing newline) while rendering past assistant reasoning as '<think>\n…'. This inconsistency breaks llama.cpp's reasoning parser: the forced-open think block is never recognized, so the full chain-of-thought (plus a stray </think>) leaks into content instead of reasoning_content — on every llama.cpp build, regardless of --reasoning-format. Community GGUFs that embed the upstream template inherit this bug.
These files embed a corrected template (one added newline). With stock llama-server --jinja:
reasoning_content/contentare separated correctly,- tool calls parse into structured
tool_calls, - no extra flags needed.
All quants below (except Q8_0, which doesn't use it) were quantized with an importance matrix computed from the BF16 weights over a diverse ~64k-token calibration set (the imatrix file is included in this repo).
About LibertAI
LibertAI is a decentralized AI platform — private inference, an OpenAI-compatible API, and a chat UI, all running on community GPUs over Aleph Cloud instead of a single company's servers. No accounts required to chat, no logs sent home, and the same models you'd self-host are available behind a sovereign endpoint.
If you want to put this model (or any other) to work as an autonomous agent without running your own infrastructure, check out LiberClaw — Hermes-style agents hosted on Aleph Cloud with LibertAI inference. Free tier: 2 agents, no credit card, 5 minutes to deploy. Open source.
Files
| File | Size | When to pick |
|------|------|--------------|
| Nex-N2-mini-IQ4_XS.gguf | 18.7 GB | Smallest — fits a 24 GB GPU with long context |
| Nex-N2-mini-Q4_K_M.gguf | 21.2 GB | Recommended — best size/quality balance |
| Nex-N2-mini-Q5_K_M.gguf | 24.7 GB | Higher quality, still fits 32 GB GPUs |
| Nex-N2-mini-Q6_K.gguf | 28.5 GB | Near-lossless |
| Nex-N2-mini-Q8_0.gguf | 36.9 GB | Highest quality (needs >32 GB VRAM or partial offload) |
| mmproj-Nex-N2-mini-F16.gguf | 903 MB | Required for image input — works with all of the above |
| Nex-N2-mini.imatrix | 192 MB | The importance matrix used (for making your own quants) |
Usage
Text-only (CLI)
llama-cli -m Nex-N2-mini-Q4_K_M.gguf -ngl 999 -c 8192 -p "Your prompt here"
Multimodal (server, vision + text)
llama-server \
-m Nex-N2-mini-Q4_K_M.gguf \
--mmproj mmproj-Nex-N2-mini-F16.gguf \
-ngl 999 -c 32768 --jinja \
--host 0.0.0.0 --port 8080
Then POST to /v1/chat/completions — reasoning arrives in reasoning_content, answers in content, tool calls in tool_calls. To disable thinking, set chat_template_kwargs: {"enable_thinking": false} in the request.
About the architecture
Nex-N2-mini is built on the Qwen3.5-MoE architecture (qwen35moe in GGUF): 40 layers, 3 of every 4 using linear attention with every 4th full attention, 256 routed experts (8 active) plus a shared expert. The upstream config declares a 1-layer MTP head, but the published checkpoints do not include MTP weights, so no MTP/speculative variant can be produced from public weights.
Sources & credits
- Base model: nex-agi/Nex-N2-mini by Nex AGI — Apache 2.0
- Calibration data for the imatrix: bartowski's
calibration_datav3 - Tooling: llama.cpp
convert_hf_to_gguf.py,llama-imatrix,llama-quantize
License
Apache 2.0, inherited from the upstream model.
Run LibertAIDAI/Nex-N2-mini-GGUF with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models