GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

LibertAIDAI/Nex-N2-mini-GGUF overview

Nex N2 mini GGUF imatrix, fixed chat template Imatrix calibrated GGUF quantizations of nex agi/Nex N2 mini https://huggingface.co/nex agi/Nex N2 mini for llama…

ggufllama.cppimatrixnex-n2moemultimodalimage-text-to-textenbase_model:nex-agi/Nex-N2-minibase_model:quantized:nex-agi/Nex-N2-minilicense:apache-2.0endpoints_compatibleregion:usconversational

Runs locally from ~861.0 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
0
Likes
0
Pipeline
image-text-to-text

Repository Files & Downloads

6 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
Nex-N2-mini-IQ4_XS.ggufGGUFIQ4_XS17.44 GBDownload
Nex-N2-mini-Q4_K_M.ggufGGUFQ4_K_M19.71 GBDownload
Nex-N2-mini-Q5_K_M.ggufGGUFQ5_K_M23.03 GBDownload
Nex-N2-mini-Q6_K.ggufGGUFQ6_K26.56 GBDownload
Nex-N2-mini-Q8_0.ggufGGUFQ8_034.37 GBDownload
mmproj-Nex-N2-mini-F16.ggufGGUFF16861.0 MBDownload

Model Details

Model IDLibertAIDAI/Nex-N2-mini-GGUF
AuthorLibertAIDAI
Pipelineimage-text-to-text
Licenseapache-2.0
Base modelnex-agi/Nex-N2-mini
Last modified2026-06-12T08:50:19.000Z

Model README

---

license: apache-2.0

base_model: nex-agi/Nex-N2-mini

quantized_by: LibertAIDAI

tags:

- gguf

- llama.cpp

- imatrix

- nex-n2

- moe

- multimodal

language:

- en

pipeline_tag: image-text-to-text

---

Nex-N2-mini GGUF (imatrix, fixed chat template)

Imatrix-calibrated GGUF quantizations of nex-agi/Nex-N2-mini for llama.cpp — with a fixed chat template so reasoning extraction and tool calling work out of the box (see below).

Nex-N2-mini is a 35B-total / ~3B-active MoE (256 experts, 8 active) with hybrid linear attention, vision input, and "Agentic Thinking" adaptive reasoning. Apache 2.0.

> Looking for Blackwell-optimized files? See LibertAIDAI/Nex-N2-mini-NVFP4-GGUF — NVFP4 expert tensors with native tensor-core kernels on RTX 50-series / B100/B200, faster batched serving than Q4_K_M on those GPUs.

Why these quants? Fixed chat template

The upstream chat template prefills the assistant turn with '<think>' (no trailing newline) while rendering past assistant reasoning as '<think>\n…'. This inconsistency breaks llama.cpp's reasoning parser: the forced-open think block is never recognized, so the full chain-of-thought (plus a stray </think>) leaks into content instead of reasoning_content — on every llama.cpp build, regardless of --reasoning-format. Community GGUFs that embed the upstream template inherit this bug.

These files embed a corrected template (one added newline). With stock llama-server --jinja:

  • reasoning_content / content are separated correctly,
  • tool calls parse into structured tool_calls,
  • no extra flags needed.

All quants below (except Q8_0, which doesn't use it) were quantized with an importance matrix computed from the BF16 weights over a diverse ~64k-token calibration set (the imatrix file is included in this repo).

About LibertAI

LibertAI is a decentralized AI platform — private inference, an OpenAI-compatible API, and a chat UI, all running on community GPUs over Aleph Cloud instead of a single company's servers. No accounts required to chat, no logs sent home, and the same models you'd self-host are available behind a sovereign endpoint.

If you want to put this model (or any other) to work as an autonomous agent without running your own infrastructure, check out LiberClaw — Hermes-style agents hosted on Aleph Cloud with LibertAI inference. Free tier: 2 agents, no credit card, 5 minutes to deploy. Open source.

Files

| File | Size | When to pick |

|------|------|--------------|

| Nex-N2-mini-IQ4_XS.gguf | 18.7 GB | Smallest — fits a 24 GB GPU with long context |

| Nex-N2-mini-Q4_K_M.gguf | 21.2 GB | Recommended — best size/quality balance |

| Nex-N2-mini-Q5_K_M.gguf | 24.7 GB | Higher quality, still fits 32 GB GPUs |

| Nex-N2-mini-Q6_K.gguf | 28.5 GB | Near-lossless |

| Nex-N2-mini-Q8_0.gguf | 36.9 GB | Highest quality (needs >32 GB VRAM or partial offload) |

| mmproj-Nex-N2-mini-F16.gguf | 903 MB | Required for image input — works with all of the above |

| Nex-N2-mini.imatrix | 192 MB | The importance matrix used (for making your own quants) |

Usage

Text-only (CLI)

llama-cli -m Nex-N2-mini-Q4_K_M.gguf -ngl 999 -c 8192 -p "Your prompt here"

Multimodal (server, vision + text)

llama-server \
  -m Nex-N2-mini-Q4_K_M.gguf \
  --mmproj mmproj-Nex-N2-mini-F16.gguf \
  -ngl 999 -c 32768 --jinja \
  --host 0.0.0.0 --port 8080

Then POST to /v1/chat/completions — reasoning arrives in reasoning_content, answers in content, tool calls in tool_calls. To disable thinking, set chat_template_kwargs: {"enable_thinking": false} in the request.

About the architecture

Nex-N2-mini is built on the Qwen3.5-MoE architecture (qwen35moe in GGUF): 40 layers, 3 of every 4 using linear attention with every 4th full attention, 256 routed experts (8 active) plus a shared expert. The upstream config declares a 1-layer MTP head, but the published checkpoints do not include MTP weights, so no MTP/speculative variant can be produced from public weights.

Sources & credits

  • Base model: nex-agi/Nex-N2-mini by Nex AGI — Apache 2.0
  • Calibration data for the imatrix: bartowski's calibration_datav3
  • Tooling: llama.cpp convert_hf_to_gguf.py, llama-imatrix, llama-quantize

License

Apache 2.0, inherited from the upstream model.

Run LibertAIDAI/Nex-N2-mini-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models