What is paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF?

--- base_model: - nex-agi/Nex-N2-Pro pipeline_tag: image-text-to-text --- Quants of Nex-N2-Pro, a fine tune built on Qwen 3.5 397B A17B. Basically the Qwen 3.6 397B that we never got. Comes with mmproj for vision, but isn't shipped with MTP. All quants target 16/24/32GB GPUs, with varying amounts of RAM depending on the quant. Specific quant details: IQ5_KS - ik fork only - Only works on ik_llama.cpp, targets a 256GB RAM system + nvidia GPU 24/32GB. - Will eat 20822MB of VRAM and 214GB of RAM with this config (needs a strong CPU, like 9950x3d, or PP will be slower): ``` ./build/bin/llama-server -m pmodels/Nex-397B-A17B-IQ5_KS.gguf --mmproj pmodels/Nex-397B-A17B-BF16-mmproj.gguf --no-mmproj-offload -a NexQ8 --slot-save-path slots --context-shift off -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU" -ot "token_embd\.weight=CPU" -c 196608 --ctx-checkpoints 12 -…

What license applies to paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF?

License: See model card. Verify terms on Hugging Face before commercial use.

How do I run paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF locally?

Download a GGUF file from this page and load it in guIDE or llama.cpp. Pipeline task: image-text-to-text.

Model Intelligence Sheet

paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF overview

Quants of Nex N2 Pro, a fine tune built on Qwen 3.5 397B A17B. Basically the Qwen 3.6 397B that we never got. Comes with mmproj for vision, but isn't shipped w…

ggufimage-text-to-textbase_model:nex-agi/Nex-N2-Probase_model:quantized:nex-agi/Nex-N2-Proendpoints_compatibleregion:usimatrixconversational

Runs locally from ~787.3 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads

10,426

Likes

Pipeline

image-text-to-text

Author

paragon-of-brah

Repository Files & Downloads

53 GGUF files detected

Direct downloads for local inference

File	Type	Quantization	Size	Link
IQ1_M/Nex-397B-A17B-IQ1_M-00001-of-00005.gguf	GGUF	IQ1_M	18.37 GB	Download
IQ1_M/Nex-397B-A17B-IQ1_M-00002-of-00005.gguf	GGUF	IQ1_M	18.62 GB	Download
IQ1_M/Nex-397B-A17B-IQ1_M-00003-of-00005.gguf	GGUF	IQ1_M	18.21 GB	Download
IQ1_M/Nex-397B-A17B-IQ1_M-00004-of-00005.gguf	GGUF	IQ1_M	18.61 GB	Download
IQ1_M/Nex-397B-A17B-IQ1_M-00005-of-00005.gguf	GGUF	IQ1_M	16.21 GB	Download
IQ2_M/Nex-397B-A17B-IQ2_M-00001-of-00009.gguf	GGUF	IQ2_M	17.27 GB	Download
IQ2_M/Nex-397B-A17B-IQ2_M-00002-of-00009.gguf	GGUF	IQ2_M	15.26 GB	Download
IQ2_M/Nex-397B-A17B-IQ2_M-00003-of-00009.gguf	GGUF	IQ2_M	15.26 GB	Download
IQ2_M/Nex-397B-A17B-IQ2_M-00004-of-00009.gguf	GGUF	IQ2_M	15.29 GB	Download
IQ2_M/Nex-397B-A17B-IQ2_M-00005-of-00009.gguf	GGUF	IQ2_M	15.25 GB	Download
IQ2_M/Nex-397B-A17B-IQ2_M-00006-of-00009.gguf	GGUF	IQ2_M	15.26 GB	Download
IQ2_M/Nex-397B-A17B-IQ2_M-00007-of-00009.gguf	GGUF	IQ2_M	15.26 GB	Download
IQ2_M/Nex-397B-A17B-IQ2_M-00008-of-00009.gguf	GGUF	IQ2_M	15.30 GB	Download
IQ2_M/Nex-397B-A17B-IQ2_M-00009-of-00009.gguf	GGUF	IQ2_M	8.72 GB	Download
IQ3_M/Nex-397B-A17B-IQ3_M-00001-of-00010.gguf	GGUF	IQ3_M	18.62 GB	Download
IQ3_M/Nex-397B-A17B-IQ3_M-00002-of-00010.gguf	GGUF	IQ3_M	18.47 GB	Download
IQ3_M/Nex-397B-A17B-IQ3_M-00003-of-00010.gguf	GGUF	IQ3_M	18.56 GB	Download
IQ3_M/Nex-397B-A17B-IQ3_M-00004-of-00010.gguf	GGUF	IQ3_M	18.36 GB	Download
IQ3_M/Nex-397B-A17B-IQ3_M-00005-of-00010.gguf	GGUF	IQ3_M	18.50 GB	Download
IQ3_XXS/Nex-397B-A17B-IQ3_XXS-00001-of-00009.gguf	GGUF	IQ3_XXS	18.61 GB	Download
IQ3_XXS/Nex-397B-A17B-IQ3_XXS-00002-of-00009.gguf	GGUF	IQ3_XXS	18.59 GB	Download
IQ3_XXS/Nex-397B-A17B-IQ3_XXS-00003-of-00009.gguf	GGUF	IQ3_XXS	17.88 GB	Download
IQ3_XXS/Nex-397B-A17B-IQ3_XXS-00004-of-00009.gguf	GGUF	IQ3_XXS	18.54 GB	Download
IQ3_XXS/Nex-397B-A17B-IQ3_XXS-00005-of-00009.gguf	GGUF	IQ3_XXS	17.88 GB	Download
IQ3_XXS/Nex-397B-A17B-IQ3_XXS-00006-of-00009.gguf	GGUF	IQ3_XXS	18.54 GB	Download
IQ3_XXS/Nex-397B-A17B-IQ3_XXS-00007-of-00009.gguf	GGUF	IQ3_XXS	17.88 GB	Download
IQ3_XXS/Nex-397B-A17B-IQ3_XXS-00008-of-00009.gguf	GGUF	IQ3_XXS	18.57 GB	Download
IQ3_XXS/Nex-397B-A17B-IQ3_XXS-00009-of-00009.gguf	GGUF	IQ3_XXS	787.3 MB	Download
IQ4_KSS(ik)/Nex-397B-A17B-IQ4_KSS-00001-of-00011.gguf	GGUF	IQ4_KSS	18.04 GB	Download
IQ4_KSS(ik)/Nex-397B-A17B-IQ4_KSS-00002-of-00011.gguf	GGUF	IQ4_KSS	17.98 GB	Download
IQ4_KSS(ik)/Nex-397B-A17B-IQ4_KSS-00003-of-00011.gguf	GGUF	IQ4_KSS	17.85 GB	Download
IQ4_KSS(ik)/Nex-397B-A17B-IQ4_KSS-00004-of-00011.gguf	GGUF	IQ4_KSS	18.03 GB	Download
IQ4_KSS(ik)/Nex-397B-A17B-IQ4_KSS-00005-of-00011.gguf	GGUF	IQ4_KSS	17.98 GB	Download
IQ4_KSS(ik)/Nex-397B-A17B-IQ4_KSS-00006-of-00011.gguf	GGUF	IQ4_KSS	17.85 GB	Download
IQ4_KSS(ik)/Nex-397B-A17B-IQ4_KSS-00007-of-00011.gguf	GGUF	IQ4_KSS	17.98 GB	Download
IQ4_KSS(ik)/Nex-397B-A17B-IQ4_KSS-00008-of-00011.gguf	GGUF	IQ4_KSS	18.03 GB	Download
IQ4_KSS(ik)/Nex-397B-A17B-IQ4_KSS-00009-of-00011.gguf	GGUF	IQ4_KSS	17.85 GB	Download
IQ4_KSS(ik)/Nex-397B-A17B-IQ4_KSS-00010-of-00011.gguf	GGUF	IQ4_KSS	17.98 GB	Download
IQ4_KSS(ik)/Nex-397B-A17B-IQ4_KSS-00011-of-00011.gguf	GGUF	IQ4_KSS	12.51 GB	Download
IQ5_KS(ik)/Nex-N2-Pro-397B-A17B-IQ5_KS-00001-of-00013.gguf	GGUF	IQ5_KS	18.03 GB	Download
IQ5_KS(ik)/Nex-N2-Pro-397B-A17B-IQ5_KS-00002-of-00013.gguf	GGUF	IQ5_KS	18.57 GB	Download
IQ5_KS(ik)/Nex-N2-Pro-397B-A17B-IQ5_KS-00003-of-00013.gguf	GGUF	IQ5_KS	18.09 GB	Download
IQ5_KS(ik)/Nex-N2-Pro-397B-A17B-IQ5_KS-00004-of-00013.gguf	GGUF	IQ5_KS	18.21 GB	Download
IQ5_KS(ik)/Nex-N2-Pro-397B-A17B-IQ5_KS-00005-of-00013.gguf	GGUF	IQ5_KS	18.33 GB	Download
IQ5_KS(ik)/Nex-N2-Pro-397B-A17B-IQ5_KS-00006-of-00013.gguf	GGUF	IQ5_KS	18.33 GB	Download
IQ5_KS(ik)/Nex-N2-Pro-397B-A17B-IQ5_KS-00007-of-00013.gguf	GGUF	IQ5_KS	18.33 GB	Download
IQ5_KS(ik)/Nex-N2-Pro-397B-A17B-IQ5_KS-00008-of-00013.gguf	GGUF	IQ5_KS	18.21 GB	Download
IQ5_KS(ik)/Nex-N2-Pro-397B-A17B-IQ5_KS-00009-of-00013.gguf	GGUF	IQ5_KS	18.33 GB	Download
IQ5_KS(ik)/Nex-N2-Pro-397B-A17B-IQ5_KS-00010-of-00013.gguf	GGUF	IQ5_KS	18.33 GB	Download
IQ5_KS(ik)/Nex-N2-Pro-397B-A17B-IQ5_KS-00011-of-00013.gguf	GGUF	IQ5_KS	18.33 GB	Download
IQ5_KS(ik)/Nex-N2-Pro-397B-A17B-IQ5_KS-00012-of-00013.gguf	GGUF	IQ5_KS	18.21 GB	Download
IQ5_KS(ik)/Nex-N2-Pro-397B-A17B-IQ5_KS-00013-of-00013.gguf	GGUF	IQ5_KS	2.33 GB	Download
Nex-397B-A17B-BF16-mmproj.gguf	GGUF	BF16	879.0 MB	Download

Model Details

Model ID	paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF
Author	paragon-of-brah
Pipeline	image-text-to-text
License	—
Base model	nex-agi/Nex-N2-Pro
Last modified	2026-06-14T06:18:29.000Z

Model README

---

base_model:

nex-agi/Nex-N2-Pro

pipeline_tag: image-text-to-text

---

Quants of Nex-N2-Pro, a fine tune built on Qwen 3.5 397B A17B. Basically the Qwen 3.6 397B that we never got.

Comes with mmproj for vision, but isn't shipped with MTP.

All quants target 16/24/32GB GPUs, with varying amounts of RAM depending on the quant.

Specific quant details:

Only works on ik_llama.cpp, targets a 256GB RAM system + nvidia GPU 24/32GB.
Will eat 20822MB of VRAM and 214GB of RAM with this config (needs a strong CPU, like 9950x3d, or PP will be slower):

```

./build/bin/llama-server

-m pmodels/Nex-397B-A17B-IQ5_KS.gguf

--mmproj pmodels/Nex-397B-A17B-BF16-mmproj.gguf

--no-mmproj-offload

-a NexQ8

--slot-save-path slots

--context-shift off

-ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn._exps.=CPU"

-ot "token_embd\.weight=CPU"

-c 196608

--ctx-checkpoints 12

--ctx-checkpoints-interval 512

--ctx-checkpoints-tolerance 4

--parallel 1

-cram 0

-b 4096 -ub 4096

-wgt 1

-ctk q8_0 -ctv q8_0

-khad -vhad

-mqkv

--threads 7 --threads-batch 8 -ngl 100

-cuda fusion=1,offload-batch-size=1000,mmq-id-size=0,fa-offset=0

--host 127.0.0.1

--port 8080

--webui none

--jinja

```

Will eat 23500MB of VRAM and 214GB of RAM with this config (increases PP speed for weaker CPUs at the cost of more VRAM usage):

```

./build/bin/llama-server

-m pmodels/Nex-397B-A17B-IQ5_KS.gguf

--mmproj pmodels/Nex-397B-A17B-BF16-mmproj.gguf

--no-mmproj-offload

-a NexQ8

--slot-save-path slots

--context-shift off

-ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn._exps.=CPU"

-ot "token_embd\.weight=CPU"

-c 196608

--ctx-checkpoints 12

--ctx-checkpoints-interval 512

--ctx-checkpoints-tolerance 4

--parallel 1

-cram 0

-b 4096 -ub 4096

-wgt 1

-ctk q8_0 -ctv q8_0

-khad -vhad

-mqkv

--threads 7 --threads-batch 8 -ngl 100

-cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0

--host 127.0.0.1

--port 8080

--webui none

--jinja

```

Details:

```

## Gated Attention/Delta Net [Blended 0-59]

blk\..*\.attn_gate\.weight=bf16

blk\..*\.attn_qkv\.weight=bf16

blk\..*\.ssm_alpha\.weight=bf16

blk\..*\.ssm_beta\.weight=bf16

blk\..*\.ssm_out\.weight=bf16

# Normal attention

blk\..*\.attn_output\.weight=q8_0

blk\..*\.attn_q\.weight=q8_0

blk\..*\.attn_k\.weight=q8_0

blk\..*\.attn_v\.weight=q8_0

# Shared Expert Layers [0-59]

blk\..*\.ffn_down_shexp\.weight=q8_0

blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

# Routed Experts Layers [0-59]

blk\..*\.ffn_down_exps\.weight=IQ5_KS

blk\..*\.ffn_(gate|up)_exps\.weight=IQ4_KS

# Non-Repeating Layers

token_embd\.weight=q8_0

output\.weight=q8_0

```

---

</details>

Works with ik only, targets a 192GB RAM system + any GPU 24GB.
Will eat 19450MB of VRAM and 182GB of RAM with standard config:

```

./build/bin/llama-server

-m pmodels/Nex-397B-A17B-IQ4_KSS.gguf

--mmproj pmodels/Nex-397B-A17B-BF16-mmproj.gguf

--no-mmproj-offload

-a NexQ8

--slot-save-path slots

--context-shift off

-ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn._exps.=CPU"

-ot "token_embd\.weight=CPU"

-c 196608

--ctx-checkpoints 12

--ctx-checkpoints-interval 512

--ctx-checkpoints-tolerance 4

--parallel 1

-cram 0

-b 4096 -ub 4096

-wgt 1

-ctk q8_0 -ctv q8_0

-khad -vhad

-mqkv

--threads 7 --threads-batch 8 -ngl 100

-cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0

--host 127.0.0.1

--port 8080

--webui none

--jinja

```

Details:

```

## Gated Attention/Delta Net [Blended 0-59]

blk\..*\.attn_gate\.weight=q8_0

blk\..*\.attn_qkv\.weight=q8_0

blk\..*\.ssm_alpha\.weight=bf16

blk\..*\.ssm_beta\.weight=bf16

blk\..*\.ssm_out\.weight=bf16

# Normal attention

blk\..*\.attn_output\.weight=q8_0

blk\..*\.attn_q\.weight=q8_0

blk\..*\.attn_k\.weight=q8_0

blk\..*\.attn_v\.weight=q8_0

# Shared Expert Layers [0-59]

blk\..*\.ffn_down_shexp\.weight=q8_0

blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

# Routed Experts Layers [0-59]

blk\..*\.ffn_down_exps\.weight=iq4_kss

blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

# Non-Repeating Layers

token_embd\.weight=q8_0

output\.weight=q8_0

```

---

</details>

<summary>IQ3_M - mainline compatible (Uploading..)</summary>

Works with mainline and ik, targets a 196GB RAM system + any GPU 24GB.
Will eat 19600MB of VRAM and 180GB of RAM with standard config:

```

./build/bin/llama-server

-m pmodels/Nex-397B-A17B-IQ3_M.gguf

--mmproj pmodels/Nex-397B-A17B-BF16-mmproj.gguf

--no-mmproj-offload

-a NexQ8

--slot-save-path slots

--context-shift off

-ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn._exps.=CPU"

-ot "token_embd\.weight=CPU"

-c 196608

--ctx-checkpoints 12

--ctx-checkpoints-interval 512

--ctx-checkpoints-tolerance 4

--parallel 1

-cram 0

-b 4096 -ub 4096

-wgt 1

-ctk q8_0 -ctv q8_0

-khad -vhad

-mqkv

--threads 7 --threads-batch 8 -ngl 100

-cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0

--host 127.0.0.1

--port 8080

--webui none

--jinja

```

Details:

```

## Gated Attention/Delta Net [Blended 0-59]

blk\..*\.attn_gate\.weight=q8_0

blk\..*\.attn_qkv\.weight=q8_0

blk\..*\.ssm_alpha\.weight=q8_0

blk\..*\.ssm_beta\.weight=q8_0

blk\..*\.ssm_out\.weight=q8_0

# Normal attention

blk\..*\.attn_output\.weight=q8_0

blk\..*\.attn_q\.weight=q8_0

blk\..*\.attn_k\.weight=q8_0

blk\..*\.attn_v\.weight=q8_0

# Shared Expert Layers [0-59]

blk\..*\.ffn_down_shexp\.weight=q8_0

blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

# Routed Experts Layers [0-59]

blk\..*\.ffn_down_exps\.weight=IQ4_XS

blk\..*\.ffn_(gate|up)_exps\.weight=IQ3_S

# Non-Repeating Layers

token_embd\.weight=q8_0

output\.weight=q8_0

```

---

</details>

<summary>IQ3_XXS - mainline compatible</summary>

Works with mainline and ik, targets a 196GB RAM system + any GPU 24GB.
Will eat 18930MB of VRAM and 151GB of RAM with standard config:

```

./build/bin/llama-server

-m pmodels/Nex-397B-A17B-IQ3_XXS.gguf

--mmproj pmodels/Nex-397B-A17B-BF16-mmproj.gguf

--no-mmproj-offload

-a NexQ8

--slot-save-path slots

--context-shift off

-ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn._exps.=CPU"

-ot "token_embd\.weight=CPU"

-c 196608

--ctx-checkpoints 12

--ctx-checkpoints-interval 512

--ctx-checkpoints-tolerance 4

--parallel 1

-cram 0

-b 4096 -ub 4096

-wgt 1

-ctk q8_0 -ctv q8_0

-khad -vhad

-mqkv

--threads 7 --threads-batch 8 -ngl 100

-cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0

--host 127.0.0.1

--port 8080

--webui none

--jinja

```

Details:

```

## Gated Attention/Delta Net [Blended 0-59]

blk\..*\.attn_gate\.weight=q8_0

blk\..*\.attn_qkv\.weight=q8_0

blk\..*\.ssm_alpha\.weight=q8_0

blk\..*\.ssm_beta\.weight=q8_0

blk\..*\.ssm_out\.weight=q8_0

# Normal attention

blk\..*\.attn_output\.weight=q8_0

blk\..*\.attn_q\.weight=q8_0

blk\..*\.attn_k\.weight=q8_0

blk\..*\.attn_v\.weight=q8_0

# Shared Expert Layers [0-59]

blk\..*\.ffn_down_shexp\.weight=Q6_K

blk\..*\.ffn_(gate|up)_shexp\.weight=Q6_K

# Routed Experts Layers [0-59]

blk\..*\.ffn_down_exps\.weight=IQ3_XXS

blk\..*\.ffn_(gate|up)_exps\.weight=IQ3_XXS

# Non-Repeating Layers

token_embd\.weight=q6_k

output\.weight=q6_k

```

---

</details>

<summary>IQ2_M - mainline compatible</summary>

Works with mainline and ik, targets a 196GB RAM system + any GPU 24GB.
Will eat 19050MB of VRAM and 138GB of RAM with standard config:

```

./build/bin/llama-server

-m pmodels/Nex-397B-A17B-IQ2_M.gguf

--mmproj pmodels/Nex-397B-A17B-BF16-mmproj.gguf

--no-mmproj-offload

-a NexQ8

--slot-save-path slots

--context-shift off

-ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn._exps.=CPU"

-ot "token_embd\.weight=CPU"

-c 196608

--ctx-checkpoints 12

--ctx-checkpoints-interval 512

--ctx-checkpoints-tolerance 4

--parallel 1

-cram 0

-b 4096 -ub 4096

-wgt 1

-ctk q8_0 -ctv q8_0

-khad -vhad

-mqkv

--threads 7 --threads-batch 8 -ngl 100

-cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0

--host 127.0.0.1

--port 8080

--webui none

--jinja

```

Details:

```

## Gated Attention/Delta Net [Blended 0-59]

blk\..*\.attn_gate\.weight=q8_0

blk\..*\.attn_qkv\.weight=q8_0

blk\..*\.ssm_alpha\.weight=q8_0

blk\..*\.ssm_beta\.weight=q8_0

blk\..*\.ssm_out\.weight=q8_0

# Normal attention

blk\..*\.attn_output\.weight=q8_0

blk\..*\.attn_q\.weight=q8_0

blk\..*\.attn_k\.weight=q8_0

blk\..*\.attn_v\.weight=q8_0

# Shared Expert Layers [0-59]

blk\..*\.ffn_down_shexp\.weight=q8_0

blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

# Routed Experts Layers [0-59]

blk\..*\.ffn_down_exps\.weight=IQ3_XXS

blk\..*\.ffn_(gate|up)_exps\.weight=IQ2_S

# Non-Repeating Layers

token_embd\.weight=q8_0

output\.weight=q8_0

```

---

</details>

<summary>IQ1_M - mainline compatible</summary>

Works with mainline and ik, targets a 128GB RAM system + any GPU 16GB+.
Will eat 14210MB of VRAM and 94GB of RAM with standard config:

```

./build/bin/llama-server

-m pmodels/Nex-397B-A17B-IQ1_M.gguf

--mmproj pmodels/Nex-397B-A17B-BF16-mmproj.gguf

--no-mmproj-offload

-a NexQ8

--slot-save-path slots

--context-shift off

-ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn._exps.=CPU"

-ot "token_embd\.weight=CPU"

-c 196608

--ctx-checkpoints 12

--ctx-checkpoints-interval 512

--ctx-checkpoints-tolerance 4

--parallel 1

-cram 0

-b 4096 -ub 4096

-wgt 1

-ctk q8_0 -ctv q8_0

-khad -vhad

-mqkv

--threads 7 --threads-batch 8 -ngl 100

-cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0

--host 127.0.0.1

--port 8080

--webui none

--jinja

```

Details:

```

## Gated Attention/Delta Net [Blended 0-59]

blk\..*\.attn_gate\.weight=IQ4_XS

blk\..*\.attn_qkv\.weight=IQ4_XS

blk\..*\.ssm_alpha\.weight=q8_0

blk\..*\.ssm_beta\.weight=q8_0

blk\..*\.ssm_out\.weight=q8_0

# Normal attention

blk\..*\.attn_output\.weight=IQ4_XS

blk\..*\.attn_q\.weight=IQ4_XS

blk\..*\.attn_k\.weight=IQ4_XS

blk\..*\.attn_v\.weight=IQ4_XS

# Shared Expert Layers [0-59]

blk\..*\.ffn_down_shexp\.weight=IQ4_XS

blk\..*\.ffn_(gate|up)_shexp\.weight=IQ4_XS

# Routed Experts Layers [0-59]

blk\..*\.ffn_down_exps\.weight=IQ2_XXS

blk\..*\.ffn_(gate|up)_exps\.weight=IQ1_M

# Non-Repeating Layers

token_embd\.weight=Q6_K

output\.weight=Q6_K

```

---

</details>

---

Every additional 65536 tokens of context window require one additional GB of VRAM at Q8 KV cache.

The model was natively trained on a 262144 ctx window, so if you want to go beyond 262144 you need to use the additional YARN commands (both for ik and mainline):

  --rope-scaling yarn
  --rope-scale N
  --yarn-orig-ctx 262144

Where N is the context ceiling multiplier (2 for 524288, 4 for 1M). Close to no quality loss at scale 2, some quality loss at scale 4.

Run paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models