duyntnet/llama-3.1-nemotron-nano-4b-v1.1-imatrix-gguf Q3_K_S GGUF - Free GGUF Download is indexed on GraySoft with repository links, GGUF quant files, and Hugging Face metadata. This page helps you pick a local model for guIDE or other runtimes. See related models in the same shard below.

Model Intelligence Sheet

duyntnet/llama-3.1-nemotron-nano-4b-v1.1-imatrix-gguf overview

Llama-3.1-Nemotron-Nano-4B-v1.1 is a large language model (LLM) which is a derivative of nvidia/Llama-3.1-Minitron-4B-Width-Base, which is created from Llama 3.1 8B using our LLM compression technique and offers improvements in model accuracy and efficiency. It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. Llama-3.1-Nemotron-Nano-4B-v1.1 is a model which offers a great tradeoff between model accuracy and efficiency. The model fits on a single RTX GPU and can be used locally. The model supports a context length of 128K. This model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and RPO checkpoints This model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here: This model is ready for commercial use.

transformersggufimatrixLlama-3.1-Nemotron-Nano-4B-v1.1text-generationenarxiv:2408.11796license:otherregion:usconversational

duyntnet/llama-3.1-nemotron-nano-4b-v1.1-imatrix-gguf visual

Downloads

611

Likes

Pipeline

text-generation

Library

transformers

Visibility

Public

Access

Open

Repository Files & Downloads

27 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ1_M.gguf	GGUF	IQ1_M	1.20 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ1_S.gguf	GGUF	IQ1_S	1.13 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ2_M.gguf	GGUF	IQ2_M	1.60 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ2_S.gguf	GGUF	IQ2_S	1.51 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ2_XS.gguf	GGUF	IQ2_XS	1.41 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ2_XXS.gguf	GGUF	IQ2_XXS	1.31 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ3_M.gguf	GGUF	IQ3_M	2.03 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ3_S.gguf	GGUF	IQ3_S	1.97 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ3_XS.gguf	GGUF	IQ3_XS	1.89 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ3_XXS.gguf	GGUF	IQ3_XXS	1.75 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ4_NL.gguf	GGUF	IQ4_NL	2.48 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ4_XS.gguf	GGUF	IQ4_XS	2.36 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q2_K.gguf	GGUF	Q2_K	1.71 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q2_K_S.gguf	GGUF	Q2_K_S	1.61 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q3_K_L.gguf	GGUF	Q3_K_L	2.30 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q3_K_M.gguf	GGUF	Q3_K_M	2.14 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q3_K_S.gguf	GGUF	Q3_K_S	1.96 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q4_0.gguf	GGUF	—	2.47 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q4_1.gguf	GGUF	—	2.71 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q4_K_M.gguf	GGUF	Q4_K_M	2.59 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q4_K_S.gguf	GGUF	Q4_K_S	2.48 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q5_0.gguf	GGUF	—	2.95 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q5_1.gguf	GGUF	—	3.19 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q5_K_M.gguf	GGUF	Q5_K_M	3.01 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q5_K_S.gguf	GGUF	Q5_K_S	2.95 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q6_K.gguf	GGUF	Q6_K	3.46 GB	Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q8_0.gguf	GGUF	—	4.47 GB	Download

Model Details Live

Model Slug

duyntnet/llama-3.1-nemotron-nano-4b-v1.1-imatrix-gguf

Author

duyntnet

Pipeline Task

text-generation

Library

transformers

Created

2025-05-21

Last Modified

2025-05-21

Gated

Private

HF SHA

15a10f3bf59fb89d91a5485a97d60534b877b5a6

License

other

Language

Base Model

Unknown

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "license": "other",
    "language": [
      "en"
    ],
    "pipeline_tag": "text-generation",
    "inference": false,
    "tags": [
      "transformers",
      "gguf",
      "imatrix",
      "Llama-3.1-Nemotron-Nano-4B-v1.1"
    ],
    "frontmatter": {
      "license": "other",
      "language": [
        "en"
      ],
      "pipeline_tag": "text-generation",
      "inference": "false",
      "tags": [
        "transformers",
        "gguf",
        "imatrix",
        "Llama-3.1-Nemotron-Nano-4B-v1.1"
      ]
    },
    "hero_image_url": "",
    "summary": "Llama-3.1-Nemotron-Nano-4B-v1.1 is a large language model (LLM) which is a derivative of nvidia/Llama-3.1-Minitron-4B-Width-Base, which is created from Llama 3.1 8B using our LLM compression technique and offers improvements in model accuracy and efficiency. It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. Llama-3.1-Nemotron-Nano-4B-v1.1 is a model which offers a great tradeoff between model accuracy and efficiency. The model fits on a single RTX GPU and can be used locally. The model supports a context length of 128K. This model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and RPO checkpoints This model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here: This model is ready for commercial use.",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nlicense: other\nlanguage:\n- en\npipeline_tag: text-generation\ninference: false\ntags:\n- transformers\n- gguf\n- imatrix\n- Llama-3.1-Nemotron-Nano-4B-v1.1\n---\n\nQuantizations of https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1\n\n\n### Open source inference clients/UIs\n* [llama.cpp](https://github.com/ggerganov/llama.cpp)\n* [KoboldCPP](https://github.com/LostRuins/koboldcpp)\n* [text-generation-webui](https://github.com/oobabooga/text-generation-webui)\n* [ollama](https://github.com/ollama/ollama)\n* [jan](https://github.com/janhq/jan)\n\n### Closed source inference clients/UIs\n* [LM Studio](https://lmstudio.ai/)\n* [Backyard AI](https://backyard.ai/)\n* More will be added...\n---\n\n# From original readme\n\nLlama-3.1-Nemotron-Nano-4B-v1.1 is a large language model (LLM) which is a derivative of [nvidia/Llama-3.1-Minitron-4B-Width-Base](https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base), which is created from Llama 3.1 8B using [our LLM compression technique](https://arxiv.org/abs/2408.11796) and offers improvements in model accuracy and efficiency. It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. \n\nLlama-3.1-Nemotron-Nano-4B-v1.1 is a model which offers a great tradeoff between model accuracy and efficiency. The model fits on a single RTX GPU and can be used locally. The model supports a context length of 128K.\n\nThis model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and RPO checkpoints\n\nThis model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here: \n- [Llama-3.3-Nemotron-Ultra-253B-v1](https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1)\n- [Llama-3.3-Nemotron-Super-49B-v1](https://huggingface.co/nvidia/Llama-3.3-Nemotron-Super-49B-v1)\n- [Llama-3.1-Nemotron-Nano-8B-v1](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1)\n\nThis model is ready for commercial use.\n\n\n## Quick Start and Usage Recommendations:\n\n1. Reasoning mode (ON/OFF) is controlled via the system prompt, which must be set as shown in the example below. All instructions should be contained within the user prompt\n2. We recommend setting temperature to `0.6`, and Top P to `0.95` for Reasoning ON mode\n3. We recommend using greedy decoding for Reasoning OFF mode\n4. We have provided a list of prompts to use for evaluation for each benchmark where a specific template is required\n\nSee the snippet below for usage with Hugging Face Transformers library. Reasoning mode (ON/OFF) is controlled via system prompt. Please see the example below.\nOur code requires the transformers package version to be `4.44.2` or higher.\n\n\n### Example of “Reasoning On:”\n\n```python\nimport torch\nimport transformers\n\nmodel_id = \"nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1\"\nmodel_kwargs = {\"torch_dtype\": torch.bfloat16, \"device_map\": \"auto\"}\ntokenizer = transformers.AutoTokenizer.from_pretrained(model_id)\ntokenizer.pad_token_id = tokenizer.eos_token_id\n\npipeline = transformers.pipeline(\n   \"text-generation\",\n   model=model_id,\n   tokenizer=tokenizer,\n   max_new_tokens=32768,\n   temperature=0.6,\n   top_p=0.95,\n   **model_kwargs\n)\n\n# Thinking can be \"on\" or \"off\"\nthinking = \"on\"\n\nprint(pipeline([{\"role\": \"system\", \"content\": f\"detailed thinking {thinking}\"}, {\"role\": \"user\", \"content\": \"Solve x*(sin(x)+2)=0\"}]))\n```\n\n\n### Example of “Reasoning Off:”\n\n```python\nimport torch\nimport transformers\n\nmodel_id = \"nvidia/Llama-3.1-Nemotron-Nano-4B-v1\"\nmodel_kwargs = {\"torch_dtype\": torch.bfloat16, \"device_map\": \"auto\"}\ntokenizer = transformers.AutoTokenizer.from_pretrained(model_id)\ntokenizer.pad_token_id = tokenizer.eos_token_id\n\npipeline = transformers.pipeline(\n   \"text-generation\",\n   model=model_id,\n   tokenizer=tokenizer,\n   max_new_tokens=32768,\n   do_sample=False,\n   **model_kwargs\n)\n\n# Thinking can be \"on\" or \"off\"\nthinking = \"off\"\n\nprint(pipeline([{\"role\": \"system\", \"content\": f\"detailed thinking {thinking}\"}, {\"role\": \"user\", \"content\": \"Solve x*(sin(x)+2)=0\"}]))\n```\n\nFor some prompts, even though thinking is disabled, the model emergently prefers to think before responding. But if desired, the users can prevent it by pre-filling the assistant response.\n\n```python\nimport torch\nimport transformers\n\nmodel_id = \"nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1\"\nmodel_kwargs = {\"torch_dtype\": torch.bfloat16, \"device_map\": \"auto\"}\ntokenizer = transformers.AutoTokenizer.from_pretrained(model_id)\ntokenizer.pad_token_id = tokenizer.eos_token_id\n\n# Thinking can be \"on\" or \"off\"\nthinking = \"off\"\n\npipeline = transformers.pipeline(\n   \"text-generation\",\n   model=model_id,\n   tokenizer=tokenizer,\n   max_new_tokens=32768,\n   do_sample=False,\n   **model_kwargs\n)\n\nprint(pipeline([{\"role\": \"system\", \"content\": f\"detailed thinking {thinking}\"}, {\"role\": \"user\", \"content\": \"Solve x*(sin(x)+2)=0\"}, {\"role\":\"assistant\", \"content\":\"<think>\\n</think>\"}]))\n```\n\n## Running a vLLM Server with Tool-call Support\n\nLlama-3.1-Nemotron-Nano-4B-v1.1 supports tool calling. This HF repo hosts a tool-callilng parser as well as a chat template in Jinja, which can be used to launch a vLLM server.\n\nHere is a shell script example to launch a vLLM server with tool-call support. `vllm/vllm-openai:v0.6.6` or newer should support the model.\n\n```shell\n#!/bin/bash\n\nCWD=$(pwd)\nPORT=5000\ngit clone https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1\ndocker run -it --rm \\\n    --runtime=nvidia \\\n    --gpus all \\\n    --shm-size=16GB \\\n    -p ${PORT}:${PORT} \\\n    -v ${CWD}:${CWD} \\\n    vllm/vllm-openai:v0.6.6 \\\n    --model $CWD/Llama-3.1-Nemotron-Nano-4B-v1.1 \\\n    --trust-remote-code \\\n    --seed 1 \\\n    --host \"0.0.0.0\" \\\n    --port $PORT \\\n    --served-model-name \"Llama-Nemotron-Nano-4B-v1.1\" \\\n    --tensor-parallel-size 1 \\\n    --max-model-len 131072 \\\n    --gpu-memory-utilization 0.95 \\\n    --enforce-eager \\\n    --enable-auto-tool-choice \\\n    --tool-parser-plugin \"${CWD}/Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_toolcall_parser.py\" \\\n    --tool-call-parser \"llama_nemotron_json\" \\\n    --chat-template \"${CWD}/Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_generic_tool_calling.jinja\"\n```\n\nAlternatively, you can use a virtual environment to launch a vLLM server like below.\n\n```console\n$ git clone https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1\n\n$ conda create -n vllm python=3.12 -y\n$ conda activate vllm\n\n$ python -m vllm.entrypoints.openai.api_server \\\n  --model Llama-3.1-Nemotron-Nano-4B-v1.1 \\\n  --trust-remote-code \\\n  --seed 1 \\\n  --host \"0.0.0.0\" \\\n  --port 5000 \\\n  --served-model-name \"Llama-Nemotron-Nano-4B-v1.1\" \\\n  --tensor-parallel-size 1 \\\n  --max-model-len 131072 \\\n  --gpu-memory-utilization 0.95 \\\n  --enforce-eager \\\n  --enable-auto-tool-choice \\\n  --tool-parser-plugin \"Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_toolcall_parser.py\" \\\n  --tool-call-parser \"llama_nemotron_json\" \\\n  --chat-template \"Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_generic_tool_calling.jinja\"\n```\n\nAfter launching a vLLM server, you can call the server with tool-call support using a Python script like below.\n\n```python\n>>> from openai import OpenAI\n>>> client = OpenAI(\n        base_url=\"http://0.0.0.0:5000/v1\",\n        api_key=\"dummy\",\n    )\n\n>>> completion = client.chat.completions.create(\n      model=\"Llama-Nemotron-Nano-v1.1\",\n      messages=[\n        {\"role\": \"system\", \"content\": \"detailed thinking on\"},\n        {\"role\": \"user\", \"content\": \"My bill is $100. What will be the amount for 18% tip?\"},\n      ],\n      tools=[\n        {\"type\": \"function\", \"function\": {\"name\": \"calculate_tip\", \"parameters\": {\"type\": \"object\", \"properties\": {\"bill_total\": {\"type\": \"integer\", \"description\": \"The total amount of the bill\"}, \"tip_percentage\": {\"type\": \"integer\", \"description\": \"The percentage of tip to be applied\"}}, \"required\": [\"bill_total\", \"tip_percentage\"]}}},\n        {\"type\": \"function\", \"function\": {\"name\": \"convert_currency\", \"parameters\": {\"type\": \"object\", \"properties\": {\"amount\": {\"type\": \"integer\", \"description\": \"The amount to be converted\"}, \"from_currency\": {\"type\": \"string\", \"description\": \"The currency code to convert from\"}, \"to_currency\": {\"type\": \"string\", \"description\": \"The currency code to convert to\"}}, \"required\": [\"from_currency\", \"amount\", \"to_currency\"]}}},\n      ],\n    )\n\n>>> completion.choices[0].message.content\n'<think>\\nOkay, let\\'s see. The user has a bill of $100 and wants to know the amount of a 18% tip. So, I need to calculate the tip amount. The available tools include calculate_tip, which requires bill_total and tip_percentage. The parameters are both integers. The bill_total is 100, and the tip percentage is 18. So, the function should multiply 100 by 18% and return 18.0. But wait, maybe the user wants the total including the tip? The question says \"the amount for 18% tip,\" which could be interpreted as the tip amount itself. Since the function is called calculate_tip, it\\'s likely that it\\'s designed to compute the tip, not the total. So, using calculate_tip with bill_total=100 and tip_percentage=18 should give the correct result. The other function, convert_currency, isn\\'t relevant here. So, I should call calculate_tip with those values.\\n</think>\\n\\n'\n\n>>> completion.choices[0].message.tool_calls\n[ChatCompletionMessageToolCall(id='chatcmpl-tool-2972d86817344edc9c1e0f9cd398e999', function=Function(arguments='{\"bill_total\": 100, \"tip_percentage\": 18}', name='calculate_tip'), type='function')]\n```",
    "related_quantizations": []
  },
  "tags": [
    "transformers",
    "gguf",
    "imatrix",
    "Llama-3.1-Nemotron-Nano-4B-v1.1",
    "text-generation",
    "en",
    "arxiv:2408.11796",
    "license:other",
    "region:us",
    "conversational"
  ],
  "likes": 0,
  "downloads": 611,
  "gated": false,
  "private": false,
  "last_modified": "2025-05-21T15:31:19.000Z",
  "created_at": "2025-05-21T14:54:41.000Z",
  "pipeline_tag": "text-generation",
  "library_name": "transformers"
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "682de931f3e4b74b54726e00",
  "id": "duyntnet/Llama-3.1-Nemotron-Nano-4B-v1.1-imatrix-GGUF",
  "modelId": "duyntnet/Llama-3.1-Nemotron-Nano-4B-v1.1-imatrix-GGUF",
  "sha": "15a10f3bf59fb89d91a5485a97d60534b877b5a6",
  "createdAt": "2025-05-21T14:54:41.000Z",
  "lastModified": "2025-05-21T15:31:19.000Z",
  "author": "duyntnet",
  "downloads": 611,
  "likes": 0,
  "gated": false,
  "private": false,
  "pipeline_tag": "text-generation",
  "library_name": "transformers",
  "siblings_count": 29
}

duyntnet/llama-3.1-nemotron-nano-4b-v1.1-imatrix-gguf overview

Repository Files & Downloads

Model Details Live

Metadata Inspector

More models in this shard