GraySoft
Projects Models About FAQ Contact Download guIDE →

duyntnet/llama-3.1-nemotron-nano-4b-v1.1-imatrix-gguf Q3_K_S GGUF - Free GGUF Download is indexed on GraySoft with repository links, GGUF quant files, and Hugging Face metadata. This page helps you pick a local model for guIDE or other runtimes. See related models in the same shard below.

Model Intelligence Sheet

duyntnet/llama-3.1-nemotron-nano-4b-v1.1-imatrix-gguf overview

Llama-3.1-Nemotron-Nano-4B-v1.1 is a large language model (LLM) which is a derivative of nvidia/Llama-3.1-Minitron-4B-Width-Base, which is created from Llama 3.1 8B using our LLM compression technique and offers improvements in model accuracy and efficiency. It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. Llama-3.1-Nemotron-Nano-4B-v1.1 is a model which offers a great tradeoff between model accuracy and efficiency. The model fits on a single RTX GPU and can be used locally. The model supports a context length of 128K. This model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and RPO checkpoints This model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here: This model is ready for commercial use.

transformersggufimatrixLlama-3.1-Nemotron-Nano-4B-v1.1text-generationenarxiv:2408.11796license:otherregion:usconversational
duyntnet/llama-3.1-nemotron-nano-4b-v1.1-imatrix-gguf visual
Downloads
611
Likes
0
Pipeline
text-generation
Library
transformers
Visibility
Public
Access
Open

Repository Files & Downloads

27 files detected
Direct downloads for all repository files
FileTypeQuantizationSizeLink
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ1_M.gguf GGUF IQ1_M 1.20 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ1_S.gguf GGUF IQ1_S 1.13 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ2_M.gguf GGUF IQ2_M 1.60 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ2_S.gguf GGUF IQ2_S 1.51 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ2_XS.gguf GGUF IQ2_XS 1.41 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ2_XXS.gguf GGUF IQ2_XXS 1.31 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ3_M.gguf GGUF IQ3_M 2.03 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ3_S.gguf GGUF IQ3_S 1.97 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ3_XS.gguf GGUF IQ3_XS 1.89 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ3_XXS.gguf GGUF IQ3_XXS 1.75 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ4_NL.gguf GGUF IQ4_NL 2.48 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-IQ4_XS.gguf GGUF IQ4_XS 2.36 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q2_K.gguf GGUF Q2_K 1.71 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q2_K_S.gguf GGUF Q2_K_S 1.61 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q3_K_L.gguf GGUF Q3_K_L 2.30 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q3_K_M.gguf GGUF Q3_K_M 2.14 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q3_K_S.gguf GGUF Q3_K_S 1.96 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q4_0.gguf GGUF 2.47 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q4_1.gguf GGUF 2.71 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q4_K_M.gguf GGUF Q4_K_M 2.59 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q4_K_S.gguf GGUF Q4_K_S 2.48 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q5_0.gguf GGUF 2.95 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q5_1.gguf GGUF 3.19 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q5_K_M.gguf GGUF Q5_K_M 3.01 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q5_K_S.gguf GGUF Q5_K_S 2.95 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q6_K.gguf GGUF Q6_K 3.46 GB Download
Llama-3.1-Nemotron-Nano-4B-v1.1-Q8_0.gguf GGUF 4.47 GB Download

Model Details Live

Model Slug
duyntnet/llama-3.1-nemotron-nano-4b-v1.1-imatrix-gguf
Author
duyntnet
Pipeline Task
text-generation
Library
transformers
Created
2025-05-21
Last Modified
2025-05-21
Gated
No
Private
No
HF SHA
15a10f3bf59fb89d91a5485a97d60534b877b5a6
License
other
Language
en
Base Model
Unknown

Metadata Inspector

Normalized metadata (stored in metadata_json)
{
  "metadata": {},
  "card_data": {
    "license": "other",
    "language": [
      "en"
    ],
    "pipeline_tag": "text-generation",
    "inference": false,
    "tags": [
      "transformers",
      "gguf",
      "imatrix",
      "Llama-3.1-Nemotron-Nano-4B-v1.1"
    ],
    "frontmatter": {
      "license": "other",
      "language": [
        "en"
      ],
      "pipeline_tag": "text-generation",
      "inference": "false",
      "tags": [
        "transformers",
        "gguf",
        "imatrix",
        "Llama-3.1-Nemotron-Nano-4B-v1.1"
      ]
    },
    "hero_image_url": "",
    "summary": "Llama-3.1-Nemotron-Nano-4B-v1.1 is a large language model (LLM) which is a derivative of nvidia/Llama-3.1-Minitron-4B-Width-Base, which is created from Llama 3.1 8B using our LLM compression technique and offers improvements in model accuracy and efficiency. It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. Llama-3.1-Nemotron-Nano-4B-v1.1 is a model which offers a great tradeoff between model accuracy and efficiency. The model fits on a single RTX GPU and can be used locally. The model supports a context length of 128K. This model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and RPO checkpoints This model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here: This model is ready for commercial use.",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nlicense: other\nlanguage:\n- en\npipeline_tag: text-generation\ninference: false\ntags:\n- transformers\n- gguf\n- imatrix\n- Llama-3.1-Nemotron-Nano-4B-v1.1\n---\n\nQuantizations of https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1\n\n\n### Open source inference clients/UIs\n* [llama.cpp](https://github.com/ggerganov/llama.cpp)\n* [KoboldCPP](https://github.com/LostRuins/koboldcpp)\n* [text-generation-webui](https://github.com/oobabooga/text-generation-webui)\n* [ollama](https://github.com/ollama/ollama)\n* [jan](https://github.com/janhq/jan)\n\n### Closed source inference clients/UIs\n* [LM Studio](https://lmstudio.ai/)\n* [Backyard AI](https://backyard.ai/)\n* More will be added...\n---\n\n# From original readme\n\nLlama-3.1-Nemotron-Nano-4B-v1.1 is a large language model (LLM) which is a derivative of [nvidia/Llama-3.1-Minitron-4B-Width-Base](https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base), which is created from Llama 3.1 8B using [our LLM compression technique](https://arxiv.org/abs/2408.11796) and offers improvements in model accuracy and efficiency. It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. \n\nLlama-3.1-Nemotron-Nano-4B-v1.1 is a model which offers a great tradeoff between model accuracy and efficiency. The model fits on a single RTX GPU and can be used locally. The model supports a context length of 128K.\n\nThis model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and RPO checkpoints\n\nThis model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here: \n- [Llama-3.3-Nemotron-Ultra-253B-v1](https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1)\n- [Llama-3.3-Nemotron-Super-49B-v1](https://huggingface.co/nvidia/Llama-3.3-Nemotron-Super-49B-v1)\n- [Llama-3.1-Nemotron-Nano-8B-v1](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1)\n\nThis model is ready for commercial use.\n\n\n## Quick Start and Usage Recommendations:\n\n1. Reasoning mode (ON/OFF) is controlled via the system prompt, which must be set as shown in the example below. All instructions should be contained within the user prompt\n2. We recommend setting temperature to `0.6`, and Top P to `0.95` for Reasoning ON mode\n3. We recommend using greedy decoding for Reasoning OFF mode\n4. We have provided a list of prompts to use for evaluation for each benchmark where a specific template is required\n\nSee the snippet below for usage with Hugging Face Transformers library. Reasoning mode (ON/OFF) is controlled via system prompt. Please see the example below.\nOur code requires the transformers package version to be `4.44.2` or higher.\n\n\n### Example of “Reasoning On:”\n\n```python\nimport torch\nimport transformers\n\nmodel_id = \"nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1\"\nmodel_kwargs = {\"torch_dtype\": torch.bfloat16, \"device_map\": \"auto\"}\ntokenizer = transformers.AutoTokenizer.from_pretrained(model_id)\ntokenizer.pad_token_id = tokenizer.eos_token_id\n\npipeline = transformers.pipeline(\n   \"text-generation\",\n   model=model_id,\n   tokenizer=tokenizer,\n   max_new_tokens=32768,\n   temperature=0.6,\n   top_p=0.95,\n   **model_kwargs\n)\n\n# Thinking can be \"on\" or \"off\"\nthinking = \"on\"\n\nprint(pipeline([{\"role\": \"system\", \"content\": f\"detailed thinking {thinking}\"}, {\"role\": \"user\", \"content\": \"Solve x*(sin(x)+2)=0\"}]))\n```\n\n\n### Example of “Reasoning Off:”\n\n```python\nimport torch\nimport transformers\n\nmodel_id = \"nvidia/Llama-3.1-Nemotron-Nano-4B-v1\"\nmodel_kwargs = {\"torch_dtype\": torch.bfloat16, \"device_map\": \"auto\"}\ntokenizer = transformers.AutoTokenizer.from_pretrained(model_id)\ntokenizer.pad_token_id = tokenizer.eos_token_id\n\npipeline = transformers.pipeline(\n   \"text-generation\",\n   model=model_id,\n   tokenizer=tokenizer,\n   max_new_tokens=32768,\n   do_sample=False,\n   **model_kwargs\n)\n\n# Thinking can be \"on\" or \"off\"\nthinking = \"off\"\n\nprint(pipeline([{\"role\": \"system\", \"content\": f\"detailed thinking {thinking}\"}, {\"role\": \"user\", \"content\": \"Solve x*(sin(x)+2)=0\"}]))\n```\n\nFor some prompts, even though thinking is disabled, the model emergently prefers to think before responding. But if desired, the users can prevent it by pre-filling the assistant response.\n\n```python\nimport torch\nimport transformers\n\nmodel_id = \"nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1\"\nmodel_kwargs = {\"torch_dtype\": torch.bfloat16, \"device_map\": \"auto\"}\ntokenizer = transformers.AutoTokenizer.from_pretrained(model_id)\ntokenizer.pad_token_id = tokenizer.eos_token_id\n\n# Thinking can be \"on\" or \"off\"\nthinking = \"off\"\n\npipeline = transformers.pipeline(\n   \"text-generation\",\n   model=model_id,\n   tokenizer=tokenizer,\n   max_new_tokens=32768,\n   do_sample=False,\n   **model_kwargs\n)\n\nprint(pipeline([{\"role\": \"system\", \"content\": f\"detailed thinking {thinking}\"}, {\"role\": \"user\", \"content\": \"Solve x*(sin(x)+2)=0\"}, {\"role\":\"assistant\", \"content\":\"<think>\\n</think>\"}]))\n```\n\n## Running a vLLM Server with Tool-call Support\n\nLlama-3.1-Nemotron-Nano-4B-v1.1 supports tool calling. This HF repo hosts a tool-callilng parser as well as a chat template in Jinja, which can be used to launch a vLLM server.\n\nHere is a shell script example to launch a vLLM server with tool-call support. `vllm/vllm-openai:v0.6.6` or newer should support the model.\n\n```shell\n#!/bin/bash\n\nCWD=$(pwd)\nPORT=5000\ngit clone https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1\ndocker run -it --rm \\\n    --runtime=nvidia \\\n    --gpus all \\\n    --shm-size=16GB \\\n    -p ${PORT}:${PORT} \\\n    -v ${CWD}:${CWD} \\\n    vllm/vllm-openai:v0.6.6 \\\n    --model $CWD/Llama-3.1-Nemotron-Nano-4B-v1.1 \\\n    --trust-remote-code \\\n    --seed 1 \\\n    --host \"0.0.0.0\" \\\n    --port $PORT \\\n    --served-model-name \"Llama-Nemotron-Nano-4B-v1.1\" \\\n    --tensor-parallel-size 1 \\\n    --max-model-len 131072 \\\n    --gpu-memory-utilization 0.95 \\\n    --enforce-eager \\\n    --enable-auto-tool-choice \\\n    --tool-parser-plugin \"${CWD}/Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_toolcall_parser.py\" \\\n    --tool-call-parser \"llama_nemotron_json\" \\\n    --chat-template \"${CWD}/Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_generic_tool_calling.jinja\"\n```\n\nAlternatively, you can use a virtual environment to launch a vLLM server like below.\n\n```console\n$ git clone https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1\n\n$ conda create -n vllm python=3.12 -y\n$ conda activate vllm\n\n$ python -m vllm.entrypoints.openai.api_server \\\n  --model Llama-3.1-Nemotron-Nano-4B-v1.1 \\\n  --trust-remote-code \\\n  --seed 1 \\\n  --host \"0.0.0.0\" \\\n  --port 5000 \\\n  --served-model-name \"Llama-Nemotron-Nano-4B-v1.1\" \\\n  --tensor-parallel-size 1 \\\n  --max-model-len 131072 \\\n  --gpu-memory-utilization 0.95 \\\n  --enforce-eager \\\n  --enable-auto-tool-choice \\\n  --tool-parser-plugin \"Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_toolcall_parser.py\" \\\n  --tool-call-parser \"llama_nemotron_json\" \\\n  --chat-template \"Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_generic_tool_calling.jinja\"\n```\n\nAfter launching a vLLM server, you can call the server with tool-call support using a Python script like below.\n\n```python\n>>> from openai import OpenAI\n>>> client = OpenAI(\n        base_url=\"http://0.0.0.0:5000/v1\",\n        api_key=\"dummy\",\n    )\n\n>>> completion = client.chat.completions.create(\n      model=\"Llama-Nemotron-Nano-v1.1\",\n      messages=[\n        {\"role\": \"system\", \"content\": \"detailed thinking on\"},\n        {\"role\": \"user\", \"content\": \"My bill is $100. What will be the amount for 18% tip?\"},\n      ],\n      tools=[\n        {\"type\": \"function\", \"function\": {\"name\": \"calculate_tip\", \"parameters\": {\"type\": \"object\", \"properties\": {\"bill_total\": {\"type\": \"integer\", \"description\": \"The total amount of the bill\"}, \"tip_percentage\": {\"type\": \"integer\", \"description\": \"The percentage of tip to be applied\"}}, \"required\": [\"bill_total\", \"tip_percentage\"]}}},\n        {\"type\": \"function\", \"function\": {\"name\": \"convert_currency\", \"parameters\": {\"type\": \"object\", \"properties\": {\"amount\": {\"type\": \"integer\", \"description\": \"The amount to be converted\"}, \"from_currency\": {\"type\": \"string\", \"description\": \"The currency code to convert from\"}, \"to_currency\": {\"type\": \"string\", \"description\": \"The currency code to convert to\"}}, \"required\": [\"from_currency\", \"amount\", \"to_currency\"]}}},\n      ],\n    )\n\n>>> completion.choices[0].message.content\n'<think>\\nOkay, let\\'s see. The user has a bill of $100 and wants to know the amount of a 18% tip. So, I need to calculate the tip amount. The available tools include calculate_tip, which requires bill_total and tip_percentage. The parameters are both integers. The bill_total is 100, and the tip percentage is 18. So, the function should multiply 100 by 18% and return 18.0. But wait, maybe the user wants the total including the tip? The question says \"the amount for 18% tip,\" which could be interpreted as the tip amount itself. Since the function is called calculate_tip, it\\'s likely that it\\'s designed to compute the tip, not the total. So, using calculate_tip with bill_total=100 and tip_percentage=18 should give the correct result. The other function, convert_currency, isn\\'t relevant here. So, I should call calculate_tip with those values.\\n</think>\\n\\n'\n\n>>> completion.choices[0].message.tool_calls\n[ChatCompletionMessageToolCall(id='chatcmpl-tool-2972d86817344edc9c1e0f9cd398e999', function=Function(arguments='{\"bill_total\": 100, \"tip_percentage\": 18}', name='calculate_tip'), type='function')]\n```",
    "related_quantizations": []
  },
  "tags": [
    "transformers",
    "gguf",
    "imatrix",
    "Llama-3.1-Nemotron-Nano-4B-v1.1",
    "text-generation",
    "en",
    "arxiv:2408.11796",
    "license:other",
    "region:us",
    "conversational"
  ],
  "likes": 0,
  "downloads": 611,
  "gated": false,
  "private": false,
  "last_modified": "2025-05-21T15:31:19.000Z",
  "created_at": "2025-05-21T14:54:41.000Z",
  "pipeline_tag": "text-generation",
  "library_name": "transformers"
}
Source payload excerpt (from Hugging Face API)
{
  "_id": "682de931f3e4b74b54726e00",
  "id": "duyntnet/Llama-3.1-Nemotron-Nano-4B-v1.1-imatrix-GGUF",
  "modelId": "duyntnet/Llama-3.1-Nemotron-Nano-4B-v1.1-imatrix-GGUF",
  "sha": "15a10f3bf59fb89d91a5485a97d60534b877b5a6",
  "createdAt": "2025-05-21T14:54:41.000Z",
  "lastModified": "2025-05-21T15:31:19.000Z",
  "author": "duyntnet",
  "downloads": 611,
  "likes": 0,
  "gated": false,
  "private": false,
  "pipeline_tag": "text-generation",
  "library_name": "transformers",
  "siblings_count": 29
}