GraySoft
Projects Models About FAQ Contact Download guIDE →

ai21labs/ai21-jamba-reasoning-3b-gguf Q4_K_M GGUF - Free GGUF Download is indexed on GraySoft with repository links, GGUF quant files, and Hugging Face metadata. This page helps you pick a local model for guIDE or other runtimes. See related models in the same shard below.

Model Intelligence Sheet

ai21labs/ai21-jamba-reasoning-3b-gguf overview

modelpath = hfhubdownload( repoid="ai21labs/AI21-Jamba-Reasoning-3B-GGUF", filename="jamba-reasoning-3b-Q4KM.gguf", token="" ) llm = Llama( modelpath=modelpath, nctx=128000, nthreads=10, # CPU threads ngpulayers=-1, # -1 = all layers on GPU (Metal/CUDA if available) flashattn=True, ) prompt = """ You are analyzing a stream of customer support tickets to decide which ones require escalation. Ticket 1: "The new update caused our app to crash whenever users upload a file larger than 50MB." Ticket 2: "I can't log in because I forgot my password." Ticket 3: "The billing page is missing the new enterprise pricing option." Classify each ticket as 'Critical', 'Medium', or 'Low' priority and explain your reasoning.""" res = llm( prompt, maxtokens=8192, temperature=0.6, ) print(f"\n\nResponse: {res['choices'][0]['text']}\n\n") bash git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build cmake --build build --config Release bash ./build/bin/llama-server --jinja \ --hf-repo ai21labs/AI21-Jamba-Reasoning-3B-GGUF \ --hf-file jamba-reasoning-3b-Q4KM.gguf \ -ngl -1 \ --host 127.0.0.1 \ --port 8000 bash curl --location 'http://127.0.0.1:8000/v1/chat/completions' \ --header 'Content-Type: application/json' \ --data '{ "model": "jamba-reasoning-3b", "messages": [ { "role": "user", "content": "You are analyzing customer support tickets to decide which need escalation.\nTicket 1: '\''App crashes when uploading files >50MB.'\''\nTicket 2: '\''Forgot password, can’t log in.'\''\nTicket 3: '\''Billing page missing enterprise pricing.'\''\nClassify each ticket as Critical, Medium, or Low and explain your reasoning.\n" } ], "max_tokens": 8192, "temperature": 0.6 }' ### Run the model with vLLM Please reference the base model's model card here.

transformersgguftext-generationarxiv:2507.02782license:apache-2.0endpoints_compatibleregion:usconversational
ai21labs/ai21-jamba-reasoning-3b-gguf visual
Downloads
417
Likes
31
Pipeline
text-generation
Library
transformers
Visibility
Public
Access
Open

Repository Files & Downloads

2 files detected
Direct downloads for all repository files
FileTypeQuantizationSizeLink
jamba-reasoning-3b-F16.gguf GGUF F16 5.96 GB Download
jamba-reasoning-3b-Q4_K_M.gguf GGUF Q4_K_M 1.80 GB Download

Model Details Live

Model Slug
ai21labs/ai21-jamba-reasoning-3b-gguf
Author
ai21labs
Pipeline Task
text-generation
Library
transformers
Created
2025-10-05
Last Modified
2026-02-02
Gated
No
Private
No
HF SHA
462e08a43c3c32f6b8b85f79ff0796e484d7b65a
License
apache-2.0
Language
Unknown
Base Model
Unknown

Metadata Inspector

Normalized metadata (stored in metadata_json)
{
  "metadata": {},
  "card_data": {
    "license": "apache-2.0",
    "license_name": "jamba-open-model-license",
    "license_link": "https://www.ai21.com/jamba-open-model-license/",
    "pipeline_tag": "text-generation",
    "library_name": "transformers",
    "frontmatter": {
      "license": "apache-2.0",
      "license_name": "jamba-open-model-license",
      "license_link": "https://www.ai21.com/jamba-open-model-license/",
      "pipeline_tag": "text-generation",
      "library_name": "transformers"
    },
    "hero_image_url": "https://huggingface.co/ai21labs/AI21-Jamba-Reasoning-3B-GGUF/resolve/main/assets/Intelligence%20vs%20Speed%20Jamba%20Reasoning%203B.png",
    "summary": "model_path = hf_hub_download( repo_id=\"ai21labs/AI21-Jamba-Reasoning-3B-GGUF\", filename=\"jamba-reasoning-3b-Q4_K_M.gguf\", token=\"\" ) llm = Llama( model_path=model_path, n_ctx=128000, n_threads=10,        # CPU threads n_gpu_layers=-1,     # -1 = all layers on GPU (Metal/CUDA if available) flash_attn=True, ) prompt =  \"\"\" You are analyzing a stream of customer support tickets to decide which ones require escalation. Ticket 1: \"The new update caused our app to crash whenever users upload a file larger than 50MB.\" Ticket 2: \"I can't log in because I forgot my password.\" Ticket 3: \"The billing page is missing the new enterprise pricing option.\" Classify each ticket as 'Critical', 'Medium', or 'Low' priority and explain your reasoning.\"\"\" res = llm( prompt, max_tokens=8192, temperature=0.6, ) print(f\"\\n\\nResponse: {res['choices'][0]['text']}\\n\\n\") `` #### llama.cpp server `bash git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build cmake --build build --config Release ` Start llama.cpp server with Jamba-Reasoning-3B gguf: `bash ./build/bin/llama-server --jinja \\ --hf-repo ai21labs/AI21-Jamba-Reasoning-3B-GGUF \\ --hf-file jamba-reasoning-3b-Q4_K_M.gguf \\ -ngl -1 \\ --host 127.0.0.1 \\ --port 8000 ` Quick sanity test using curl: `bash curl --location 'http://127.0.0.1:8000/v1/chat/completions' \\ --header 'Content-Type: application/json' \\ --data '{ \"model\": \"jamba-reasoning-3b\", \"messages\": [ { \"role\": \"user\", \"content\": \"You are analyzing customer support tickets to decide which need escalation.\\nTicket 1: '\\''App crashes when uploading files >50MB.'\\''\\nTicket 2: '\\''Forgot password, can’t log in.'\\''\\nTicket 3: '\\''Billing page missing enterprise pricing.'\\''\\nClassify each ticket as Critical, Medium, or Low and explain your reasoning.\\n\" } ], \"max_tokens\": 8192, \"temperature\": 0.6 }' `` ### Run the model with vLLM > [!NOTE] > Please reference the base model's model card here.",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nlicense: apache-2.0\nlicense_name: jamba-open-model-license\nlicense_link: https://www.ai21.com/jamba-open-model-license/\npipeline_tag: text-generation\nlibrary_name: transformers\n---\n## Introduction\n\nAI21’s Jamba Reasoning 3B is a top-performing reasoning model that packs leading scores on intelligence benchmarks and highly-efficient processing into a compact 3B build. \n<br> Read the full blog post [here](https://www.ai21.com/blog/introducing-jamba-reasoning-3B). \n\n### Key Advantages\n\n**Fast: Optimized for efficient sequence processing**\n\nThe hybrid design combines Transformer attention with Mamba (a state-space model). Mamba layers are more efficient for sequence processing, while attention layers capture complex dependencies. This mix reduces memory overhead, improves throughput, and makes the model run smoothly on laptops, GPUs, and even mobile devices, while maintainig impressive quality. \n\n\n<img src=\"https://huggingface.co/ai21labs/AI21-Jamba-Reasoning-3B-GGUF/resolve/main/assets/Intelligence%20vs%20Speed%20Jamba%20Reasoning%203B.png\" width=\"900\"/>\n\n**Smart: Leading intelligence scores** \n\nThe model outperforms competitors, such as Gemma 3 4B, Llama 3.2 3B, and Granite 4.0 Micro, on a combined intelligence score that averages 6 standard benchmarks.  \n\n\n<img src=\"https://huggingface.co/ai21labs/AI21-Jamba-Reasoning-3B-GGUF/resolve/main/assets/Benchmark%20Performance%20-%20Jamba%20Reasoning%203B.png\" width=\"900\"/>\n\n**Scalable: Handles very long contexts**\n\nUnlike most compact models, Jamba Reasoning 3B supports extremely long contexts. Mamba layers allow the model to process inputs without storing massive attention caches, so it scales to **256K tokens** while keeping inference practical. This makes it suitable for edge deployment as well as datacenter workloads.\n\n\n<img src=\"https://huggingface.co/ai21labs/AI21-Jamba-Reasoning-3B-GGUF/resolve/main/assets/Speed%20vs%20Context%20Length.png\" width=\"900\"/>\n\n\n## Model Details\n\n- Number of Parameters: 3B\n- Number of Layers: 28 (26 Mamba, 2 Attention)\n- Number of Attention Heads: 20 MQA (20 for Q, 1 for KV)\n- Vocabulary Size: 64K\n- Context Length: **256k**\n- Architecture: Hybrid Transformer–Mamba with efficient attention and long-context support\n- **Developed by:** [**AI21**](https://www.ai21.com/)\n- **Supported languages:** English, Spanish, French, Portuguese, Italian, Dutch, German, Arabic and Hebrew\n- Intelligence benchmark results:\n\n|  | **MMLU-Pro** | **Humanity’s Last Exam** | **IFBench** |\n| --- | --- | --- | --- |\n| DeepSeek R1 Distill Qwen 1.5B | 27.0% | 3.3% | 13.0% |\n| Phi-4 mini | 47.0% | 4.2% | 21.0% |\n| Granite 4.0 Micro | 44.7% | 5.1% | 24.8% |\n| Llama 3.2 3B | 35.0% | 5.2% | 26.0% |\n| Gemma 3 4B | 42.0% | 5.2% | 28.0% |\n| Qwen 3 1.7B | 57.0% | 4.8% | 27.0% |\n| Qwen 3 4B | 70% | 5.1% | 33% |\n| **Jamba Reasoning 3B** | **61.0%** | **6.0%** | **52.0%** |\n\n## Quickstart\n\nYou can run Jamba Reasoning 3B on your own machine using popular lightweight runtimes. This makes it possible to experiment with long-context reasoning without relying on cloud infrastructure.\n\n- **Supported runtimes**: [llama.cpp](https://github.com/ggml-org/llama.cpp), [LM Studio](https://lmstudio.ai/), and [Ollama](https://ollama.com/).\n- **Quantizations**: Multiple quantization levels are provided to shrink the model size.\n    - Full precision FP16 GGUF - **6.4** GB\n    - 4 bit quantization using Q4-K-M GGUF - **1.93** GB\n\n### Run the model with llama.cpp\n\n#### llama.cpp Python SDK\n```bash\npip install llama-cpp-python\npip install huggingface_hub\n```      \n```python\nfrom llama_cpp import Llama\nfrom huggingface_hub import hf_hub_download\n\n# Download from HF\nmodel_path = hf_hub_download(\n    repo_id=\"ai21labs/AI21-Jamba-Reasoning-3B-GGUF\",  \n    filename=\"jamba-reasoning-3b-Q4_K_M.gguf\",\n    token=\"<HF token>\"\n)\n\nllm = Llama(\n    model_path=model_path,\n    n_ctx=128000,\n    n_threads=10,        # CPU threads\n    n_gpu_layers=-1,     # -1 = all layers on GPU (Metal/CUDA if available)\n    flash_attn=True,\n)\n\nprompt =  \"\"\"\nYou are analyzing a stream of customer support tickets to decide which ones require escalation.\n\nTicket 1: \"The new update caused our app to crash whenever users upload a file larger than 50MB.\"\nTicket 2: \"I can't log in because I forgot my password.\"\nTicket 3: \"The billing page is missing the new enterprise pricing option.\"\n\nClassify each ticket as 'Critical', 'Medium', or 'Low' priority and explain your reasoning.\"\"\"\nres = llm(\n    prompt,\n    max_tokens=8192,\n    temperature=0.6,\n)\n\nprint(f\"\\n\\nResponse: {res['choices'][0]['text']}\\n\\n\")\n```\n        \n#### llama.cpp server\n```bash\ngit clone https://github.com/ggml-org/llama.cpp\ncd llama.cpp\ncmake -B build\ncmake --build build --config Release\n```\nStart llama.cpp server with Jamba-Reasoning-3B gguf:    \n```bash\n./build/bin/llama-server --jinja \\\n--hf-repo ai21labs/AI21-Jamba-Reasoning-3B-GGUF \\\n--hf-file jamba-reasoning-3b-Q4_K_M.gguf \\\n-ngl -1 \\\n--host 127.0.0.1 \\\n--port 8000\n```\nQuick sanity test using curl:       \n```bash\ncurl --location 'http://127.0.0.1:8000/v1/chat/completions' \\\n--header 'Content-Type: application/json' \\\n--data '{\n    \"model\": \"jamba-reasoning-3b\",\n    \"messages\": [\n        {\n            \"role\": \"user\",\n            \"content\": \"You are analyzing customer support tickets to decide which need escalation.\\nTicket 1: '\\''App crashes when uploading files >50MB.'\\''\\nTicket 2: '\\''Forgot password, can’t log in.'\\''\\nTicket 3: '\\''Billing page missing enterprise pricing.'\\''\\nClassify each ticket as Critical, Medium, or Low and explain your reasoning.\\n\"\n        }\n    ],\n    \"max_tokens\": 8192,\n    \"temperature\": 0.6\n}'\n```\n### Run the model with vLLM\n\n> [!NOTE]\n> Please reference the base model's model card [here](https://huggingface.co/ai21labs/AI21-Jamba-Reasoning-3B/blob/main/README.md#run-the-model-with-vllm).\n\n\n## Training Details\n\nWe trained the model in multiple stages, each designed to strengthen reasoning and long-context performance. The process began with large-scale pre-training on a diverse corpus of natural documents. We then mid-trained on ~0.5T tokens of math and code, while extending the context length to 32K tokens. During this stage we also applied a [Mamba-specific long-context method](https://arxiv.org/abs/2507.02782), which we found to significantly improve long-context abilities.\n\nTo improve reasoning, tool use, and instruction following, we applied cold-start distillation: supervised fine-tuning with a 32K window and direct preference optimization with a 64K window. Finally, we enhanced reasoning performance further through online reinforcement learning with RLVR, targeting tasks such as code generation, mathematical problem solving, structured output, and information extraction.\n\n## Reinforcement “Fine-Tuning”\n\nFull support for training Jamba through VeRL will be available soon. AI21 has introduced several improvements to the VeRL framework (https://github.com/volcengine/verl), including new capabilities for training hybrid models, and stability improvements for GRPO training. These improvements will soon be available to the open source community. \n\n---\n\n## License\n\n- `Apache 2.0`\n\n---\n\n## Citation\n\n- Blog post- Read the full blog post [here](https://www.ai21.com/blog/introducing-jamba-reasoning-3B).",
    "related_quantizations": []
  },
  "tags": [
    "transformers",
    "gguf",
    "text-generation",
    "arxiv:2507.02782",
    "license:apache-2.0",
    "endpoints_compatible",
    "region:us",
    "conversational"
  ],
  "likes": 31,
  "downloads": 417,
  "gated": false,
  "private": false,
  "last_modified": "2026-02-02T11:37:51.000Z",
  "created_at": "2025-10-05T11:02:28.000Z",
  "pipeline_tag": "text-generation",
  "library_name": "transformers"
}
Source payload excerpt (from Hugging Face API)
{
  "_id": "68e2504474d77b5aa4c0eb86",
  "id": "ai21labs/AI21-Jamba-Reasoning-3B-GGUF",
  "modelId": "ai21labs/AI21-Jamba-Reasoning-3B-GGUF",
  "sha": "462e08a43c3c32f6b8b85f79ff0796e484d7b65a",
  "createdAt": "2025-10-05T11:02:28.000Z",
  "lastModified": "2026-02-02T11:37:51.000Z",
  "author": "ai21labs",
  "downloads": 417,
  "likes": 31,
  "gated": false,
  "private": false,
  "pipeline_tag": "text-generation",
  "library_name": "transformers",
  "siblings_count": 9
}