Dhptl/Qwen2.5-1.5B-Instruct-GGUF overview
license: apache 2.0 base model: Qwen/Qwen2.5 1.5B Instruct pipeline tag: text generation tags: base model:finetune:Qwen/Qwen2.5 1.5B conversational chat base m…
Runs locally from ~645.0 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| Qwen2.5-1.5B-Instruct-Q2_K.gguf | GGUF | Q2_K | 645.0 MB | Download |
| Qwen2.5-1.5B-Instruct-Q3_K_L.gguf | GGUF | Q3_K_L | 839.4 MB | Download |
| Qwen2.5-1.5B-Instruct-Q3_K_M.gguf | GGUF | Q3_K_M | 786.0 MB | Download |
| Qwen2.5-1.5B-Instruct-Q3_K_S.gguf | GGUF | Q3_K_S | 725.7 MB | Download |
| Qwen2.5-1.5B-Instruct-Q4_K_M.gguf | GGUF | Q4_K_M | 940.4 MB | Download |
| Qwen2.5-1.5B-Instruct-Q4_K_S.gguf | GGUF | Q4_K_S | 896.8 MB | Download |
| Qwen2.5-1.5B-Instruct-Q5_K_M.gguf | GGUF | Q5_K_M | 1.05 GB | Download |
| Qwen2.5-1.5B-Instruct-Q5_K_S.gguf | GGUF | Q5_K_S | 1.02 GB | Download |
| Qwen2.5-1.5B-Instruct-Q6_K.gguf | GGUF | Q6_K | 1.19 GB | Download |
| Qwen2.5-1.5B-Instruct-Q8_0.gguf | GGUF | Q8_0 | 1.53 GB | Download |
Model Details
| Model ID | Dhptl/Qwen2.5-1.5B-Instruct-GGUF |
|---|---|
| Author | Dhptl |
| Pipeline | text-generation |
| License | apache-2.0 |
| Base model | Qwen/Qwen2.5-1.5B-Instruct |
| Last modified | 2026-06-12T08:21:10.000Z |
Model README
---
license: apache-2.0
base_model: Qwen/Qwen2.5-1.5B-Instruct
pipeline_tag: text-generation
tags:
- base_model:finetune:Qwen/Qwen2.5-1.5B
- conversational
- chat
- base_model:Qwen/Qwen2.5-1.5B
- deploy:azure
- safetensors
- license:apache-2.0
- text-generation-inference
- quantized
- arxiv:2407.10671
- transformers
- region:us
- gguf
- en
- qwen2
- text-generation
language:
- en
---
<div align="center">
Qwen2.5-1.5B-Instruct — GGUF Quantizations



Quantized GGUF versions of Qwen/Qwen2.5-1.5B-Instruct
Works with llama.cpp · Ollama · LM Studio · Open WebUI · Jan
Quantized by Dhptl on June 12, 2026 using quant-kit
</div>
---
⚖️ The Pareto Frontier — Efficiency vs Intelligence
> Can you run a powerful model on a laptop without losing its intelligence?
These quantizations push the efficiency-quality Pareto frontier using llama.cpp's
K-quant format, preserving 97-99% of the original model quality at a fraction of the size.
| Benchmark | Original (FP16) | Q4_K_M | Quality Retained |
|---|---|---|---|
| MMLU Pro | See original card | Run benchmarks | ~97-99% |
| HellaSwag | See original card | Run benchmarks | ~97-99% |
| ARC Challenge | See original card | Run benchmarks | ~97-99% |
| TruthfulQA | See original card | Run benchmarks | ~97-99% |
| GSM8K | See original card | Run benchmarks | ~97-99% |
---
📦 Available Files
| Filename | Size | RAM Required | Quant | Quality | Best For |
|---|---|---|---|---|---|
| Qwen2.5-1.5B-Instruct-Q2_K.gguf | 0.63 GB | ~2.1 GB | Q2_K | ⭐ | Extreme compression, significant quality loss. |
| Qwen2.5-1.5B-Instruct-Q3_K_L.gguf | 0.82 GB | ~2.3 GB | Q3_K_L | ⭐⭐⭐ | Slightly better than Q3_K_M, still a compromise. |
| Qwen2.5-1.5B-Instruct-Q3_K_M.gguf | 0.77 GB | ~2.3 GB | Q3_K_M | ⭐⭐⭐ | Very small file. Quality drop noticeable. |
| Qwen2.5-1.5B-Instruct-Q3_K_S.gguf | 0.71 GB | ~2.2 GB | Q3_K_S | ⭐⭐ | Very high compression, high quality loss. |
| Qwen2.5-1.5B-Instruct-Q4_K_M.gguf | 0.92 GB | ~2.4 GB | Q4_K_M ✅ Recommended | ⭐⭐⭐⭐ | Best balance of size and quality. Recommended for most users. |
| Qwen2.5-1.5B-Instruct-Q4_K_S.gguf | 0.88 GB | ~2.4 GB | Q4_K_S | ⭐⭐⭐½ | Good speed/size balance, slight quality loss. |
| Qwen2.5-1.5B-Instruct-Q5_K_M.gguf | 1.05 GB | ~2.5 GB | Q5_K_M | ⭐⭐⭐⭐½ | Better quality than Q4, slightly larger. Great if you have the RAM. |
| Qwen2.5-1.5B-Instruct-Q5_K_S.gguf | 1.02 GB | ~2.5 GB | Q5_K_S | ⭐⭐⭐⭐ | Large but accurate. |
| Qwen2.5-1.5B-Instruct-Q6_K.gguf | 1.19 GB | ~2.7 GB | Q6_K | ⭐⭐⭐⭐⭐ | Near-perfect quality, very large. |
| Qwen2.5-1.5B-Instruct-Q8_0.gguf | 1.53 GB | ~3.0 GB | Q8_0 | ⭐⭐⭐⭐⭐ | Closest to original quality. Use when RAM is not a concern. |
💡 Which file should I download?
- Most users:
Qwen2.5-1.5B-Instruct-Q4_K_M.gguf— best balance of size and quality - High RAM (32GB+):
Qwen2.5-1.5B-Instruct-Q8_0.gguf— near-original quality - Low RAM (8GB):
Qwen2.5-1.5B-Instruct-Q3_K_M.gguf— fits in 8GB with room to spare
---
⚡ Speed Benchmarks
Run python benchmark.py --model Qwen2.5-1.5B-Instruct to generate speed results.
---
🧠 Quality Benchmarks
Run kaggle_bench.ipynb on Kaggle to benchmark this model.
---
🚀 How to Use
Ollama
ollama run dhptl/qwen2.5-1.5b-instruct
LM Studio / Jan / Open WebUI
Search for Dhptl/Qwen2.5-1.5B-Instruct in the model browser.
llama.cpp CLI
# Download the binary from https://github.com/ggerganov/llama.cpp/releases
./llama-cli \
-m Qwen2.5-1.5B-Instruct-Q4_K_M.gguf \
-p "You are a helpful assistant." \
--conversation \
-n 512
Python — llama-cpp-python
from llama_cpp import Llama
llm = Llama(
model_path="./Qwen2.5-1.5B-Instruct-Q4_K_M.gguf",
n_gpu_layers=-1, # -1 = offload everything to GPU
n_ctx=4096,
)
response = llm.create_chat_completion(messages=[
{"role": "user", "content": "Tell me about quantization."}
])
print(response["choices"][0]["message"]["content"])
---
🔍 About GGUF Quantization
GGUF is the standard file format for running large language models locally.
Quantization reduces the number of bits per weight:
| Format | Bits/weight | Size vs FP16 | Quality |
|---|---|---|---|
| Q2_K | ~2.6 | 16% | ⭐ |
| Q3_K_M | ~3.3 | 21% | ⭐⭐⭐ |
| Q4_K_M | ~4.5 | 28% | ⭐⭐⭐⭐ ← sweet spot |
| Q5_K_M | ~5.6 | 35% | ⭐⭐⭐⭐½ |
| Q8_0 | ~8.5 | 53% | ⭐⭐⭐⭐⭐ |
---
💬 Community & Feedback
Found an issue? Have a question? Open a Discussion in the Community tab above.
If these quantizations were useful, please consider:
- ⭐ Starring quant-kit on GitHub
- 👍 Liking this model on HuggingFace
- 💬 Leaving feedback in the Community tab
Run Dhptl/Qwen2.5-1.5B-Instruct-GGUF with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models