distil-labs/distil-qwen3-1.7b-customer-support-deferral-gguf overview
Distil Qwen3 1.7B Customer Support Deferral — GGUF GGUF build of distil labs/distil qwen3 1.7b customer support deferral https://huggingface.co/distil labs/dis…
Runs locally from ~1.03 GB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| distil-qwen3-1.7b-customer-support-deferral-Q4_K_M.gguf | GGUF | Q4_K_M | 1.03 GB | Download |
Model Details
| Model ID | distil-labs/distil-qwen3-1.7b-customer-support-deferral-gguf |
|---|---|
| Author | distil-labs |
| Pipeline | text-generation |
| License | apache-2.0 |
| Base model | distil-labs/distil-qwen3-1.7b-customer-support-deferral |
| Last modified | 2026-06-08T00:45:37.000Z |
Model README
---
license: apache-2.0
base_model: distil-labs/distil-qwen3-1.7b-customer-support-deferral
tags:
- tool-calling
- function-calling
- customer-support
- airline
- model-cascade
- deferral
- distil-labs
- gguf
- llama-cpp
language:
- en
pipeline_tag: text-generation
library_name: llama.cpp
---
Distil-Qwen3-1.7B-Customer-Support-Deferral — GGUF
GGUF build of
distil-labs/distil-qwen3-1.7b-customer-support-deferral,
for serving with llama.cpp.
A fine-tuned Qwen3-1.7B model for multi-turn airline customer support that runs as the
small tier of a two-model cascade: it handles most support turns itself and **defers
genuinely-hard turns to a larger model** by emitting a defer_to_larger_model tool call.
Every assistant action is a single tool call — including talking to the customer via
respond_to_user — so a thin orchestrator can drive it.
> ⚠️ Placeholder weights. This GGUF is currently a build of base Qwen3-1.7B so the
> demo can be served and validated end-to-end. It will be **replaced with the distilled
> weights** once training completes, and the metric tables below will be populated then.
Results
Populated when training completes.
| Model | Parameters | Tool Call Accuracy | ROUGE | Deferral Precision | Deferral Recall |
|---|:---:|:---:|:---:|:---:|:---:|
| GLM-5 (teacher) | — | — | — | — | — |
| This model (tuned) | 1.7B | — | — | — | — |
| Qwen3-1.7B (base) | 1.7B | — | — | — | — |
Usage (llama.cpp)
hf download distil-labs/distil-qwen3-1.7b-customer-support-deferral-gguf \
distil-qwen3-1.7b-customer-support-deferral-Q4_K_M.gguf --local-dir models
llama-server \
--model models/distil-qwen3-1.7b-customer-support-deferral-Q4_K_M.gguf \
--port 8000 \
--jinja
Then query the OpenAI-compatible API at http://127.0.0.1:8000/v1. The airline policy (system
prompt) and the 16 tool schemas ship with the demo app as job_description.json.
Demo App
This model powers the Dual-size Customer-Support Bot demo — a terminal cascade where this
local SLM handles most airline-support turns and defers hard turns to a larger,
OpenAI-compatible model.
Quantizations
| File | Quant | Notes |
|---|---|---|
| distil-qwen3-1.7b-customer-support-deferral-Q4_K_M.gguf | Q4_K_M | Default; good size/quality balance |
Additional quants may be added alongside the trained weights.
Links
License
Released under the Apache 2.0 license. See the
for base-model and teacher-model license terms.
Run distil-labs/distil-qwen3-1.7b-customer-support-deferral-gguf with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models