GraySoft
Projects Models Compare Cloud benchmarks FAQ Download guIDE →
Model Intelligence Sheet

distil-labs/distil-qwen3-1.7b-customer-support-deferral-gguf overview

Distil Qwen3 1.7B Customer Support Deferral — GGUF GGUF build of distil labs/distil qwen3 1.7b customer support deferral https://huggingface.co/distil labs/dis…

llama.cppgguftool-callingfunction-callingcustomer-supportairlinemodel-cascadedeferraldistil-labsllama-cpptext-generationenbase_model:distil-labs/distil-qwen3-1.7b-customer-support-deferralbase_model:quantized:distil-labs/distil-qwen3-1.7b-customer-support-deferrallicense:apache-2.0endpoints_compatibleregion:usimatrixconversational

Runs locally from ~1.03 GB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads
0
Likes
0
Pipeline
text-generation

Repository Files & Downloads

1 GGUF files detected
Direct downloads for local inference
FileTypeQuantizationSizeLink
distil-qwen3-1.7b-customer-support-deferral-Q4_K_M.ggufGGUFQ4_K_M1.03 GBDownload

Model Details

Model IDdistil-labs/distil-qwen3-1.7b-customer-support-deferral-gguf
Authordistil-labs
Pipelinetext-generation
Licenseapache-2.0
Base modeldistil-labs/distil-qwen3-1.7b-customer-support-deferral
Last modified2026-06-08T00:45:37.000Z

Model README

---

license: apache-2.0

base_model: distil-labs/distil-qwen3-1.7b-customer-support-deferral

tags:

- tool-calling

- function-calling

- customer-support

- airline

- model-cascade

- deferral

- distil-labs

- gguf

- llama-cpp

language:

- en

pipeline_tag: text-generation

library_name: llama.cpp

---

Distil-Qwen3-1.7B-Customer-Support-Deferral — GGUF

GGUF build of

distil-labs/distil-qwen3-1.7b-customer-support-deferral,

for serving with llama.cpp.

A fine-tuned Qwen3-1.7B model for multi-turn airline customer support that runs as the

small tier of a two-model cascade: it handles most support turns itself and **defers

genuinely-hard turns to a larger model** by emitting a defer_to_larger_model tool call.

Every assistant action is a single tool call — including talking to the customer via

respond_to_user — so a thin orchestrator can drive it.

> ⚠️ Placeholder weights. This GGUF is currently a build of base Qwen3-1.7B so the

> demo can be served and validated end-to-end. It will be **replaced with the distilled

> weights** once training completes, and the metric tables below will be populated then.

Results

Populated when training completes.

| Model | Parameters | Tool Call Accuracy | ROUGE | Deferral Precision | Deferral Recall |

|---|:---:|:---:|:---:|:---:|:---:|

| GLM-5 (teacher) | — | — | — | — | — |

| This model (tuned) | 1.7B | — | — | — | — |

| Qwen3-1.7B (base) | 1.7B | — | — | — | — |

Usage (llama.cpp)

hf download distil-labs/distil-qwen3-1.7b-customer-support-deferral-gguf \
    distil-qwen3-1.7b-customer-support-deferral-Q4_K_M.gguf --local-dir models

llama-server \
    --model models/distil-qwen3-1.7b-customer-support-deferral-Q4_K_M.gguf \
    --port 8000 \
    --jinja

Then query the OpenAI-compatible API at http://127.0.0.1:8000/v1. The airline policy (system

prompt) and the 16 tool schemas ship with the demo app as job_description.json.

Demo App

This model powers the Dual-size Customer-Support Bot demo — a terminal cascade where this

local SLM handles most airline-support turns and defers hard turns to a larger,

OpenAI-compatible model.

Quantizations

| File | Quant | Notes |

|---|---|---|

| distil-qwen3-1.7b-customer-support-deferral-Q4_K_M.gguf | Q4_K_M | Default; good size/quality balance |

Additional quants may be added alongside the trained weights.

Links

License

Released under the Apache 2.0 license. See the

transformers model card

for base-model and teacher-model license terms.

Run distil-labs/distil-qwen3-1.7b-customer-support-deferral-gguf with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models