build-small-hackathon/proofkit-distilled-qwen0.5b-gguf overview
license: apache 2.0 base model: visproj/proofkit distilled qwen0.5b library name: llama.cpp pipeline tag: text generation language: en tags: proofkit gguf llam…
Runs locally from ~379.4 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| proofkit-distilled-qwen0.5b-q4_k_m.gguf | GGUF | Q4_K_M | 379.4 MB | Download |
Model Details
| Model ID | build-small-hackathon/proofkit-distilled-qwen0.5b-gguf |
|---|---|
| Author | build-small-hackathon |
| Pipeline | text-generation |
| License | apache-2.0 |
| Base model | visproj/proofkit-distilled-qwen0.5b |
| Last modified | 2026-06-12T23:11:36.000Z |
Model README
---
license: apache-2.0
base_model: visproj/proofkit-distilled-qwen0.5b
library_name: llama.cpp
pipeline_tag: text-generation
language:
- en
tags:
- proofkit
- gguf
- llama.cpp
- distilled
- build-small-hackathon
- work-sample
---
ProofKit Qwen 0.5B — distilled (GGUF)
The llama.cpp / GGUF build of
visproj/proofkit-distilled-qwen0.5b
— a Qwen 0.5B student distilled from ProofKit's fine-tuned gpt-oss-20b teacher. This is
the default model the ProofKit Space serves: it runs free on CPU via
llama.cpp, so the app works on a free Space with no GPU.
- Quantization:
q4_k_m(~400 MB) - Runtime:
llama-cpp-python/ llama.cpp - Chat template: Qwen2 (embedded in the GGUF metadata)
Usage
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="visproj/proofkit-distilled-qwen0.5b-gguf",
filename="*q4_k_m.gguf",
n_ctx=4096,
)
resp = llm.create_chat_completion(
messages=[{"role": "system", "content": SYSTEM}, {"role": "user", "content": PROMPT}],
temperature=0.0,
)
print(resp["choices"][0]["message"]["content"])
Configure it in ProofKit with:
export PROOFKIT_DISTILLED_MODELS='ProofKit Qwen 0.5B Distilled=visproj/proofkit-distilled-qwen0.5b-gguf|*q4_k_m.gguf'
Evaluation (post-fix, 3-judge panel)
Mean score (0–100) on 15 held-out prompts, graded by Claude Opus 4.7, GPT-5.5, and a
local Qwen-3B (gpt-oss experts is a deliberately un-retrained stale control):
| model | Claude | GPT-5.5 | Qwen-3B | Avg |
|---|---:|---:|---:|---:|
| gpt-5.5 (frontier ceiling) | 94.6 | 95.6 | 90.8 | 93.7 |
| gpt-oss attn (retrained teacher) | 82.0 | 66.8 | 81.4 | 76.7 |
| qwen-0.5b distilled (served) | 79.0 | 68.6 | 82.2 | 76.6 |
| qwen-0.5b direct 7k (served) | 78.6 | 64.4 | 82.0 | 75.0 |
| gpt-oss experts (stale control) | 67.6 | 68.6 | 81.8 | 72.7 |
| qwen-3b base | 62.1 | 67.1 | 80.5 | 69.9 |
| gpt-oss base | 55.4 | 53.8 | 68.2 | 59.1 |
| qwen-0.5b base | 36.5 | 44.5 | 67.9 | 49.7 |
Both served retrained 0.5Bs beat the stale control and every untuned base across all
three judges, and the distilled 0.5B ≈ ties its own 20B teacher.
About ProofKit
ProofKit is a work-sample generator for job seekers — it turns a target
role, background, and skills-to-prove into a realistic, clearly-fictional
practice work sample (a role-specific challenge, a guided builder, a readiness
review, and a recruiter-ready portfolio packet). Built for the Hugging Face **Build
Small Hackathon** (Backyard AI track). Integrity rules are load-bearing: outputs
never claim real employment, metrics are labeled hypothetical, and exports carry an
ethical disclosure.
The ProofKit model family
| Repo | What it is |
|---|---|
| visproj/proofkit-qwen0.5b-7k | Qwen2.5-0.5B fine-tuned directly on the 7k set (Transformers) |
| visproj/proofkit-gpt-oss-20b-lora | gpt-oss-20b LoRA — the distillation teacher |
| visproj/proofkit-distilled-qwen0.5b | Qwen2.5-0.5B distilled from the teacher (merged) |
| visproj/proofkit-distilled-qwen0.5b-gguf | GGUF of the distilled student (llama.cpp — served) |
| visproj/proofkit-sft | SFT dataset (synthetic, license-safe) |
| visproj/proofkit-distill-qwen0.5b | Distillation dataset (teacher completions) |
A note on training data (the "static responses" fix)
An earlier version of these models produced repetitive, input-ignoring drafts. The
root cause was synthetic-data leakage: the dataset rendered the example *user
answers and the target* from the same template slots, so the model learned
target = template instead of target = f(input). The fix — faithfulness anchors
(a distinctive token shared by the answer and the target) + **seeded per-example
variation** across every task, then a full-chain retrain — is what these current
weights reflect.
Prompt format is a frozen contract
These 0.5B models were trained on the exact prompt shapes from ProofKit's
prompt_formats.py. They only behave well when prompted in that format; reworded or
free-form prompts push them off-distribution. They are purpose-built components of the
ProofKit app, not general chat models.
Run build-small-hackathon/proofkit-distilled-qwen0.5b-gguf with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models