Model Intelligence Sheet
votal-ai/qwen3.5-9b-guardrailed-v2-gguf overview
A surgically weight-edited version of Qwen/Qwen3.5-9B with an embedded safety probe that classifies user inputs as BLOCK/DEEP/ALLOW at inference time. No fine-tuning or adapter layers — the safety signal is folded directly into the model's MLP weights. Quantized to Q4KM GGUF format (5.2GB) for use with llama.cpp / llama-server.
Downloads
412
Likes
0
Pipeline
text-generation
Library
—
Visibility
Public
Access
Open
Repository Files & Downloads
1 files detected
Direct downloads for all repository files
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| Qwen3.5-9B-guardrailed-Q4_K_M.gguf | GGUF | Q4_K_M | 5.24 GB | Download |
Model Details Live
Metadata Inspector
Normalized metadata (stored in metadata_json)
{
"metadata": {},
"card_data": {
"license": "agpl-3.0",
"base_model": [
"Qwen/Qwen3.5-9B"
],
"pipeline_tag": "text-generation",
"tags": [
"ai-guardrails",
"safety",
"linear-probe",
"weight-editing",
"qwen3.5",
"gguf"
],
"metrics": [
"accuracy",
"f1",
"recall"
],
"frontmatter": {
"license": "agpl-3.0",
"base_model": [
"Qwen/Qwen3.5-9B"
],
"pipeline_tag": "text-generation",
"tags": [
"ai-guardrails",
"safety",
"linear-probe",
"weight-editing",
"qwen3.5",
"gguf"
],
"metrics": [
"accuracy",
"f1",
"recall"
]
},
"hero_image_url": "",
"summary": "A surgically weight-edited version of Qwen/Qwen3.5-9B with an embedded safety probe that classifies user inputs as BLOCK/DEEP/ALLOW at inference time. No fine-tuning or adapter layers — the safety signal is folded directly into the model's MLP weights. Quantized to **Q4_K_M** GGUF format (5.2GB) for use with llama.cpp / llama-server.",
"quick_links": [],
"benchmark_table_html": "",
"readme_markdown": "---\nlicense: agpl-3.0\nbase_model:\n- Qwen/Qwen3.5-9B\npipeline_tag: text-generation\ntags:\n- ai-guardrails\n- safety\n- linear-probe\n- weight-editing\n- qwen3.5\n- gguf\nmetrics:\n- accuracy\n- f1\n- recall\n---\n\n# Qwen3.5-9B-guardrailed-v2-GGUF\n\nA surgically weight-edited version of [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) with an embedded safety probe that classifies user inputs as BLOCK/DEEP/ALLOW at inference time. No fine-tuning or adapter layers — the safety signal is folded directly into the model's MLP weights.\n\nQuantized to **Q4_K_M** GGUF format (5.2GB) for use with llama.cpp / llama-server.\n\n## Model Details\n\n### Model Description\n\nThis model adds a lightweight guardrail layer to Qwen3.5-9B using **contrastive activation engineering**. A direction vector is computed from 189 (harmful, benign) text pairs across 25+ attack categories, then folded into the model's MLP down_proj weights at key layers. At inference, a multi-layer linear probe (layers 17, 20, 27) projects the hidden state onto these directions and produces a 0-1 safety score via z-score normalization and sigmoid.\n\nThe approach is **training-free** — no gradient descent, no fine-tuning data, no LoRA. The edits are deterministic rank-1 weight updates calibrated against the model's own activation magnitudes.\n\n- **Developed by:** [votal-ai](https://huggingface.co/votal-ai)\n- **Model type:** Causal language model with embedded safety probe\n- **Language(s):** English (probe trained on English attack/benign pairs)\n- **License:** AGPL-3.0\n- **Base model:** [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B)\n\n### Model Sources\n\n- **Repository:** [votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF](https://huggingface.co/votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF)\n\n## Uses\n\n### Direct Use\n\nUse as a safety-aware text generation model with llama.cpp or llama-server. The probe config (probe_config_9b.json) enables an external classifier to read hidden states and route requests:\n\n- **BLOCK** (score > 0.55): Input detected as an attack — reject or return a canned refusal\n- **DEEP** (0.45-0.55): Uncertain — route to a secondary LLM check\n- **ALLOW** (score < 0.45): Input is benign — proceed with generation\n\n```python\nfrom huggingface_hub import hf_hub_download\n\n# Download the GGUF\nhf_hub_download(\n repo_id=\"votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF\",\n filename=\"Qwen3.5-9B-guardrailed-Q4_K_M.gguf\",\n local_dir=\"./models\"\n)\n\n# Download the probe config\nhf_hub_download(\n repo_id=\"votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF\",\n filename=\"probe_config_9b.json\",\n local_dir=\"./models\"\n)\n```\n\n```bash\n# Run with llama-server\nllama-server -m ./models/Qwen3.5-9B-guardrailed-Q4_K_M.gguf \\\n --host 0.0.0.0 --port 8080\n```\n\n### Downstream Use\n\nCan be integrated into any LLM serving pipeline as a first-pass safety filter. The probe runs at layers 17/20/27 (early-to-mid network), so with hook-based early exit, classification takes ~580ms on A100 — fast enough for real-time gating before full generation.\n\n### Out-of-Scope Use\n\n- **Not a standalone content filter.** The probe catches known attack patterns but should be layered with other safety measures (output filtering, rate limiting, human review).\n- **English only.** The contrastive pairs are English — attack detection for other languages is not validated.\n- **Not adversarially robust.** Sophisticated adversaries may find novel attack phrasings that bypass the probe. The DEEP path exists for this reason.\n\n## Bias, Risks, and Limitations\n\n- **False positives on ambiguous phrasing.** Certain legitimate phrases that share vocabulary with attacks may be incorrectly flagged. Known examples: \"You are a great assistant thank you\" (0.67), \"Can you explain this like I am five\" (0.55), \"Disregard the return value\" (0.54). These route to the DEEP path, not outright blocking.\n- **Probe direction is static.** The safety signal is baked into the weights at edit time. It does not adapt to new attack patterns without re-running the pipeline.\n- **Quantization may shift probe scores.** The probe was calibrated on the bf16 model. Q4_K_M quantization may slightly alter hidden state magnitudes, though testing shows minimal impact.\n\n### Recommendations\n\n- Always pair with output-side safety filtering — the probe only classifies inputs, not generated outputs.\n- Implement the DEEP path as a secondary check (e.g., a smaller classifier or LLM-as-judge) rather than defaulting to BLOCK or ALLOW.\n- Monitor false positive rates in production and retrain the probe direction if new benign patterns are being flagged.\n\n## How to Get Started with the Model\n\n### Probe scoring (Python)\n\n```python\nimport torch, json\n\n# Load probe config\nwith open(\"probe_config_9b.json\") as f:\n cfg = json.load(f)\n\n# Multi-layer z-score probe\ndef classify(hidden_states):\n \"\"\"Score from model hidden states. Returns (score, action).\"\"\"\n combined = 0.0\n for li, w in zip(cfg[\"probe_layers\"], cfg[\"probe_weights\"]):\n direction = torch.tensor(cfg[\"layer_directions\"][str(li)])\n h = hidden_states[li][0, -1, :].float()\n raw = (h @ direction).item()\n stats = cfg[\"layer_stats\"][str(li)]\n z = (raw - stats[\"mean\"]) / stats[\"std\"]\n combined += w * z\n\n score = torch.sigmoid(torch.tensor(combined * cfg[\"probe_scale\"])).item()\n\n if score > cfg[\"threshold_block\"]:\n return score, \"BLOCK\"\n elif score < cfg[\"threshold_allow\"]:\n return score, \"ALLOW\"\n else:\n return score, \"DEEP\"\n```\n\n## Evaluation\n\n### Testing Data\n\n88 test cases across 30 categories:\n- **53 attack prompts**: prompt injection, jailbreaking, DAN, social engineering, obfuscation, payload splitting, code injection, bad chain reasoning, and more\n- **35 benign prompts**: general coding questions, security education, tricky vocabulary (dev jargon like \"kill process\", \"hack together\", \"bypass cache\"), conversational queries\n\n### Metrics\n\n| Metric | Value |\n|---|---|\n| **Overall accuracy** | 95% (84/88) |\n| **Attack recall** | 100% (53/53) |\n| **Benign precision** | 89% (31/35) |\n| **False positives** | 4 |\n| **False negatives** | 0 |\n| **F1 score** | 0.964 |\n\n### Results by Category\n\n| Category | Accuracy |\n|---|---|\n| Prompt Injection | 100% |\n| Jailbreaking | 100% |\n| DAN | 100% |\n| Social Engineering | 100% |\n| Code Injection | 100% |\n| Obfuscation | 100% |\n| Payload Splitting | 100% |\n| Bad Chain Reasoning | 100% |\n| Legitimate Coding | 100% |\n| Security Education | 100% |\n| Tricky Vocab | 82% (9/11) |\n| Conversational | 67% (4/6) |\n\n### Latency\n\n| Input Length | Avg | P50 | P95 | P99 |\n|---|---|---|---|---|\n| Short (~5 tokens) | 580ms | 597ms | 682ms | 684ms |\n| Medium (~20 tokens) | 570ms | 593ms | 612ms | 687ms |\n| Long (~40 tokens) | 588ms | 597ms | 679ms | 687ms |\n\nMeasured on A100 GPU with full forward pass through all layers.\n\n## Technical Specifications\n\n### Model Architecture and Objective\n\n**Base architecture:** Qwen3.5-9B — hybrid attention + SSM (Mamba) causal language model with 32 layers, 4096 hidden size, 16 attention heads.\n\n**Safety edits (3 modifications):**\n\n1. **MLP bias folding** (layers 17, 20, 22, 18): Contrastive safety direction folded into down_proj weights via rank-1 update. Bias-free — compatible with llama.cpp (no separate bias tensors needed, 427 GGUF tensors).\n\n2. **Attention head amplification** (top 3 layers): The 2 most safety-aligned attention heads per layer are scaled by 1.04x in o_proj.\n\n3. **Reasoning amplification** (layers 22-32): up_proj and gate_proj weights scaled by 1.015x to strengthen late-layer reasoning.\n\n**Probe architecture:** Multi-layer linear probe using z-score normalized projections from layers 17, 20, and 27 with equal weights (0.34/0.33/0.33), sigmoid scale 1.5.\n\n### Compute Infrastructure\n\n#### Hardware\n\n- NVIDIA A100 GPU (40GB VRAM)\n- ~18GB VRAM for bf16 inference\n- Weight editing takes ~10 minutes\n- GGUF conversion takes ~5 minutes\n\n#### Software\n\n- Python 3.10+\n- PyTorch 2.x\n- Transformers 5.x\n- llama.cpp (build 8580+)\n\n## Environmental Impact\n\n- **Hardware Type:** NVIDIA A100\n- **Hours used:** < 1 hour (no training — deterministic weight editing only)\n- **Carbon Emitted:** Negligible (no gradient computation or training loops)\n\n## Model Card Authors\n\nvotal-ai\n\n## Model Card Contact\n\n[votal-ai on HuggingFace](https://huggingface.co/votal-ai)\n",
"related_quantizations": []
},
"tags": [
"gguf",
"qwen3_5_text",
"ai-guardrails",
"safety",
"linear-probe",
"weight-editing",
"qwen3.5",
"text-generation",
"conversational",
"base_model:Qwen/Qwen3.5-9B",
"base_model:quantized:Qwen/Qwen3.5-9B",
"license:agpl-3.0",
"endpoints_compatible",
"region:us"
],
"likes": 0,
"downloads": 412,
"gated": false,
"private": false,
"last_modified": "2026-03-31T22:37:49.000Z",
"created_at": "2026-03-31T21:51:47.000Z",
"pipeline_tag": "text-generation",
"library_name": ""
}
Source payload excerpt (from Hugging Face API)
{
"_id": "69cc41f35ab0fc02ec11cab5",
"id": "votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF",
"modelId": "votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF",
"sha": "bbd6a7591f4385e2655fc72d99627864a52e34a7",
"createdAt": "2026-03-31T21:51:47.000Z",
"lastModified": "2026-03-31T22:37:49.000Z",
"author": "votal-ai",
"downloads": 412,
"likes": 0,
"gated": false,
"private": false,
"pipeline_tag": "text-generation",
"library_name": "",
"siblings_count": 8
}