Model Intelligence Sheet

votal-ai/qwen3.5-9b-guardrailed-v2-gguf overview

A surgically weight-edited version of Qwen/Qwen3.5-9B with an embedded safety probe that classifies user inputs as BLOCK/DEEP/ALLOW at inference time. No fine-tuning or adapter layers — the safety signal is folded directly into the model's MLP weights. Quantized to Q4KM GGUF format (5.2GB) for use with llama.cpp / llama-server.

ggufqwen3_5_textai-guardrailssafetylinear-probeweight-editingqwen3.5text-generationconversationalbase_model:Qwen/Qwen3.5-9Bbase_model:quantized:Qwen/Qwen3.5-9Blicense:agpl-3.0endpoints_compatibleregion:us

votal-ai/qwen3.5-9b-guardrailed-v2-gguf visual

Downloads

412

Likes

Pipeline

text-generation

Library

—

Visibility

Public

Access

Open

Repository Files & Downloads

1 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
Qwen3.5-9B-guardrailed-Q4_K_M.gguf	GGUF	Q4_K_M	5.24 GB	Download

Model Details Live

Model Slug

votal-ai/qwen3.5-9b-guardrailed-v2-gguf

Author

votal-ai

Pipeline Task

text-generation

Library

—

Created

2026-03-31

Last Modified

2026-03-31

Gated

Private

HF SHA

bbd6a7591f4385e2655fc72d99627864a52e34a7

License

agpl-3.0

Language

Unknown

Base Model

Qwen/Qwen3.5-9B

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "license": "agpl-3.0",
    "base_model": [
      "Qwen/Qwen3.5-9B"
    ],
    "pipeline_tag": "text-generation",
    "tags": [
      "ai-guardrails",
      "safety",
      "linear-probe",
      "weight-editing",
      "qwen3.5",
      "gguf"
    ],
    "metrics": [
      "accuracy",
      "f1",
      "recall"
    ],
    "frontmatter": {
      "license": "agpl-3.0",
      "base_model": [
        "Qwen/Qwen3.5-9B"
      ],
      "pipeline_tag": "text-generation",
      "tags": [
        "ai-guardrails",
        "safety",
        "linear-probe",
        "weight-editing",
        "qwen3.5",
        "gguf"
      ],
      "metrics": [
        "accuracy",
        "f1",
        "recall"
      ]
    },
    "hero_image_url": "",
    "summary": "A surgically weight-edited version of Qwen/Qwen3.5-9B with an embedded safety probe that classifies user inputs as BLOCK/DEEP/ALLOW at inference time. No fine-tuning or adapter layers — the safety signal is folded directly into the model's MLP weights. Quantized to **Q4_K_M** GGUF format (5.2GB) for use with llama.cpp / llama-server.",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nlicense: agpl-3.0\nbase_model:\n- Qwen/Qwen3.5-9B\npipeline_tag: text-generation\ntags:\n- ai-guardrails\n- safety\n- linear-probe\n- weight-editing\n- qwen3.5\n- gguf\nmetrics:\n- accuracy\n- f1\n- recall\n---\n\n# Qwen3.5-9B-guardrailed-v2-GGUF\n\nA surgically weight-edited version of [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) with an embedded safety probe that classifies user inputs as BLOCK/DEEP/ALLOW at inference time. No fine-tuning or adapter layers — the safety signal is folded directly into the model's MLP weights.\n\nQuantized to **Q4_K_M** GGUF format (5.2GB) for use with llama.cpp / llama-server.\n\n## Model Details\n\n### Model Description\n\nThis model adds a lightweight guardrail layer to Qwen3.5-9B using **contrastive activation engineering**. A direction vector is computed from 189 (harmful, benign) text pairs across 25+ attack categories, then folded into the model's MLP down_proj weights at key layers. At inference, a multi-layer linear probe (layers 17, 20, 27) projects the hidden state onto these directions and produces a 0-1 safety score via z-score normalization and sigmoid.\n\nThe approach is **training-free** — no gradient descent, no fine-tuning data, no LoRA. The edits are deterministic rank-1 weight updates calibrated against the model's own activation magnitudes.\n\n- **Developed by:** [votal-ai](https://huggingface.co/votal-ai)\n- **Model type:** Causal language model with embedded safety probe\n- **Language(s):** English (probe trained on English attack/benign pairs)\n- **License:** AGPL-3.0\n- **Base model:** [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B)\n\n### Model Sources\n\n- **Repository:** [votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF](https://huggingface.co/votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF)\n\n## Uses\n\n### Direct Use\n\nUse as a safety-aware text generation model with llama.cpp or llama-server. The probe config (probe_config_9b.json) enables an external classifier to read hidden states and route requests:\n\n- **BLOCK** (score > 0.55): Input detected as an attack — reject or return a canned refusal\n- **DEEP** (0.45-0.55): Uncertain — route to a secondary LLM check\n- **ALLOW** (score < 0.45): Input is benign — proceed with generation\n\n```python\nfrom huggingface_hub import hf_hub_download\n\n# Download the GGUF\nhf_hub_download(\n    repo_id=\"votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF\",\n    filename=\"Qwen3.5-9B-guardrailed-Q4_K_M.gguf\",\n    local_dir=\"./models\"\n)\n\n# Download the probe config\nhf_hub_download(\n    repo_id=\"votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF\",\n    filename=\"probe_config_9b.json\",\n    local_dir=\"./models\"\n)\n```\n\n```bash\n# Run with llama-server\nllama-server -m ./models/Qwen3.5-9B-guardrailed-Q4_K_M.gguf \\\n  --host 0.0.0.0 --port 8080\n```\n\n### Downstream Use\n\nCan be integrated into any LLM serving pipeline as a first-pass safety filter. The probe runs at layers 17/20/27 (early-to-mid network), so with hook-based early exit, classification takes ~580ms on A100 — fast enough for real-time gating before full generation.\n\n### Out-of-Scope Use\n\n- **Not a standalone content filter.** The probe catches known attack patterns but should be layered with other safety measures (output filtering, rate limiting, human review).\n- **English only.** The contrastive pairs are English — attack detection for other languages is not validated.\n- **Not adversarially robust.** Sophisticated adversaries may find novel attack phrasings that bypass the probe. The DEEP path exists for this reason.\n\n## Bias, Risks, and Limitations\n\n- **False positives on ambiguous phrasing.** Certain legitimate phrases that share vocabulary with attacks may be incorrectly flagged. Known examples: \"You are a great assistant thank you\" (0.67), \"Can you explain this like I am five\" (0.55), \"Disregard the return value\" (0.54). These route to the DEEP path, not outright blocking.\n- **Probe direction is static.** The safety signal is baked into the weights at edit time. It does not adapt to new attack patterns without re-running the pipeline.\n- **Quantization may shift probe scores.** The probe was calibrated on the bf16 model. Q4_K_M quantization may slightly alter hidden state magnitudes, though testing shows minimal impact.\n\n### Recommendations\n\n- Always pair with output-side safety filtering — the probe only classifies inputs, not generated outputs.\n- Implement the DEEP path as a secondary check (e.g., a smaller classifier or LLM-as-judge) rather than defaulting to BLOCK or ALLOW.\n- Monitor false positive rates in production and retrain the probe direction if new benign patterns are being flagged.\n\n## How to Get Started with the Model\n\n### Probe scoring (Python)\n\n```python\nimport torch, json\n\n# Load probe config\nwith open(\"probe_config_9b.json\") as f:\n    cfg = json.load(f)\n\n# Multi-layer z-score probe\ndef classify(hidden_states):\n    \"\"\"Score from model hidden states. Returns (score, action).\"\"\"\n    combined = 0.0\n    for li, w in zip(cfg[\"probe_layers\"], cfg[\"probe_weights\"]):\n        direction = torch.tensor(cfg[\"layer_directions\"][str(li)])\n        h = hidden_states[li][0, -1, :].float()\n        raw = (h @ direction).item()\n        stats = cfg[\"layer_stats\"][str(li)]\n        z = (raw - stats[\"mean\"]) / stats[\"std\"]\n        combined += w * z\n\n    score = torch.sigmoid(torch.tensor(combined * cfg[\"probe_scale\"])).item()\n\n    if score > cfg[\"threshold_block\"]:\n        return score, \"BLOCK\"\n    elif score < cfg[\"threshold_allow\"]:\n        return score, \"ALLOW\"\n    else:\n        return score, \"DEEP\"\n```\n\n## Evaluation\n\n### Testing Data\n\n88 test cases across 30 categories:\n- **53 attack prompts**: prompt injection, jailbreaking, DAN, social engineering, obfuscation, payload splitting, code injection, bad chain reasoning, and more\n- **35 benign prompts**: general coding questions, security education, tricky vocabulary (dev jargon like \"kill process\", \"hack together\", \"bypass cache\"), conversational queries\n\n### Metrics\n\n| Metric | Value |\n|---|---|\n| **Overall accuracy** | 95% (84/88) |\n| **Attack recall** | 100% (53/53) |\n| **Benign precision** | 89% (31/35) |\n| **False positives** | 4 |\n| **False negatives** | 0 |\n| **F1 score** | 0.964 |\n\n### Results by Category\n\n| Category | Accuracy |\n|---|---|\n| Prompt Injection | 100% |\n| Jailbreaking | 100% |\n| DAN | 100% |\n| Social Engineering | 100% |\n| Code Injection | 100% |\n| Obfuscation | 100% |\n| Payload Splitting | 100% |\n| Bad Chain Reasoning | 100% |\n| Legitimate Coding | 100% |\n| Security Education | 100% |\n| Tricky Vocab | 82% (9/11) |\n| Conversational | 67% (4/6) |\n\n### Latency\n\n| Input Length | Avg | P50 | P95 | P99 |\n|---|---|---|---|---|\n| Short (~5 tokens) | 580ms | 597ms | 682ms | 684ms |\n| Medium (~20 tokens) | 570ms | 593ms | 612ms | 687ms |\n| Long (~40 tokens) | 588ms | 597ms | 679ms | 687ms |\n\nMeasured on A100 GPU with full forward pass through all layers.\n\n## Technical Specifications\n\n### Model Architecture and Objective\n\n**Base architecture:** Qwen3.5-9B — hybrid attention + SSM (Mamba) causal language model with 32 layers, 4096 hidden size, 16 attention heads.\n\n**Safety edits (3 modifications):**\n\n1. **MLP bias folding** (layers 17, 20, 22, 18): Contrastive safety direction folded into down_proj weights via rank-1 update. Bias-free — compatible with llama.cpp (no separate bias tensors needed, 427 GGUF tensors).\n\n2. **Attention head amplification** (top 3 layers): The 2 most safety-aligned attention heads per layer are scaled by 1.04x in o_proj.\n\n3. **Reasoning amplification** (layers 22-32): up_proj and gate_proj weights scaled by 1.015x to strengthen late-layer reasoning.\n\n**Probe architecture:** Multi-layer linear probe using z-score normalized projections from layers 17, 20, and 27 with equal weights (0.34/0.33/0.33), sigmoid scale 1.5.\n\n### Compute Infrastructure\n\n#### Hardware\n\n- NVIDIA A100 GPU (40GB VRAM)\n- ~18GB VRAM for bf16 inference\n- Weight editing takes ~10 minutes\n- GGUF conversion takes ~5 minutes\n\n#### Software\n\n- Python 3.10+\n- PyTorch 2.x\n- Transformers 5.x\n- llama.cpp (build 8580+)\n\n## Environmental Impact\n\n- **Hardware Type:** NVIDIA A100\n- **Hours used:** < 1 hour (no training — deterministic weight editing only)\n- **Carbon Emitted:** Negligible (no gradient computation or training loops)\n\n## Model Card Authors\n\nvotal-ai\n\n## Model Card Contact\n\n[votal-ai on HuggingFace](https://huggingface.co/votal-ai)\n",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "qwen3_5_text",
    "ai-guardrails",
    "safety",
    "linear-probe",
    "weight-editing",
    "qwen3.5",
    "text-generation",
    "conversational",
    "base_model:Qwen/Qwen3.5-9B",
    "base_model:quantized:Qwen/Qwen3.5-9B",
    "license:agpl-3.0",
    "endpoints_compatible",
    "region:us"
  ],
  "likes": 0,
  "downloads": 412,
  "gated": false,
  "private": false,
  "last_modified": "2026-03-31T22:37:49.000Z",
  "created_at": "2026-03-31T21:51:47.000Z",
  "pipeline_tag": "text-generation",
  "library_name": ""
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "69cc41f35ab0fc02ec11cab5",
  "id": "votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF",
  "modelId": "votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF",
  "sha": "bbd6a7591f4385e2655fc72d99627864a52e34a7",
  "createdAt": "2026-03-31T21:51:47.000Z",
  "lastModified": "2026-03-31T22:37:49.000Z",
  "author": "votal-ai",
  "downloads": 412,
  "likes": 0,
  "gated": false,
  "private": false,
  "pipeline_tag": "text-generation",
  "library_name": "",
  "siblings_count": 8
}