GraySoft
Projects Models About FAQ Contact Download guIDE →
Model Intelligence Sheet

pbhappliedsystems/ministral-3-14b-reasoning-2512-gguf-q4-k-m overview

Quantized, converted, and evaluated by PBH Applied Systems, LLC — Applied AI/ML Consulting · LLM Optimization & Deployment · Quantized AI Infrastructure 🔬 This repository is part of a production-oriented evaluation series. Every model published under pbhappliedsystems has been independently evaluated using quant_eval v7.21 — a proprietary behavioral evaluation harness developed by PBH Applied Systems. Scores measure real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning families — not perplexity or benchmark leaderboard proxies. 🧠 This is a reasoning-architecture model. Ministral-3-14B-Reasoning-2512 generates internal chain-of-thought traces before producing its final answer. This has significant implications for inference speed, tool-calling behavior, and quantization tradeoffs — all documented in detail below. ---

ggufquantizedq4_k_mmistralreasoningchain-of-thoughtllama-cppagenticstructured-outputpbh-applied-systemsquant-evalenbase_model:mistralai/Ministral-3-14B-Reasoning-2512base_model:quantized:mistralai/Ministral-3-14B-Reasoning-2512license:apache-2.0endpoints_compatibleregion:usconversational
pbhappliedsystems/ministral-3-14b-reasoning-2512-gguf-q4-k-m visual
Downloads
1,012
Likes
0
Pipeline
Library
Visibility
Public
Access
Open

Repository Files & Downloads

1 files detected
Direct downloads for all repository files
FileTypeQuantizationSizeLink
ministral-3-14b-reasoning-2512-gguf-Q4-K-M.gguf GGUF 7.67 GB Download

Model Details Live

Model Slug
pbhappliedsystems/ministral-3-14b-reasoning-2512-gguf-q4-k-m
Author
pbhappliedsystems
Pipeline Task
Library
Created
2026-04-11
Last Modified
2026-04-15
Gated
No
Private
No
HF SHA
3ab7c0c089326d103052169da5dd76f633f3954f
License
apache-2.0
Language
en
Base Model
mistralai/Ministral-3-14B-Reasoning-2512

Metadata Inspector

Normalized metadata (stored in metadata_json)
{
  "metadata": {},
  "card_data": {
    "language": [
      "en"
    ],
    "license": "apache-2.0",
    "base_model": "mistralai/Ministral-3-14B-Reasoning-2512",
    "tags": [
      "gguf",
      "quantized",
      "q4_k_m",
      "mistral",
      "reasoning",
      "chain-of-thought",
      "llama-cpp",
      "agentic",
      "structured-output",
      "pbh-applied-systems",
      "quant-eval"
    ],
    "frontmatter": {
      "language": [
        "en"
      ],
      "license": "apache-2.0",
      "base_model": "mistralai/Ministral-3-14B-Reasoning-2512",
      "tags": [
        "gguf",
        "quantized",
        "q4_k_m",
        "mistral",
        "reasoning",
        "chain-of-thought",
        "llama-cpp",
        "agentic",
        "structured-output",
        "pbh-applied-systems",
        "quant-eval"
      ]
    },
    "hero_image_url": "",
    "summary": "**Quantized, converted, and evaluated by PBH Applied Systems, LLC** — Applied AI/ML Consulting · LLM Optimization & Deployment · Quantized AI Infrastructure > 🔬 **This repository is part of a production-oriented evaluation series.** Every model published under pbhappliedsystems has been independently evaluated using **quant_eval v7.21** — a proprietary behavioral evaluation harness developed by PBH Applied Systems. Scores measure real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning families — not perplexity or benchmark leaderboard proxies. > 🧠 **This is a reasoning-architecture model.** Ministral-3-14B-Reasoning-2512 generates internal chain-of-thought traces before producing its final answer. This has significant implications for inference speed, tool-calling behavior, and quantization tradeoffs — all documented in detail below. ---",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nlanguage:\n  - en\nlicense: apache-2.0\nbase_model: mistralai/Ministral-3-14B-Reasoning-2512\ntags:\n  - gguf\n  - quantized\n  - q4_k_m\n  - mistral\n  - reasoning\n  - chain-of-thought\n  - llama-cpp\n  - agentic\n  - structured-output\n  - pbh-applied-systems\n  - quant-eval\n---\n\n# Ministral-3-14B-Reasoning-2512 · GGUF Q4\\_K\\_M\n\n**Quantized, converted, and evaluated by [PBH Applied Systems, LLC](https://pbhappliedsystems.com)**\n— Applied AI/ML Consulting · LLM Optimization & Deployment · Quantized AI Infrastructure\n\n> 🔬 **This repository is part of a production-oriented evaluation series.** Every model published under [`pbhappliedsystems`](https://huggingface.co/pbhappliedsystems) has been independently evaluated using **quant_eval v7.21** — a proprietary behavioral evaluation harness developed by PBH Applied Systems. Scores measure real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning families — not perplexity or benchmark leaderboard proxies.\n\n> 🧠 **This is a reasoning-architecture model.** Ministral-3-14B-Reasoning-2512 generates internal chain-of-thought traces before producing its final answer. This has significant implications for inference speed, tool-calling behavior, and quantization tradeoffs — all documented in detail below.\n\n---\n\n## Model Description\n\nThis repository contains the **4-bit quantized (Q4\\_K\\_M)** GGUF of [`mistralai/Ministral-3-14B-Reasoning-2512`](https://huggingface.co/mistralai/Ministral-3-14B-Reasoning-2512), a 14-billion parameter reasoning-tuned model from Mistral AI (December 2025 release). This is the reasoning variant of the Ministral-3-14B architecture — distinct from the instruction-tuned variant ([`pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-Q4-K-M`](https://huggingface.co/pbhappliedsystems/ministral-3-14b-instruct-2512-gguf-Q4-K-M)) in that it performs extended internal deliberation before generating output.\n\nThe Q4\\_K\\_M format applies 4-bit quantization with K-quant medium precision. As documented in the evaluation section below, Q4\\_K\\_M quantization has a pronounced and measurable effect on this model's reasoning behavior — compressing the chain-of-thought dramatically and altering failure modes in ways that differ substantially from the F16 baseline.\n\nThe full-precision F16 baseline is published separately at [`pbhappliedsystems/ministral-3-14b-reasoning-2512-gguf-F16`](https://huggingface.co/pbhappliedsystems/ministral-3-14b-reasoning-2512-gguf-F16).\n\n### Key Characteristics\n\n- **Parameters:** 14B\n- **Architecture:** Reasoning (extended chain-of-thought)\n- **Format:** GGUF Q4\\_K\\_M\n- **File size:** 8.24 GB\n- **SHA256:** `e7171d96748ddc948fd6d9edb3d1c6e3f9ba6b855ff964aee98519788da330c2`\n- **Minimum VRAM (GPU inference):** ~10–12 GB (T4 class or better)\n- **Recommended GPU tier:** NVIDIA T4 (16 GB) · RTX 3080/4080 · A10G\n- **Context window:** 32,768 tokens (per base model specification)\n- **Q4\\_K\\_M avg inference time (eval hardware):** **1.18 sec/case** on RTX 4090\n- **F16 avg inference time (eval hardware):** **65.67 sec/case** on RTX 4090\n\n---\n\n## ⚡ The Quantization Speed Story — 55.7× Faster, With Tradeoffs\n\nThe single most striking finding from this evaluation is the inference timing differential between precision levels:\n\n| Runner | Avg sec/case | fuzz | json | json\\_multistep | toolcall | stateful |\n|---|---:|---:|---:|---:|---:|---:|\n| F16 (`full_weight_transformers`) | 65.67 | 84.22 | 105.82 | 99.64 | 22.90 | 15.93 |\n| Q4\\_K\\_M (`quantized_llama_cpp`) | **1.18** | 1.21 | 1.50 | 2.98 | 0.64 | 0.46 |\n| **Speedup** | **55.7×** | 69.6× | 70.5× | 33.4× | 35.8× | 34.6× |\n\n**The chain-of-thought is the cost at F16.** The F16 reasoning model spends 35–176 seconds per case generating internal deliberation traces before producing output. Fuzz case `fuzz_0000` took **176.47 seconds** at full precision. The `json` family averaged over **105 seconds per case**.\n\n**Q4\\_K\\_M compresses the reasoning dramatically.** At 1.18 sec/case average, the Q4\\_K\\_M variant is not just faster — it is the fastest model in the PBH Applied Systems evaluated series, faster even than Qwen2.5-3B-Instruct. This speed comes from the effective compression or elimination of the extended chain-of-thought generation. Whether this constitutes a deployment advantage or a capability regression depends entirely on your use case, and the evaluation data below provides the evidence to make that determination.\n\n---\n\n## PBH Applied Systems Evaluation — quant\\_eval v7.21\n\n> **Evaluation conducted by PBH Applied Systems, LLC using quant_eval v7.21**\n> Run ID: `20260209_233252` · Fixtures: `golden_oracle_fixtures_v7_21` (SHA256: `6d71a0b9147c...`) · Seed: 42\n> Hardware: NVIDIA RTX 4090 · Total rows evaluated: 84 (42 F16 · 42 Q4\\_K\\_M)\n\n### Aggregate Scores (Q4\\_K\\_M)\n\nScores are normalized to [0.0 – 1.0]. Higher is better.\n\n| Dimension | Score |\n|---|---:|\n| Task Completion | 0.6786 |\n| Reasoning | **0.9389** |\n| Coherence | 0.9259 |\n| Instruction Following | 0.9649 |\n| **Q4\\_K\\_M avg inference time** | **1.18 sec/case** |\n\n> **Note on avg_secs:** The 33.43 sec figure that appears in cross-model comparison tables is the arithmetic mean of both runners (F16: 65.67 + Q4\\_K\\_M: 1.18 ÷ 2). The Q4\\_K\\_M standalone figure of **1.18 sec/case** is the operationally relevant number for deployment planning.\n\n### Per-Family Pass Rates\n\n#### F16 Baseline (`full_weight_transformers`)\n\n| Family | N | Pass Rate | Avg Secs | Notes |\n|---|---:|---:|---:|---|\n| json\\_multistep | 5 | 0.600 | 99.64 | 3 pass — 2× oracle\\_equiv\\_ok failure |\n| stateful\\_followup | 2 | **1.000** | 15.93 | Both turns exact match |\n| toolcall\\_only | 2 | 0.000 | 14.41 | tool\\_name\\_ok=0.5, args\\_ok=0.0 |\n| mixed\\_brief\\_json | 2 | **1.000** | 12.73 | Answer line + JSON schema correct |\n| toolcall | 2 | **1.000** | 22.90 | Tool parse + schema valid |\n| json | 4 | n/a | 105.82 | bucket\\_score avg = 10.000 |\n| fuzz | 20 | n/a | 84.22 | bucket\\_score avg = 10.000 |\n| mcq | 5 | n/a | 4.02 | bucket\\_score avg = 0.800 |\n\n#### Q4\\_K\\_M (`quantized_llama_cpp`)\n\n| Family | N | Pass Rate | Δ vs F16 | Avg Secs | Notes |\n|---|---:|---:|---:|---:|---|\n| json\\_multistep | 5 | 0.600 | 0.000 | 2.98 | Different failure pattern — see below |\n| stateful\\_followup | 2 | **1.000** | 0.000 | 0.46 | Perfect retention maintained |\n| toolcall\\_only | 2 | 0.000 | 0.000 | 0.64 | tool\\_name\\_ok=1.0, args\\_ok=0.0 |\n| mixed\\_brief\\_json | 2 | **1.000** | 0.000 | 0.42 | No degradation |\n| toolcall | 2 | **1.000** | 0.000 | 0.64 | No degradation |\n| json | 4 | n/a | — | 1.50 | bucket\\_score avg = 10.000 |\n| fuzz | 20 | n/a | — | 1.21 | bucket\\_score avg = 10.000 |\n| mcq | 5 | n/a | — | 0.05 | bucket\\_score avg = 0.800 |\n\n---\n\n## Key Findings — Reasoning Architecture Under Quantization\n\n### Finding 1: json\\_multistep — Hard Case Capability Regression\n\nThe aggregate pass rate (0.600) is identical across both runners, but the case-level breakdown reveals a meaningful capability difference:\n\n| Case | Difficulty | F16 Result | Q4\\_K\\_M Result | Delta |\n|---|---|---|---|---|\n| ms\\_easy\\_01 | Easy | ❌ FAIL (oracle\\_equiv\\_ok) | ✅ PASS | Q4\\_K\\_M recovers |\n| ms\\_easy\\_02 | Easy | ❌ FAIL (oracle\\_equiv\\_ok) | ❌ FAIL (oracle\\_equiv\\_ok) | Consistent failure |\n| ms\\_med\\_01 | Medium | ✅ PASS | ✅ PASS | Consistent |\n| ms\\_med\\_02 | Medium | ✅ PASS | ✅ PASS | Consistent |\n| ms\\_hard\\_01 | Hard | ✅ **PASS** | ❌ **FAIL** (checks\\_consistent + oracle) | ⚠️ Regression |\n\n**The F16 passes the hard case; the Q4\\_K\\_M fails it.** The F16 model's extended chain-of-thought reasoning (110.73 sec on ms\\_hard\\_01) enables it to work through the harder planning problem correctly. The Q4\\_K\\_M variant, with its compressed reasoning at 3.39 sec, loses this capability precisely where it matters most. The easy case trade also inverts — Q4\\_K\\_M recovers ms\\_easy\\_01 while F16 misses it, suggesting a qualitative shift in how planning is approached rather than uniform degradation.\n\n**Practical implication:** For multi-step planning tasks at medium-to-hard difficulty, the F16 is meaningfully more capable. If planning correctness on harder cases is a deployment requirement, the F16 or an external validation loop is necessary.\n\n### Finding 2: toolcall\\_only — Architectural Incompatibility\n\n`toolcall_only` fails on **both runners** (0.000), but the failure signatures differ in an instructive way:\n\n| Signal | F16 Rate | Q4\\_K\\_M Rate | Interpretation |\n|---|---:|---:|---|\n| tool\\_name\\_ok | 0.500 | **1.000** | Q4\\_K\\_M improves tool name recognition |\n| args\\_ok | 0.000 | 0.000 | Args extraction fails at both precision levels |\n| schema\\_ok | 0.000 | 0.000 | Schema non-compliance at both levels |\n\n**This is an architectural finding, not a quantization finding.** The reasoning model's chain-of-thought output is structurally incompatible with bare schema-only tool dispatch. At F16, the extended reasoning trace interferes with tool name extraction on one of two cases (0.5). At Q4\\_K\\_M, tool name recognition is actually perfect (1.0) — the compressed reasoning improves extraction — but argument construction remains broken at both precision levels.\n\nThe root cause is the thinking trace: a reasoning model generating chain-of-thought tokens cannot cleanly emit a bare `{\"tool_name\": \"...\", \"args\": {...}}` JSON block without prose scaffolding. This is not a bug in the quantization; it is a characteristic of reasoning architecture that `toolcall` (1.000 at both runners) works correctly — when tool dispatch is embedded within a broader response, the reasoning traces are absorbed into the response structure and the tool call itself is valid.\n\n**Practical implication:** Do not use this model for bare tool-call dispatch at any precision level without a schema enforcement layer. Use `toolcall` mode (tool call embedded in a broader response) for reliable tool use.\n\n### Finding 3: MCQ — Chain-of-Thought Wrapping vs. Answer Accuracy\n\nBoth runners fail `mcq_02`, but for entirely different reasons that reveal the reasoning architecture's behavior:\n\n| Runner | mcq\\_02 result | Detail |\n|---|---|---|\n| F16 | ❌ `invalid_choice raw='```'` | Model outputs markdown fence before answer letter |\n| Q4\\_K\\_M | ❌ `wrong_choice got=A` | Model suppresses fences but selects wrong answer |\n\nThe F16 model wraps its MCQ response in a chain-of-thought + code fence block, causing the choice extractor to capture the fence delimiter instead of the answer letter. The Q4\\_K\\_M model suppresses this wrapping (consistent with its compressed reasoning behavior) but in doing so, loses whatever deliberation led to the correct answer at F16 — and selects the wrong choice.\n\n**Both fail mcq\\_02; only one of them had a correct answer buried inside an extraction-unfriendly format.**\n\n---\n\n## Signal-Level Diagnostics\n\n### Q4\\_K\\_M — json\\_multistep\n\n| Signal | Rate | Tier |\n|---|---:|---|\n| schema\\_ok | 1.000 | Tier-1 (gating) |\n| checks\\_consistent\\_ok | 0.800 | Tier-1 (gating) |\n| stop\\_semantics\\_ok | 1.000 | Tier-1 (gating) |\n| oracle\\_equiv\\_ok | 0.600 | Tier-1 (gating) |\n| final\\_consistent\\_ok | 0.000 | Tier-2 (tracked, non-gating) |\n| final\\_match\\_reported | 0.000 | Tier-2 (tracked, non-gating) |\n\n> **Note on checks\\_consistent\\_ok:** F16 achieves 1.000 on this signal; Q4\\_K\\_M drops to 0.800. This is the quantization-driven consistency regression on ms\\_hard\\_01.\n\n### Q4\\_K\\_M — stateful\\_followup\n\n| Signal | Rate | Tier |\n|---|---:|---|\n| turn1\\_parse\\_ok | 1.000 | Tier-1 |\n| turn2\\_parse\\_ok | 1.000 | Tier-1 |\n| turn1\\_exact\\_match | 1.000 | Tier-1 |\n| turn2\\_exact\\_match | 1.000 | Tier-1 |\n\n### Q4\\_K\\_M — toolcall\\_only\n\n| Signal | Rate | Tier |\n|---|---:|---|\n| tool\\_name\\_ok | 1.000 | Tier-1 (gating) |\n| args\\_ok | 0.000 | Tier-1 (gating) |\n\n### Q4\\_K\\_M — mixed\\_brief\\_json\n\n| Signal | Rate | Tier |\n|---|---:|---|\n| answer\\_line\\_ok | 1.000 | Tier-1 |\n| json\\_parse\\_ok | 1.000 | Tier-1 |\n| schema\\_ok | 1.000 | Tier-1 |\n\n---\n\n## Recommended Use Cases\n\n### ✅ Deploy with Confidence (Q4\\_K\\_M)\n\n- **Stateful multi-turn agents** — Perfect two-turn state retention (1.000) at 0.46 sec/case. Reliable and fast.\n- **Structured JSON outputs (single-step)** — bucket\\_score avg of 10.000 on both `json` and `fuzz`; valid structured outputs at 1.50 sec avg.\n- **Hybrid brief + JSON responses** — `mixed_brief_json` passes at 1.000 in 0.42 sec. The fastest mixed-output result in the evaluated series.\n- **Tool-calling with response scaffolding** — `toolcall` passes at 1.000. Embed tool dispatch in a structured response rather than bare JSON output.\n- **JSON multi-step at easy-to-medium difficulty with validation loop** — ms\\_easy (partial) and ms\\_med cases pass reliably; use with external validation for production.\n- **High-throughput structured inference** — At 1.18 sec/case average, Q4\\_K\\_M is exceptionally fast for a 14B model. Suitable for batch processing pipelines where throughput matters.\n\n### ⚠️ Use with Guardrails (Q4\\_K\\_M)\n\n- **Hard multi-step planning** — ms\\_hard\\_01 fails at Q4\\_K\\_M (passes at F16). Use with an external validator or upgrade to F16 for harder planning horizons.\n- **MCQ and single-choice extraction** — bucket\\_score 0.800 on both runners. Add a post-processing extraction layer to handle potential reasoning token wrapping.\n\n### ❌ Not Recommended (Q4\\_K\\_M or F16)\n\n- **Bare tool-call dispatch (schema-only output)** — `toolcall_only` fails at 0.000 on both runners. This is an architectural limitation of reasoning models, not a quantization artifact. Use `toolcall` mode instead.\n- **Applications requiring full chain-of-thought traces** — Q4\\_K\\_M's compressed reasoning suppresses the deliberation that makes this model variant valuable for hard reasoning tasks. Use F16 if the thinking trace itself is part of the deliverable.\n\n---\n\n## F16 vs. Q4\\_K\\_M — When to Choose Each\n\n| Criterion | Q4\\_K\\_M | F16 |\n|---|---|---|\n| VRAM available | ~10–12 GB | ~30 GB |\n| Latency requirement | < 5 sec/response | Acceptable > 60 sec |\n| Planning difficulty | Easy to medium | Medium to hard |\n| Chain-of-thought needed | No | Yes |\n| Tool dispatch | With scaffolding | With scaffolding |\n| Throughput priority | ✅ High | ❌ Low |\n| Hard planning correctness | ❌ ms\\_hard\\_01 fails | ✅ ms\\_hard\\_01 passes |\n\n---\n\n## Hardware Requirements\n\n| Configuration | VRAM Required | Recommended GPU |\n|---|---|---|\n| Q4\\_K\\_M (this repo) · GPU only | ~10–12 GB | T4 16 GB · RTX 3080/4080 · A10G |\n| Q4\\_K\\_M · CPU offload fallback | 8 GB VRAM + 4 GB RAM | Any CUDA-capable GPU |\n| F16 baseline (companion repo) | ~30 GB | A100 40 GB · RTX 4090 · 2× A10G |\n\n---\n\n## Usage\n\n### Installation\n\n```bash\npip install llama-cpp-python huggingface_hub\n```\n\nFor GPU acceleration (CUDA):\n\n```bash\nCMAKE_ARGS=\"-DGGML_CUDA=on\" pip install llama-cpp-python --force-reinstall --no-cache-dir\n```\n\n### Python — llama-cpp-python\n\n```python\nfrom huggingface_hub import hf_hub_download\nfrom llama_cpp import Llama\n\nmodel_path = hf_hub_download(\n    repo_id=\"pbhappliedsystems/ministral-3-14b-reasoning-2512-gguf-Q4-K-M\",\n    filename=\"ministral-3-14b-reasoning-2512-gguf-Q4-K-M.gguf\"\n)\n\nllm = Llama(\n    model_path=model_path,\n    n_ctx=8192,       # Increase up to 32768; reasoning traces consume context\n    n_gpu_layers=-1,  # -1 offloads all layers to GPU\n    verbose=False,\n)\n\nresponse = llm.create_chat_completion(\n    messages=[\n        {\n            \"role\": \"system\",\n            \"content\": \"You are a precise reasoning assistant. Think through the problem carefully before responding.\"\n        },\n        {\n            \"role\": \"user\",\n            \"content\": \"Analyze the following clause for logical inconsistencies and return a structured JSON summary: ...\"\n        }\n    ],\n    temperature=0.7,\n    max_tokens=2048,  # Reasoning models may emit longer outputs even at Q4_K_M\n)\n\nprint(response[\"choices\"][0][\"message\"][\"content\"])\n```\n\nFor tool-calling, always use response-scaffolded mode (see evaluation findings — bare tool dispatch is unreliable):\n\n```python\nimport json, re\nfrom huggingface_hub import hf_hub_download\nfrom llama_cpp import Llama\n\nmodel_path = hf_hub_download(\n    repo_id=\"pbhappliedsystems/ministral-3-14b-reasoning-2512-gguf-Q4-K-M\",\n    filename=\"ministral-3-14b-reasoning-2512-gguf-Q4-K-M.gguf\"\n)\n\nllm = Llama(model_path=model_path, n_ctx=4096, n_gpu_layers=-1, verbose=False)\n\ndef call_tool_scaffolded(prompt: str) -> dict:\n    \"\"\"\n    Use response-scaffolded tool calling for reasoning models.\n    toolcall pass rate = 1.000; toolcall_only pass rate = 0.000.\n    Embed the tool call instruction in a broader response prompt.\n    \"\"\"\n    response = llm.create_chat_completion(\n        messages=[\n            {\n                \"role\": \"system\",\n                \"content\": (\n                    \"You are a tool-calling assistant. When you determine a tool is needed, \"\n                    \"briefly explain your reasoning, then output the tool call as:\\n\"\n                    \"TOOL_CALL: {\\\"tool_name\\\": \\\"...\\\", \\\"args\\\": {...}}\"\n                )\n            },\n            {\"role\": \"user\", \"content\": prompt}\n        ],\n        temperature=0.0,\n        max_tokens=512,\n    )\n    raw = response[\"choices\"][0][\"message\"][\"content\"]\n    # Extract structured tool call from scaffolded response\n    match = re.search(r'TOOL_CALL:\\s*(\\{.*?\\})', raw, re.DOTALL)\n    if match:\n        return json.loads(match.group(1))\n    raise ValueError(f\"No tool call found in response: {raw}\")\n\nresult = call_tool_scaffolded(\"Calculate 5 plus 10 using the add tool.\")\n```\n\n### CLI — llama-cli\n\n```bash\n# One-shot reasoning prompt\nllama-cli \\\n  --model ministral-3-14b-reasoning-2512-gguf-Q4-K-M.gguf \\\n  --chat-template mistral \\\n  --system-prompt \"You are a precise reasoning assistant. Think through the problem carefully.\" \\\n  --prompt \"Analyze the following data and identify the three most significant risk factors. Return a JSON array.\" \\\n  --n-predict 2048 \\\n  --ctx-size 8192 \\\n  --n-gpu-layers -1 \\\n  --temp 0.15\n```\n\nFor server deployment (OpenAI-compatible endpoint):\n\n```bash\nllama-server \\\n  --model ministral-3-14b-reasoning-2512-gguf-Q4-K-M.gguf \\\n  --chat-template mistral \\\n  --ctx-size 8192 \\\n  --n-gpu-layers -1 \\\n  --port 8080 \\\n  --host 0.0.0.0\n```\n\nQuery via the OpenAI-compatible API:\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(base_url=\"http://localhost:8080/v1\", api_key=\"not-required\")\n\nresponse = client.chat.completions.create(\n    model=\"ministral-3-14b-reasoning-2512-gguf-Q4-K-M\",\n    messages=[{\"role\": \"user\", \"content\": \"Your prompt here\"}],\n    temperature=0.7,\n)\nprint(response.choices[0].message.content)\n```\n\n---\n\n## Artifact Provenance\n\n| Artifact | Format | Size | SHA256 |\n|---|---|---|---|\n| `ministral-3-14b-reasoning-2512-gguf-Q4-K-M.gguf` | GGUF Q4\\_K\\_M | 8.24 GB | `e7171d96748ddc948fd6d9edb3d1c6e3f9ba6b855ff964aee98519788da330c2` |\n| F16 *(companion repo)* | GGUF F16 | 27.0 GB | `7645d01deed3415326c7c2bf8b58280e234021a91b4b3ade52b4735c976ad221` |\n\nBoth artifacts were produced from `mistralai/Ministral-3-14B-Reasoning-2512` using a custom-built llama.cpp conversion and quantization pipeline developed by PBH Applied Systems.\n\n---\n\n## Evaluation Methodology\n\n**quant_eval v7.21** is a proprietary behavioral evaluation harness developed by PBH Applied Systems. It evaluates both the full-precision (F16) and quantized variants against an identical fixture set, enabling direct comparison of capability retention across quantization levels.\n\n**Fixture set:** `golden_oracle_fixtures_v7_21` (SHA256: `6d71a0b9147c079371b02a94f3c149eb78a6adc03dc16ff6833b964fbf4174f0`)\n\n| Family | Description | Pass Signals |\n|---|---|---|\n| `fuzz` | Property-based regression; structured placement correctness | schema\\_ok, constraints\\_ok |\n| `json` | Single-step structured JSON with constraint rules | schema\\_ok, constraints\\_ok |\n| `json_multistep` | Multi-step planning with self-check and oracle verification | schema\\_ok, checks\\_consistent\\_ok, stop\\_semantics\\_ok, oracle\\_equiv\\_ok |\n| `mcq` | Multiple-choice extraction | choice\\_ok |\n| `stateful_followup` | Two-turn state tracking; turn-2 correct given turn-1 | turn1/2\\_parse\\_ok, turn1/2\\_exact\\_match |\n| `mixed_brief_json` | Hybrid: natural language answer + valid JSON block | answer\\_line\\_ok, json\\_parse\\_ok, schema\\_ok |\n| `toolcall` | Tool call embedded in response; parse + schema validation | stage1\\_tool\\_parse\\_ok, stage1\\_tool\\_schema\\_ok |\n| `toolcall_only` | Bare schema-only tool call; strict tool name + args check | tool\\_name\\_ok, args\\_ok |\n\n**Evaluation hardware:** NVIDIA RTX 4090 (24 GB VRAM)\n**Evaluation date:** February 9, 2026\n**quant_eval seed:** 42\n\n---\n\n## About PBH Applied Systems\n\n[**PBH Applied Systems, LLC**](https://pbhappliedsystems.com) is an Oklahoma City–based applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development. The organization operates with a strong emphasis on engineering rigor, reproducibility, and real-world deployment constraints — particularly in environments where performance, cost efficiency, and reliability must be balanced against available hardware and budget.\n\n### Founder — Patrick Hill, M.S.\n\nPBH Applied Systems was founded by **Patrick Hill**, a Data Scientist and AI/ML Engineer with 10+ years of experience delivering advanced analytics, predictive modeling, and decision-support solutions across high-stakes operational environments. Patrick holds a **Master of Science in Software Engineering with concentrations in Artificial Intelligence and Machine Learning** (GPA: 4.0) and a B.S. in Business Finance.\n\n**Technical expertise spans:**\n\n- **Languages & Data:** Python, SQL, Linux, Pandas, NumPy, scikit-learn\n- **ML & Modeling:** Supervised and unsupervised learning, neural networks, NLP, transformers, regression, classification, forecasting, and feature engineering\n- **AI/ML Frameworks:** PyTorch, TensorFlow/Keras, HuggingFace Transformers, GGUF, llama.cpp, BitsAndBytes, PEFT, QLoRA\n- **Deployment & MLOps:** Flask APIs, Docker, CI/CD pipelines, REST endpoints, streaming inference, version control\n- **Data Platforms:** Jupyter, Databricks, Power BI, Matplotlib\n- **Quantization:** GGUF conversion, Q4\\_K\\_M / Q5\\_K\\_M / Q8\\_0 strategies, adapter-per-model evaluation architecture\n\n### Published Author\n\nPatrick is the author of **[Applied Machine Learning: Concepts, Tools, and Case Studies](https://a.co/d/05qat7Xz)** — a 1,200+ page practitioner-oriented textbook covering statistical modeling, supervised and unsupervised learning, neural networks, NLP, and real-world decision-support case studies. Adopted as **required reading for CSC 373 – Machine Learning at the University of Advancing Technology**.\n\n### Core Service Areas\n\n**1. LLM Optimization & Deployment** — End-to-end GGUF conversion and quantization with custom llama.cpp pipelines and adapter-per-model architecture.\n\n**2. AI Evaluation Frameworks** — Proprietary behavioral evaluation via quant_eval: per-family pass rates, F16 vs. quantized delta analysis, failure cluster diagnostics, deployment recommendations.\n\n**3. Agentic AI Infrastructure** — LlamaIndex ReAct agents, Flask orchestration, serverless GPU inference, full pipeline from model selection to production serving.\n\n**4. Scalable AI Application Development** — Multimodal applications (quantized LLMs + Whisper + BLIP), Dockerized Flask APIs, advanced time-series forecasting with custom attention mechanisms, Bayesian hyperparameter optimization, and FinBERT sentiment fusion.\n\n**5. ML Pipeline Design & Analytics** — Feature engineering, forward-chaining cross-validation, KPI dashboards, analytical governance at scale.\n\n**6. Model & Agent Cataloging** — Structured catalog publishing with reproducible artifacts and clear performance tradeoff documentation.\n\n---\n\n## 📞 Work With PBH Applied Systems\n\nThe findings documented in this card — the 55.7× inference speedup, the hard planning case regression, the reasoning architecture's incompatibility with bare tool dispatch, and the MCQ failure mode divergence between precision levels — are all signals that only emerge from systematic, production-oriented evaluation. None of them appear in perplexity scores or benchmark leaderboards.\n\n**If you are deploying a reasoning model in production, the gap between F16 and Q4\\_K\\_M behavior is not uniform. It is task-specific, signal-level, and only visible when you measure it.**\n\n👉 **[Book a Scoping Call](https://pbhappliedsystems.com)** — Discuss your reasoning model selection, quantization strategy, or deployment architecture directly with Patrick.\n\n👉 **[Request an Evaluation Report](https://pbhappliedsystems.com)** — A full quant_eval behavioral audit: per-family pass rates, F16 vs. quantized delta analysis, failure cluster diagnostics, and a deployment recommendation. Engagements from $2,500.\n\n### Connect\n\n| | |\n|---|---|\n| 🌐 **Website** | [pbhappliedsystems.com](https://pbhappliedsystems.com) |\n| 📧 **Email** | [patrick@pbhappliedsystems.com](mailto:patrick@pbhappliedsystems.com) |\n| 💼 **LinkedIn** | [PBH Applied Systems, LLC](https://www.linkedin.com/company/pbh-applied-systems-llc) |\n| ▶️ **YouTube** | [@pbhappliedsystems](https://www.youtube.com/@pbhappliedsystems) |\n| 📸 **Instagram** | [@pbhappliedsystems](https://www.instagram.com/pbhappliedsystems) |\n| 👍 **Facebook** | [pbhappliedsystems](https://www.facebook.com/pbhappliedsystems) |\n\n---\n\n## License\n\nThis GGUF repository inherits the license of the base model:\n**Apache 2.0** — [`mistralai/Ministral-3-14B-Reasoning-2512`](https://huggingface.co/mistralai/Ministral-3-14B-Reasoning-2512)\n\nThe quant_eval evaluation methodology, fixture set, and scoring framework are proprietary to PBH Applied Systems, LLC and are not included in this repository.\n\n---\n\n*GGUF conversion, quantization, and behavioral evaluation performed by [PBH Applied Systems, LLC](https://pbhappliedsystems.com) · quant_eval v7.21 · Run ID: `20260209_233252`*\n",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "quantized",
    "q4_k_m",
    "mistral",
    "reasoning",
    "chain-of-thought",
    "llama-cpp",
    "agentic",
    "structured-output",
    "pbh-applied-systems",
    "quant-eval",
    "en",
    "base_model:mistralai/Ministral-3-14B-Reasoning-2512",
    "base_model:quantized:mistralai/Ministral-3-14B-Reasoning-2512",
    "license:apache-2.0",
    "endpoints_compatible",
    "region:us",
    "conversational"
  ],
  "likes": 0,
  "downloads": 1012,
  "gated": false,
  "private": false,
  "last_modified": "2026-04-15T06:59:50.000Z",
  "created_at": "2026-04-11T07:56:28.000Z",
  "pipeline_tag": "",
  "library_name": ""
}
Source payload excerpt (from Hugging Face API)
{
  "_id": "69d9feac71b9db9fb71d205c",
  "id": "pbhappliedsystems/ministral-3-14b-reasoning-2512-gguf-Q4-K-M",
  "modelId": "pbhappliedsystems/ministral-3-14b-reasoning-2512-gguf-Q4-K-M",
  "sha": "3ab7c0c089326d103052169da5dd76f633f3954f",
  "createdAt": "2026-04-11T07:56:28.000Z",
  "lastModified": "2026-04-15T06:59:50.000Z",
  "author": "pbhappliedsystems",
  "downloads": 1012,
  "likes": 0,
  "gated": false,
  "private": false,
  "pipeline_tag": "",
  "library_name": "",
  "siblings_count": 7
}