GraySoft
Projects Models About FAQ Contact Download guIDE →

jorge-erdb/qwen3.5-35b-a3b-heretic-opus-4.6-distilled-4bit-gguf IQ4_NL GGUF - Free GGUF Download is indexed on GraySoft with repository links, GGUF quant files, and Hugging Face metadata. This page helps you pick a local model for guIDE or other runtimes. See related models in the same shard below.

Model Intelligence Sheet

jorge-erdb/qwen3.5-35b-a3b-heretic-opus-4.6-distilled-4bit-gguf overview

4-bit GGUF quantizations of Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled, a Claude 4.6 Opus reasoning-distilled fine-tune of Jongsim/Qwen3.5-35B-A3B-heretic (SOMA+MPOA abliterated Qwen/Qwen3.5-35B-A3B).

ggufqwen3_5_moehereticabliterateduncensoredreasoningdistillationunslothqwen3.5Mixture of Experts4bitimatriximage-text-to-textendataset:nohurry/Opus-4.6-Reasoning-3000x-filtereddataset:TeichAI/claude-4.5-opus-high-reasoning-250xdataset:Jackrong/Qwen3.5-reasoning-700xbase_model:Jongsim/Qwen3.5-35B-A3B-hereticbase_model:quantized:Jongsim/Qwen3.5-35B-A3B-hereticlicense:apache-2.0endpoints_compatibleregion:usconversational
jorge-erdb/qwen3.5-35b-a3b-heretic-opus-4.6-distilled-4bit-gguf visual
Downloads
1,449
Likes
1
Pipeline
image-text-to-text
Library
Visibility
Public
Access
Open

Repository Files & Downloads

2 files detected
Direct downloads for all repository files
FileTypeQuantizationSizeLink
Qwen3.5-35B-A3B-heretic-Opus-Distilled-4.6-IQ4_NL.gguf GGUF IQ4_NL 18.42 GB Download
Qwen3.5-35B-A3B-heretic-Opus-Distilled-4.6-Q4_K_M.gguf GGUF Q4_K_M 19.71 GB Download

Model Details Live

Model Slug
jorge-erdb/qwen3.5-35b-a3b-heretic-opus-4.6-distilled-4bit-gguf
Author
jorge-erdb
Pipeline Task
image-text-to-text
Library
Created
2026-04-08
Last Modified
2026-04-08
Gated
No
Private
No
HF SHA
9eb1fed20c475a643b8f63c0445cce45fa3b69c9
License
apache-2.0
Language
en
Base Model
Jongsim/Qwen3.5-35B-A3B-heretic

Metadata Inspector

Normalized metadata (stored in metadata_json)
{
  "metadata": {},
  "card_data": {
    "license": "apache-2.0",
    "language": [
      "en"
    ],
    "base_model": [
      "Jongsim/Qwen3.5-35B-A3B-heretic"
    ],
    "tags": [
      "qwen3_5_moe",
      "heretic",
      "abliterated",
      "uncensored",
      "reasoning",
      "distillation",
      "unsloth",
      "qwen3.5",
      "Mixture of Experts",
      "4bit",
      "imatrix"
    ],
    "datasets": [
      "nohurry/Opus-4.6-Reasoning-3000x-filtered",
      "TeichAI/claude-4.5-opus-high-reasoning-250x",
      "Jackrong/Qwen3.5-reasoning-700x"
    ],
    "pipeline_tag": "image-text-to-text",
    "frontmatter": {
      "license": "apache-2.0",
      "language": [
        "en"
      ],
      "base_model": [
        "Jongsim/Qwen3.5-35B-A3B-heretic"
      ],
      "tags": [
        "qwen3_5_moe",
        "heretic",
        "abliterated",
        "uncensored",
        "reasoning",
        "distillation",
        "unsloth",
        "qwen3.5",
        "Mixture of Experts",
        "4bit",
        "imatrix"
      ],
      "datasets": [
        "nohurry/Opus-4.6-Reasoning-3000x-filtered",
        "TeichAI/claude-4.5-opus-high-reasoning-250x",
        "Jackrong/Qwen3.5-reasoning-700x"
      ],
      "pipeline_tag": "image-text-to-text"
    },
    "hero_image_url": "",
    "summary": "4-bit GGUF quantizations of Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled, a Claude 4.6 Opus reasoning-distilled fine-tune of Jongsim/Qwen3.5-35B-A3B-heretic (SOMA+MPOA abliterated Qwen/Qwen3.5-35B-A3B).",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nlicense: apache-2.0\nlanguage:\n- en\nbase_model:\n- Jongsim/Qwen3.5-35B-A3B-heretic\ntags:\n- qwen3_5_moe\n- heretic\n- abliterated\n- uncensored\n- reasoning\n- distillation\n- unsloth\n- qwen3.5\n- Mixture of Experts\n- 4bit\n- imatrix\ndatasets:\n- nohurry/Opus-4.6-Reasoning-3000x-filtered\n- TeichAI/claude-4.5-opus-high-reasoning-250x\n- Jackrong/Qwen3.5-reasoning-700x\npipeline_tag: image-text-to-text\n---\n\n# Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-GGUF — 4-bit\n\n4-bit GGUF quantizations of [Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled](https://huggingface.co/Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled), a Claude 4.6 Opus reasoning-distilled fine-tune of [Jongsim/Qwen3.5-35B-A3B-heretic](https://huggingface.co/Jongsim/Qwen3.5-35B-A3B-heretic) (SOMA+MPOA abliterated [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B)).\n\n## Quantization Details\n\n| Detail | Value |\n|---|---|\n| Quant types | IQ4_NL, Q4_K_M |\n| Quantized by | [jorge-erdb](https://huggingface.co/jorge-erdb) |\n| Method | [llama.cpp](https://github.com/ggml-org/llama.cpp) with importance matrix |\n| Importance matrix | [bartowski's imatrix calibration dataset](https://github.com/ggerganov/llama.cpp/discussions/5263) |\n| Source model | [Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled](https://huggingface.co/Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled) (BF16) |\n\n| Quant | Best for | Notes |\n|---|---|---|\n| IQ4_NL | CUDA / CPU | Non-linear, slightly better quality per bit |\n| Q4_K_M | CUDA / CPU / **Metal** | Linear, best Metal compatibility |\n\n## Download\n\n```bash\npip install -U \"huggingface_hub[cli]\"\n\n# IQ4_NL (non-linear, best for CUDA/CPU)\nhuggingface-cli download jorge-erdb/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-4bit-GGUF --include \"*IQ4_NL.gguf\" --local-dir ./\n\n# Q4_K_M (linear, Metal-friendly)\nhuggingface-cli download jorge-erdb/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-4bit-GGUF --include \"*Q4_K_M.gguf\" --local-dir ./\n```\n\n> [!Important]\n> ## Apple Metal Backend Warning\n>\n> **IQ4_NL is a non-linear quantization format.** It performs sub-optimally on Apple's Metal backend due to the lack of native support for non-linear dequantization kernels. If you are running on an Apple Silicon Mac with GPU offloading via Metal, you will likely experience:\n>\n> - Slower inference compared to linear quants of similar size (e.g., Q4_K_M)\n> - No speed benefit from the ARM weight repacking that IQ4_NL supports on CPU\n>\n> **If you're on Apple Metal, use the Q4_K_M quant from this repo instead.** For higher precision options, see [Jongsim's repo](https://huggingface.co/Jongsim/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled).\n\n## Credits\n\n- **Quantization**: [jorge-erdb](https://huggingface.co/jorge-erdb)\n- **Importance matrix**: [bartowski](https://huggingface.co/bartowski) — imatrix calibration dataset\n- **Fine-tune & abliteration**: [Jongsim](https://huggingface.co/Jongsim) — SOMA+MPOA abliteration via [Heretic v1.2.0](https://github.com/p-e-w/heretic) + Claude 4.6 Opus reasoning distillation via Unsloth + LoRA\n- **Base model**: [Qwen Team](https://huggingface.co/Qwen) — Qwen3.5-35B-A3B\n\n---\n\n# Qwen3.5-35B-A3B-heretic-Reasoning\n\nA reasoning-enhanced, abliterated version of Qwen3.5-35B-A3B (35B total / 3B active parameters, Mixture of Experts). This model was built in two stages: first, censorship removal via directional ablation using [Heretic](https://github.com/p-e-w/heretic), then supervised fine-tuning on high-quality Chain-of-Thought reasoning traces distilled from Claude 4.6 Opus.\n\nThe model produces structured reasoning within `<think>...</think>` tags before delivering final responses. All weights are in bf16 precision.\n\n\n## Model Introduction\n\nThis model is a fine-tuned derivative of [Jongsim/Qwen3.5-35B-A3B-heretic](https://huggingface.co/Jongsim/Qwen3.5-35B-A3B-heretic), which itself is an abliterated (decensored) version of [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B).\n\nThe primary objective is to inject high-density structured reasoning capability from Claude 4.6 Opus while preserving the uncensored nature of the abliterated base model. Through SFT on curated reasoning distillation data, the model learns to decompose complex problems into sequential steps within a dedicated thinking block before generating the final answer.\n\n### Architecture Overview\n\n| Property | Value |\n|:---|:---|\n| Architecture | Qwen3.5 MoE (Gated DeltaNet + Gated Attention + MoE) |\n| Total Parameters | 35B |\n| Active Parameters | 3B per token |\n| Hidden Dimension | 2048 |\n| Layers | 40 (10 repeating blocks of 3x DeltaNet-MoE + 1x Attention-MoE) |\n| Experts | 256 total, 8 routed + 1 shared active |\n| Expert Intermediate Dim | 512 |\n| Context Length | 262,144 tokens (native) |\n| Precision | bf16 |\n| Vocabulary | 248,320 tokens |\n\n\n## Training Pipeline\n\n```\nQwen/Qwen3.5-35B-A3B (original)\n |\n | Heretic v1.2.0 (SOMA + MPOA abliteration)\n v\nJongsim/Qwen3.5-35B-A3B-heretic (abliterated base)\n |\n | Supervised Fine-Tuning (LoRA + Unsloth)\n v\nJongsim/Qwen3.5-35B-A3B-heretic-Reasoning (this model)\n```\n\n\n## Stage 1: Abliteration (Censorship Removal)\n\nThe base model was processed with [Heretic v1.2.0](https://github.com/p-e-w/heretic), an automated censorship removal tool that applies directional ablation optimized via Bayesian hyperparameter search (Optuna TPE).\n\nTwo techniques were combined:\n- **SOMA** (Self-Organizing Map Abliteration): Uses a 4x4 SOM to discover multiple refusal directions in activation space, then ablates the top-k directions simultaneously.\n- **MPOA** (Magnitude-Preserving Orthogonal Ablation): Projects out the refusal direction while preserving the original weight magnitude via row normalization with low-rank correction (rank 4).\n\n### Abliteration Configuration\n\n| Parameter | Value |\n|:---|:---|\n| Method | SOMA + MPOA |\n| Orthogonalize Direction | true |\n| Row Normalization | full |\n| Full Normalization LoRA Rank | 4 |\n| Winsorization Quantile | 0.95 |\n| SOM Grid | 4 x 4 (16 neurons) |\n| SOM Iterations | 10,000 |\n| SOM Learning Rate | 0.01 |\n| SOM Sigma | 0.5 |\n| SOM k (directions) | 4 |\n| Optimization Trials | 200 (60 startup) |\n| Selected Trial | Trial 84 / 200 |\n| Good Prompts | mlabonne/harmless_alpaca (train[:400]) |\n| Bad Prompts | mlabonne/harmful_behaviors (train[:400]) |\n| Quantization | none (bf16) |\n\n### Abliteration Results\n\n| Metric | Original | Abliterated |\n|:---|:---|:---|\n| KL Divergence | 0 (reference) | 0.0638 |\n| Refusals (out of 100) | 91 | 6 |\n\n93.4% refusal reduction with minimal distribution shift (KL = 0.0638).\n\n\n## Stage 2: Supervised Fine-Tuning (Reasoning Distillation)\n\n### Objective\n\nInject structured Chain-of-Thought reasoning patterns from Claude 4.6 Opus into the abliterated model. The training enforces a strict output format where the model generates internal reasoning within `<think>` blocks before producing the final response.\n\n### Training Strategy\n\n- **Framework**: Unsloth 2026.3.3 + TRL SFTTrainer\n- **Method**: LoRA (Low-Rank Adaptation) applied to both attention and MoE expert layers\n- **Loss Computation**: `train_on_responses_only` — loss is calculated exclusively on assistant responses (both thinking trace and final answer), not on user prompts\n  - Instruction boundary: `<|im_start|>user\\n`\n  - Response boundary: `<|im_start|>assistant\\n<think>`\n- **Chat Template**: Qwen ChatML format (`<|im_start|>` / `<|im_end|>`)\n\n### LoRA Configuration\n\n| Parameter | Value |\n|:---|:---|\n| PEFT Method | LoRA |\n| Rank (r) | 16 |\n| Alpha | 32 (= 2 x rank) |\n| Dropout | 0.0 |\n| Bias | none |\n| Target Modules (Attention) | q_proj, k_proj, v_proj, o_proj |\n| Target Modules (FFN) | gate_proj, up_proj, down_proj |\n| Target Modules (MoE) | gate_up_proj |\n| Gradient Checkpointing | unsloth mode |\n\n### Training Hyperparameters\n\n| Parameter | Value |\n|:---|:---|\n| Max Sequence Length | 2,048 |\n| Per-Device Batch Size | 1 |\n| Gradient Accumulation Steps | 8 |\n| Effective Batch Size | 8 |\n| Number of Epochs | 5 |\n| Total Training Steps | 1,995 |\n| Learning Rate | 2e-4 |\n| LR Scheduler | Linear decay |\n| Warmup Steps | 5 |\n| Optimizer | AdamW 8-bit |\n| Weight Decay | 0.001 |\n| Precision | bf16 |\n| Seed | 3407 |\n| Total FLOPs | 3.56 x 10^18 |\n\n### Datasets\n\nThree publicly available reasoning distillation datasets were combined, shuffled (seed=42), and used for training:\n\n| Dataset | Samples | Description |\n|:---|:---|:---|\n| [nohurry/Opus-4.6-Reasoning-3000x-filtered](https://huggingface.co/datasets/nohurry/Opus-4.6-Reasoning-3000x-filtered) | ~2,308 | Filtered reasoning trajectories from Claude 4.6 Opus. Each sample contains a problem, a detailed thinking trace, and a final solution. |\n| [TeichAI/claude-4.5-opus-high-reasoning-250x](https://huggingface.co/datasets/TeichAI/claude-4.5-opus-high-reasoning-250x) | ~250 | High-intensity structured reasoning instances from Claude 4.5 Opus multi-turn conversations. |\n| [Jackrong/Qwen3.5-reasoning-700x](https://huggingface.co/datasets/Jackrong/Qwen3.5-reasoning-700x) | ~633 | Curated reasoning samples in both conversation and instruction format, designed for step-by-step problem solving diversity. |\n| **Total** | **~3,191** | Combined after filtering empty/invalid rows. |\n\n### Training Loss\n\n| Epoch | Avg Loss | Steps |\n|:---|:---|:---|\n| 1 | 0.4299 | 79 - 399 |\n| 2 | 0.3729 | 400 - 798 |\n| 3 | 0.3359 | 799 - 1197 |\n| 4 | 0.3059 | 1198 - 1596 |\n| 5 | 0.2958 | 1597 - 1995 |\n\nTraining loss decreased monotonically from 0.4299 to 0.2958 across 5 epochs, indicating stable convergence without overfitting signs at the loss level.\n\n### Checkpoint Selection\n\nThe best checkpoint was selected based on GSM8K accuracy (50 samples). All checkpoints were evaluated in isolated subprocesses to prevent GPU memory leaks from Unsloth's model patching.\n\n| Checkpoint | Epoch | GSM8K Accuracy |\n|:---|:---|:---|\n| checkpoint-1200 | 3.0 | 8.0% (4/50) |\n| checkpoint-1400 | 3.5 | 10.0% (5/50) |\n| checkpoint-1596 | 4.0 | 10.0% (5/50) |\n| **checkpoint-1995** | **5.0** | **12.0% (6/50)** |\n\ncheckpoint-1995 (epoch 5) was selected and merged into bf16 for the final release.\n\nNote: GSM8K measures narrow arithmetic reasoning and does not fully reflect the model's broader reasoning capabilities (code generation, logical analysis, multi-step planning) which are the primary targets of the distillation training.\n\n\n## Hardware and Environment\n\n| Component | Value |\n|:---|:---|\n| Hardware | NVIDIA DGX Spark |\n| GPU | NVIDIA GB10 (128GB unified memory) |\n| Compute Capability | sm121 |\n| Architecture | aarch64 |\n| CUDA | 13.0 |\n| PyTorch | 2.9.1a0 |\n| Transformers | 5.2.0 |\n| Unsloth | 2026.3.3 |\n| TRL | 0.24.0 |\n| PEFT | 0.18.1 |\n| Datasets | 4.3.0 |\n| Tokenizers | 0.22.2 |\n\n### DGX Spark-Specific Notes\n\n- Flash Attention and Memory-Efficient Attention (cutlass) are disabled due to sm121 incompatibility (supported: sm80-sm100). Only Math SDP is used.\n- `flash_attn` package is fully removed to prevent FATAL errors on sm121.\n- `torch.compile` / TorchInductor is disabled due to Triton ptxas compatibility issues.\n- The entire model (35B parameters) fits in a single GPU's 128GB unified memory without quantization.\n\n\n## Usage\n\nThis model uses the standard Qwen3.5 chat template. It operates in thinking mode by default.\n\n### Inference Example\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nmodel_name = \"Jongsim/Qwen3.5-35B-A3B-heretic-Reasoning\"\nmodel = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=\"auto\", device_map=\"auto\")\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\nmessages = [\n    {\"role\": \"user\", \"content\": \"Explain the difference between TCP and UDP, and when you would choose one over the other.\"}\n]\n\ntext = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\ninputs = tokenizer(text, return_tensors=\"pt\").to(model.device)\noutputs = model.generate(**inputs, max_new_tokens=8192, temperature=0.7, top_p=0.8, top_k=20)\nresponse = tokenizer.decode(outputs[0][inputs[\"input_ids\"].shape[-1]:], skip_special_tokens=True)\nprint(response)\n```\n\n### Recommended Sampling Parameters\n\n| Mode | temperature | top_p | top_k | presence_penalty |\n|:---|:---|:---|:---|:---|\n| Thinking (general) | 1.0 | 0.95 | 20 | 1.5 |\n| Thinking (coding) | 0.6 | 0.95 | 20 | 0.0 |\n| Non-thinking (general) | 0.7 | 0.8 | 20 | 1.5 |\n\n\n## Example of Learned Reasoning Format\n\nThe model produces output in the following structure:\n\n```\n<think>\nLet me analyze this problem step by step.\n\n1. First, I need to identify the core question being asked.\n2. Then, I'll consider the relevant constraints and conditions.\n3. Next, I'll work through the logic systematically.\n4. Finally, I'll verify my reasoning for consistency.\n\n[detailed reasoning follows...]\n</think>\n\n[final answer here]\n```\n\nThis structured thinking pattern, distilled from Claude 4.6 Opus interactions, reduces redundant cognitive loops while preserving deep analytical capacity.\n\n\n## Limitations\n\n- **Hallucination Risk**: As an autoregressive language model, the model may generate plausible-sounding but factually incorrect statements, particularly regarding real-world events or obscure technical details.\n- **GSM8K Performance**: The model scores 12% on GSM8K (50 samples). This is expected because the training data emphasizes broad reasoning patterns (code, logic, planning) rather than arithmetic drill. For pure math benchmarks, consider models specifically trained on mathematical datasets.\n- **Abliteration Residual**: 6 out of 100 harmful prompts still trigger refusal. The abliteration is not exhaustive.\n- **Context Length Trade-off**: While the architecture supports 262K tokens natively, the SFT was performed with max_seq_length=2048. Very long reasoning chains beyond the training distribution may degrade in quality.\n- **MoE Inference Overhead**: Despite having only 3B active parameters per token, the full 35B model must be loaded into memory. Minimum ~65GB VRAM/RAM required for bf16.\n\n\n## Acknowledgements\n\n- [Qwen Team](https://qwen.ai/) for the Qwen3.5-35B-A3B architecture and pretrained weights\n- [Heretic](https://github.com/p-e-w/heretic) (p-e-w) for the automated directional ablation framework\n- [Unsloth AI](https://unsloth.ai/) for efficient LoRA fine-tuning of large MoE models\n- [nohurry](https://huggingface.co/nohurry), [TeichAI](https://huggingface.co/TeichAI), and [Jackrong](https://huggingface.co/Jackrong) for the reasoning distillation datasets\n\n\n## Citation\n\n```bibtex\n@misc{jongsim_qwen35_heretic_reasoning,\n  title        = {Qwen3.5-35B-A3B-heretic-Reasoning},\n  author       = {Jongsim},\n  year         = {2026},\n  publisher    = {Hugging Face},\n  howpublished = {\\url{https://huggingface.co/Jongsim/Qwen3.5-35B-A3B-heretic-Reasoning}}\n}\n```",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "qwen3_5_moe",
    "heretic",
    "abliterated",
    "uncensored",
    "reasoning",
    "distillation",
    "unsloth",
    "qwen3.5",
    "Mixture of Experts",
    "4bit",
    "imatrix",
    "image-text-to-text",
    "en",
    "dataset:nohurry/Opus-4.6-Reasoning-3000x-filtered",
    "dataset:TeichAI/claude-4.5-opus-high-reasoning-250x",
    "dataset:Jackrong/Qwen3.5-reasoning-700x",
    "base_model:Jongsim/Qwen3.5-35B-A3B-heretic",
    "base_model:quantized:Jongsim/Qwen3.5-35B-A3B-heretic",
    "license:apache-2.0",
    "endpoints_compatible",
    "region:us",
    "conversational"
  ],
  "likes": 1,
  "downloads": 1449,
  "gated": false,
  "private": false,
  "last_modified": "2026-04-08T23:15:50.000Z",
  "created_at": "2026-04-08T17:50:03.000Z",
  "pipeline_tag": "image-text-to-text",
  "library_name": ""
}
Source payload excerpt (from Hugging Face API)
{
  "_id": "69d6954b440f679e7f12e47a",
  "id": "jorge-erdb/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-4bit-GGUF",
  "modelId": "jorge-erdb/Qwen3.5-35B-A3B-heretic-Opus-4.6-Distilled-4bit-GGUF",
  "sha": "9eb1fed20c475a643b8f63c0445cce45fa3b69c9",
  "createdAt": "2026-04-08T17:50:03.000Z",
  "lastModified": "2026-04-08T23:15:50.000Z",
  "author": "jorge-erdb",
  "downloads": 1449,
  "likes": 1,
  "gated": false,
  "private": false,
  "pipeline_tag": "image-text-to-text",
  "library_name": "",
  "siblings_count": 4
}