What is trjxter/Gwimi-4-12B-IT-GGUF?

--- license: gemma base_model: - trjxter/Gwimi-4-12B-IT-BF16 library_name: gguf pipeline_tag: text-generation tags: - gguf - gemma - gemma-4 - gemma4 - reasoning - conversational - sft - reinforcement-learning - gspo - math - science - coding datasets: - trjxter/Kimi-K2.6-Reasoning-3300x-WandB - trjxter/Kimi-K2.6-Technical-Reasoning-AddOn-3300x - trjxter/Gemma-4-31B-Reasoning-1000x - Jackrong/Claude-opus-4.7-TraceInversion-5000x - Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned - math-dataset/DAPO-17k-Eng - unsloth/OpenMathReasoning-mini - allenai/sciq --- # Gwimi-4-12B-IT-GGUF Quantized GGUF releases of **Gwimi-4-12B-IT**, a Gemma 4 12B instruction model post-trained through: 1. **Supervised Fine-Tuning (SFT)** on a 20,000-example reasoning mixture. 2. **Group Sequence Policy Optimization (GSPO)** on 12,000 frozen reinforcement-learning prompts. The source model for every file in this reposito…

What license applies to trjxter/Gwimi-4-12B-IT-GGUF?

License: gemma. Verify terms on Hugging Face before commercial use.

How do I run trjxter/Gwimi-4-12B-IT-GGUF locally?

Download a GGUF file from this page and load it in guIDE or llama.cpp. Pipeline task: text-generation.

Model Intelligence Sheet

trjxter/Gwimi-4-12B-IT-GGUF overview

Gwimi 4 12B IT GGUF Quantized GGUF releases of Gwimi 4 12B IT , a Gemma 4 12B instruction model post trained through: 1. Supervised Fine Tuning SFT on a 20,000…

ggufgemmagemma-4gemma4reasoningconversationalsftreinforcement-learninggspomathsciencecodingtext-generationdataset:trjxter/Kimi-K2.6-Reasoning-3300x-WandBdataset:trjxter/Kimi-K2.6-Technical-Reasoning-AddOn-3300xdataset:trjxter/Gemma-4-31B-Reasoning-1000xdataset:Jackrong/Claude-opus-4.7-TraceInversion-5000xdataset:Jackrong/Kimi-K2.5-Reasoning-1M-Cleaneddataset:math-dataset/DAPO-17k-Engdataset:unsloth/OpenMathReasoning-minidataset:allenai/sciqbase_model:trjxter/Gwimi-4-12B-IT-BF16base_model:quantized:trjxter/Gwimi-4-12B-IT-BF16license:gemma

Runs locally from ~4.26 GB disk (8 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads

330

Likes

Pipeline

text-generation

Author

trjxter

Repository Files & Downloads

12 GGUF files detected

Direct downloads for local inference

File	Type	Quantization	Size	Link
Gwimi-4-12B-IT-Q2_K_L.gguf	GGUF	Q2_K_L	4.26 GB	Download
Gwimi-4-12B-IT-Q3_K_L.gguf	GGUF	Q3_K_L	6.12 GB	Download
Gwimi-4-12B-IT-Q3_K_M.gguf	GGUF	Q3_K_M	5.67 GB	Download
Gwimi-4-12B-IT-Q3_K_S.gguf	GGUF	Q3_K_S	5.15 GB	Download
Gwimi-4-12B-IT-Q4_K_L.gguf	GGUF	Q4_K_L	7.10 GB	Download
Gwimi-4-12B-IT-Q4_K_M.gguf	GGUF	Q4_K_M	6.87 GB	Download
Gwimi-4-12B-IT-Q4_K_S.gguf	GGUF	Q4_K_S	6.54 GB	Download
Gwimi-4-12B-IT-Q5_K_L.gguf	GGUF	Q5_K_L	8.19 GB	Download
Gwimi-4-12B-IT-Q5_K_M.gguf	GGUF	Q5_K_M	7.96 GB	Download
Gwimi-4-12B-IT-Q5_K_S.gguf	GGUF	Q5_K_S	7.77 GB	Download
Gwimi-4-12B-IT-Q6_K.gguf	GGUF	Q6_K	9.11 GB	Download
Gwimi-4-12B-IT-Q8_0.gguf	GGUF	Q8_0	11.80 GB	Download

Model Details

Model ID	trjxter/Gwimi-4-12B-IT-GGUF
Author	trjxter
Pipeline	text-generation
License	gemma
Base model	trjxter/Gwimi-4-12B-IT-BF16
Last modified	2026-06-21T04:50:11.000Z

Model README

---

license: gemma

base_model:

trjxter/Gwimi-4-12B-IT-BF16

library_name: gguf

pipeline_tag: text-generation

tags:

gguf
gemma
gemma-4
gemma4
reasoning
conversational
sft
reinforcement-learning
gspo
math
science
coding

datasets:

trjxter/Kimi-K2.6-Reasoning-3300x-WandB
trjxter/Kimi-K2.6-Technical-Reasoning-AddOn-3300x
trjxter/Gemma-4-31B-Reasoning-1000x
Jackrong/Claude-opus-4.7-TraceInversion-5000x
Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned
math-dataset/DAPO-17k-Eng
unsloth/OpenMathReasoning-mini
allenai/sciq

---

Gwimi-4-12B-IT-GGUF

Quantized GGUF releases of Gwimi-4-12B-IT, a Gemma 4 12B instruction model post-trained through:

Supervised Fine-Tuning (SFT) on a 20,000-example reasoning mixture.
Group Sequence Policy Optimization (GSPO) on 12,000 frozen reinforcement-learning prompts.

The source model for every file in this repository is:

trjxter/Gwimi-4-12B-IT-BF16

The BF16 release contains the cumulative SFT + GSPO updates merged into the exact original Gemma 4 12B BF16 base. No LoRA adapter is required when using these GGUF files.

---

Available quantizations

| Quantization | File size | Practical guidance |

|---|---:|---|

| Q2_K_L | 4.57 GB | Smallest option. Useful when memory is extremely limited, with the largest expected quality loss. |

| Q3_K_S | 5.53 GB | Compact 3-bit option prioritizing size. |

| Q3_K_M | 6.09 GB | Better-balanced 3-bit option. |

| Q3_K_L | 6.57 GB | Highest-quality 3-bit option in this repository. |

| Q4_K_S | 7.02 GB | Smaller 4-bit option. |

| Q4_K_M | 7.38 GB | Recommended default for most local users. |

| Q4_K_L | 7.63 GB | Higher-precision 4-bit variant for selected important tensors. |

| Q5_K_S | 8.34 GB | Smaller 5-bit option with strong quality. |

| Q5_K_M | 8.55 GB | Recommended when additional memory is available. |

| Q5_K_L | 8.79 GB | Higher-precision 5-bit variant for selected important tensors. |

| Q6_K | 9.79 GB | High-quality quant with relatively little compression loss. |

| Q8_0 | 12.7 GB | Largest quantized release and the closest option here to the BF16 source. |

Quick recommendation

Best general default: Q4_K_M
Better quality with moderate extra memory: Q5_K_M
High-quality local inference: Q6_K
Closest to the BF16 model: Q8_0
Memory-constrained systems: Q3_K_M or Q3_K_L
Absolute smallest file: Q2_K_L

File size is not the same as total runtime memory usage. Your runtime also needs memory for the inference backend, model metadata, temporary buffers, and the KV cache. Longer context lengths increase KV-cache memory use.

---

Model training overview

Stage 1: Supervised Fine-Tuning

The SFT corpus contained exactly 20,000 examples.

SFT dataset composition

| Dataset | Rows |

|---|---:|

| trjxter/Kimi-K2.6-Technical-Reasoning-AddOn-3300x | 3,301 |

| trjxter/Kimi-K2.6-Reasoning-3300x-WandB | 3,303 |

| trjxter/Gemma-4-31B-Reasoning-1000x | 995 |

| Jackrong/Claude-opus-4.7-TraceInversion-5000x | 4,761 |

| Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned top-up | 7,640 |

| Total | 20,000 |

The Kimi K2.5 top-up consisted of:

| Category | Rows |

|---|---:|

| General-Distillation | 4,000 |

| General-Math | 1,500 |

| PHD-Science | 1,500 |

| MultilingualSTEM | 640 |

The final split contained:

| Split | Rows |

|---|---:|

| Training | 18,000 |

| Held-out evaluation | 2,000 |

SFT configuration

| Parameter | Value |

|---|---:|

| Maximum sequence length | 32,768 |

| Training method | rank-stabilized LoRA |

| LoRA rank | 128 |

| LoRA alpha | 256 |

| Base loading precision | 8-bit |

| Effective optimizer batch | 16 |

| Epochs | 1 |

| Learning rate | 2e-5 |

| Scheduler | cosine |

| Maximum gradient norm | 1.0 |

SFT evaluation

The 2,000-example evaluation set contained:

100 fixed anchor examples
1,900 examples in a rotating evaluation pool

Scheduled evaluations used 100 fixed anchor examples plus 100 rotating examples.

The final full held-out evaluation produced:

| Metric | Result |

|---|---:|

| Evaluation examples | 2,000 |

| Final evaluation loss | 0.6940310597419739 |

| Final perplexity | 2.001768540 |

These are teacher-forced SFT evaluation metrics and should not be interpreted as free-generation benchmark accuracy.

---

Stage 2: GSPO reinforcement learning

GSPO stands for Group Sequence Policy Optimization.

For each prompt, the model generated multiple candidate completions. Programmatic reward functions scored those completions, and training used within-group reward differences to update the policy.

The defining configuration was:

importance_sampling_level = "sequence"

This applies sequence-level rather than token-level importance ratios.

GSPO datasets

Training split

| Source | Rows |

|---|---:|

| DAPO English mathematics | 8,400 |

| OpenMathReasoning Mini | 1,800 |

| SciQ | 1,800 |

| Total | 12,000 |

Reserved evaluation and protection splits

| Split | Rows |

|---|---:|

| Fixed anchor evaluation | 300 |

| Held-out evaluation | 1,897 |

| Protected SciQ test | 991 |

The frozen dataset suite had zero cross-split normalized-prompt overlap.

Reward functions

The run used three frozen reward components:

correctness_reward_func
format_reward_func
anomaly_reward_func

They measured:

answer correctness;
required response formatting;
malformed, repetitive, degenerate, or suspicious generations.

GSPO configuration

| Parameter | Value |

|---|---:|

| Final global step | 2250 |

| Learning rate | 2e-6 |

| Importance sampling | sequence-level |

| Reward scaling | group |

| KL coefficient | 0.0 |

| Generations per prompt | 8 |

| Unique prompts per rollout | 3 |

| Total completions per rollout | 24 |

| Effective optimizer batch | 8 |

| Maximum completion length | 1,280 |

| Runtime sequence length | 4,096 |

| Temperature | 1.15 |

| Top-p | 0.95 |

| Repetition penalty | 1.05 |

| Maximum gradient norm | 1.0 |

| Scheduler | cosine |

Near-terminal GSPO telemetry

The following values are a near-terminal W&B snapshot around global step 2250. They are training telemetry, not independent benchmark results.

| Metric | Near-terminal value |

|---|---:|

| Combined reward | 0.2667 |

| Correctness reward | 0.1667 |

| Format reward | 0.1000 |

| Reward standard deviation | 0.3086 |

| Fraction of zero-variance reward groups | 0.6667 |

| Entropy | 0.0893 |

| Sequence clip ratio, region mean | 0.125 |

| Mean completion length | 762.875 tokens |

| Completion clipped ratio | 0.25 |

| Gradient norm before clipping | 8.3416 |

| Approximate processed tokens | 15.55 million |

The run stopped at global step 2250 after the format reward had saturated, completion lengths remained controlled, entropy had stabilized, and sequence clipping remained active without saturating.

The online RL worker did not run periodic held-out evaluation during the expensive generation loop. The preserved anchor, held-out, and protected test splits are intended for later independent evaluation.

---

Quantization provenance

All GGUF files were generated from the same verified merged BF16 model:

trjxter/Gwimi-4-12B-IT-BF16

The merged BF16 weights were verified to differ from the untouched base model:

Base SHA-256:
5a84cb313260ac447237b890387116dfa8682e49a6b44bc585ae8353abbff18d

Merged SHA-256:
1e024792bf994c200fc7757621d202eb2bb2ba11593afcf6a6a98ab6bb9c4845

The original and merged BF16 files had the same byte size because LoRA merging changes the numerical values inside the existing model tensors rather than adding a second set of full model weights.

A temporary BF16 GGUF was used as the private quantization source. It was not uploaded to this repository.

All 12 public GGUF files were generated and then verified as present in this repository.

---

Running with llama.cpp

Use a recent llama.cpp build with Gemma 4 GGUF support.

Example using Q4_K_M:

llama-cli \
  -m Gwimi-4-12B-IT-Q4_K_M.gguf \
  -cnv \
  -c 4096 \
  -n 1024 \
  --temp 0.7 \
  --top-p 0.95

For a direct prompt:

llama-cli \
  -m Gwimi-4-12B-IT-Q4_K_M.gguf \
  -p "Solve carefully: What is 17% of 240?" \
  -c 4096 \
  -n 512 \
  --temp 0.7 \
  --top-p 0.95

Adjust GPU offloading according to your hardware. For example:

-ngl 99

attempts to offload as many model layers as possible to the GPU.

---

Using the model in local applications

These files are intended for GGUF-compatible runtimes such as:

llama.cpp
LM Studio
KoboldCpp
compatible local model launchers and servers

Select the quant that fits comfortably within your available RAM or VRAM after accounting for KV cache and runtime overhead.

For most users, begin with:

Gwimi-4-12B-IT-Q4_K_M.gguf

Then compare Q5_K_M, Q6_K, or Q8_0 when more memory is available.

---

Intended uses

This model is intended for experimentation with:

mathematical and scientific reasoning;
coding and debugging assistance;
technical question answering;
structured instruction following;
local inference;
quantization-quality comparisons;
SFT and reinforcement-learning research.

---

Evaluations

Independent free-generation benchmarks were run locally against Gwimi-4-12B-IT-Q6_K.gguf using llama.cpp. Results below are pass@1 with greedy decoding unless noted otherwise.

Run date: 2026-06-20

Common settings

| Setting | Coding benchmarks | MMLU-Pro |

|---|---|---|

| Backend | llama.cpp server (http://localhost:8088/v1) | llama.cpp server (http://localhost:8087) |

| Temperature | 0.0 | 0.0 |

| Thinking mode | on | on |

| Seed | 42 | 42 |

| n_predict | 8192 | 2048 |

| Context | default | 32768 |

| Scoring | executable unit tests | 0-shot chain-of-thought, letter extraction |

Summary

|---|---|---:|---:|---:|---:|---:|---|

| HumanEval | Full | 164 | 164 | 118 | 46 | 72.0% | Python |

| MBPP | Partial | 50 | 500 (test) | 30 | 20 | 60.0% | Python, tests-as-spec |

| LiveCodeBench v6 | Partial | 40 | 175 (test6.jsonl) | 18 | 22 | 45.0% | Python, all tests must pass |

| MultiPL-E | Partial | 80 | 80 (40 JS + 40 PHP) | 48 | 32 | 60.0% | HumanEval ports |

| MMLU-Pro | Partial | 50 | 2,000 (stratified pool) | 25 | 25 | 50.0% | 3 extraction failures |

HumanEval is the only benchmark completed end-to-end. The other four runs are partial subsets; totals in the right-hand column are the intended full benchmark size for that eval configuration.

LiveCodeBench v6 — by difficulty

|---|---:|---:|---:|

| Easy | 13 | 12 | 92.3% |

| Medium | 13 | 4 | 30.8% |

| Hard | 14 | 2 | 14.3% |

| Overall | 40 | 18 | 45.0% |

MultiPL-E — by language

|---|---:|---:|---:|

| JavaScript | 40 | 27 | 67.5% |

| PHP | 40 | 21 | 52.5% |

| Overall | 80 | 48 | 60.0% |

MMLU-Pro — by category

Fifty questions were drawn from a stratified 2,000-question pool (12,032 total in the test split). Three answers failed letter extraction and are counted as incorrect.

|---|---:|---:|---:|

| Math | 6 | 5 | 83.3% |

| Economics | 4 | 3 | 75.0% |

| Biology | 3 | 2 | 66.7% |

| History | 3 | 2 | 66.7% |

| Business | 8 | 4 | 50.0% |

| Law | 5 | 2 | 40.0% |

| Psychology | 6 | 2 | 33.3% |

| Physics | 7 | 1 | 14.3% |

| Chemistry | 2 | 0 | 0.0% |

| Engineering | 1 | 0 | 0.0% |

| Other | 1 | 0 | 0.0% |

| Computer science | 1 | 1 | 100% |

| Health | 2 | 2 | 100% |

| Philosophy | 1 | 1 | 100% |

| Overall | 50 | 25 | 50.0% |

These results reflect one quantization (Q6_K) under one local inference setup. They should not be compared directly to leaderboard numbers from other backends, quantizations, or sampling settings without matching those conditions.

PersonalBenchmarks (custom suite)

A 56-scenario local suite covering reasoning, tool calling, agentic planning, structured output, long-context recall, programming, and response-efficiency constraints. Scoring is deterministic (JSON shape checks, keyword order, executable patterns) — not LLM-as-judge.

Suite version: v2 (2026-06-21) — Performance, Tool Calling, and Agentic use canonical prompts with explicit raw-JSON format instructions (same scenarios across all model comparisons). Prior v1 prompts are archived under PersonalBenchmarks/scenarios/archive/.

Run date: 2026-06-21 · Settings: Q6_K, llama.cpp @ http://127.0.0.1:8088/v1, temperature 0.0, thinking on, seed 42, n_predict 8192

| Metric | Result |

|---|---:|

| Strict pass | 43/56 (76.8%) |

| Average score | 77.7 |

| Category | Pass/Total | Avg score |

|---|---:|---:|

| Tool Calling | 7/8 | 88% |

| Context | 7/8 | 88% |

| Programming | 7/8 | 88% |

| Reasoning | 6/8 | 81% |

| Structured Output | 6/8 | 75% |

| Performance | 5/8 | 62% |

| Agentic | 5/8 | 62% |

Performance, Tool Calling, and Agentic scores are from the v2 canonical scenario run. Reasoning, Structured Output, Context, and Programming scores are from the prior 2026-06-20 run (scenarios unchanged). Re-run python run_personal.py for a single fresh 56-scenario pass on any model.

PersonalBenchmarks is maintained in this repository (PersonalBenchmarks/, run_personal.py). It is independent of the third-party BenchLocal desktop benchmark packs.

---

Limitations

External benchmarking is still in progress: HumanEval is complete, but MBPP, LiveCodeBench, MultiPL-E, and MMLU-Pro are partial runs on Q6_K only.
GSPO reward telemetry is not equivalent to external benchmark accuracy.
Reward optimization can inherit blind spots from the reward functions.
The RL stage focused heavily on verifiable mathematics, science, and structured reasoning.
Long-context behavior was not independently benchmarked after GSPO.
Quantization can change generation quality, especially at lower bit widths.
The model may hallucinate, make calculation errors, produce unsafe advice, or follow incorrect premises.
Outputs should be independently verified for high-stakes use.
Results can vary across llama.cpp versions, inference backends, hardware, context lengths, and sampling settings.

---

Recommended evaluation approach

For a fair comparison, evaluate these models under identical prompts and decoding settings:

1. Original Gemma 4 12B instruction base
2. Gwimi SFT-only checkpoint
3. Gwimi SFT + GSPO BF16 model
4. Each selected GGUF quantization

Keep fixed:

prompt formatting;
chat template;
maximum generated tokens;
temperature and top-p;
random seed;
answer extraction;
benchmark scoring;
context length;
inference backend where practical.

Useful comparisons include:

exact-answer mathematics;
scientific multiple choice;
coding and debugging;
formatting compliance;
repetition and anomaly rate;
response length;
pass@1 and sampled pass@k;
generation speed;
RAM and VRAM use;
qualitative reasoning review.

---

Acknowledgements

This release builds on:

Gemma;
Unsloth;
Hugging Face;
Transformers, PEFT, and TRL;
llama.cpp;
Math-Verify;
the authors and maintainers of the SFT and GSPO datasets.

---

License

This model is a derivative of Gemma and remains subject to the applicable Gemma license and terms of use.

Users are responsible for reviewing the upstream license and ensuring that their intended use complies with it.

Run trjxter/Gwimi-4-12B-IT-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models