trjxter/Gwimi-4-12B-IT-GGUF overview
Gwimi 4 12B IT GGUF Quantized GGUF releases of Gwimi 4 12B IT , a Gemma 4 12B instruction model post trained through: 1. Supervised Fine Tuning SFT on a 20,000…
Runs locally from ~4.26 GB disk (8 GB VRAM class GPUs with llama.cpp / guIDE).
Repository Files & Downloads
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| Gwimi-4-12B-IT-Q2_K_L.gguf | GGUF | Q2_K_L | 4.26 GB | Download |
| Gwimi-4-12B-IT-Q3_K_L.gguf | GGUF | Q3_K_L | 6.12 GB | Download |
| Gwimi-4-12B-IT-Q3_K_M.gguf | GGUF | Q3_K_M | 5.67 GB | Download |
| Gwimi-4-12B-IT-Q3_K_S.gguf | GGUF | Q3_K_S | 5.15 GB | Download |
| Gwimi-4-12B-IT-Q4_K_L.gguf | GGUF | Q4_K_L | 7.10 GB | Download |
| Gwimi-4-12B-IT-Q4_K_M.gguf | GGUF | Q4_K_M | 6.87 GB | Download |
| Gwimi-4-12B-IT-Q4_K_S.gguf | GGUF | Q4_K_S | 6.54 GB | Download |
| Gwimi-4-12B-IT-Q5_K_L.gguf | GGUF | Q5_K_L | 8.19 GB | Download |
| Gwimi-4-12B-IT-Q5_K_M.gguf | GGUF | Q5_K_M | 7.96 GB | Download |
| Gwimi-4-12B-IT-Q5_K_S.gguf | GGUF | Q5_K_S | 7.77 GB | Download |
| Gwimi-4-12B-IT-Q6_K.gguf | GGUF | Q6_K | 9.11 GB | Download |
| Gwimi-4-12B-IT-Q8_0.gguf | GGUF | Q8_0 | 11.80 GB | Download |
Model Details
| Model ID | trjxter/Gwimi-4-12B-IT-GGUF |
|---|---|
| Author | trjxter |
| Pipeline | text-generation |
| License | gemma |
| Base model | trjxter/Gwimi-4-12B-IT-BF16 |
| Last modified | 2026-06-18T03:29:32.000Z |
Model README
---
license: gemma
base_model:
- trjxter/Gwimi-4-12B-IT-BF16
library_name: gguf
pipeline_tag: text-generation
tags:
- gguf
- gemma
- gemma-4
- gemma4
- reasoning
- conversational
- sft
- reinforcement-learning
- gspo
- math
- science
- coding
datasets:
- trjxter/Kimi-K2.6-Reasoning-3300x-WandB
- trjxter/Kimi-K2.6-Technical-Reasoning-AddOn-3300x
- trjxter/Gemma-4-31B-Reasoning-1000x
- Jackrong/Claude-opus-4.7-TraceInversion-5000x
- Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned
- math-dataset/DAPO-17k-Eng
- unsloth/OpenMathReasoning-mini
- allenai/sciq
---
Gwimi-4-12B-IT-GGUF
Quantized GGUF releases of Gwimi-4-12B-IT, a Gemma 4 12B instruction model post-trained through:
- Supervised Fine-Tuning (SFT) on a 20,000-example reasoning mixture.
- Group Sequence Policy Optimization (GSPO) on 12,000 frozen reinforcement-learning prompts.
The source model for every file in this repository is:
trjxter/Gwimi-4-12B-IT-BF16
The BF16 release contains the cumulative SFT + GSPO updates merged into the exact original Gemma 4 12B BF16 base. No LoRA adapter is required when using these GGUF files.
---
Available quantizations
| Quantization | File size | Practical guidance |
|---|---:|---|
| Q2_K_L | 4.57 GB | Smallest option. Useful when memory is extremely limited, with the largest expected quality loss. |
| Q3_K_S | 5.53 GB | Compact 3-bit option prioritizing size. |
| Q3_K_M | 6.09 GB | Better-balanced 3-bit option. |
| Q3_K_L | 6.57 GB | Highest-quality 3-bit option in this repository. |
| Q4_K_S | 7.02 GB | Smaller 4-bit option. |
| Q4_K_M | 7.38 GB | Recommended default for most local users. |
| Q4_K_L | 7.63 GB | Higher-precision 4-bit variant for selected important tensors. |
| Q5_K_S | 8.34 GB | Smaller 5-bit option with strong quality. |
| Q5_K_M | 8.55 GB | Recommended when additional memory is available. |
| Q5_K_L | 8.79 GB | Higher-precision 5-bit variant for selected important tensors. |
| Q6_K | 9.79 GB | High-quality quant with relatively little compression loss. |
| Q8_0 | 12.7 GB | Largest quantized release and the closest option here to the BF16 source. |
Quick recommendation
- Best general default:
Q4_K_M - Better quality with moderate extra memory:
Q5_K_M - High-quality local inference:
Q6_K - Closest to the BF16 model:
Q8_0 - Memory-constrained systems:
Q3_K_MorQ3_K_L - Absolute smallest file:
Q2_K_L
File size is not the same as total runtime memory usage. Your runtime also needs memory for the inference backend, model metadata, temporary buffers, and the KV cache. Longer context lengths increase KV-cache memory use.
---
Model training overview
Stage 1: Supervised Fine-Tuning
The SFT corpus contained exactly 20,000 examples.
SFT dataset composition
| Dataset | Rows |
|---|---:|
| trjxter/Kimi-K2.6-Technical-Reasoning-AddOn-3300x | 3,301 |
| trjxter/Kimi-K2.6-Reasoning-3300x-WandB | 3,303 |
| trjxter/Gemma-4-31B-Reasoning-1000x | 995 |
| Jackrong/Claude-opus-4.7-TraceInversion-5000x | 4,761 |
| Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned top-up | 7,640 |
| Total | 20,000 |
The Kimi K2.5 top-up consisted of:
| Category | Rows |
|---|---:|
| General-Distillation | 4,000 |
| General-Math | 1,500 |
| PHD-Science | 1,500 |
| MultilingualSTEM | 640 |
The final split contained:
| Split | Rows |
|---|---:|
| Training | 18,000 |
| Held-out evaluation | 2,000 |
SFT configuration
| Parameter | Value |
|---|---:|
| Maximum sequence length | 32,768 |
| Training method | rank-stabilized LoRA |
| LoRA rank | 128 |
| LoRA alpha | 256 |
| Base loading precision | 8-bit |
| Effective optimizer batch | 16 |
| Epochs | 1 |
| Learning rate | 2e-5 |
| Scheduler | cosine |
| Maximum gradient norm | 1.0 |
SFT evaluation
The 2,000-example evaluation set contained:
- 100 fixed anchor examples
- 1,900 examples in a rotating evaluation pool
Scheduled evaluations used 100 fixed anchor examples plus 100 rotating examples.
The final full held-out evaluation produced:
| Metric | Result |
|---|---:|
| Evaluation examples | 2,000 |
| Final evaluation loss | 0.6940310597419739 |
| Final perplexity | 2.001768540 |
These are teacher-forced SFT evaluation metrics and should not be interpreted as free-generation benchmark accuracy.
---
Stage 2: GSPO reinforcement learning
GSPO stands for Group Sequence Policy Optimization.
For each prompt, the model generated multiple candidate completions. Programmatic reward functions scored those completions, and training used within-group reward differences to update the policy.
The defining configuration was:
importance_sampling_level = "sequence"
This applies sequence-level rather than token-level importance ratios.
GSPO datasets
Training split
| Source | Rows |
|---|---:|
| DAPO English mathematics | 8,400 |
| OpenMathReasoning Mini | 1,800 |
| SciQ | 1,800 |
| Total | 12,000 |
Reserved evaluation and protection splits
| Split | Rows |
|---|---:|
| Fixed anchor evaluation | 300 |
| Held-out evaluation | 1,897 |
| Protected SciQ test | 991 |
The frozen dataset suite had zero cross-split normalized-prompt overlap.
Reward functions
The run used three frozen reward components:
correctness_reward_func
format_reward_func
anomaly_reward_func
They measured:
- answer correctness;
- required response formatting;
- malformed, repetitive, degenerate, or suspicious generations.
GSPO configuration
| Parameter | Value |
|---|---:|
| Final global step | 2250 |
| Learning rate | 2e-6 |
| Importance sampling | sequence-level |
| Reward scaling | group |
| KL coefficient | 0.0 |
| Generations per prompt | 8 |
| Unique prompts per rollout | 3 |
| Total completions per rollout | 24 |
| Effective optimizer batch | 8 |
| Maximum completion length | 1,280 |
| Runtime sequence length | 4,096 |
| Temperature | 1.15 |
| Top-p | 0.95 |
| Repetition penalty | 1.05 |
| Maximum gradient norm | 1.0 |
| Scheduler | cosine |
Near-terminal GSPO telemetry
The following values are a near-terminal W&B snapshot around global step 2250. They are training telemetry, not independent benchmark results.
| Metric | Near-terminal value |
|---|---:|
| Combined reward | 0.2667 |
| Correctness reward | 0.1667 |
| Format reward | 0.1000 |
| Reward standard deviation | 0.3086 |
| Fraction of zero-variance reward groups | 0.6667 |
| Entropy | 0.0893 |
| Sequence clip ratio, region mean | 0.125 |
| Mean completion length | 762.875 tokens |
| Completion clipped ratio | 0.25 |
| Gradient norm before clipping | 8.3416 |
| Approximate processed tokens | 15.55 million |
The run stopped at global step 2250 after the format reward had saturated, completion lengths remained controlled, entropy had stabilized, and sequence clipping remained active without saturating.
The online RL worker did not run periodic held-out evaluation during the expensive generation loop. The preserved anchor, held-out, and protected test splits are intended for later independent evaluation.
---
Quantization provenance
All GGUF files were generated from the same verified merged BF16 model:
trjxter/Gwimi-4-12B-IT-BF16
The merged BF16 weights were verified to differ from the untouched base model:
Base SHA-256:
5a84cb313260ac447237b890387116dfa8682e49a6b44bc585ae8353abbff18d
Merged SHA-256:
1e024792bf994c200fc7757621d202eb2bb2ba11593afcf6a6a98ab6bb9c4845
The original and merged BF16 files had the same byte size because LoRA merging changes the numerical values inside the existing model tensors rather than adding a second set of full model weights.
A temporary BF16 GGUF was used as the private quantization source. It was not uploaded to this repository.
All 12 public GGUF files were generated and then verified as present in this repository.
---
Running with llama.cpp
Use a recent llama.cpp build with Gemma 4 GGUF support.
Example using Q4_K_M:
llama-cli \
-m Gwimi-4-12B-IT-Q4_K_M.gguf \
-cnv \
-c 4096 \
-n 1024 \
--temp 0.7 \
--top-p 0.95
For a direct prompt:
llama-cli \
-m Gwimi-4-12B-IT-Q4_K_M.gguf \
-p "Solve carefully: What is 17% of 240?" \
-c 4096 \
-n 512 \
--temp 0.7 \
--top-p 0.95
Adjust GPU offloading according to your hardware. For example:
-ngl 99
attempts to offload as many model layers as possible to the GPU.
---
Using the model in local applications
These files are intended for GGUF-compatible runtimes such as:
- llama.cpp
- LM Studio
- KoboldCpp
- compatible local model launchers and servers
Select the quant that fits comfortably within your available RAM or VRAM after accounting for KV cache and runtime overhead.
For most users, begin with:
Gwimi-4-12B-IT-Q4_K_M.gguf
Then compare Q5_K_M, Q6_K, or Q8_0 when more memory is available.
---
Intended uses
This model is intended for experimentation with:
- mathematical and scientific reasoning;
- coding and debugging assistance;
- technical question answering;
- structured instruction following;
- local inference;
- quantization-quality comparisons;
- SFT and reinforcement-learning research.
---
Limitations
- No independent final benchmark suite is reported yet.
- GSPO reward telemetry is not equivalent to external benchmark accuracy.
- Reward optimization can inherit blind spots from the reward functions.
- The RL stage focused heavily on verifiable mathematics, science, and structured reasoning.
- Long-context behavior was not independently benchmarked after GSPO.
- Quantization can change generation quality, especially at lower bit widths.
- The model may hallucinate, make calculation errors, produce unsafe advice, or follow incorrect premises.
- Outputs should be independently verified for high-stakes use.
- Results can vary across llama.cpp versions, inference backends, hardware, context lengths, and sampling settings.
---
Recommended evaluation approach
For a fair comparison, evaluate these models under identical prompts and decoding settings:
1. Original Gemma 4 12B instruction base
2. Gwimi SFT-only checkpoint
3. Gwimi SFT + GSPO BF16 model
4. Each selected GGUF quantization
Keep fixed:
- prompt formatting;
- chat template;
- maximum generated tokens;
- temperature and top-p;
- random seed;
- answer extraction;
- benchmark scoring;
- context length;
- inference backend where practical.
Useful comparisons include:
- exact-answer mathematics;
- scientific multiple choice;
- coding and debugging;
- formatting compliance;
- repetition and anomaly rate;
- response length;
- pass@1 and sampled pass@k;
- generation speed;
- RAM and VRAM use;
- qualitative reasoning review.
---
Acknowledgements
This release builds on:
- Gemma;
- Unsloth;
- Hugging Face;
- Transformers, PEFT, and TRL;
- llama.cpp;
- Math-Verify;
- the authors and maintainers of the SFT and GSPO datasets.
---
License
This model is a derivative of Gemma and remains subject to the applicable Gemma license and terms of use.
Users are responsible for reviewing the upstream license and ensuring that their intended use complies with it.
Run trjxter/Gwimi-4-12B-IT-GGUF with guIDE
Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.
Source: Hugging Face · Compare models