Model Intelligence Sheet

richarderkhov/euclaise_-_remask-3b-gguf overview

Comprehensive model page for richarderkhov/euclaise-remask-3b-gguf

ggufarxiv:2401.01335arxiv:2403.02178endpoints_compatibleregion:usconversational

richarderkhov/euclaise_-_remask-3b-gguf visual

Downloads

156

Likes

Pipeline

—

Library

—

Visibility

Public

Access

Open

Repository Files & Downloads

22 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
ReMask-3B.IQ3_M.gguf	GGUF	IQ3_M	1.23 GB	Download
ReMask-3B.IQ3_S.gguf	GGUF	IQ3_S	1.17 GB	Download
ReMask-3B.IQ3_XS.gguf	GGUF	IQ3_XS	1.11 GB	Download
ReMask-3B.IQ4_NL.gguf	GGUF	IQ4_NL	1.51 GB	Download
ReMask-3B.IQ4_XS.gguf	GGUF	IQ4_XS	1.43 GB	Download
ReMask-3B.Q2_K.gguf	GGUF	Q2_K	1.01 GB	Download
ReMask-3B.Q3_K.gguf	GGUF	Q3_K	1.30 GB	Download
ReMask-3B.Q3_K_L.gguf	GGUF	Q3_K_L	1.40 GB	Download
ReMask-3B.Q3_K_M.gguf	GGUF	Q3_K_M	1.30 GB	Download
ReMask-3B.Q3_K_S.gguf	GGUF	Q3_K_S	1.17 GB	Download
ReMask-3B.Q4_0.gguf	GGUF	—	1.50 GB	Download
ReMask-3B.Q4_1.gguf	GGUF	—	1.65 GB	Download
ReMask-3B.Q4_K.gguf	GGUF	Q4_K	1.59 GB	Download
ReMask-3B.Q4_K_M.gguf	GGUF	Q4_K_M	1.59 GB	Download
ReMask-3B.Q4_K_S.gguf	GGUF	Q4_K_S	1.51 GB	Download
ReMask-3B.Q5_0.gguf	GGUF	—	1.81 GB	Download
ReMask-3B.Q5_1.gguf	GGUF	—	1.96 GB	Download
ReMask-3B.Q5_K.gguf	GGUF	Q5_K	1.86 GB	Download
ReMask-3B.Q5_K_M.gguf	GGUF	Q5_K_M	1.86 GB	Download
ReMask-3B.Q5_K_S.gguf	GGUF	Q5_K_S	1.81 GB	Download
ReMask-3B.Q6_K.gguf	GGUF	Q6_K	2.14 GB	Download
ReMask-3B.Q8_0.gguf	GGUF	—	2.77 GB	Download

Model Details Live

Model Slug

richarderkhov/euclaise_-_remask-3b-gguf

Author

RichardErkhov

Pipeline Task

—

Library

—

Created

2024-08-19

Last Modified

2024-08-19

Gated

Private

HF SHA

3c15c4bc0ddd3d6951a2ef4a6965566d8d36949f

License

Unknown

Language

Unknown

Base Model

Unknown

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "frontmatter": {},
    "hero_image_url": "",
    "summary": "",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "Quantization made by Richard Erkhov.\n\n[Github](https://github.com/RichardErkhov)\n\n[Discord](https://discord.gg/pvy7H8DZMG)\n\n[Request more models](https://github.com/RichardErkhov/quant_request)\n\n\nReMask-3B - GGUF\n- Model creator: https://huggingface.co/euclaise/\n- Original model: https://huggingface.co/euclaise/ReMask-3B/\n\n\n| Name | Quant method | Size |\n| ---- | ---- | ---- |\n| [ReMask-3B.Q2_K.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q2_K.gguf) | Q2_K | 1.01GB |\n| [ReMask-3B.IQ3_XS.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.IQ3_XS.gguf) | IQ3_XS | 1.11GB |\n| [ReMask-3B.IQ3_S.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.IQ3_S.gguf) | IQ3_S | 1.17GB |\n| [ReMask-3B.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q3_K_S.gguf) | Q3_K_S | 1.17GB |\n| [ReMask-3B.IQ3_M.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.IQ3_M.gguf) | IQ3_M | 1.23GB |\n| [ReMask-3B.Q3_K.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q3_K.gguf) | Q3_K | 1.3GB |\n| [ReMask-3B.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q3_K_M.gguf) | Q3_K_M | 1.3GB |\n| [ReMask-3B.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q3_K_L.gguf) | Q3_K_L | 1.4GB |\n| [ReMask-3B.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.IQ4_XS.gguf) | IQ4_XS | 1.43GB |\n| [ReMask-3B.Q4_0.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q4_0.gguf) | Q4_0 | 1.5GB |\n| [ReMask-3B.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.IQ4_NL.gguf) | IQ4_NL | 1.51GB |\n| [ReMask-3B.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q4_K_S.gguf) | Q4_K_S | 1.51GB |\n| [ReMask-3B.Q4_K.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q4_K.gguf) | Q4_K | 1.59GB |\n| [ReMask-3B.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q4_K_M.gguf) | Q4_K_M | 1.59GB |\n| [ReMask-3B.Q4_1.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q4_1.gguf) | Q4_1 | 1.65GB |\n| [ReMask-3B.Q5_0.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q5_0.gguf) | Q5_0 | 1.81GB |\n| [ReMask-3B.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q5_K_S.gguf) | Q5_K_S | 1.81GB |\n| [ReMask-3B.Q5_K.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q5_K.gguf) | Q5_K | 1.86GB |\n| [ReMask-3B.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q5_K_M.gguf) | Q5_K_M | 1.86GB |\n| [ReMask-3B.Q5_1.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q5_1.gguf) | Q5_1 | 1.96GB |\n| [ReMask-3B.Q6_K.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q6_K.gguf) | Q6_K | 2.14GB |\n| [ReMask-3B.Q8_0.gguf](https://huggingface.co/RichardErkhov/euclaise_-_ReMask-3B-gguf/blob/main/ReMask-3B.Q8_0.gguf) | Q8_0 | 2.77GB |\n\n\n\n\nOriginal model description:\n---\nlanguage:\n- en\nlicense: cc-by-sa-4.0\ndatasets:\n- euclaise/TinyCoT\n- euclaise/reddit-instruct-curated\n- sablo/oasst2_curated\n---\n\n# ReMask: Improving autoregressive language models via regularized masking\n\n## Background\n\n[Self-Play Finetuning (SPIN)](https://arxiv.org/abs/2401.01335) is a recent finetuning method which outperforms standard supervised finetuning (SFT).\nInstead of just performing next-token prediction, SPIN it an iterative method which contrasts generations from the previous iteration of the model with the ground-truth completions.\nUnlike methods like reinforcement learning or ranking losses, SPIN does not require preference data, which makes it an attractive method since preference data can be hard to gather.\nHowever, SPIN's popularity has been limited by the need to repeatedly generate sequences from the model -- generation is much slower than training, so SPIN is much more slow and expensive compared to SFT.\n\nWith this problem in mind, I sought out to create an alternative to SPIN which doesn't require generation.\n\n### Why does SPIN work?\n\nSFT trains models to predict the next token given all the ground-truth previous tokens.\nHowever, in generation, the model doesn't have access to a ground-truth to predict from, and instead repeatedly predicts on top of its own predictions.\nThis creates a bias known as \"exposure bias\": Models often can pick reasonable choices for the next token on average, but can't keep this up for the full sequence.\nIn particular, it might be easy to predict a *reasonable* next token, but much more difficult to predict the full sequence.\n\n***For instance, consider the following case:***\n\n> The astronomer pointed his telescope at the distant star, hoping to see\n\nThe correct prediction here might be \"signs of life.\". However, the model might predict \"and\" rather than \"signs\", since \"and\" is *reasonable* in the immediate context - it's gramatically correct, but implies a strange ending to the sentence.\nAs a result, the model might end up with something like *\"The astronomer pointed his telescope at the distant star, hoping to see and hear.\"* - which makes little sense.\n\nSPIN's advantage over SFT likely comes from its partial mitigation of exposure bias.\nSPIN doesn't only train the model to predict the next token accurately, it repeatedly trains the model to identify and fix discrepancies between its generations and the ground-truth.\nIn order to do this, the model must implicitly learn to think ahead, as exposure bias is likely what causes many of the discrepancies.\n\n### How can we simplify this?\n\nUnfortunately, explicitly predicting ahead for many steps is very expensive, and considering full model generations requires a slow generation process.\n\nAn obvious option is to simply randomly corrupt tokens in the sequence.\nThe model must keep an internal estimate of what the corrupted tokens ought to be in order to predict the token after them, forcing the model to think ahead.\n\nThe most obvious ways to do this are to randomly replace input tokens with a special `[mask]` token, or to randomly replace input tokens with other random tokens.\nThese approaches were tried in [Masked Thought](https://arxiv.org/abs/2403.02178), albeit with somewhat different motivations.\n\nHowever, these approaches have a problem: Models can detect when a token is `[mask]` or is highly unlikely, so the model may only learn to think ahead when the corruptions are present.\n\nTo avoid this issue, we can run the model twice - once with a masked sequence, and once on the full sequence.\nThen, we penalize deviations between these two runs, which forces the model to act the same regardless of if the `[mask]` token is present or not.\n\nThis approach was initially introduced with [R-TeaFor](https://aclanthology.org/2022.emnlp-main.423/) for abstractive summarization, but can be easily applied to standard generation tasks too.\n\n### ReMask and ReMask-CoT:\n\nReMask applies an approach similar to R-TeaFor to typical chat/instruction tuning.\n\nConsider the following chat interaction:\n\n> User: What is 1+1?\n> \n> Assistant: **1+1=2**\n> \n> **User:**\n\nThe model must predict the bolded parts.  So, we randomly mask tokens from the bolded parts, and run the model once on the masked sequence and once on the full sequence.\n\nWe then compute a divergence loss `D(p_masked, p_full)` between the two predictions. For this, I used the average of the backwards and forwards KL divergences between the predictions.\n\nFinally, we add this loss to the standard cross-entropy language modeling losses from each prediction, with a weighting value:\n```\nloss = 0.5*(CE(p_masked, labels) + CE(p_full, labels)) + weight*D(p_masked, p_full)\n```\n\n***ReMask-CoT:***\n\nFor CoT tasks where the reasoning is explicitly separated from the answer, we can add some further improvements.\n\nFirst, note that CoT rationales are noisy -- there are many correct rationales which might lead to the same correct answer, and rationales are impacted by things like writing style which don't matter for the actual correctness of the reasoning.\n\nKeeping this in mind:\n\n- We also randomly mask a small portion of the labels of the rationale, but not the answer, such that an accurate answer is more important than a rationale that is word-for-word identical to the annotated rationale.\n- The exact answer is always important and is always a few tokens. Hence, we do not mask the labels or input tokens for the answer value.\n- Rarely, we ignore the rationale labels entirely, such that the model is only pushed to learn what leads to the best answer.\n\n## Results\n\nI trained StableLM-3B-4e1t repeatedly on [TinyCoT](https://huggingface.co/datasets/euclaise/TinyCoT), along with 1000 examples from [reddit-instruct-curated](https://huggingface.co/datasets/euclaise/reddit-instruct-curated) and 1000 examples from [oasst2-curated](https://huggingface.co/datasets/sablo/oasst2_curated).\n\nI trained once with ReMask/ReMask-CoT, once without regularization to match Masked Thought (w/ partial label-masking for CoT), and once with SFT.\n\nIf my hypothesis regarding exposure bias is correct, ReMask should significantly improve generative benchmarks like GSM8K, but would not necessarily improve logprob-based benchmarks like ARC-c (as implemented by the evaluation harness):\n\nHere are some benchmark results, computed using the the LM Evaluation Harness with vllm:\n\n| Model          | GSM8K (strict, 5-shot) | ARC-c (acc_norm, 25-shot) |\n|:--------------:|-----------------------:|--------------------------:|\n| SFT            | 24.34%                 | 42.92%                    |\n| Masked Thought | 24.18%                 | *43.60%*                |\n| **ReMask**     | **27.90%**             | 43.26%                    |\n\nAs I expected, it improves GSM8K, but doesn't do much to ARC.\n\n## Training details\n- Framework: PyTorch Lightning\n- Optimizer: [Lilith](https://github.com/euclaise/supertrainer2000/blob/master/src/supertrainer2k/optim/lilith.py)\n- Training sequence length: 256\n- Input masking probability: 40%\n- Label masking probability: 10%\n- Answer-only (full rationale label masking) probability: 10%\n- Batch size: 16, accumulated to 256\n- Epochs: 6\n- Learning rate: 1e-5\n- Learning rate schedule: One Cycle, cosine, no cycle_momentum\n- Regularization weight: 0.1\n\n## Prompt format\n\nThe format for reddit-instruct and oasst2 was:\n\n```\n<|user|>\n[insert instruction here]\n<|assistant|>\n[insert response here]\n<|user|>\n...\n```\n\nThe format for TinyCoT was:\n```\n<|user|>\n[insert instruction here]\n<|rationale|>\n[insert reasoning here]\n<|answer|>\n[insert direct answer here]\n```\n\n\n",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "arxiv:2401.01335",
    "arxiv:2403.02178",
    "endpoints_compatible",
    "region:us",
    "conversational"
  ],
  "likes": 0,
  "downloads": 156,
  "gated": false,
  "private": false,
  "last_modified": "2024-08-19T07:36:16.000Z",
  "created_at": "2024-08-19T07:08:20.000Z",
  "pipeline_tag": "",
  "library_name": ""
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "66c2ef64fa92c9e00afab0ce",
  "id": "RichardErkhov/euclaise_-_ReMask-3B-gguf",
  "modelId": "RichardErkhov/euclaise_-_ReMask-3B-gguf",
  "sha": "3c15c4bc0ddd3d6951a2ef4a6965566d8d36949f",
  "createdAt": "2024-08-19T07:08:20.000Z",
  "lastModified": "2024-08-19T07:36:16.000Z",
  "author": "RichardErkhov",
  "downloads": 156,
  "likes": 0,
  "gated": false,
  "private": false,
  "pipeline_tag": "",
  "library_name": "",
  "siblings_count": 24
}