Model Intelligence Sheet

richarderkhov/neuralmagic_-_llama-2-7b-pruned70-retrained-gguf overview

This repo contains model files for a Llama 2 7B model that has had 50% of the parameters pruned in one-shot with SparseGPT, then retrained by Cerebras with 50B tokens from SlimPajama while maintaining sparsity. It was then one-shot pruned to 70% sparsity and trained for another 100B tokens. Official model weights from Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment. Authors: Neural Magic, Cerebras

ggufarxiv:2301.00774arxiv:2405.03594arxiv:2009.03300arxiv:1905.07830arxiv:1907.10641arxiv:1911.01547arxiv:2109.07958arxiv:2110.14168arxiv:2107.03374endpoints_compatibleregion:us

richarderkhov/neuralmagic_-_llama-2-7b-pruned70-retrained-gguf visual

Downloads

686

Likes

Pipeline

—

Library

—

Visibility

Public

Access

Open

Repository Files & Downloads

19 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
Llama-2-7b-pruned70-retrained.IQ4_NL.gguf	GGUF	IQ4_NL	3.58 GB	Download
Llama-2-7b-pruned70-retrained.IQ4_XS.gguf	GGUF	IQ4_XS	3.40 GB	Download
Llama-2-7b-pruned70-retrained.Q2_K.gguf	GGUF	Q2_K	2.36 GB	Download
Llama-2-7b-pruned70-retrained.Q3_K.gguf	GGUF	Q3_K	3.07 GB	Download
Llama-2-7b-pruned70-retrained.Q3_K_L.gguf	GGUF	Q3_K_L	3.35 GB	Download
Llama-2-7b-pruned70-retrained.Q3_K_M.gguf	GGUF	Q3_K_M	3.07 GB	Download
Llama-2-7b-pruned70-retrained.Q3_K_S.gguf	GGUF	Q3_K_S	2.75 GB	Download
Llama-2-7b-pruned70-retrained.Q4_0.gguf	GGUF	—	3.56 GB	Download
Llama-2-7b-pruned70-retrained.Q4_1.gguf	GGUF	—	3.95 GB	Download
Llama-2-7b-pruned70-retrained.Q4_K.gguf	GGUF	Q4_K	3.80 GB	Download
Llama-2-7b-pruned70-retrained.Q4_K_M.gguf	GGUF	Q4_K_M	3.80 GB	Download
Llama-2-7b-pruned70-retrained.Q4_K_S.gguf	GGUF	Q4_K_S	3.59 GB	Download
Llama-2-7b-pruned70-retrained.Q5_0.gguf	GGUF	—	4.33 GB	Download
Llama-2-7b-pruned70-retrained.Q5_1.gguf	GGUF	—	4.72 GB	Download
Llama-2-7b-pruned70-retrained.Q5_K.gguf	GGUF	Q5_K	4.45 GB	Download
Llama-2-7b-pruned70-retrained.Q5_K_M.gguf	GGUF	Q5_K_M	4.45 GB	Download
Llama-2-7b-pruned70-retrained.Q5_K_S.gguf	GGUF	Q5_K_S	4.33 GB	Download
Llama-2-7b-pruned70-retrained.Q6_K.gguf	GGUF	Q6_K	5.15 GB	Download
Llama-2-7b-pruned70-retrained.Q8_0.gguf	GGUF	—	6.67 GB	Download

Model Details Live

Model Slug

richarderkhov/neuralmagic_-_llama-2-7b-pruned70-retrained-gguf

Author

RichardErkhov

Pipeline Task

—

Library

—

Created

2024-11-17

Last Modified

2024-11-17

Gated

Private

HF SHA

c6e86b4bc509b0946c5471a1abd86488f8927900

License

Unknown

Language

Unknown

Base Model

Unknown

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "frontmatter": {},
    "hero_image_url": "",
    "summary": "This repo contains model files for a Llama 2 7B model that has had 50% of the parameters pruned in one-shot with SparseGPT, then retrained by Cerebras with 50B tokens from SlimPajama while maintaining sparsity. It was then one-shot pruned to 70% sparsity and trained for another 100B tokens. Official model weights from Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment. **Authors**: Neural Magic, Cerebras",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "Quantization made by Richard Erkhov.\n\n[Github](https://github.com/RichardErkhov)\n\n[Discord](https://discord.gg/pvy7H8DZMG)\n\n[Request more models](https://github.com/RichardErkhov/quant_request)\n\n\nLlama-2-7b-pruned70-retrained - GGUF\n- Model creator: https://huggingface.co/neuralmagic/\n- Original model: https://huggingface.co/neuralmagic/Llama-2-7b-pruned70-retrained/\n\n\n| Name | Quant method | Size |\n| ---- | ---- | ---- |\n| [Llama-2-7b-pruned70-retrained.Q2_K.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q2_K.gguf) | Q2_K | 2.36GB |\n| [Llama-2-7b-pruned70-retrained.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q3_K_S.gguf) | Q3_K_S | 2.75GB |\n| [Llama-2-7b-pruned70-retrained.Q3_K.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q3_K.gguf) | Q3_K | 3.07GB |\n| [Llama-2-7b-pruned70-retrained.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q3_K_M.gguf) | Q3_K_M | 3.07GB |\n| [Llama-2-7b-pruned70-retrained.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q3_K_L.gguf) | Q3_K_L | 3.35GB |\n| [Llama-2-7b-pruned70-retrained.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.IQ4_XS.gguf) | IQ4_XS | 3.4GB |\n| [Llama-2-7b-pruned70-retrained.Q4_0.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q4_0.gguf) | Q4_0 | 3.56GB |\n| [Llama-2-7b-pruned70-retrained.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.IQ4_NL.gguf) | IQ4_NL | 3.58GB |\n| [Llama-2-7b-pruned70-retrained.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q4_K_S.gguf) | Q4_K_S | 3.59GB |\n| [Llama-2-7b-pruned70-retrained.Q4_K.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q4_K.gguf) | Q4_K | 3.8GB |\n| [Llama-2-7b-pruned70-retrained.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q4_K_M.gguf) | Q4_K_M | 3.8GB |\n| [Llama-2-7b-pruned70-retrained.Q4_1.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q4_1.gguf) | Q4_1 | 3.95GB |\n| [Llama-2-7b-pruned70-retrained.Q5_0.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q5_0.gguf) | Q5_0 | 4.33GB |\n| [Llama-2-7b-pruned70-retrained.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q5_K_S.gguf) | Q5_K_S | 4.33GB |\n| [Llama-2-7b-pruned70-retrained.Q5_K.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q5_K.gguf) | Q5_K | 4.45GB |\n| [Llama-2-7b-pruned70-retrained.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q5_K_M.gguf) | Q5_K_M | 4.45GB |\n| [Llama-2-7b-pruned70-retrained.Q5_1.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q5_1.gguf) | Q5_1 | 4.72GB |\n| [Llama-2-7b-pruned70-retrained.Q6_K.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q6_K.gguf) | Q6_K | 5.15GB |\n| [Llama-2-7b-pruned70-retrained.Q8_0.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q8_0.gguf) | Q8_0 | 6.67GB |\n\n\n\n\nOriginal model description:\n---\nbase_model: neuralmagic/Llama-2-7b-pruned50-retrained\ninference: true\nmodel_type: llama\npipeline_tag: text-generation\ndatasets:\n  - cerebras/SlimPajama-627B\ntags:\n- sparse\n---\n\n# Llama-2-7b-pruned70-retrained\n\nThis repo contains model files for a [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b-hf) model that has had 50% of the parameters pruned in one-shot with [SparseGPT](https://arxiv.org/abs/2301.00774), then retrained by [Cerebras](https://huggingface.co/cerebras) with 50B tokens from SlimPajama while maintaining sparsity. It was then one-shot pruned to 70% sparsity and trained for another 100B tokens.\n\nOfficial model weights from [Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment](https://arxiv.org/abs/2405.03594).\n\n**Authors**: Neural Magic, Cerebras\n\n## Usage\n\nBelow we share some code snippets on how to get quickly started with running the model.\n\n### Sparse Transfer\n\nBy leveraging a pre-sparsified model's structure, you can efficiently fine-tune on new data, leading to reduced hyperparameter tuning, training times, and computational costs. Learn about this process [here](https://neuralmagic.github.io/docs-v2/get-started/transfer).\n\n### Running the model\n\nThis model has not been fine-tuned for instruction-following but may be run with the transformers library. For accelerated inference with sparsity, deploy with [nm-vllm](https://github.com/neuralmagic/nm-vllm) or [deepsparse](https://github.com/neuralmagic/deepsparse).\n\n```python\n# pip install transformers accelerate\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\ntokenizer = AutoTokenizer.from_pretrained(\"neuralmagic/Llama-2-7b-pruned70-retrained\")\nmodel = AutoModelForCausalLM.from_pretrained(\"neuralmagic/Llama-2-7b-pruned70-retrained\", device_map=\"auto\")\n\ninput_text = \"Write me a poem about Machine Learning.\"\ninput_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\")\n\noutputs = model.generate(**input_ids)\nprint(tokenizer.decode(outputs[0]))\n```\n\n## Evaluation Benchmark Results\n\nModel evaluation metrics and results. [UPDATE]\n\n| Benchmark                                      | Metric        | Llama-2-7b  | Llama-2-7b-pruned70-retrained |\n|------------------------------------------------|---------------|-------------|-------------------------------|\n| [MMLU](https://arxiv.org/abs/2009.03300)       | 5-shot        | 46.9%       | 36.5%                         |\n| [HellaSwag](https://arxiv.org/abs/1905.07830)  | 0-shot        | 78.6%       | 74.1%                         |\n| [WinoGrande](https://arxiv.org/abs/1907.10641) | 5-shot        | 74.0%       | 69.5%                         |\n| [ARC-c](https://arxiv.org/abs/1911.01547)      | 25-shot       | 53.1%       | 45.4%                         |\n| [TruthfulQA](https://arxiv.org/abs/2109.07958) | 5-shot        | 38.8%       | 36.7%                         |\n| [GSM8K](https://arxiv.org/abs/2110.14168)      | 5-shot        | 14.5%       | 8.0%                          |\n| [HumanEval](https://arxiv.org/abs/2107.03374)  | pass@1        | 13.4%       | 14.4%                         |\n\n## Model Training Details\n\n[UPDATE]\n\n## Help\n\nFor further support, and discussions on these models and AI in general, join [Neural Magic's Slack Community](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)\n\n",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "arxiv:2301.00774",
    "arxiv:2405.03594",
    "arxiv:2009.03300",
    "arxiv:1905.07830",
    "arxiv:1907.10641",
    "arxiv:1911.01547",
    "arxiv:2109.07958",
    "arxiv:2110.14168",
    "arxiv:2107.03374",
    "endpoints_compatible",
    "region:us"
  ],
  "likes": 0,
  "downloads": 686,
  "gated": false,
  "private": false,
  "last_modified": "2024-11-17T09:48:01.000Z",
  "created_at": "2024-11-17T08:37:34.000Z",
  "pipeline_tag": "",
  "library_name": ""
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "6739ab4e8bf916a35ff098fe",
  "id": "RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf",
  "modelId": "RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf",
  "sha": "c6e86b4bc509b0946c5471a1abd86488f8927900",
  "createdAt": "2024-11-17T08:37:34.000Z",
  "lastModified": "2024-11-17T09:48:01.000Z",
  "author": "RichardErkhov",
  "downloads": 686,
  "likes": 0,
  "gated": false,
  "private": false,
  "pipeline_tag": "",
  "library_name": "",
  "siblings_count": 21
}