Model Intelligence Sheet
richarderkhov/neuralmagic_-_llama-2-7b-pruned70-retrained-gguf overview
This repo contains model files for a Llama 2 7B model that has had 50% of the parameters pruned in one-shot with SparseGPT, then retrained by Cerebras with 50B tokens from SlimPajama while maintaining sparsity. It was then one-shot pruned to 70% sparsity and trained for another 100B tokens. Official model weights from Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment. Authors: Neural Magic, Cerebras
Downloads
686
Likes
0
Pipeline
—
Library
—
Visibility
Public
Access
Open
Repository Files & Downloads
19 files detected
Direct downloads for all repository files
| File | Type | Quantization | Size | Link |
|---|---|---|---|---|
| Llama-2-7b-pruned70-retrained.IQ4_NL.gguf | GGUF | IQ4_NL | 3.58 GB | Download |
| Llama-2-7b-pruned70-retrained.IQ4_XS.gguf | GGUF | IQ4_XS | 3.40 GB | Download |
| Llama-2-7b-pruned70-retrained.Q2_K.gguf | GGUF | Q2_K | 2.36 GB | Download |
| Llama-2-7b-pruned70-retrained.Q3_K.gguf | GGUF | Q3_K | 3.07 GB | Download |
| Llama-2-7b-pruned70-retrained.Q3_K_L.gguf | GGUF | Q3_K_L | 3.35 GB | Download |
| Llama-2-7b-pruned70-retrained.Q3_K_M.gguf | GGUF | Q3_K_M | 3.07 GB | Download |
| Llama-2-7b-pruned70-retrained.Q3_K_S.gguf | GGUF | Q3_K_S | 2.75 GB | Download |
| Llama-2-7b-pruned70-retrained.Q4_0.gguf | GGUF | — | 3.56 GB | Download |
| Llama-2-7b-pruned70-retrained.Q4_1.gguf | GGUF | — | 3.95 GB | Download |
| Llama-2-7b-pruned70-retrained.Q4_K.gguf | GGUF | Q4_K | 3.80 GB | Download |
| Llama-2-7b-pruned70-retrained.Q4_K_M.gguf | GGUF | Q4_K_M | 3.80 GB | Download |
| Llama-2-7b-pruned70-retrained.Q4_K_S.gguf | GGUF | Q4_K_S | 3.59 GB | Download |
| Llama-2-7b-pruned70-retrained.Q5_0.gguf | GGUF | — | 4.33 GB | Download |
| Llama-2-7b-pruned70-retrained.Q5_1.gguf | GGUF | — | 4.72 GB | Download |
| Llama-2-7b-pruned70-retrained.Q5_K.gguf | GGUF | Q5_K | 4.45 GB | Download |
| Llama-2-7b-pruned70-retrained.Q5_K_M.gguf | GGUF | Q5_K_M | 4.45 GB | Download |
| Llama-2-7b-pruned70-retrained.Q5_K_S.gguf | GGUF | Q5_K_S | 4.33 GB | Download |
| Llama-2-7b-pruned70-retrained.Q6_K.gguf | GGUF | Q6_K | 5.15 GB | Download |
| Llama-2-7b-pruned70-retrained.Q8_0.gguf | GGUF | — | 6.67 GB | Download |
Model Details Live
Metadata Inspector
Normalized metadata (stored in metadata_json)
{
"metadata": {},
"card_data": {
"frontmatter": {},
"hero_image_url": "",
"summary": "This repo contains model files for a Llama 2 7B model that has had 50% of the parameters pruned in one-shot with SparseGPT, then retrained by Cerebras with 50B tokens from SlimPajama while maintaining sparsity. It was then one-shot pruned to 70% sparsity and trained for another 100B tokens. Official model weights from Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment. **Authors**: Neural Magic, Cerebras",
"quick_links": [],
"benchmark_table_html": "",
"readme_markdown": "Quantization made by Richard Erkhov.\n\n[Github](https://github.com/RichardErkhov)\n\n[Discord](https://discord.gg/pvy7H8DZMG)\n\n[Request more models](https://github.com/RichardErkhov/quant_request)\n\n\nLlama-2-7b-pruned70-retrained - GGUF\n- Model creator: https://huggingface.co/neuralmagic/\n- Original model: https://huggingface.co/neuralmagic/Llama-2-7b-pruned70-retrained/\n\n\n| Name | Quant method | Size |\n| ---- | ---- | ---- |\n| [Llama-2-7b-pruned70-retrained.Q2_K.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q2_K.gguf) | Q2_K | 2.36GB |\n| [Llama-2-7b-pruned70-retrained.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q3_K_S.gguf) | Q3_K_S | 2.75GB |\n| [Llama-2-7b-pruned70-retrained.Q3_K.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q3_K.gguf) | Q3_K | 3.07GB |\n| [Llama-2-7b-pruned70-retrained.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q3_K_M.gguf) | Q3_K_M | 3.07GB |\n| [Llama-2-7b-pruned70-retrained.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q3_K_L.gguf) | Q3_K_L | 3.35GB |\n| [Llama-2-7b-pruned70-retrained.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.IQ4_XS.gguf) | IQ4_XS | 3.4GB |\n| [Llama-2-7b-pruned70-retrained.Q4_0.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q4_0.gguf) | Q4_0 | 3.56GB |\n| [Llama-2-7b-pruned70-retrained.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.IQ4_NL.gguf) | IQ4_NL | 3.58GB |\n| [Llama-2-7b-pruned70-retrained.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q4_K_S.gguf) | Q4_K_S | 3.59GB |\n| [Llama-2-7b-pruned70-retrained.Q4_K.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q4_K.gguf) | Q4_K | 3.8GB |\n| [Llama-2-7b-pruned70-retrained.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q4_K_M.gguf) | Q4_K_M | 3.8GB |\n| [Llama-2-7b-pruned70-retrained.Q4_1.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q4_1.gguf) | Q4_1 | 3.95GB |\n| [Llama-2-7b-pruned70-retrained.Q5_0.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q5_0.gguf) | Q5_0 | 4.33GB |\n| [Llama-2-7b-pruned70-retrained.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q5_K_S.gguf) | Q5_K_S | 4.33GB |\n| [Llama-2-7b-pruned70-retrained.Q5_K.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q5_K.gguf) | Q5_K | 4.45GB |\n| [Llama-2-7b-pruned70-retrained.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q5_K_M.gguf) | Q5_K_M | 4.45GB |\n| [Llama-2-7b-pruned70-retrained.Q5_1.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q5_1.gguf) | Q5_1 | 4.72GB |\n| [Llama-2-7b-pruned70-retrained.Q6_K.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q6_K.gguf) | Q6_K | 5.15GB |\n| [Llama-2-7b-pruned70-retrained.Q8_0.gguf](https://huggingface.co/RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf/blob/main/Llama-2-7b-pruned70-retrained.Q8_0.gguf) | Q8_0 | 6.67GB |\n\n\n\n\nOriginal model description:\n---\nbase_model: neuralmagic/Llama-2-7b-pruned50-retrained\ninference: true\nmodel_type: llama\npipeline_tag: text-generation\ndatasets:\n - cerebras/SlimPajama-627B\ntags:\n- sparse\n---\n\n# Llama-2-7b-pruned70-retrained\n\nThis repo contains model files for a [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b-hf) model that has had 50% of the parameters pruned in one-shot with [SparseGPT](https://arxiv.org/abs/2301.00774), then retrained by [Cerebras](https://huggingface.co/cerebras) with 50B tokens from SlimPajama while maintaining sparsity. It was then one-shot pruned to 70% sparsity and trained for another 100B tokens.\n\nOfficial model weights from [Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment](https://arxiv.org/abs/2405.03594).\n\n**Authors**: Neural Magic, Cerebras\n\n## Usage\n\nBelow we share some code snippets on how to get quickly started with running the model.\n\n### Sparse Transfer\n\nBy leveraging a pre-sparsified model's structure, you can efficiently fine-tune on new data, leading to reduced hyperparameter tuning, training times, and computational costs. Learn about this process [here](https://neuralmagic.github.io/docs-v2/get-started/transfer).\n\n### Running the model\n\nThis model has not been fine-tuned for instruction-following but may be run with the transformers library. For accelerated inference with sparsity, deploy with [nm-vllm](https://github.com/neuralmagic/nm-vllm) or [deepsparse](https://github.com/neuralmagic/deepsparse).\n\n```python\n# pip install transformers accelerate\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\ntokenizer = AutoTokenizer.from_pretrained(\"neuralmagic/Llama-2-7b-pruned70-retrained\")\nmodel = AutoModelForCausalLM.from_pretrained(\"neuralmagic/Llama-2-7b-pruned70-retrained\", device_map=\"auto\")\n\ninput_text = \"Write me a poem about Machine Learning.\"\ninput_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\")\n\noutputs = model.generate(**input_ids)\nprint(tokenizer.decode(outputs[0]))\n```\n\n## Evaluation Benchmark Results\n\nModel evaluation metrics and results. [UPDATE]\n\n| Benchmark | Metric | Llama-2-7b | Llama-2-7b-pruned70-retrained |\n|------------------------------------------------|---------------|-------------|-------------------------------|\n| [MMLU](https://arxiv.org/abs/2009.03300) | 5-shot | 46.9% | 36.5% |\n| [HellaSwag](https://arxiv.org/abs/1905.07830) | 0-shot | 78.6% | 74.1% |\n| [WinoGrande](https://arxiv.org/abs/1907.10641) | 5-shot | 74.0% | 69.5% |\n| [ARC-c](https://arxiv.org/abs/1911.01547) | 25-shot | 53.1% | 45.4% |\n| [TruthfulQA](https://arxiv.org/abs/2109.07958) | 5-shot | 38.8% | 36.7% |\n| [GSM8K](https://arxiv.org/abs/2110.14168) | 5-shot | 14.5% | 8.0% |\n| [HumanEval](https://arxiv.org/abs/2107.03374) | pass@1 | 13.4% | 14.4% |\n\n## Model Training Details\n\n[UPDATE]\n\n## Help\n\nFor further support, and discussions on these models and AI in general, join [Neural Magic's Slack Community](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)\n\n",
"related_quantizations": []
},
"tags": [
"gguf",
"arxiv:2301.00774",
"arxiv:2405.03594",
"arxiv:2009.03300",
"arxiv:1905.07830",
"arxiv:1907.10641",
"arxiv:1911.01547",
"arxiv:2109.07958",
"arxiv:2110.14168",
"arxiv:2107.03374",
"endpoints_compatible",
"region:us"
],
"likes": 0,
"downloads": 686,
"gated": false,
"private": false,
"last_modified": "2024-11-17T09:48:01.000Z",
"created_at": "2024-11-17T08:37:34.000Z",
"pipeline_tag": "",
"library_name": ""
}
Source payload excerpt (from Hugging Face API)
{
"_id": "6739ab4e8bf916a35ff098fe",
"id": "RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf",
"modelId": "RichardErkhov/neuralmagic_-_Llama-2-7b-pruned70-retrained-gguf",
"sha": "c6e86b4bc509b0946c5471a1abd86488f8927900",
"createdAt": "2024-11-17T08:37:34.000Z",
"lastModified": "2024-11-17T09:48:01.000Z",
"author": "RichardErkhov",
"downloads": 686,
"likes": 0,
"gated": false,
"private": false,
"pipeline_tag": "",
"library_name": "",
"siblings_count": 21
}