Model Intelligence Sheet

richarderkhov/slicexai_-_llama3.1-elm-turbo-3b-instruct-gguf overview

ELM (which stands for Efficient Language Models) Turbo is the next generation model in the series of cutting-edge language models from SliceX AI that is designed to achieve the best in class performance in terms of quality, throughput & memory. ELM is designed to be a modular and customizable family of neural networks that are highly efficient and performant. Today we are sharing the second version in this series: ELM Turbo models (named Starfruit). Model: ELM Turbo introduces a more adaptable, decomposable LLM architecture thereby yielding flexibility in (de)-composing LLM models into smaller stand-alone slices. In comparison to our previous version, the new architecture allows for more powerful model slices to be learnt during the training process (yielding better quality & higher generative capacity) and a higher level of control wrt LLM efficiency - fine-grained slices to produce varying LLM model sizes (depending on the user/task needs and deployment criteria, i.e., Cloud or Edge device constraints). Training: ELM Turbo introduces algorithmic optimizations that allows us to train a single model but once trained the ELM Turbo model can be sliced in many ways to fit different user/task needs. We formulate the entire training procedure for ELM Turbo as a continual learning process during which we apply "slicing" operations & corresponding optimizations during the pre-training and/or fine-tuning stage. In a nutshell, this procedure teaches the model to learn & compress its knowledge into smaller slices. Fast Inference with Customization: As with our previous version, once trained, ELM Turbo model architecture permits flexible inference strategies at runtime depending on deployment & device constraints to allow users to make optimal compute/memory tradeoff choices for their application needs. In addition to the blazing fast speeds achieved by native ELM Turbo slice optimization, we also layered in NVIDIA's TensorRT-LLM integration to get further speedups. The end result 👉 optimized ELM Turbo models that achieve one of the world's best LLM performance.

ggufendpoints_compatibleregion:usconversational

richarderkhov/slicexai_-_llama3.1-elm-turbo-3b-instruct-gguf visual

Downloads

Likes

Pipeline

—

Library

—

Visibility

Public

Access

Open

Repository Files & Downloads

22 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
Llama3.1-elm-turbo-3B-instruct.IQ3_M.gguf	GGUF	IQ3_M	1.51 GB	Download
Llama3.1-elm-turbo-3B-instruct.IQ3_S.gguf	GGUF	IQ3_S	1.46 GB	Download
Llama3.1-elm-turbo-3B-instruct.IQ3_XS.gguf	GGUF	IQ3_XS	1.41 GB	Download
Llama3.1-elm-turbo-3B-instruct.IQ4_NL.gguf	GGUF	IQ4_NL	1.78 GB	Download
Llama3.1-elm-turbo-3B-instruct.IQ4_XS.gguf	GGUF	IQ4_XS	1.71 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q2_K.gguf	GGUF	Q2_K	1.29 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q3_K.gguf	GGUF	Q3_K	1.55 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q3_K_L.gguf	GGUF	Q3_K_L	1.65 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q3_K_M.gguf	GGUF	Q3_K_M	1.55 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q3_K_S.gguf	GGUF	Q3_K_S	1.45 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q4_0.gguf	GGUF	—	1.77 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q4_1.gguf	GGUF	—	1.92 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q4_K.gguf	GGUF	Q4_K	1.82 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q4_K_M.gguf	GGUF	Q4_K_M	1.82 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q4_K_S.gguf	GGUF	Q4_K_S	1.77 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q5_0.gguf	GGUF	—	2.07 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q5_1.gguf	GGUF	—	2.22 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q5_K.gguf	GGUF	Q5_K	2.10 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q5_K_M.gguf	GGUF	Q5_K_M	2.10 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q5_K_S.gguf	GGUF	Q5_K_S	2.07 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q6_K.gguf	GGUF	Q6_K	2.39 GB	Download
Llama3.1-elm-turbo-3B-instruct.Q8_0.gguf	GGUF	—	3.09 GB	Download

Model Details Live

Model Slug

richarderkhov/slicexai_-_llama3.1-elm-turbo-3b-instruct-gguf

Author

RichardErkhov

Pipeline Task

—

Library

—

Created

2024-08-03

Last Modified

2024-08-04

Gated

Private

HF SHA

5ba0545778f59d4e929e04666e74e1d3e72047fc

License

Unknown

Language

Unknown

Base Model

Unknown

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "frontmatter": {},
    "hero_image_url": "https://raw.githubusercontent.com/slicex-ai/elm-turbo/main/elm-turbo-training.png",
    "summary": "**ELM** (which stands for **E**fficient **L**anguage **M**odels) **Turbo** is the next generation model in the series of cutting-edge language models from SliceX AI that is designed to achieve the best in class performance in terms of _quality_, _throughput_ & _memory_.    ELM is designed to be a modular and customizable family of neural networks that are highly efficient and performant. Today we are sharing the second version in this series: **ELM Turbo** models (named _Starfruit_). _Model:_ ELM Turbo introduces a more _adaptable_, _decomposable LLM architecture_ thereby yielding flexibility in (de)-composing LLM models into smaller stand-alone slices. In comparison to our previous version, the new architecture allows for more powerful model slices to be learnt during the training process (yielding better quality & higher generative capacity) and a higher level of control wrt LLM efficiency - fine-grained slices to produce varying LLM model sizes (depending on the user/task needs and deployment criteria, i.e., Cloud or Edge device constraints). _Training:_ ELM Turbo introduces algorithmic optimizations that allows us to train a single model but once trained the ELM Turbo model can be sliced in many ways to fit different user/task needs. We formulate the entire training procedure for ELM Turbo as a _continual learning process_ during which we apply **\"slicing\"** operations & corresponding optimizations during the pre-training and/or fine-tuning stage. In a nutshell, this procedure _teaches the model to learn & compress its knowledge into smaller slices_. _Fast Inference with Customization:_ As with our previous version, once trained, ELM Turbo model architecture permits flexible inference strategies at runtime depending on deployment & device constraints to allow users to make optimal compute/memory tradeoff choices for their application needs. In addition to the blazing fast speeds achieved by native ELM Turbo slice optimization, we also layered in NVIDIA's TensorRT-LLM integration to get further speedups. The end result 👉 optimized ELM Turbo models that achieve one of the world's best LLM performance.",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "Quantization made by Richard Erkhov.\n\n[Github](https://github.com/RichardErkhov)\n\n[Discord](https://discord.gg/pvy7H8DZMG)\n\n[Request more models](https://github.com/RichardErkhov/quant_request)\n\n\nLlama3.1-elm-turbo-3B-instruct - GGUF\n- Model creator: https://huggingface.co/slicexai/\n- Original model: https://huggingface.co/slicexai/Llama3.1-elm-turbo-3B-instruct/\n\n\n| Name | Quant method | Size |\n| ---- | ---- | ---- |\n| [Llama3.1-elm-turbo-3B-instruct.Q2_K.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q2_K.gguf) | Q2_K | 1.29GB |\n| [Llama3.1-elm-turbo-3B-instruct.IQ3_XS.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.IQ3_XS.gguf) | IQ3_XS | 1.41GB |\n| [Llama3.1-elm-turbo-3B-instruct.IQ3_S.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.IQ3_S.gguf) | IQ3_S | 1.46GB |\n| [Llama3.1-elm-turbo-3B-instruct.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q3_K_S.gguf) | Q3_K_S | 1.45GB |\n| [Llama3.1-elm-turbo-3B-instruct.IQ3_M.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.IQ3_M.gguf) | IQ3_M | 1.51GB |\n| [Llama3.1-elm-turbo-3B-instruct.Q3_K.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q3_K.gguf) | Q3_K | 1.55GB |\n| [Llama3.1-elm-turbo-3B-instruct.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q3_K_M.gguf) | Q3_K_M | 1.55GB |\n| [Llama3.1-elm-turbo-3B-instruct.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q3_K_L.gguf) | Q3_K_L | 1.65GB |\n| [Llama3.1-elm-turbo-3B-instruct.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.IQ4_XS.gguf) | IQ4_XS | 1.71GB |\n| [Llama3.1-elm-turbo-3B-instruct.Q4_0.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q4_0.gguf) | Q4_0 | 1.77GB |\n| [Llama3.1-elm-turbo-3B-instruct.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.IQ4_NL.gguf) | IQ4_NL | 1.78GB |\n| [Llama3.1-elm-turbo-3B-instruct.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q4_K_S.gguf) | Q4_K_S | 1.77GB |\n| [Llama3.1-elm-turbo-3B-instruct.Q4_K.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q4_K.gguf) | Q4_K | 1.82GB |\n| [Llama3.1-elm-turbo-3B-instruct.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q4_K_M.gguf) | Q4_K_M | 1.82GB |\n| [Llama3.1-elm-turbo-3B-instruct.Q4_1.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q4_1.gguf) | Q4_1 | 1.92GB |\n| [Llama3.1-elm-turbo-3B-instruct.Q5_0.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q5_0.gguf) | Q5_0 | 2.07GB |\n| [Llama3.1-elm-turbo-3B-instruct.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q5_K_S.gguf) | Q5_K_S | 2.07GB |\n| [Llama3.1-elm-turbo-3B-instruct.Q5_K.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q5_K.gguf) | Q5_K | 2.1GB |\n| [Llama3.1-elm-turbo-3B-instruct.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q5_K_M.gguf) | Q5_K_M | 2.1GB |\n| [Llama3.1-elm-turbo-3B-instruct.Q5_1.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q5_1.gguf) | Q5_1 | 2.22GB |\n| [Llama3.1-elm-turbo-3B-instruct.Q6_K.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q6_K.gguf) | Q6_K | 2.39GB |\n| [Llama3.1-elm-turbo-3B-instruct.Q8_0.gguf](https://huggingface.co/RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf/blob/main/Llama3.1-elm-turbo-3B-instruct.Q8_0.gguf) | Q8_0 | 3.09GB |\n\n\n\n\nOriginal model description:\n---\nlicense: llama3.1\nlanguage:\n- en\n---\n# SliceX AI™ ELM Turbo\n**ELM** (which stands for **E**fficient **L**anguage **M**odels) **Turbo** is the next generation model in the series of cutting-edge language models from [SliceX AI](https://slicex.ai) that is designed to achieve the best in class performance in terms of _quality_, _throughput_ & _memory_.\n\n<div align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/slicex-ai/elm-turbo/main/elm-turbo-training.png\" width=\"768\"/>\n</div>\n\nELM is designed to be a modular and customizable family of neural networks that are highly efficient and performant. Today we are sharing the second version in this series: **ELM Turbo** models (named _Starfruit_). \n\n_Model:_ ELM Turbo introduces a more _adaptable_, _decomposable LLM architecture_ thereby yielding flexibility in (de)-composing LLM models into smaller stand-alone slices. In comparison to our previous version, the new architecture allows for more powerful model slices to be learnt during the training process (yielding better quality & higher generative capacity) and a higher level of control wrt LLM efficiency - fine-grained slices to produce varying LLM model sizes (depending on the user/task needs and deployment criteria, i.e., Cloud or Edge device constraints).\n\n_Training:_ ELM Turbo introduces algorithmic optimizations that allows us to train a single model but once trained the ELM Turbo model can be sliced in many ways to fit different user/task needs. We formulate the entire training procedure for ELM Turbo as a _continual learning process_ during which we apply **\"slicing\"** operations & corresponding optimizations during the pre-training and/or fine-tuning stage. In a nutshell, this procedure _teaches the model to learn & compress its knowledge into smaller slices_.\n\n_Fast Inference with Customization:_ As with our previous version, once trained, ELM Turbo model architecture permits flexible inference strategies at runtime depending on deployment & device constraints to allow users to make optimal compute/memory tradeoff choices for their application needs. In addition to the blazing fast speeds achieved by native ELM Turbo slice optimization, we also layered in NVIDIA's TensorRT-LLM integration to get further speedups. The end result 👉 optimized ELM Turbo models that achieve one of the world's best LLM performance.\n\n- **Blog:** [Medium](https://medium.com/sujith-ravi/introducing-elm-turbo-next-generation-efficient-decomposable-llms-a2347bd08676)\n\n- **Github:** https://github.com/slicex-ai/elm-turbo\n\n- **HuggingFace** (access ELM Turbo Models in HF): 👉 [here](https://huggingface.co/collections/slicexai/llama31-elm-turbo-66a81aa5f6bcb0b775ba5dd7)\n\n## ELM Turbo Model Release (version for sliced Llama 3.1)\nIn this version, we employed our new, improved decomposable ELM techniques on a widely used open-source LLM, `meta-llama/Meta-Llama-3.1-8B-Instruct` (8B params) (check [Llama-license](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE) for usage). After training, we generated three smaller slices with parameter counts ranging from 3B billion to 6B billion. \n\n- [Section 1.](https://huggingface.co/slicexai/Llama3.1-elm-turbo-4B-instruct#1-run-elm-turbo-models-with-huggingface-transformers-library) 👉 instructions to run ELM-Turbo with the Huggingface Transformers library.\n\n**NOTE**: The open-source datasets from the HuggingFace hub used for instruction fine-tuning ELM Turbo include, but are not limited to: `allenai/tulu-v2-sft-mixture`, `microsoft/orca-math-word-problems-200k`, `mlabonne/WizardLM_evol_instruct_70k-ShareGPT`, and `mlabonne/WizardLM_evol_instruct_v2_196K-ShareGPT`. We advise users to exercise caution when utilizing ELM Turbo, as these datasets may contain factually incorrect information, unintended biases, inappropriate content, and other potential issues. It is recommended to thoroughly evaluate the model's outputs and implement appropriate safeguards for your specific use case.\n\n## 1. Run ELM Turbo models with Huggingface Transformers library.\nThere are three ELM Turbo slices derived from the `Meta-Llama-3.1-8B-Instruct` model: \n  1. **`slicexai/Llama3.1-elm-turbo-3B-instruct` (3B params)**\n  2. `slicexai/Llama3.1-elm-turbo-4B-instruct`(4B params)\n  3. `slicexai/Llama3.1-elm-turbo-6B-instruct` (6B params) \n\nMake sure to update your transformers installation via pip install --upgrade transformers.\n\nExample - To run the `slicexai/Llama3.1-elm-turbo-3B-instruct`\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, pipeline\nimport torch\n\nelm_turbo_model = \"slicexai/Llama3.1-elm-turbo-3B-instruct\"\nmodel = AutoModelForCausalLM.from_pretrained( \n    elm_turbo_model,  \n    device_map=\"cuda\",  \n    torch_dtype=torch.bfloat16,  \n    trust_remote_code=True,\n)\nmessages = [ \n    {\"role\": \"user\", \"content\": \"Can you provide ways to eat combinations of bananas and dragonfruits?\"}, \n]\n\ntokenizer = AutoTokenizer.from_pretrained(elm_turbo_model, legacy=False) \npipe = pipeline( \n    \"text-generation\", \n    model=model, \n    tokenizer=tokenizer, \n) \n\ngeneration_args = { \n    \"max_new_tokens\": 500, \n    \"return_full_text\": False,\n    \"repetition_penalty\": 1.2,\n    \"temperature\": 0.0, \n    \"do_sample\": False, \n} \n\noutput = pipe(messages, **generation_args) \nprint(output[0]['generated_text']) \n```\n\n",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "endpoints_compatible",
    "region:us",
    "conversational"
  ],
  "likes": 0,
  "downloads": 90,
  "gated": false,
  "private": false,
  "last_modified": "2024-08-04T02:19:41.000Z",
  "created_at": "2024-08-03T22:42:41.000Z",
  "pipeline_tag": "",
  "library_name": ""
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "66aeb26187f605ac4d3d4552",
  "id": "RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf",
  "modelId": "RichardErkhov/slicexai_-_Llama3.1-elm-turbo-3B-instruct-gguf",
  "sha": "5ba0545778f59d4e929e04666e74e1d3e72047fc",
  "createdAt": "2024-08-03T22:42:41.000Z",
  "lastModified": "2024-08-04T02:19:41.000Z",
  "author": "RichardErkhov",
  "downloads": 90,
  "likes": 0,
  "gated": false,
  "private": false,
  "pipeline_tag": "",
  "library_name": "",
  "siblings_count": 24
}