GraySoft
Projects Models About FAQ Contact Download guIDE →

maddes8cht/lightonai-alfred-40b-1023-gguf Q3_K_M GGUF - Free GGUF Download is indexed on GraySoft with repository links, GGUF quant files, and Hugging Face metadata. This page helps you pick a local model for guIDE or other runtimes. See related models in the same shard below.

Model Intelligence Sheet

maddes8cht/lightonai-alfred-40b-1023-gguf overview

K-Quants in Falcon 7b models New releases of Llama.cpp now support K-quantization for previously incompatible models, in particular all Falcon 7B models (While Falcon 40b is and always has been fully compatible with K-Quantisation). This is achieved by employing a fallback solution for model layers that cannot be quantized with real K-quants. For Falcon 7B models, although only a quarter of the layers can be quantized with true K-quants, this approach still benefits from utilizing different legacy quantization types Q40, Q41, Q50, and Q51. As a result, it offers better quality at the same file size or smaller file sizes with comparable performance. So this solution ensures improved performance and efficiency over legacy Q40, Q41, Q50 and Q51 Quantizations. # About GGUF format gguf is the current file format used by the ggml library. A growing list of Software is using it and can therefore use this model. The core project making use of the ggml library is the llama.cpp project by Georgi Gerganov # Quantization variants There is a bunch of quantized files available to cater to your specific needs. Here's how to choose the best option for you: # Legacy quants Q40, Q41, Q50, Q51 and Q8 are legacy quantization types. Nevertheless, they are fully supported, as there are several circumstances that cause certain model not to be compatible with the modern K-quants.

gguffalcon-40blong-contextfalconNTK-YaRNenfrdeesitdataset:OpenAssistant/oasst1dataset:ehartford/dolphindataset:tau/sleddataset:tiiuae/falcon-refinedwebarxiv:2306.15595arxiv:2309.00071arxiv:2307.03172license:apache-2.0endpoints_compatibleregion:us
maddes8cht/lightonai-alfred-40b-1023-gguf visual
Downloads
156
Likes
0
Pipeline
Library
Visibility
Public
Access
Open

Repository Files & Downloads

14 files detected
Direct downloads for all repository files
FileTypeQuantizationSizeLink
lightonai-alfred-40b-1023-Q2_K.gguf GGUF Q2_K 16.20 GB Download
lightonai-alfred-40b-1023-Q3_K_L.gguf GGUF Q3_K_L 20.12 GB Download
lightonai-alfred-40b-1023-Q3_K_M.gguf GGUF Q3_K_M 18.68 GB Download
lightonai-alfred-40b-1023-Q3_K_S.gguf GGUF Q3_K_S 17.06 GB Download
lightonai-alfred-40b-1023-Q4_0.gguf GGUF 22.17 GB Download
lightonai-alfred-40b-1023-Q4_1.gguf GGUF 24.58 GB Download
lightonai-alfred-40b-1023-Q4_K_M.gguf GGUF Q4_K_M 23.70 GB Download
lightonai-alfred-40b-1023-Q4_K_S.gguf GGUF Q4_K_S 22.17 GB Download
lightonai-alfred-40b-1023-Q5_0.gguf GGUF 26.98 GB Download
lightonai-alfred-40b-1023-Q5_1.gguf GGUF 29.39 GB Download
lightonai-alfred-40b-1023-Q5_K_M.gguf GGUF Q5_K_M 28.54 GB Download
lightonai-alfred-40b-1023-Q5_K_S.gguf GGUF Q5_K_S 26.98 GB Download
lightonai-alfred-40b-1023-Q6_K.gguf GGUF Q6_K 32.09 GB Download
lightonai-alfred-40b-1023-Q8_0.gguf GGUF 41.41 GB Download

Model Details Live

Model Slug
maddes8cht/lightonai-alfred-40b-1023-gguf
Author
maddes8cht
Pipeline Task
Library
Created
2023-11-18
Last Modified
2023-11-22
Gated
No
Private
No
HF SHA
ea0a7145df283f40c76eb5cfef2b449a1fb00c2b
License
apache-2.0
Language
en, fr, de, es, it
Base Model
Unknown

Metadata Inspector

Normalized metadata (stored in metadata_json)
{
  "metadata": {},
  "card_data": {
    "license": "apache-2.0",
    "thumbnail": "images/alfred-40b-1023.png",
    "datasets": [
      "OpenAssistant/oasst1",
      "ehartford/dolphin",
      "tau/sled",
      "tiiuae/falcon-refinedweb"
    ],
    "language": [
      "en",
      "fr",
      "de",
      "es",
      "it"
    ],
    "tags": [
      "falcon-40b",
      "long-context",
      "falcon",
      "NTK-YaRN"
    ],
    "frontmatter": {
      "license": "apache-2.0",
      "thumbnail": "images/alfred-40b-1023.png",
      "datasets": [
        "OpenAssistant/oasst1",
        "ehartford/dolphin",
        "tau/sled",
        "tiiuae/falcon-refinedweb"
      ],
      "language": [
        "en",
        "fr",
        "de",
        "es",
        "it"
      ],
      "tags": [
        "falcon-40b",
        "long-context",
        "falcon",
        "NTK-YaRN"
      ]
    },
    "hero_image_url": "https://maddes8cht.github.io/assets/buttons/Huggingface-banner.jpg",
    "summary": "# K-Quants in Falcon 7b models New releases of Llama.cpp now support K-quantization for previously incompatible models, in particular all Falcon 7B models (While Falcon 40b is and always has been fully compatible with K-Quantisation). This is achieved by employing a fallback solution for model layers that cannot be quantized with real K-quants. For Falcon 7B models, although only a quarter of the layers can be quantized with true K-quants, this approach still benefits from utilizing *different* legacy quantization types Q4_0, Q4_1, Q5_0, and Q5_1. As a result, it offers better quality at the same file size or smaller file sizes with comparable performance. So this solution ensures improved performance and efficiency over legacy Q4_0, Q4_1, Q5_0 and Q5_1 Quantizations. # About GGUF format gguf is the current file format used by the ggml library. A growing list of Software is using it and can therefore use this model. The core project making use of the ggml library is the llama.cpp project by Georgi Gerganov # Quantization variants There is a bunch of quantized files available to cater to your specific needs. Here's how to choose the best option for you: # Legacy quants Q4_0, Q4_1, Q5_0, Q5_1 and Q8 are legacy quantization types. Nevertheless, they are fully supported, as there are several circumstances that cause certain model not to be compatible with the modern K-quants.",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nlicense: apache-2.0\nthumbnail: images/alfred-40b-1023.png\ndatasets:\n- OpenAssistant/oasst1\n- ehartford/dolphin\n- tau/sled\n- tiiuae/falcon-refinedweb\nlanguage:\n- en\n- fr\n- de\n- es\n- it\ntags:\n- falcon-40b\n- long-context\n- falcon\n- NTK-YaRN\n---\n[![banner](https://maddes8cht.github.io/assets/buttons/Huggingface-banner.jpg)]()\n\nI'm constantly enhancing these model descriptions to provide you with the most relevant and comprehensive information\n\n# alfred-40b-1023 - GGUF\n- Model creator: [lightonai](https://huggingface.co/lightonai)\n- Original model: [alfred-40b-1023](https://huggingface.co/lightonai/alfred-40b-1023)\n\n# K-Quants in Falcon 7b models\n\nNew releases of Llama.cpp now support K-quantization for previously incompatible models, in particular all Falcon 7B models (While Falcon 40b is and always has been fully compatible with K-Quantisation). This is achieved by employing a fallback solution for model layers that cannot be quantized with real K-quants.\n\nFor Falcon 7B models, although only a quarter of the layers can be quantized with true K-quants, this approach still benefits from utilizing *different* legacy quantization types Q4_0, Q4_1, Q5_0, and Q5_1. As a result, it offers better quality at the same file size or smaller file sizes with comparable performance.\n\nSo this solution ensures improved performance and efficiency over legacy Q4_0, Q4_1, Q5_0 and Q5_1 Quantizations.\n\n\n\n\n\n# About GGUF format\n\n`gguf` is the current file format used by the [`ggml`](https://github.com/ggerganov/ggml) library.\nA growing list of Software is using it and can therefore use this model.\nThe core project making use of the ggml library is the [llama.cpp](https://github.com/ggerganov/llama.cpp) project by Georgi Gerganov\n\n# Quantization variants\n\nThere is a bunch of quantized files available to cater to your specific needs. Here's how to choose the best option for you:\n\n# Legacy quants\n\nQ4_0, Q4_1, Q5_0, Q5_1 and Q8 are `legacy` quantization types.\nNevertheless, they are fully supported, as there are several circumstances that cause certain model not to be compatible with the modern K-quants.\n## Note:\nNow there's a new option to use K-quants even for previously 'incompatible' models, although this involves some fallback solution that makes them not *real* K-quants. More details can be found in affected model descriptions.\n(This mainly refers to Falcon 7b and Starcoder models)\n\n# K-quants\n\nK-quants are designed with the idea that different levels of quantization in specific parts of the model can optimize performance, file size, and memory load.\nSo, if possible, use K-quants.\nWith a Q6_K, you'll likely find it challenging to discern a quality difference from the original model - ask your model two times the same question and you may encounter bigger quality differences.\n\n\n\n\n---\n\n# Original Model Card:\n# Model Card for Alfred-40B-1023\n\n![a witty and elegant butler with a falcon on his shoulder, smile, flat illustration, simple shapes, colorful, lo-fi aesthetics](images/alfred-40b-1023.png)\n\n`Alfred-40B-1023` is a finetuned version of [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b), with an **extended context length of 8192 tokens**.\nFinetuning was performed in October 2023. `Alfred-40B-1023` is made available under the Apache 2.0 License.\n\n## Model Details\n\n### Model Description\n\n- **Developed by:** [LightOn](https://www.lighton.ai/) \n    * [Oskar Hallström](https://huggingface.co/ohallstrom) (project lead, training & modeling, internal long context data, evaluation)\n    * [Amélie Chatelain](https://huggingface.co/ameliechatelain) (internal data & long context data, data generation)\n    * [Clément Thiriet](https://huggingface.co/cthiriet) (data infrastructure, data generation, evaluation)\n    * [Julien Séailles](https://huggingface.co/Jseailleslighton) (data generation)\n    * [Adrien Cavaillès](https://huggingface.co/adcavail) (data generation)\n    * [Axel Marmet](https://huggingface.co/WeightsnWizardry)* (training 2K baseline)\n\n`*` work done while at LightOn\n- **Model type:** Causal decoder-only;\n- **Language(s) (NLP):** English, German, Spanish, French (and limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish);\n- **License:** Apache 2.0 license.\n- **Finetuned from model:** [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b)\n- **Training date:** October 2023 (`1023`).\n\n## Uses\n\n### Direct Use\n\n`Alfred-40B-1023` can be used as a chat model or as an instruct model. \n\nFor both instruct and chat mode, the model has been trained with chat tokens `<start_system>`, `<start_user>`, `<start_assistant>`, and `<end_message>`. These can be integrated into the prompt in the follwoing way:\n```\n<start_system>You are Alfred, a helpful assistant trained by LightOn. Knowledge cutoff: November 2022. Current date: 16 November, 2023<end_message><start_user>{user query}<end_message><start_assistant>\n```\n\nThe stop word `<end_message>` should be used.\n\n### Out-of-Scope Use\n\nProduction use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful. \n\n## Bias, Risks, and Limitations\n\n`Alfred-40B-1023` is a finetune of Falcon-40B. As such, it is trained mostly on English, German, Spanish, French, with limited capabilities also in Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish. It will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.\n\n### Recommendations\n\nWe recommend users of `Alfred-40B-1023` to implement appropriate guardrails and precautions in any production use.\n\n## How to Get Started with the Model\n\nUse the code below to get started with the model.\n\n```\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\nimport transformers\nimport torch\n\nmodel = \"lightonai/alfred-40b-1023\"\ntokenizer = AutoTokenizer.from_pretrained(\"lightonai/alfred-0923-tokenizer\")\n\npipeline = transformers.pipeline(\n    \"text-generation\",\n    model=model,\n    tokenizer=tokenizer,\n    torch_dtype=torch.bfloat16,\n    trust_remote_code=True,\n    device_map=\"auto\",\n)\n\nsequences = pipeline(\n   \"<start_system>You are Alfred, a helpful assistant trained by LightOn. Knowledge cutoff: November 2022. Current date: 16 November, 2023<end_message><start_user>Write me an email to my boss, explaining how the company could benefit by using LightOns platform for Large Language Models, Paradigm.<end_message><start_assistant>\",\n    max_length=1000,\n    do_sample=True,\n    top_k=3,\n    num_return_sequences=1,\n    eos_token_id=tokenizer.eos_token_id,\n)\nfor seq in sequences:\n    print(f\"Result: {seq['generated_text']}\")\n```\n\n## Training Details\n\n### Training Data\n\nAlfred-40B-1023 was trained on a mixture of publicly available and in-house curated datasets. The training data is composed of 50 % short context tasks, 45 % long context tasks and 5 % RefinedWeb.\n\n| **Short context sources** |\n|--------------------|\n| [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) | \n| [dolphin](https://huggingface.co/ehartford/dolphin) |\n| [openai-critiques](https://openaipublic.blob.core.windows.net/critiques/README.md) | \n| internal |\n`internal` is a collection of synthetic and human-generated datasets created by Ligthon, tailored towards the use cases of our clients.\n\n| **Long context sources** |\n|--------------------|\n| [sled](https://huggingface.co/datasets/tau/sled) | \n| internal-long-context |\n\n`internal-long-context` is a collection of synthetic datasets generated by LightOn, tailored towards the use cases of our clients.\n\nDuring training, we apply regular language modeling loss for a partition of the prompts in the long context data.\n\n| **Pretraining objective source** |\n|--------------------|\n| [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | \n\n### Training Procedure \n\n`Alfred-40B-1023` was trained on 128 A100 40GB GPUs, using a 3D parallelism strategy (TP=8, PP=2, DP=8) combined with ZeRO. Alfred has been trained through supervised finetuning on 100 megatokens, with a learning rate decayed with a cosine schedule. \n\n#### Preprocessing\n\nAll datasets have been filtered, up or downsampled, and adapted to our chat token format.\n\n#### Context length extension\n\nWe extend the context length to 8K with a custom method that we name NTK-YaRN. As guessable from its name, our extension method draws inspiration from NTK-aware interpolation and YaRN.\n\nDuring our context length extension efforts, we experimented with various methods suitable for RoPE embeddings. These include vanilla [positional interpolation](https://arxiv.org/abs/2306.15595), [NTK-aware interpolation](https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/), [NTK-by-parts](https://github.com/jquesnelle/scaled-rope/pull/1), and lastly [YaRN](https://arxiv.org/abs/2309.00071).\n\nYaRN looked very promising when applied at test-time, however finetuning with YaRN was not successful in our experiments. When extending the context length at training-time, NTK-aware interpolation was the most successful out of the already existing methods. Some of our results from trying different long context extension methods are shared in the Evaluation section below. We acknowledge that the same parameter values as proposed in the YaRN-paper have been used in our YaRN experiments, and that these potentially could have other optimal values for our particular setup.\n\n##### NTK-YaRN\n\nSimilarly to NTK-aware interpolation (`NTK`), NTK-YaRN involves increasing the base of the RoPE embeddings. In the original implementation of NTK-aware interpolation the new base `b'` is adapted according to the following formula:\n\n$$ b' = b \\times s^{\\frac{|D|}{|D|-2}} $$\n\nwhere `b` is the original base, `s` the scaling factor of the context length, and `|D|` the model's head dimension.\n\nHowever, we find (similar to other actors) that increasing the base slightly more is even better. The value of `b'` could probably be optimized even further, but for these experiments we have settled with the following value: \n\n$$ b' = b \\times (s+1)^{\\frac{|D|}{|D|-2}} $$\n\nIn the following parts of this model card, context length extension with this extended scaling of the base is referred to as `NTK-Margin`. For `NTK-YaRN`, the extended scaling of the base is combined with the modification of the computation of the attention weights made in YaRN, where the query and key matrices are scaled by the factor `m`. \n\n$$ m = 1 + 0.1 \\times \\log(s) $$\n\nScaling the query and key matrices this way substantially reduces the initial grad norm when applying a context length extension method in our training runs.\n\nTo cite NTK-YaRN, please refer to the model bibtex in the bottom of this model card.\n\n## Evaluation\n\n### Context length extension strategies\n#### Training losses\n\nAfter experimenting on a 7B scale, we finally run a selected partition of the extension methods on a 40B scale. In the figure below, we display the resulting training losses when training a 40B model with the different extension methods, ceteris paribus.\n\n![Training loss curves for extension methods](images/training-loss-curves.png \"Training loss curves for extension methods\")\n\nInitially, YaRN has the lowest training loss, which can be seen as a reflection of the fact that YaRN was the most successful extension method at test time. However all the other methods surpasse YaRN in terms of training loss already after a handful of megatokens. Comparing NTK-Margin vs NTK-YaRN, we can note that the scaling of Q and K matrices makes the training loss lower in the beginning, however NTK-YaRN's advantage over NTK-Margin decreases as the training goes on. Comparing NTK-Margin with NTK in turn, it seems like the larger value of the base in NTK-Margin gives an initial boost in training loss, however this advantage decreases as training goes on.\n\n#### Performance on Long Context Benchmarks\nWe evaluate the context length extension methods on an own benchmark, consisting of four tasks.\n\n* [Key-value retrieval UUID](https://arxiv.org/pdf/2307.03172.pdf)\n* [Coarse-grained Topic Retrieval](https://lmsys.org/blog/2023-06-29-longchat/)\n* [Fine-grained Line Retrieval](https://lmsys.org/blog/2023-06-29-longchat/)\n* [Multi document retrieval data](https://nlp.stanford.edu/data/nfliu/lost-in-the-middle/nq-open-contriever-msmarco-retrieved-documents.jsonl.gz)\n\nFor each task, we have created 3 subtasks - one for each of the three context lengths 2K, 4K and 8K. In total, we thus have 12 subtasks. \n\nIn order to get an aggregated score that values each subtask equally, we normalize the scores for each subtask and then calculate the mean of the normalized scores for each extension method.\n\n![Aggregated scores on long context benchmarks](images/lc_benchmarks.png \"Aggregated scores on long context benchmarks\")\n\nOn these benchmarks, YaRN clearly lags behind. NTK-YaRN is the winning method, however NTK-Margin is so close that more extensive research is needed to verify that NTK-YaRN really is superior to NTK-Margin, especially when trained for longer.\n\n### Comparison to 2K baseline\n\nIn order to track any potential degradation on 2K context tasks due to the context length extension, we compare our 8K model against a 2K model trained in a similar setup for 100 megatokens. When training the 2K baseline, we don't include any long context data.\n\nWe conduct the comparison by evaluating the models on a selection of tasks from EleutherAI harness, as well as ranking model outputs internally.\n\n![Evaluation of 2K vs 8K version of alfred-40b-2023](images/2k_vs_8k.png \"Evaluation of 2K vs 8K version of alfred-40b-2023\")\n\nNotably, our 8K model not only performs on par with our 2K model on most of our EleutherAI harness tasks, in fact it outperforms the 2K model on a majority of the tasks. Reading comprehension is the only subcategory for which our 8K model is outperformed by the 2K model.\n\nWe recognize that there is a discrepancy between performance on classical NLP benchmarks and how humans perceive the model quality. When model outputs (limited to 2K context lengths) are ranked by LightOn employees internally, the 2K and 8K have strikingly similar performance. However, a few rare failure modes have been noted for the 8K version, which are not seen when using the 2K model. These failure modes are likely to be fixable with better composition of the long context data.\n\n\n## Compute Infrastructure\n\n### Hardware\n\nAlfred-40B-1023 was trained on AWS SageMaker, on 128 A100 40GB GPUs in P4d instances.\n\n### Software\n\nAlfred-40B-1023 was trained with a custom codebase. Training leverages a 3D parallelism approach combined with ZeRO, as well as high-performance kernels such as FlashAttention.\n\n## Model Card Contact\n\nPlease open a Community Discussion for any support request related to using Alfred with HuggingFace transformers.\n\nFor any other inquiry: contact@lighton.ai\n\n## Citation\n\nIf you find the model useful in your work, please use the following bibtex when citing.\n```\n@article{alfred-40b-1023,\n  title={Alfred-40B-1023},\n  author={Hallström, Oskar and Chatelain, Amélie and Thiriet, Clément and Séailles, Julien and Cavaillès, Adrien and Marmet, Axel},\n  year={2023}\n}\n```\n\n***End of original Model File***\n---\n\n\n## Please consider to support my work\n**Coming Soon:** I'm in the process of launching a sponsorship/crowdfunding campaign for my work. I'm evaluating Kickstarter, Patreon, or the new GitHub Sponsors platform, and I am hoping for some support and contribution to the continued availability of these kind of models. Your support will enable me to provide even more valuable resources and maintain the models you rely on. Your patience and ongoing support are greatly appreciated as I work to make this page an even more valuable resource for the community.\n\n<center>\n\n[![GitHub](https://maddes8cht.github.io/assets/buttons/github-io-button.png)](https://maddes8cht.github.io)\n[![Stack Exchange](https://stackexchange.com/users/flair/26485911.png)](https://stackexchange.com/users/26485911)\n[![GitHub](https://maddes8cht.github.io/assets/buttons/github-button.png)](https://github.com/maddes8cht)\n[![HuggingFace](https://maddes8cht.github.io/assets/buttons/huggingface-button.png)](https://huggingface.co/maddes8cht)\n[![Twitter](https://maddes8cht.github.io/assets/buttons/twitter-button.png)](https://twitter.com/maddes1966)\n\n</center>",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "falcon-40b",
    "long-context",
    "falcon",
    "NTK-YaRN",
    "en",
    "fr",
    "de",
    "es",
    "it",
    "dataset:OpenAssistant/oasst1",
    "dataset:ehartford/dolphin",
    "dataset:tau/sled",
    "dataset:tiiuae/falcon-refinedweb",
    "arxiv:2306.15595",
    "arxiv:2309.00071",
    "arxiv:2307.03172",
    "license:apache-2.0",
    "endpoints_compatible",
    "region:us"
  ],
  "likes": 0,
  "downloads": 156,
  "gated": false,
  "private": false,
  "last_modified": "2023-11-22T13:12:13.000Z",
  "created_at": "2023-11-18T15:32:24.000Z",
  "pipeline_tag": "",
  "library_name": ""
}
Source payload excerpt (from Hugging Face API)
{
  "_id": "6558d908ec17c88302970f94",
  "id": "maddes8cht/lightonai-alfred-40b-1023-gguf",
  "modelId": "maddes8cht/lightonai-alfred-40b-1023-gguf",
  "sha": "ea0a7145df283f40c76eb5cfef2b449a1fb00c2b",
  "createdAt": "2023-11-18T15:32:24.000Z",
  "lastModified": "2023-11-22T13:12:13.000Z",
  "author": "maddes8cht",
  "downloads": 156,
  "likes": 0,
  "gated": false,
  "private": false,
  "pipeline_tag": "",
  "library_name": "",
  "siblings_count": 20
}