Model Intelligence Sheet

quantfactory/llama-2-7b-32k-gguf overview

This is quantized version of togethercomputer/LLaMA-2-7B-32K created using llama.cpp # Original Model Card # LLaMA-2-7B-32K

transformersggufendataset:togethercomputer/RedPajama-Data-1Tdataset:togethercomputer/RedPajama-Data-Instructdataset:EleutherAI/piledataset:togethercomputer/Long-Data-Collectionslicense:llama2endpoints_compatibleregion:us

Downloads

307

Likes

Pipeline

—

Library

transformers

Visibility

Public

Access

Open

Repository Files & Downloads

14 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
LLaMA-2-7B-32K.Q2_K.gguf	GGUF	Q2_K	2.36 GB	Download
LLaMA-2-7B-32K.Q3_K_L.gguf	GGUF	Q3_K_L	3.35 GB	Download
LLaMA-2-7B-32K.Q3_K_M.gguf	GGUF	Q3_K_M	3.07 GB	Download
LLaMA-2-7B-32K.Q3_K_S.gguf	GGUF	Q3_K_S	2.75 GB	Download
LLaMA-2-7B-32K.Q4_0.gguf	GGUF	—	3.56 GB	Download
LLaMA-2-7B-32K.Q4_1.gguf	GGUF	—	3.95 GB	Download
LLaMA-2-7B-32K.Q4_K_M.gguf	GGUF	Q4_K_M	3.80 GB	Download
LLaMA-2-7B-32K.Q4_K_S.gguf	GGUF	Q4_K_S	3.59 GB	Download
LLaMA-2-7B-32K.Q5_0.gguf	GGUF	—	4.33 GB	Download
LLaMA-2-7B-32K.Q5_1.gguf	GGUF	—	4.72 GB	Download
LLaMA-2-7B-32K.Q5_K_M.gguf	GGUF	Q5_K_M	4.45 GB	Download
LLaMA-2-7B-32K.Q5_K_S.gguf	GGUF	Q5_K_S	4.33 GB	Download
LLaMA-2-7B-32K.Q6_K.gguf	GGUF	Q6_K	5.15 GB	Download
LLaMA-2-7B-32K.Q8_0.gguf	GGUF	—	6.67 GB	Download

Model Details Live

Model Slug

quantfactory/llama-2-7b-32k-gguf

Author

QuantFactory

Pipeline Task

—

Library

transformers

Created

2024-10-18

Last Modified

2024-10-18

Gated

Private

HF SHA

887b264eb7a08b215c1960c8a1784b782318f2d5

License

Unknown

Language

Unknown

Base Model

Unknown

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "license": "llama2",
    "datasets": [
      "togethercomputer/RedPajama-Data-1T",
      "togethercomputer/RedPajama-Data-Instruct",
      "EleutherAI/pile",
      "togethercomputer/Long-Data-Collections"
    ],
    "language": [
      "en"
    ],
    "library_name": "transformers",
    "frontmatter": {},
    "hero_image_url": "https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ",
    "summary": "This is quantized version of togethercomputer/LLaMA-2-7B-32K created using llama.cpp # Original Model Card # LLaMA-2-7B-32K",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "\n---\n\nlicense: llama2\ndatasets:\n- togethercomputer/RedPajama-Data-1T\n- togethercomputer/RedPajama-Data-Instruct\n- EleutherAI/pile\n- togethercomputer/Long-Data-Collections\nlanguage:\n- en\nlibrary_name: transformers\n\n---\n\n[![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)\n\n\n# QuantFactory/LLaMA-2-7B-32K-GGUF\nThis is quantized version of [togethercomputer/LLaMA-2-7B-32K](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K) created using llama.cpp\n\n# Original Model Card\n\n\n# LLaMA-2-7B-32K\n\n## Model Description\n\nLLaMA-2-7B-32K is an open-source, long context language model developed by Together, fine-tuned from Meta's original Llama-2 7B model. \nThis model represents our efforts to contribute to the rapid progress of the open-source ecosystem for large language models. \nThe model has been extended to a context length of 32K with position interpolation, \nallowing applications on multi-document QA, long text summarization, etc.\n\n## What's new?\n\nThis model introduces several improvements and new features:\n\n1. **Extended Context:** The model has been trained to handle context lengths up to 32K, which is a significant improvement over the previous versions.\n\n2. **Pre-training and Instruction Tuning:** We have shared our data recipe, which consists of a mixture of pre-training and instruction tuning data.\n\n3. **Fine-tuning Examples:** We provide examples of how to fine-tune the model for specific applications, including book summarization and long context question and answering.\n\n4. **Software Support:** We have updated both the inference and training stack to allow efficient inference and fine-tuning for 32K context.\n\n## Model Architecture\n\nThe model follows the architecture of Llama-2-7B and extends it to handle a longer context. It leverages the recently released FlashAttention-2 and a range of other optimizations to improve the speed and efficiency of inference and training.\n\n## Training and Fine-tuning\n\nThe model has been trained using a mixture of pre-training and instruction tuning data. \n- In the first training phase of continued pre-training, our data mixture contains 25% RedPajama Book, 25% RedPajama ArXiv (including abstracts), 25% other data from RedPajama, and 25% from the UL2 Oscar Data, which is a part of OIG (Open-Instruction-Generalist), asking the model to fill in missing chunks, or complete the text. \nTo enhance the long-context ability, we exclude data shorter than 2K word. The inclusion of UL2 Oscar Data is effective in compelling the model to read and utilize long-range context.\n- We then fine-tune the model to focus on its few shot capacity under long context, including 20% Natural Instructions (NI), 20% Public Pool of Prompts (P3), 20% the Pile. We decontaminated all data against HELM core scenarios . We teach the model to leverage the in-context examples by packing examples into one 32K-token sequence. To maintain the knowledge learned from the first piece of data, we incorporate 20% RedPajama-Data Book and 20% RedPajama-Data ArXiv.\n\nNext, we provide examples of how to fine-tune the model for specific applications. \nThe example datasets are placed in [togethercomputer/Long-Data-Collections](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections)\nYou can use the [OpenChatKit](https://github.com/togethercomputer/OpenChatKit) to fine-tune your own 32K model over LLaMA-2-7B-32K.\nPlease refer to [OpenChatKit](https://github.com/togethercomputer/OpenChatKit) for step-by-step illustrations.\n\n1. Long Context QA.\n\n   We take as an example the multi-document question answering task from the paper “Lost in the Middle: How Language Models Use Long Contexts”. The input for the model consists of (i) a question that requires an answer and (ii) k documents, which are passages extracted from Wikipedia. Notably, only one of these documents contains the answer to the question, while the remaining k − 1 documents, termed as \"distractor\" documents, do not. To successfully perform this task, the model must identify and utilize the document containing the answer from its input context. \n\n   With OCK, simply run the following command to fine-tune:\n   ```\n   bash training/finetune_llama-2-7b-32k-mqa.sh\n   ```\n\n2. Summarization.\n\n   Another example is BookSum, a unique dataset designed to address the challenges of long-form narrative summarization. This dataset features source documents from the literature domain, including novels, plays, and stories, and offers human-written, highly abstractive summaries. We here focus on chapter-level data.  BookSum poses a unique set of challenges, necessitating that the model comprehensively read through each chapter.\n\n   With OCK, simply run the following command to fine-tune:\n   ```\n   bash training/finetune_llama-2-7b-32k-booksum.sh\n   ```\n\n\n## Inference\n\nYou can use the [Together API](https://together.ai/blog/api-announcement) to try out LLaMA-2-7B-32K for inference. \nThe updated inference stack allows for efficient inference.\n\nTo run the model locally, we strongly recommend to install Flash Attention V2, which is necessary to obtain the best performance:\n```\n# Please update the path of `CUDA_HOME`\nexport CUDA_HOME=/usr/local/cuda-11.8\npip install transformers==4.31.0\npip install sentencepiece\npip install ninja\npip install flash-attn --no-build-isolation\npip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary\n```\n\nYou can use this model directly from the Hugging Face Model Hub or fine-tune it on your own data using the OpenChatKit.\n\n```python\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\ntokenizer = AutoTokenizer.from_pretrained(\"togethercomputer/LLaMA-2-7B-32K\")\nmodel = AutoModelForCausalLM.from_pretrained(\"togethercomputer/LLaMA-2-7B-32K\", trust_remote_code=True, torch_dtype=torch.float16)\n\ninput_context = \"Your text here\"\ninput_ids = tokenizer.encode(input_context, return_tensors=\"pt\")\noutput = model.generate(input_ids, max_length=128, temperature=0.7)\noutput_text = tokenizer.decode(output[0], skip_special_tokens=True)\nprint(output_text)\n```\n\nAlternatively, you can set `trust_remote_code=False` if you prefer not to use flash attention.\n\n\n## Limitations and Bias\n\nAs with all language models, LLaMA-2-7B-32K may generate incorrect or biased content. It's important to keep this in mind when using the model.\n\n## Community\n\nJoin us on [Together Discord](https://discord.gg/6ZVDU8tTD4)\n",
    "related_quantizations": []
  },
  "tags": [
    "transformers",
    "gguf",
    "en",
    "dataset:togethercomputer/RedPajama-Data-1T",
    "dataset:togethercomputer/RedPajama-Data-Instruct",
    "dataset:EleutherAI/pile",
    "dataset:togethercomputer/Long-Data-Collections",
    "license:llama2",
    "endpoints_compatible",
    "region:us"
  ],
  "likes": 1,
  "downloads": 307,
  "gated": false,
  "private": false,
  "last_modified": "2024-10-18T17:54:00.000Z",
  "created_at": "2024-10-18T17:21:10.000Z",
  "pipeline_tag": "",
  "library_name": "transformers"
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "67129906d5df1a4dc6ae736b",
  "id": "QuantFactory/LLaMA-2-7B-32K-GGUF",
  "modelId": "QuantFactory/LLaMA-2-7B-32K-GGUF",
  "sha": "887b264eb7a08b215c1960c8a1784b782318f2d5",
  "createdAt": "2024-10-18T17:21:10.000Z",
  "lastModified": "2024-10-18T17:54:00.000Z",
  "author": "QuantFactory",
  "downloads": 307,
  "likes": 1,
  "gated": false,
  "private": false,
  "pipeline_tag": "",
  "library_name": "transformers",
  "siblings_count": 16
}