duyntnet/gemma-2-9b-it-imatrix-gguf IQ3_S GGUF - Free GGUF Download is indexed on GraySoft with repository links, GGUF quant files, and Hugging Face metadata. This page helps you pick a local model for guIDE or other runtimes. See related models in the same shard below.

Model Intelligence Sheet

duyntnet/gemma-2-9b-it-imatrix-gguf overview

Usage Below we share some code snippets on how to get quickly started with running the model. First make sure to pip install -U transformers, then copy the snippet from the section that is relevant for your usecase. #### Running the model on a single / multi GPU #### Running the model on a GPU using different precisions The native weights of this model were exported in bfloat16 precision. You can use float16, which may be faster on certain hardware, indicating the torchdtype when loading the model. For convenience, the float16 revision of the repo contains a copy of the weights already converted to that precision. You can also use float32 if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to float32). See examples below. _Using torch.float16_ Using torch.bfloat16 * Upcasting to torch.float32 #### Quantized Versions through bitsandbytes * Using 8-bit precision (int8) * Using 4-bit precision #### Other optimizations * Flash Attention 2 First make sure to install flash-attn in your environment pip install flash-attn ### Chat Template The instruction-tuned models use a chat template that must be adhered to for conversational use. The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet. Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction: At this point, the prompt contains the following text: As you can see, each turn is preceded by a delimiter and then the role of the entity (either user, for content supplied by the user, or model for LLM responses). Turns finish with the token. You can follow this format to build the prompt manually, if you need to do it without the tokenizer's chat template. After the prompt is ready, generation can be performed like this:

transformersggufimatrixgemma-2-9b-ittext-generationenlicense:otherregion:usconversational

duyntnet/gemma-2-9b-it-imatrix-gguf visual

Downloads

186

Likes

Pipeline

text-generation

Library

transformers

Visibility

Public

Access

Open

Repository Files & Downloads

27 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
gemma-2-9b-it-IQ1_M.gguf	GGUF	IQ1_M	2.37 GB	Download
gemma-2-9b-it-IQ1_S.gguf	GGUF	IQ1_S	2.22 GB	Download
gemma-2-9b-it-IQ2_M.gguf	GGUF	IQ2_M	3.20 GB	Download
gemma-2-9b-it-IQ2_S.gguf	GGUF	IQ2_S	2.99 GB	Download
gemma-2-9b-it-IQ2_XS.gguf	GGUF	IQ2_XS	2.86 GB	Download
gemma-2-9b-it-IQ2_XXS.gguf	GGUF	IQ2_XXS	2.63 GB	Download
gemma-2-9b-it-IQ3_M.gguf	GGUF	IQ3_M	4.19 GB	Download
gemma-2-9b-it-IQ3_S.gguf	GGUF	IQ3_S	4.04 GB	Download
gemma-2-9b-it-IQ3_XS.gguf	GGUF	IQ3_XS	3.86 GB	Download
gemma-2-9b-it-IQ3_XXS.gguf	GGUF	IQ3_XXS	3.54 GB	Download
gemma-2-9b-it-IQ4_NL.gguf	GGUF	IQ4_NL	5.07 GB	Download
gemma-2-9b-it-IQ4_XS.gguf	GGUF	IQ4_XS	4.83 GB	Download
gemma-2-9b-it-Q2_K.gguf	GGUF	Q2_K	3.54 GB	Download
gemma-2-9b-it-Q2_K_S.gguf	GGUF	Q2_K_S	3.31 GB	Download
gemma-2-9b-it-Q3_K_L.gguf	GGUF	Q3_K_L	4.78 GB	Download
gemma-2-9b-it-Q3_K_M.gguf	GGUF	Q3_K_M	4.43 GB	Download
gemma-2-9b-it-Q3_K_S.gguf	GGUF	Q3_K_S	4.04 GB	Download
gemma-2-9b-it-Q4_0.gguf	GGUF	—	5.08 GB	Download
gemma-2-9b-it-Q4_1.gguf	GGUF	—	5.55 GB	Download
gemma-2-9b-it-Q4_K_M.gguf	GGUF	Q4_K_M	5.37 GB	Download
gemma-2-9b-it-Q4_K_S.gguf	GGUF	Q4_K_S	5.10 GB	Download
gemma-2-9b-it-Q5_0.gguf	GGUF	—	6.05 GB	Download
gemma-2-9b-it-Q5_1.gguf	GGUF	—	6.52 GB	Download
gemma-2-9b-it-Q5_K_M.gguf	GGUF	Q5_K_M	6.19 GB	Download
gemma-2-9b-it-Q5_K_S.gguf	GGUF	Q5_K_S	6.04 GB	Download
gemma-2-9b-it-Q6_K.gguf	GGUF	Q6_K	7.07 GB	Download
gemma-2-9b-it-Q8_0.gguf	GGUF	—	9.15 GB	Download

Model Details Live

Model Slug

duyntnet/gemma-2-9b-it-imatrix-gguf

Author

duyntnet

Pipeline Task

text-generation

Library

transformers

Created

2024-06-28

Last Modified

2024-09-07

Gated

Private

HF SHA

522423977f806ddc665320dfcbd010f40b1db1dc

License

other

Language

Base Model

Unknown

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "license": "other",
    "language": [
      "en"
    ],
    "pipeline_tag": "text-generation",
    "inference": false,
    "tags": [
      "transformers",
      "gguf",
      "imatrix",
      "gemma-2-9b-it"
    ],
    "frontmatter": {
      "license": "other",
      "language": [
        "en"
      ],
      "pipeline_tag": "text-generation",
      "inference": "false",
      "tags": [
        "transformers",
        "gguf",
        "imatrix",
        "gemma-2-9b-it"
      ]
    },
    "hero_image_url": "",
    "summary": "### Usage Below we share some code snippets on how to get quickly started with running the model. First make sure to pip install -U transformers, then copy the snippet from the section that is relevant for your usecase. #### Running the model on a single / multi GPU ``python # pip install accelerate from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained(\"google/gemma-2-9b-it\") model = AutoModelForCausalLM.from_pretrained( \"google/gemma-2-9b-it\", device_map=\"auto\", torch_dtype=torch.bfloat16 ) input_text = \"Write me a poem about Machine Learning.\" input_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\") outputs = model.generate(**input_ids) print(tokenizer.decode(outputs[0])) `  #### Running the model on a GPU using different precisions The native weights of this model were exported in bfloat16 precision. You can use float16, which may be faster on certain hardware, indicating the torch_dtype when loading the model. For convenience, the float16 revision of the repo contains a copy of the weights already converted to that precision. You can also use float32 if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to float32). See examples below. * _Using torch.float16_ `python # pip install accelerate from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained(\"google/gemma-2-9b-it\") model = AutoModelForCausalLM.from_pretrained( \"google/gemma-2-9b-it\", device_map=\"auto\", torch_dtype=torch.float16, revision=\"float16\", ) input_text = \"Write me a poem about Machine Learning.\" input_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\") outputs = model.generate(**input_ids) print(tokenizer.decode(outputs[0])) ` * _Using torch.bfloat16_ `python # pip install accelerate from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained(\"google/gemma-2-9b-it\") model = AutoModelForCausalLM.from_pretrained( \"google/gemma-2-9b-it\", device_map=\"auto\", torch_dtype=torch.bfloat16) input_text = \"Write me a poem about Machine Learning.\" input_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\") outputs = model.generate(**input_ids) print(tokenizer.decode(outputs[0])) ` * _Upcasting to torch.float32_ `python # pip install accelerate from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained(\"google/gemma-2-9b-it\") model = AutoModelForCausalLM.from_pretrained( \"google/gemma-2-9b-it\", device_map=\"auto\") input_text = \"Write me a poem about Machine Learning.\" input_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\") outputs = model.generate(**input_ids) print(tokenizer.decode(outputs[0])) ` #### Quantized Versions through bitsandbytes * _Using 8-bit precision (int8)_ `python # pip install bitsandbytes accelerate from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_8bit=True) tokenizer = AutoTokenizer.from_pretrained(\"google/gemma-2-9b-it\") model = AutoModelForCausalLM.from_pretrained( \"google/gemma-2-9b-it\", quantization_config=quantization_config) input_text = \"Write me a poem about Machine Learning.\" input_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\") outputs = model.generate(**input_ids) print(tokenizer.decode(outputs[0])) ` * _Using 4-bit precision_ `python # pip install bitsandbytes accelerate from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_4bit=True) tokenizer = AutoTokenizer.from_pretrained(\"google/gemma-2-9b-it\") model = AutoModelForCausalLM.from_pretrained( \"google/gemma-2-9b-it\", quantization_config=quantization_config) input_text = \"Write me a poem about Machine Learning.\" input_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\") outputs = model.generate(**input_ids) print(tokenizer.decode(outputs[0])) ` #### Other optimizations * _Flash Attention 2_ First make sure to install flash-attn in your environment pip install flash-attn `diff model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, +   attn_implementation=\"flash_attention_2\" ).to(0) ` ### Chat Template The instruction-tuned models use a chat template that must be adhered to for conversational use. The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet. Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction: `py from transformers import AutoTokenizer, AutoModelForCausalLM import transformers import torch model_id = \"google/gemma-2-9b-it\" dtype = torch.bfloat16 tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map=\"cuda\", torch_dtype=dtype,) chat = [ { \"role\": \"user\", \"content\": \"Write a hello world program\" }, ] prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) ` At this point, the prompt contains the following text: ` user Write a hello world program model ` As you can see, each turn is preceded by a  delimiter and then the role of the entity (either user, for content supplied by the user, or model for LLM responses). Turns finish with the  token. You can follow this format to build the prompt manually, if you need to do it without the tokenizer's chat template. After the prompt is ready, generation can be performed like this: `py inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors=\"pt\") outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150) print(tokenizer.decode(outputs[0])) ``",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nlicense: other\nlanguage:\n- en\npipeline_tag: text-generation\ninference: false\ntags:\n- transformers\n- gguf\n- imatrix\n- gemma-2-9b-it\n---\nQuantizations of https://huggingface.co/google/gemma-2-9b-it\n\nUpdate (July 7, 2024): **Requantized and reuploaded** using llama.cpp latest version (b3325), everything should work as expected. \n\nUpdate #2 (Sept 6, 2024): **Requantized and reuploaded** using llama.cpp latest version (b3672), remaining issues (if any) should be gone now.\n\n### Inference Clients/UIs\n* [llama.cpp](https://github.com/ggerganov/llama.cpp)\n* [JanAI](https://github.com/janhq/jan)\n* [KoboldCPP](https://github.com/LostRuins/koboldcpp)\n* [text-generation-webui](https://github.com/oobabooga/text-generation-webui)\n* [ollama](https://github.com/ollama/ollama)\n* [GPT4All](https://github.com/nomic-ai/gpt4all)\n  \n---\n\n# From original readme\n\n### Usage\n\nBelow we share some code snippets on how to get quickly started with running the model. First make sure to `pip install -U transformers`, then copy the snippet from the section that is relevant for your usecase.\n\n\n#### Running the model on a single / multi GPU\n\n\n```python\n# pip install accelerate\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\nimport torch\n\ntokenizer = AutoTokenizer.from_pretrained(\"google/gemma-2-9b-it\")\nmodel = AutoModelForCausalLM.from_pretrained(\n    \"google/gemma-2-9b-it\",\n    device_map=\"auto\",\n    torch_dtype=torch.bfloat16\n)\n\ninput_text = \"Write me a poem about Machine Learning.\"\ninput_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\")\n\noutputs = model.generate(**input_ids)\nprint(tokenizer.decode(outputs[0]))\n```\n\n<a name=\"precisions\"></a>\n#### Running the model on a GPU using different precisions\n\nThe native weights of this model were exported in `bfloat16` precision. You can use `float16`, which may be faster on certain hardware, indicating the `torch_dtype` when loading the model. For convenience, the `float16` revision of the repo contains a copy of the weights already converted to that precision.\n\nYou can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.\n\n* _Using `torch.float16`_\n\n```python\n# pip install accelerate\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\nimport torch\n\ntokenizer = AutoTokenizer.from_pretrained(\"google/gemma-2-9b-it\")\nmodel = AutoModelForCausalLM.from_pretrained(\n    \"google/gemma-2-9b-it\",\n    device_map=\"auto\",\n    torch_dtype=torch.float16,\n    revision=\"float16\",\n)\n\ninput_text = \"Write me a poem about Machine Learning.\"\ninput_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\")\n\noutputs = model.generate(**input_ids)\nprint(tokenizer.decode(outputs[0]))\n```\n\n* _Using `torch.bfloat16`_\n\n```python\n# pip install accelerate\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\ntokenizer = AutoTokenizer.from_pretrained(\"google/gemma-2-9b-it\")\nmodel = AutoModelForCausalLM.from_pretrained(\n    \"google/gemma-2-9b-it\",\n    device_map=\"auto\",\n    torch_dtype=torch.bfloat16)\n\ninput_text = \"Write me a poem about Machine Learning.\"\ninput_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\")\n\noutputs = model.generate(**input_ids)\nprint(tokenizer.decode(outputs[0]))\n```\n\n* _Upcasting to `torch.float32`_\n\n```python\n# pip install accelerate\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\ntokenizer = AutoTokenizer.from_pretrained(\"google/gemma-2-9b-it\")\nmodel = AutoModelForCausalLM.from_pretrained(\n    \"google/gemma-2-9b-it\",\n    device_map=\"auto\")\n\ninput_text = \"Write me a poem about Machine Learning.\"\ninput_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\")\n\noutputs = model.generate(**input_ids)\nprint(tokenizer.decode(outputs[0]))\n```\n\n#### Quantized Versions through `bitsandbytes`\n\n* _Using 8-bit precision (int8)_\n\n```python\n# pip install bitsandbytes accelerate\nfrom transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n\nquantization_config = BitsAndBytesConfig(load_in_8bit=True)\n\ntokenizer = AutoTokenizer.from_pretrained(\"google/gemma-2-9b-it\")\nmodel = AutoModelForCausalLM.from_pretrained(\n    \"google/gemma-2-9b-it\",\n    quantization_config=quantization_config)\n\ninput_text = \"Write me a poem about Machine Learning.\"\ninput_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\")\n\noutputs = model.generate(**input_ids)\nprint(tokenizer.decode(outputs[0]))\n```\n\n* _Using 4-bit precision_\n\n```python\n# pip install bitsandbytes accelerate\nfrom transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n\nquantization_config = BitsAndBytesConfig(load_in_4bit=True)\n\ntokenizer = AutoTokenizer.from_pretrained(\"google/gemma-2-9b-it\")\nmodel = AutoModelForCausalLM.from_pretrained(\n    \"google/gemma-2-9b-it\",\n    quantization_config=quantization_config)\n\ninput_text = \"Write me a poem about Machine Learning.\"\ninput_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\")\n\noutputs = model.generate(**input_ids)\nprint(tokenizer.decode(outputs[0]))\n```\n\n\n#### Other optimizations\n\n* _Flash Attention 2_\n\nFirst make sure to install `flash-attn` in your environment `pip install flash-attn`\n\n```diff\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_id, \n    torch_dtype=torch.float16, \n+   attn_implementation=\"flash_attention_2\"\n).to(0)\n```\n\n### Chat Template\n\nThe instruction-tuned models use a chat template that must be adhered to for conversational use.\nThe easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.\n\nLet's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction:\n\n```py\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\nimport transformers\nimport torch\n\nmodel_id = \"google/gemma-2-9b-it\"\ndtype = torch.bfloat16\n\ntokenizer = AutoTokenizer.from_pretrained(model_id)\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_id,\n    device_map=\"cuda\",\n    torch_dtype=dtype,)\n\nchat = [\n    { \"role\": \"user\", \"content\": \"Write a hello world program\" },\n]\nprompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)\n```\n\nAt this point, the prompt contains the following text:\n\n```\n<bos><start_of_turn>user\nWrite a hello world program<end_of_turn>\n<start_of_turn>model\n```\n\nAs you can see, each turn is preceded by a `<start_of_turn>` delimiter and then the role of the entity\n(either `user`, for content supplied by the user, or `model` for LLM responses). Turns finish with\nthe `<end_of_turn>` token.\n\nYou can follow this format to build the prompt manually, if you need to do it without the tokenizer's\nchat template.\n\nAfter the prompt is ready, generation can be performed like this:\n\n```py\ninputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors=\"pt\")\noutputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)\nprint(tokenizer.decode(outputs[0]))\n```",
    "related_quantizations": []
  },
  "tags": [
    "transformers",
    "gguf",
    "imatrix",
    "gemma-2-9b-it",
    "text-generation",
    "en",
    "license:other",
    "region:us",
    "conversational"
  ],
  "likes": 1,
  "downloads": 186,
  "gated": false,
  "private": false,
  "last_modified": "2024-09-07T06:49:56.000Z",
  "created_at": "2024-06-28T15:30:16.000Z",
  "pipeline_tag": "text-generation",
  "library_name": "transformers"
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "667ed70860af4273833c8915",
  "id": "duyntnet/gemma-2-9b-it-imatrix-GGUF",
  "modelId": "duyntnet/gemma-2-9b-it-imatrix-GGUF",
  "sha": "522423977f806ddc665320dfcbd010f40b1db1dc",
  "createdAt": "2024-06-28T15:30:16.000Z",
  "lastModified": "2024-09-07T06:49:56.000Z",
  "author": "duyntnet",
  "downloads": 186,
  "likes": 1,
  "gated": false,
  "private": false,
  "pipeline_tag": "text-generation",
  "library_name": "transformers",
  "siblings_count": 29
}

duyntnet/gemma-2-9b-it-imatrix-gguf overview

Repository Files & Downloads

Model Details Live

Metadata Inspector

More models in this shard