GraySoft
Projects Models About FAQ Contact Download guIDE →
Model Intelligence Sheet

nisten/qwenv2-7b-inst-imatrix-gguf overview

These are a whole bunch of conversions of qwen7b v2 in an attempt to fix the reduced performance while quantizing. The bf16 versions will NOT work with apple GPUs but will work with most cpus and newer nvidia cards (older ones like 1080 series don't support bf16 inference well). Perplexity benchmarks will come later once an automated suite is written by me or whoemever, sorry have just been too busy and doing those properly for each quant takes all day. Model names should be self explanatory. Just pick the biggest one that your hardware can run. Overall in this experiment I noticed that quantising the embedding weight had much less effect on perplexity that expected, even at q4k it didnt harm the model much but under 4k it was drastic damage to intelligence. Whereas quantising the output weight to q8 was fine and nearly as long as it was done from a bf16 instead of quantising to f16 ( which just deletes 3 bits of precision) and then quantising to q8. Going any lower had a lot of issues, there are further improvements coming in the future to this as imatrix optimizations still do not work well when applied to the output weight. Anyway, just uploaded what I had experimented in so far if anyone wants to carry over the work. Feel free to just use the biggest weight you can run. q4k with 8bit output works very well and iq4xs with 8bit out is probably the best performance/intelligence ratio for any 7b model that exists right now in my opinion. Cheers, Nisten

ggufbase_model:Qwen/Qwen2-7B-Instructbase_model:quantized:Qwen/Qwen2-7B-Instructlicense:apache-2.0endpoints_compatibleregion:usimatrixconversational
nisten/qwenv2-7b-inst-imatrix-gguf visual
Downloads
103
Likes
3
Pipeline
Library
Visibility
Public
Access
Open

Repository Files & Downloads

8 files detected
Direct downloads for all repository files
FileTypeQuantizationSizeLink
qwen7bv2inst_iq4xs_embedding4xs_output6k.gguf GGUF 3.93 GB Download
qwen7bv2inst_iq4xs_embedding4xs_output8bit.gguf GGUF 4.05 GB Download
qwen7bv2inst_iq4xs_embedding8_outputq8.gguf GGUF 4.32 GB Download
qwen7bv2inst_q4km_embedding4k_output8bit.gguf GGUF 4.48 GB Download
qwen7bv2inst_q4km_embeddingf16_outputf16.gguf GGUF F16 5.69 GB Download
qwen7bv2instruct_bf16.gguf GGUF BF16 14.19 GB Download
qwen7bv2instruct_q5km.gguf GGUF 5.19 GB Download
qwen7bv2instruct_q8.gguf GGUF 7.54 GB Download

Model Details Live

Model Slug
nisten/qwenv2-7b-inst-imatrix-gguf
Author
nisten
Pipeline Task
Library
Created
2024-06-16
Last Modified
2024-06-16
Gated
No
Private
No
HF SHA
9869461a44b797b7a292c9184d43baac3c33f484
License
apache-2.0
Language
Unknown
Base Model
Qwen/Qwen2-7B-Instruct

Metadata Inspector

Normalized metadata (stored in metadata_json)
{
  "metadata": {},
  "card_data": {
    "license": "apache-2.0",
    "base_model": "Qwen/Qwen2-7B-Instruct",
    "frontmatter": {
      "license": "apache-2.0",
      "base_model": "Qwen/Qwen2-7B-Instruct"
    },
    "hero_image_url": "",
    "summary": "These are a whole bunch of conversions of qwen7b v2 in an attempt to fix the reduced performance while quantizing. The bf16 versions will NOT work with apple GPUs but will work with most cpus and newer nvidia cards (older ones like 1080 series don't support bf16 inference well). Perplexity benchmarks will come later once an automated suite is written by me or whoemever, sorry have just been too busy and doing those properly for each quant takes all day. Model names should be self explanatory. Just pick the biggest one that your hardware can run. Overall in this experiment I noticed that quantising the embedding weight had much less effect on perplexity that expected, even at q4k it didnt harm the model much but under 4k it was drastic damage to intelligence. Whereas quantising the output weight to q8 was fine and nearly as long as it was done from a bf16 instead of quantising to f16 ( which just deletes 3 bits of precision) and then quantising to q8. Going any lower had a lot of issues, there are further improvements coming in the future to this as imatrix optimizations still do not work well when applied to the output weight. Anyway, just uploaded what I had experimented in so far if anyone wants to carry over the work. Feel free to just use the biggest weight you can run. q4k with 8bit output works very well and iq4xs with 8bit out is probably the best performance/intelligence ratio for any 7b model that exists right now in my opinion. Cheers, Nisten",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nlicense: apache-2.0\nbase_model: Qwen/Qwen2-7B-Instruct\n---\n\nThese are a whole bunch of conversions of qwen7b v2 in an attempt to fix the reduced performance while quantizing.\nThe bf16 versions will NOT work with apple GPUs but will work with most cpus and newer nvidia cards (older ones like 1080 series don't support bf16 inference well).\n\nPerplexity benchmarks will come later once an automated suite is written by me or whoemever, sorry have just been too busy and doing those properly for each quant takes all day.\n\nModel names should be self explanatory. Just pick the biggest one that your hardware can run. \n\n\nOverall in this experiment I noticed that quantising the embedding weight had much less effect on perplexity that expected, even at q4k it didnt harm the model much but under 4k it was drastic damage to intelligence.\n\n\nWhereas quantising the output weight to q8 was fine and nearly as long as it was done from a bf16 instead of quantising to f16 ( which just deletes 3 bits of precision) and then quantising to q8. Going any lower had a lot of issues, there are further improvements coming in the future to this as imatrix optimizations still do not work well when applied to the output weight.\n\n\nAnyway, just uploaded what I had experimented in so far if anyone wants to carry over the work. Feel free to just use the biggest weight you can run. q4k with 8bit output works very well and iq4xs with 8bit out is probably the best performance/intelligence ratio for any 7b model that exists right now in my opinion.\n\n\n\nCheers,\nNisten",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "base_model:Qwen/Qwen2-7B-Instruct",
    "base_model:quantized:Qwen/Qwen2-7B-Instruct",
    "license:apache-2.0",
    "endpoints_compatible",
    "region:us",
    "imatrix",
    "conversational"
  ],
  "likes": 3,
  "downloads": 103,
  "gated": false,
  "private": false,
  "last_modified": "2024-06-16T18:12:28.000Z",
  "created_at": "2024-06-16T17:06:26.000Z",
  "pipeline_tag": "",
  "library_name": ""
}
Source payload excerpt (from Hugging Face API)
{
  "_id": "666f1b92a7ca4800af0f6f57",
  "id": "nisten/qwenv2-7b-inst-imatrix-gguf",
  "modelId": "nisten/qwenv2-7b-inst-imatrix-gguf",
  "sha": "9869461a44b797b7a292c9184d43baac3c33f484",
  "createdAt": "2024-06-16T17:06:26.000Z",
  "lastModified": "2024-06-16T18:12:28.000Z",
  "author": "nisten",
  "downloads": 103,
  "likes": 3,
  "gated": false,
  "private": false,
  "pipeline_tag": "",
  "library_name": "",
  "siblings_count": 11
}