Model Intelligence Sheet

nisten/qwenv2-7b-inst-imatrix-gguf overview

These are a whole bunch of conversions of qwen7b v2 in an attempt to fix the reduced performance while quantizing. The bf16 versions will NOT work with apple GPUs but will work with most cpus and newer nvidia cards (older ones like 1080 series don't support bf16 inference well). Perplexity benchmarks will come later once an automated suite is written by me or whoemever, sorry have just been too busy and doing those properly for each quant takes all day. Model names should be self explanatory. Just pick the biggest one that your hardware can run. Overall in this experiment I noticed that quantising the embedding weight had much less effect on perplexity that expected, even at q4k it didnt harm the model much but under 4k it was drastic damage to intelligence. Whereas quantising the output weight to q8 was fine and nearly as long as it was done from a bf16 instead of quantising to f16 ( which just deletes 3 bits of precision) and then quantising to q8. Going any lower had a lot of issues, there are further improvements coming in the future to this as imatrix optimizations still do not work well when applied to the output weight. Anyway, just uploaded what I had experimented in so far if anyone wants to carry over the work. Feel free to just use the biggest weight you can run. q4k with 8bit output works very well and iq4xs with 8bit out is probably the best performance/intelligence ratio for any 7b model that exists right now in my opinion. Cheers, Nisten

ggufbase_model:Qwen/Qwen2-7B-Instructbase_model:quantized:Qwen/Qwen2-7B-Instructlicense:apache-2.0endpoints_compatibleregion:usimatrixconversational

nisten/qwenv2-7b-inst-imatrix-gguf visual

Downloads

103

Likes

Pipeline

—

Library

—

Visibility

Public

Access

Open

Repository Files & Downloads

8 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
qwen7bv2inst_iq4xs_embedding4xs_output6k.gguf	GGUF	—	3.93 GB	Download
qwen7bv2inst_iq4xs_embedding4xs_output8bit.gguf	GGUF	—	4.05 GB	Download
qwen7bv2inst_iq4xs_embedding8_outputq8.gguf	GGUF	—	4.32 GB	Download
qwen7bv2inst_q4km_embedding4k_output8bit.gguf	GGUF	—	4.48 GB	Download
qwen7bv2inst_q4km_embeddingf16_outputf16.gguf	GGUF	F16	5.69 GB	Download
qwen7bv2instruct_bf16.gguf	GGUF	BF16	14.19 GB	Download
qwen7bv2instruct_q5km.gguf	GGUF	—	5.19 GB	Download
qwen7bv2instruct_q8.gguf	GGUF	—	7.54 GB	Download

Model Details Live

Model Slug

nisten/qwenv2-7b-inst-imatrix-gguf

Author

nisten

Pipeline Task

—

Library

—

Created

2024-06-16

Last Modified

2024-06-16

Gated

Private

HF SHA

9869461a44b797b7a292c9184d43baac3c33f484

License

apache-2.0

Language

Unknown

Base Model

Qwen/Qwen2-7B-Instruct

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "license": "apache-2.0",
    "base_model": "Qwen/Qwen2-7B-Instruct",
    "frontmatter": {
      "license": "apache-2.0",
      "base_model": "Qwen/Qwen2-7B-Instruct"
    },
    "hero_image_url": "",
    "summary": "These are a whole bunch of conversions of qwen7b v2 in an attempt to fix the reduced performance while quantizing. The bf16 versions will NOT work with apple GPUs but will work with most cpus and newer nvidia cards (older ones like 1080 series don't support bf16 inference well). Perplexity benchmarks will come later once an automated suite is written by me or whoemever, sorry have just been too busy and doing those properly for each quant takes all day. Model names should be self explanatory. Just pick the biggest one that your hardware can run. Overall in this experiment I noticed that quantising the embedding weight had much less effect on perplexity that expected, even at q4k it didnt harm the model much but under 4k it was drastic damage to intelligence. Whereas quantising the output weight to q8 was fine and nearly as long as it was done from a bf16 instead of quantising to f16 ( which just deletes 3 bits of precision) and then quantising to q8. Going any lower had a lot of issues, there are further improvements coming in the future to this as imatrix optimizations still do not work well when applied to the output weight. Anyway, just uploaded what I had experimented in so far if anyone wants to carry over the work. Feel free to just use the biggest weight you can run. q4k with 8bit output works very well and iq4xs with 8bit out is probably the best performance/intelligence ratio for any 7b model that exists right now in my opinion. Cheers, Nisten",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nlicense: apache-2.0\nbase_model: Qwen/Qwen2-7B-Instruct\n---\n\nThese are a whole bunch of conversions of qwen7b v2 in an attempt to fix the reduced performance while quantizing.\nThe bf16 versions will NOT work with apple GPUs but will work with most cpus and newer nvidia cards (older ones like 1080 series don't support bf16 inference well).\n\nPerplexity benchmarks will come later once an automated suite is written by me or whoemever, sorry have just been too busy and doing those properly for each quant takes all day.\n\nModel names should be self explanatory. Just pick the biggest one that your hardware can run. \n\n\nOverall in this experiment I noticed that quantising the embedding weight had much less effect on perplexity that expected, even at q4k it didnt harm the model much but under 4k it was drastic damage to intelligence.\n\n\nWhereas quantising the output weight to q8 was fine and nearly as long as it was done from a bf16 instead of quantising to f16 ( which just deletes 3 bits of precision) and then quantising to q8. Going any lower had a lot of issues, there are further improvements coming in the future to this as imatrix optimizations still do not work well when applied to the output weight.\n\n\nAnyway, just uploaded what I had experimented in so far if anyone wants to carry over the work. Feel free to just use the biggest weight you can run. q4k with 8bit output works very well and iq4xs with 8bit out is probably the best performance/intelligence ratio for any 7b model that exists right now in my opinion.\n\n\n\nCheers,\nNisten",
    "related_quantizations": []
  },
  "tags": [
    "gguf",
    "base_model:Qwen/Qwen2-7B-Instruct",
    "base_model:quantized:Qwen/Qwen2-7B-Instruct",
    "license:apache-2.0",
    "endpoints_compatible",
    "region:us",
    "imatrix",
    "conversational"
  ],
  "likes": 3,
  "downloads": 103,
  "gated": false,
  "private": false,
  "last_modified": "2024-06-16T18:12:28.000Z",
  "created_at": "2024-06-16T17:06:26.000Z",
  "pipeline_tag": "",
  "library_name": ""
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "666f1b92a7ca4800af0f6f57",
  "id": "nisten/qwenv2-7b-inst-imatrix-gguf",
  "modelId": "nisten/qwenv2-7b-inst-imatrix-gguf",
  "sha": "9869461a44b797b7a292c9184d43baac3c33f484",
  "createdAt": "2024-06-16T17:06:26.000Z",
  "lastModified": "2024-06-16T18:12:28.000Z",
  "author": "nisten",
  "downloads": 103,
  "likes": 3,
  "gated": false,
  "private": false,
  "pipeline_tag": "",
  "library_name": "",
  "siblings_count": 11
}