Model Intelligence Sheet

xxxxyu/llama3-8b-1.58-100b-tokens-vlut-gguf overview

This repository contains state-of-the-art ternary-packed versions of Llama3-8B-1.58-100B-tokens in GGUF format, optimized for efficient on-device inference using the Vec-LUT method. ### Key Features

vlut.cppgguftext-generationternaryquantizededge-aion-deviceenarxiv:2512.06443base_model:HF1BitLLM/Llama3-8B-1.58-100B-tokensbase_model:quantized:HF1BitLLM/Llama3-8B-1.58-100B-tokenslicense:otherendpoints_compatibleregion:usconversational

xxxxyu/llama3-8b-1.58-100b-tokens-vlut-gguf visual

Downloads

117

Likes

Pipeline

text-generation

Library

vlut.cpp

Visibility

Public

Access

Open

Repository Files & Downloads

5 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
ggml-model-I1_V.gguf	GGUF	—	1.99 GB	Download
ggml-model-I1_V_2.gguf	GGUF	—	1.99 GB	Download
ggml-model-I2_V.gguf	GGUF	—	2.31 GB	Download
ggml-model-I2_V_4.gguf	GGUF	—	2.31 GB	Download
ggml-model-I2_V_8.gguf	GGUF	—	2.31 GB	Download

Model Details Live

Model Slug

xxxxyu/llama3-8b-1.58-100b-tokens-vlut-gguf

Author

XXXXyu

Pipeline Task

text-generation

Library

vlut.cpp

Created

2025-12-29

Last Modified

2026-01-01

Gated

Private

HF SHA

3ce02519709f400bb9c19d6ec7410e12a2c20d4f

License

other

Language

Base Model

HF1BitLLM/Llama3-8B-1.58-100B-tokens

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "license": "other",
    "license_name": "llama3",
    "license_link": "https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE",
    "base_model": "HF1BitLLM/Llama3-8B-1.58-100B-tokens",
    "tags": [
      "text-generation",
      "ternary",
      "quantized",
      "edge-ai",
      "on-device"
    ],
    "language": [
      "en"
    ],
    "library_name": "vlut.cpp",
    "pipeline_tag": "text-generation",
    "frontmatter": {
      "license": "other",
      "license_name": "llama3",
      "license_link": "https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE",
      "base_model": "HF1BitLLM/Llama3-8B-1.58-100B-tokens",
      "tags": [
        "text-generation",
        "ternary",
        "quantized",
        "edge-ai",
        "on-device"
      ],
      "language": [
        "en"
      ],
      "library_name": "vlut.cpp",
      "pipeline_tag": "text-generation"
    },
    "hero_image_url": "",
    "summary": "This repository contains **state-of-the-art ternary-packed versions** of Llama3-8B-1.58-100B-tokens in GGUF format, optimized for efficient on-device inference using the Vec-LUT method. ### Key Features",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nlicense: other\nlicense_name: llama3\nlicense_link: https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE\nbase_model: HF1BitLLM/Llama3-8B-1.58-100B-tokens\ntags:\n- text-generation\n- ternary\n- quantized\n- edge-ai\n- on-device\nlanguage:\n- en\nlibrary_name: vlut.cpp\npipeline_tag: text-generation\n---\n\n# Llama3-8B-1.58-100B-tokens-vlut-gguf\n\nThis repository contains **state-of-the-art ternary-packed versions** of [Llama3-8B-1.58-100B-tokens](https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens) in GGUF format, optimized for efficient on-device inference using the [Vec-LUT](https://arxiv.org/abs/2512.06443) method.\n\n### Key Features\n\n- **🎯 SOTA Compression**: Achieves BPW (bits per weight) as low as **1.60** through **lossless** sub-2-bit ternary packing.\n- **⚡ SOTA Performance**: Delivers superior throughput (**4.2x speedup**) in **parallel inference** scenarios via vector lookup table (LUT).\n- **🔌 Drop-in Ready**: Seamless integration with [vlut.cpp](https://github.com/Cipherxzc/vlut.cpp) for immediate deployment on edge devices.\n\n## Available Model Variants\n\nModels are named as `ggml-model-{PACKING}_{TILE}.gguf`:\n\n| File Name | Packing (BPW) | Tile Size | Comment |\n|---------|---------|--------|------|\n| `ggml-model-I1_V.gguf` | `I1_V` (1.60) | 1 | |\n| `ggml-model-I1_V_2.gguf` | `I1_V` (1.60) | 2 | Recommended |\n| `ggml-model-I2_V.gguf` | `I2_V` (2.00) | 1 | |\n| `ggml-model-I2_V_4.gguf` | `I2_V` (2.00) | 4 | Recommended |\n| `ggml-model-I2_V_8.gguf` | `I2_V` (2.00) | 8 | |\n\n### Selection Guide\n\n- **BPW vs. Speed**: `I1_V` achieves lower memory usage but may not always outperform `I2_V` in speed.\n- **Tiling Trade-off**: Tiled variants (tile size > 1) deliver higher throughput but require larger cache capacity.\n- **Starting Point**: Use `I1_V_2` or `I2_V_4` as a starting point.\n\nFor detailed tiling parameter analysis, see [Evaluation.md](https://github.com/Cipherxzc/vlut.cpp/blob/master/evaluation/Evaluation.md#tiling-parameters) and the paper.\n\n## Usage\n\n### Prerequisites\n\nInstall [vlut.cpp](https://github.com/Cipherxzc/vlut.cpp) (these models require vlut.cpp, **not** vanilla llama.cpp):\n\n```bash\ngit clone https://github.com/Cipherxzc/vlut.cpp.git\ncd vlut.cpp\ncmake -B build && cmake --build build --config Release -j4\n```\n\n### Download & Run\n\n```bash\n# Download the recommended variant, e.g., I2_V_4\nhf download <repo_id> \\\n  ggml-model-I2_V_4.gguf --local-dir ./models\n\n# Run parallel inference\n./build/bin/llama-batched \\\n  -m ./models/ggml-model-I2_V_4.gguf \\\n  -p \"I believe the meaning of life is\" \\\n  -np 32 -n 16 -t 1 --temp 0.5 --repeat-penalty 1.5\n\n# Benchmark performance\n./build/bin/llama-bench \\\n  -m ./models/ggml-model-I2_V_4.gguf \\\n  -t 1 -p 128 -n 0\n```\n\nFor comprehensive usage instructions, refer to the [vlut.cpp Quick Start Guide](https://github.com/Cipherxzc/vlut.cpp/blob/master/README.md#quick-start).\n\n## Citation\n\nIf you use these models, please cite our [paper](https://arxiv.org/abs/2512.06443):\n\n```bibtex\n@article{li2025veclut,\n  title={Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices},\n  author={Li, Xiangyu and Yin, Chengyu and Wang, Weijun and Wei, Jianyu and Cao, Ting and Liu, Yunxin},\n  journal={arXiv preprint arXiv:2512.06443},\n  year={2025},\n  url={https://arxiv.org/abs/2512.06443}\n}\n```\n\nAnd the original Llama3-8B-1.58-100B-tokens work:\n\n```bibtex\n@misc{,\n      title={1.58-Bit LLM: A New Era of Extreme Quantization}, \n      author={Mohamed Mekkouri and Marc Sun and Leandro von Werra and Thomas Wolf},\n      year={2024},\n}\n```\n",
    "related_quantizations": []
  },
  "tags": [
    "vlut.cpp",
    "gguf",
    "text-generation",
    "ternary",
    "quantized",
    "edge-ai",
    "on-device",
    "en",
    "arxiv:2512.06443",
    "base_model:HF1BitLLM/Llama3-8B-1.58-100B-tokens",
    "base_model:quantized:HF1BitLLM/Llama3-8B-1.58-100B-tokens",
    "license:other",
    "endpoints_compatible",
    "region:us",
    "conversational"
  ],
  "likes": 0,
  "downloads": 117,
  "gated": false,
  "private": false,
  "last_modified": "2026-01-01T08:55:43.000Z",
  "created_at": "2025-12-29T13:36:46.000Z",
  "pipeline_tag": "text-generation",
  "library_name": "vlut.cpp"
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "695283ee852441e40ee00d2b",
  "id": "XXXXyu/Llama3-8B-1.58-100B-tokens-vlut-gguf",
  "modelId": "XXXXyu/Llama3-8B-1.58-100B-tokens-vlut-gguf",
  "sha": "3ce02519709f400bb9c19d6ec7410e12a2c20d4f",
  "createdAt": "2025-12-29T13:36:46.000Z",
  "lastModified": "2026-01-01T08:55:43.000Z",
  "author": "XXXXyu",
  "downloads": 117,
  "likes": 0,
  "gated": false,
  "private": false,
  "pipeline_tag": "text-generation",
  "library_name": "vlut.cpp",
  "siblings_count": 7
}