Model Intelligence Sheet
xxxxyu/llama3-8b-1.58-100b-tokens-vlut-gguf overview
This repository contains state-of-the-art ternary-packed versions of Llama3-8B-1.58-100B-tokens in GGUF format, optimized for efficient on-device inference using the Vec-LUT method. ### Key Features
Downloads
117
Likes
0
Pipeline
text-generation
Library
vlut.cpp
Visibility
Public
Access
Open
Repository Files & Downloads
5 files detected
Direct downloads for all repository files
Model Details Live
Metadata Inspector
Normalized metadata (stored in metadata_json)
{
"metadata": {},
"card_data": {
"license": "other",
"license_name": "llama3",
"license_link": "https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE",
"base_model": "HF1BitLLM/Llama3-8B-1.58-100B-tokens",
"tags": [
"text-generation",
"ternary",
"quantized",
"edge-ai",
"on-device"
],
"language": [
"en"
],
"library_name": "vlut.cpp",
"pipeline_tag": "text-generation",
"frontmatter": {
"license": "other",
"license_name": "llama3",
"license_link": "https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE",
"base_model": "HF1BitLLM/Llama3-8B-1.58-100B-tokens",
"tags": [
"text-generation",
"ternary",
"quantized",
"edge-ai",
"on-device"
],
"language": [
"en"
],
"library_name": "vlut.cpp",
"pipeline_tag": "text-generation"
},
"hero_image_url": "",
"summary": "This repository contains **state-of-the-art ternary-packed versions** of Llama3-8B-1.58-100B-tokens in GGUF format, optimized for efficient on-device inference using the Vec-LUT method. ### Key Features",
"quick_links": [],
"benchmark_table_html": "",
"readme_markdown": "---\nlicense: other\nlicense_name: llama3\nlicense_link: https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE\nbase_model: HF1BitLLM/Llama3-8B-1.58-100B-tokens\ntags:\n- text-generation\n- ternary\n- quantized\n- edge-ai\n- on-device\nlanguage:\n- en\nlibrary_name: vlut.cpp\npipeline_tag: text-generation\n---\n\n# Llama3-8B-1.58-100B-tokens-vlut-gguf\n\nThis repository contains **state-of-the-art ternary-packed versions** of [Llama3-8B-1.58-100B-tokens](https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens) in GGUF format, optimized for efficient on-device inference using the [Vec-LUT](https://arxiv.org/abs/2512.06443) method.\n\n### Key Features\n\n- **🎯 SOTA Compression**: Achieves BPW (bits per weight) as low as **1.60** through **lossless** sub-2-bit ternary packing.\n- **⚡ SOTA Performance**: Delivers superior throughput (**4.2x speedup**) in **parallel inference** scenarios via vector lookup table (LUT).\n- **🔌 Drop-in Ready**: Seamless integration with [vlut.cpp](https://github.com/Cipherxzc/vlut.cpp) for immediate deployment on edge devices.\n\n## Available Model Variants\n\nModels are named as `ggml-model-{PACKING}_{TILE}.gguf`:\n\n| File Name | Packing (BPW) | Tile Size | Comment |\n|---------|---------|--------|------|\n| `ggml-model-I1_V.gguf` | `I1_V` (1.60) | 1 | |\n| `ggml-model-I1_V_2.gguf` | `I1_V` (1.60) | 2 | Recommended |\n| `ggml-model-I2_V.gguf` | `I2_V` (2.00) | 1 | |\n| `ggml-model-I2_V_4.gguf` | `I2_V` (2.00) | 4 | Recommended |\n| `ggml-model-I2_V_8.gguf` | `I2_V` (2.00) | 8 | |\n\n### Selection Guide\n\n- **BPW vs. Speed**: `I1_V` achieves lower memory usage but may not always outperform `I2_V` in speed.\n- **Tiling Trade-off**: Tiled variants (tile size > 1) deliver higher throughput but require larger cache capacity.\n- **Starting Point**: Use `I1_V_2` or `I2_V_4` as a starting point.\n\nFor detailed tiling parameter analysis, see [Evaluation.md](https://github.com/Cipherxzc/vlut.cpp/blob/master/evaluation/Evaluation.md#tiling-parameters) and the paper.\n\n## Usage\n\n### Prerequisites\n\nInstall [vlut.cpp](https://github.com/Cipherxzc/vlut.cpp) (these models require vlut.cpp, **not** vanilla llama.cpp):\n\n```bash\ngit clone https://github.com/Cipherxzc/vlut.cpp.git\ncd vlut.cpp\ncmake -B build && cmake --build build --config Release -j4\n```\n\n### Download & Run\n\n```bash\n# Download the recommended variant, e.g., I2_V_4\nhf download <repo_id> \\\n ggml-model-I2_V_4.gguf --local-dir ./models\n\n# Run parallel inference\n./build/bin/llama-batched \\\n -m ./models/ggml-model-I2_V_4.gguf \\\n -p \"I believe the meaning of life is\" \\\n -np 32 -n 16 -t 1 --temp 0.5 --repeat-penalty 1.5\n\n# Benchmark performance\n./build/bin/llama-bench \\\n -m ./models/ggml-model-I2_V_4.gguf \\\n -t 1 -p 128 -n 0\n```\n\nFor comprehensive usage instructions, refer to the [vlut.cpp Quick Start Guide](https://github.com/Cipherxzc/vlut.cpp/blob/master/README.md#quick-start).\n\n## Citation\n\nIf you use these models, please cite our [paper](https://arxiv.org/abs/2512.06443):\n\n```bibtex\n@article{li2025veclut,\n title={Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices},\n author={Li, Xiangyu and Yin, Chengyu and Wang, Weijun and Wei, Jianyu and Cao, Ting and Liu, Yunxin},\n journal={arXiv preprint arXiv:2512.06443},\n year={2025},\n url={https://arxiv.org/abs/2512.06443}\n}\n```\n\nAnd the original Llama3-8B-1.58-100B-tokens work:\n\n```bibtex\n@misc{,\n title={1.58-Bit LLM: A New Era of Extreme Quantization}, \n author={Mohamed Mekkouri and Marc Sun and Leandro von Werra and Thomas Wolf},\n year={2024},\n}\n```\n",
"related_quantizations": []
},
"tags": [
"vlut.cpp",
"gguf",
"text-generation",
"ternary",
"quantized",
"edge-ai",
"on-device",
"en",
"arxiv:2512.06443",
"base_model:HF1BitLLM/Llama3-8B-1.58-100B-tokens",
"base_model:quantized:HF1BitLLM/Llama3-8B-1.58-100B-tokens",
"license:other",
"endpoints_compatible",
"region:us",
"conversational"
],
"likes": 0,
"downloads": 117,
"gated": false,
"private": false,
"last_modified": "2026-01-01T08:55:43.000Z",
"created_at": "2025-12-29T13:36:46.000Z",
"pipeline_tag": "text-generation",
"library_name": "vlut.cpp"
}
Source payload excerpt (from Hugging Face API)
{
"_id": "695283ee852441e40ee00d2b",
"id": "XXXXyu/Llama3-8B-1.58-100B-tokens-vlut-gguf",
"modelId": "XXXXyu/Llama3-8B-1.58-100B-tokens-vlut-gguf",
"sha": "3ce02519709f400bb9c19d6ec7410e12a2c20d4f",
"createdAt": "2025-12-29T13:36:46.000Z",
"lastModified": "2026-01-01T08:55:43.000Z",
"author": "XXXXyu",
"downloads": 117,
"likes": 0,
"gated": false,
"private": false,
"pipeline_tag": "text-generation",
"library_name": "vlut.cpp",
"siblings_count": 7
}