Model Intelligence Sheet

mungert/hunyuan-7b-instruct-gguf overview

Comprehensive model page for mungert/hunyuan-7b-instruct-gguf

transformersggufbase_model:tencent/Hunyuan-7B-Pretrainbase_model:quantized:tencent/Hunyuan-7B-Pretrainendpoints_compatibleregion:usconversational

Downloads

288

Likes

Pipeline

—

Library

transformers

Visibility

Public

Access

Open

Repository Files & Downloads

23 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
Hunyuan-7B-Instruct-bf16.gguf	GGUF	BF16	13.99 GB	Download
Hunyuan-7B-Instruct-bf16_q8_0.gguf	GGUF	BF16	10.12 GB	Download
Hunyuan-7B-Instruct-f16_q8_0.gguf	GGUF	F16	10.12 GB	Download
Hunyuan-7B-Instruct-imatrix.gguf	GGUF	—	4.78 MB	Download
Hunyuan-7B-Instruct-iq2_m.gguf	GGUF	IQ2_M	2.86 GB	Download
Hunyuan-7B-Instruct-iq2_s.gguf	GGUF	IQ2_S	2.85 GB	Download
Hunyuan-7B-Instruct-iq2_xs.gguf	GGUF	IQ2_XS	2.66 GB	Download
Hunyuan-7B-Instruct-iq2_xxs.gguf	GGUF	IQ2_XXS	2.50 GB	Download
Hunyuan-7B-Instruct-iq3_m.gguf	GGUF	IQ3_M	3.74 GB	Download
Hunyuan-7B-Instruct-iq3_xxs.gguf	GGUF	IQ3_XXS	3.21 GB	Download
Hunyuan-7B-Instruct-iq4_nl.gguf	GGUF	IQ4_NL	3.95 GB	Download
Hunyuan-7B-Instruct-iq4_xs.gguf	GGUF	IQ4_XS	3.88 GB	Download
Hunyuan-7B-Instruct-q2_k_m.gguf	GGUF	Q2_K_M	2.99 GB	Download
Hunyuan-7B-Instruct-q2_k_s.gguf	GGUF	Q2_K_S	2.93 GB	Download
Hunyuan-7B-Instruct-q3_k_m.gguf	GGUF	Q3_K_M	3.91 GB	Download
Hunyuan-7B-Instruct-q3_k_s.gguf	GGUF	Q3_K_S	3.84 GB	Download
Hunyuan-7B-Instruct-q4_0.gguf	GGUF	—	3.94 GB	Download
Hunyuan-7B-Instruct-q4_1.gguf	GGUF	—	4.38 GB	Download
Hunyuan-7B-Instruct-q4_k_m.gguf	GGUF	Q4_K_M	4.37 GB	Download
Hunyuan-7B-Instruct-q4_k_s.gguf	GGUF	Q4_K_S	4.18 GB	Download
Hunyuan-7B-Instruct-q5_0.gguf	GGUF	—	4.81 GB	Download
Hunyuan-7B-Instruct-q5_1.gguf	GGUF	—	5.25 GB	Download
Hunyuan-7B-Instruct-q8_0.gguf	GGUF	—	7.43 GB	Download

Model Details Live

Model Slug

mungert/hunyuan-7b-instruct-gguf

Author

Mungert

Pipeline Task

—

Library

transformers

Created

2026-02-02

Last Modified

2026-02-02

Gated

Private

HF SHA

def343acd21d51d99fce0a4e065599c9fa44cd6f

License

Unknown

Language

Unknown

Base Model

tencent/Hunyuan-7B-Pretrain

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "base_model": [
      "tencent/Hunyuan-7B-Pretrain"
    ],
    "library_name": "transformers",
    "frontmatter": {
      "base_model": [
        "tencent/Hunyuan-7B-Pretrain"
      ],
      "library_name": "transformers"
    },
    "hero_image_url": "https://dscache.tencent-cloud.cn/upload/uploader/hunyuan-64b418fd052c033b228e04bc77bbc4b54fd7f5bc.png",
    "summary": "",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nbase_model:\n- tencent/Hunyuan-7B-Pretrain\nlibrary_name: transformers\n---\n\n# <span style=\"color: #7FFF7F;\">Hunyuan-7B-Instruct GGUF Models</span>\n\n\n## <span style=\"color: #7F7FFF;\">Model Generation Details</span>\n\nThis model was generated using [llama.cpp](https://github.com/ggerganov/llama.cpp) at commit [`0c21677e4`](https://github.com/ggerganov/llama.cpp/commit/0c21677e43044d27f6f7a7f9f95c67f7c4b3fdb4).\n\n\n\n\n\n---\n\n## <span style=\"color: #7FFF7F;\">Quantization Beyond the IMatrix</span>\n\nI've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides.\n\nIn my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually \"bump\" important layers to higher precision. You can see the implementation here:  \n👉 [Layer bumping with llama.cpp](https://github.com/Mungert69/GGUFModelBuilder/blob/main/model-converter/tensor_list_builder.py)\n\nWhile this does increase model file size, it significantly improves precision for a given quantization level.\n\n### **I'd love your feedback—have you tried this? How does it perform for you?**\n\n\n\n\n---\n\n<a href=\"https://readyforquantum.com/huggingface_gguf_selection_guide.html\" style=\"color: #7FFF7F;\">\n  Click here to get info on choosing the right GGUF model format\n</a>\n\n---\n\n\n\n<!--Begin Original Model Card-->\n\n\n\n\n<p align=\"center\">\n <img src=\"https://dscache.tencent-cloud.cn/upload/uploader/hunyuan-64b418fd052c033b228e04bc77bbc4b54fd7f5bc.png\" width=\"400\"/> <br>\n</p><p></p>\n\n\n<p align=\"center\">\n    🤗&nbsp;<a href=\"https://huggingface.co/tencent/\"><b>HuggingFace</b></a>&nbsp;|&nbsp;\n    🤖&nbsp;<a href=\"https://modelscope.cn/models/Tencent-Hunyuan/Hunyuan-7B-Instruct\"><b>ModelScope</b></a>&nbsp;|&nbsp;\n    🪡&nbsp;<a href=\"https://github.com/Tencent/AngelSlim/tree/main\"><b>AngelSlim</b></a>\n</p>\n\n<p align=\"center\">\n    🖥️&nbsp;<a href=\"https://hunyuan.tencent.com\" style=\"color: red;\"><b>Official Website</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;\n    🕖&nbsp;<a href=\"https://cloud.tencent.com/product/hunyuan\"><b>HunyuanAPI</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;\n    🕹️&nbsp;<a href=\"https://hunyuan.tencent.com/\"><b>Demo</b></a>&nbsp;&nbsp;&nbsp;&nbsp;\n</p>\n\n<p align=\"center\">\n    <a href=\"https://github.com/Tencent-Hunyuan/Hunyuan-7B\"><b>GITHUB</b></a> | \n    <a href=\"https://cnb.cool/tencent/hunyuan/Hunyuan-7B\"><b>cnb.cool</b></a> | \n    <a href=\"https://github.com/Tencent-Hunyuan/Hunyuan-7B/blob/main/LICENSE\"><b>LICENSE</b></a> | \n    <a href=\"https://raw.githubusercontent.com/Tencent-Hunyuan/Hunyuan-A13B/main/assets/1751881231452.jpg\"><b>WeChat</b></a> | \n    <a href=\"https://discord.gg/bsPcMEtV7v\"><b>Discord</b></a>\n</p>\n\n\n## Model Introduction\n\nHunyuan is Tencent's open-source efficient large language model series, designed for versatile deployment across diverse computational environments. From edge devices to high-concurrency production systems, these models deliver optimal performance with advanced quantization support and ultra-long context capabilities.\n\nWe have released a series of Hunyuan dense models, comprising both pre-trained and instruction-tuned variants, with parameter scales of 0.5B, 1.8B, 4B, and 7B. These models adopt training strategies similar to the Hunyuan-A13B, thereby inheriting its robust performance characteristics. This comprehensive model family enables flexible deployment optimization - from resource-constrained edge computing with smaller variants to high-throughput production environments with larger models, all while maintaining strong capabilities across diverse scenarios.\n\n### Key Features and Advantages\n\n- **Hybrid Reasoning Support**: Supports both fast and slow thinking modes, allowing users to flexibly choose according to their needs.\n- **Ultra-Long Context Understanding**: Natively supports a 256K context window, maintaining stable performance on long-text tasks.\n- **Enhanced Agent Capabilities**: Optimized for agent tasks, achieving leading results on benchmarks such as BFCL-v3, τ-Bench and C3-Bench.\n- **Efficient Inference**: Utilizes Grouped Query Attention (GQA) and supports multiple quantization formats, enabling highly efficient inference.\n\n## Related News\n* 2025.7.30 We have open-sourced  **Hunyuan-0.5B-Pretrain** ,  **Hunyuan-0.5B-Instruct** , **Hunyuan-1.8B-Pretrain** ,  **Hunyuan-1.8B-Instruct** , **Hunyuan-4B-Pretrain** ,  **Hunyuan-4B-Instruct** , **Hunyuan-7B-Pretrain** ,**Hunyuan-7B-Instruct** on Hugging Face.\n<br>\n\n\n## Benchmark\n\nNote: The following benchmarks are evaluated by TRT-LLM-backend on several **base models**. \n\n| Model            | Hunyuan-0.5B-Pretrain | Hunyuan-1.8B-Pretrain | Hunyuan-4B-Pretrain | Hunyuan-7B-Pretrain|\n|:------------------:|:---------------:|:--------------:|:-------------:|:---------------:|\n| MMLU             | 54.02          | 64.62         | 74.01        | 79.82         |\n| MMLU-Redux              |  54.72         | 64.42        | 73.53       | 79         |\n| MMLU-Pro        | 31.15             | 38.65            | 51.91        | 57.79          |\n| SuperGPQA    |  17.23         | 24.98          | 27.28           | 30.47          |\n| BBH       | 45.92          | 74.32         | 75.17        | 82.95          |\n| GPQA             | 27.76             | 35.81            | 43.52        | 44.07          |\n| GSM8K | 55.64             | 77.26            | 87.49       | 88.25         |\n| MATH             | 42.95          | 62.85          | 72.25        | 74.85          |\n| EvalPlus             | 39.71          | 60.67          | 67.76        | 66.96          |\n| MultiPL-E            | 21.83          | 45.92         | 59.87        | 60.41          |\n| MBPP            | 43.38          | 66.14         | 76.46        | 76.19          |\n| CRUX-O         | 30.75             | 36.88           | 56.5        | 60.75          |\n| Chinese SimpleQA            | 12.51             | 22.31            | 30.53        | 38.86          |\n| simpleQA (5shot)            | 2.38             | 3.61            | 4.21        | 5.69          |\n\n\n| Topic               |                        Bench                         | Hunyuan-0.5B-Instruct | Hunyuan-1.8B-Instruct | Hunyuan-4B-Instruct | Hunyuan-7B-Instruct|\n|:-------------------:|:----------------------------------------------------:|:-------------:|:------------:|:-----------:|:---------------------:|\n| **Mathematics**     |            AIME 2024<br>AIME 2025<br>MATH            | 17.2<br>20<br>48.5 | 56.7<br>53.9<br>86 | 78.3<br>66.5<br>92.6 | 81.1<br>75.3<br>93.7 |\n| **Science**         |            GPQA-Diamond<br>OlympiadBench             | 23.3<br>29.6 | 47.2<br>63.4 | 61.1<br>73.1 | 60.1<br>76.5 |\n| **Coding**          |           Livecodebench<br>Fullstackbench            | 11.1<br>20.9 | 31.5<br>42   | 49.4<br>54.6 | 57<br>56.3 |\n| **Reasoning**       |              BBH<br>DROP<br>ZebraLogic               | 40.3<br>52.8<br>34.5 | 64.6<br>76.7<br>74.6 | 83<br>78.2<br>83.5 | 87.8<br>85.9<br>85.1 |\n| **Instruction<br>Following** |        IF-Eval<br>SysBench                  | 49.7<br>28.1 | 67.6<br>55.5 | 76.6<br>68 | 79.3<br>72.7 |\n| **Agent**           | BFCL v3<br> τ-Bench<br>ComplexFuncBench<br> C3-Bench | 49.8<br>14.4<br>13.9<br>45.3 | 58.3<br>18.2<br>22.3<br>54.6 | 67.9<br>30.1<br>26.3<br>64.3 | 70.8<br>35.3<br>29.2<br>68.5 |\n| **Long<br>Context** | PenguinScrolls<br>longbench-v2<br>FRAMES          | 53.9<br>34.7<br>41.9 | 73.1<br>33.2<br>55.6 | 83.1<br>44.1<br>79.2 | 82<br>43<br>78.6 |\n\n\n&nbsp;\n\n### Use with transformers\nFirst, please install transformers.\n```SHELL\npip install \"transformers>=4.56.0\"\n```\nOur model defaults to using slow-thinking reasoning, and there are two ways to disable CoT reasoning. \n1. Pass **\"enable_thinking=False\"** when calling apply_chat_template.\n2. Adding **\"/no_think\"** before the prompt will force the model not to use perform CoT reasoning. Similarly, adding **\"/think\"** before the prompt will force the model to perform CoT reasoning.\n\nThe following code snippet shows how to use the transformers library to load and apply the model. It also demonstrates how to enable and disable the reasoning mode , and how to parse the reasoning process along with the final output.\n\nwe use tencent/Hunyuan-7B-Instruct for example\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nimport os\nimport re\n\nmodel_name_or_path = \"tencent/Hunyuan-7B-Instruct\"\n\ntokenizer = AutoTokenizer.from_pretrained(model_name_or_path)\nmodel = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map=\"auto\")  # You may want to use bfloat16 and/or move to GPU here\nmessages = [\n    {\"role\": \"user\", \"content\": \"Write a short summary of the benefits of regular exercise\"},\n]\ntokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True,return_tensors=\"pt\",\n                                                enable_thinking=True # Toggle thinking mode (default: True)\n                                                )\n                                                \noutputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=2048)\n\noutput_text = tokenizer.decode(outputs[0])\nprint(\"output_text=\",output_text)\nthink_pattern = r'<think>(.*?)</think>'\nthink_matches = re.findall(think_pattern, output_text, re.DOTALL)\n\nanswer_pattern = r'<answer>(.*?)</answer>'\nanswer_matches = re.findall(answer_pattern, output_text, re.DOTALL)\n\nthink_content = [match.strip() for match in think_matches][0]\nanswer_content = [match.strip() for match in answer_matches][0]\nprint(f\"thinking_content:{think_content}\\n\\n\")\nprint(f\"answer_content:{answer_content}\\n\\n\")\n\n\n```\n\nWe recommend using the following set of parameters for inference. Note that our model does not have the default system_prompt.\n\n```json\n\n{\n  \"do_sample\": true,\n  \"top_k\": 20,\n  \"top_p\": 0.8,\n  \"repetition_penalty\": 1.05,\n  \"temperature\": 0.7\n}\n```\n\n&nbsp;\n\n### Training Data Format\n\nIf you need to fine-tune our Instruct model, we recommend processing the data into the following format, corresponding to both slow-thinking and fast-thinking scenarios.\n\n```python\n# think_pattern\nthink = \"\"\nanswer = \"\"\nthink_pattern = f\"<think>\\n{think}\\n</think>\\n<answer>\\n{answer}\\n</answer>\"\n\n# fast think pattern\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n    {\"role\": \"user\", \"content\": \"/no_think海水为什么是咸的\" },\n    {\"role\": \"assistant\", \"content\": \"<think>\\n\\n</think>\\n<answer>\\n海水是咸的主要是因为其中含有许多溶解在水中的盐类和矿物质。这些盐类和矿物质来自于地球表面的岩石和土壤中的化学物质，随着时间的推移，它们被带到了海洋中。当海水蒸发时，水分蒸发掉了，但盐类和矿物质仍然留在水中，导致海水变得更加咸味。因此，海水的咸度是由其中的盐类和矿物质的含量决定的。\\n</answer>\"}\n]\n\n# slow think pattern\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n    {\"role\": \"user\", \"content\": \"1+1=\" },\n    {\"role\": \"assistant\", \"content\": \"<think>\\n嗯，用户问的是1加1等于多少。首先，我需要确认这是一个基本的算术问题。1加1在十进制的数学体系中，通常的结果是2。不过，可能需要考虑是否有其他情况，比如二进制或者其他数制，但用户没有特别说明，所以默认应该是十进制。另外，有时候可能会有脑筋急转弯的情况，比如在某些语境下1+1可能等于1（比如1滴水加1滴水还是1滴水），但通常数学问题中都是2。所以最准确的回答应该是2。</think>\\n<answer>\\n在十进制的基本算术运算中，1加1的结果是2。这是数学中最基础的加法运算之一，遵循自然数的加法规则。因此，1 + 1 = 2。\\n</answer>\"}\n]\n\nfrom transformers import AutoTokenizer\ntokenizer = AutoTokenizer.from_pretrained(\"your_tokenizer_path\", trust_remote_code=True)\ntrain_ids = tokenizer.apply_chat_template(messages)\n```\n\n&nbsp;\n\n### Train with LLaMA-Factory\n\nIn the following chapter, we will introduce how to use `LLaMA-Factory` to fine-tune the `Hunyuan` model.\n\n#### Prerequisites\n\nVerify installation of the following dependencies:  \n- **LLaMA-Factory**: Follow [official installation guide](https://github.com/hiyouga/LLaMA-Factory)\n- **DeepSpeed** (optional): Follow [official installation guide](https://github.com/deepspeedai/DeepSpeed#installation)\n- **Transformer Library**: Use the companion branch (Hunyuan-submitted code is pending review)\n    ```\n    pip install \"transformers>=4.56.0\"\n    ```\n\n#### Data preparation\n\nWe need to prepare a custom dataset:\n1. Organize your data in `json` format and place it in the `data` directory in `LLaMA-Factory`. The current implementation uses the `sharegpt` dataset format, which requires the following structure:\n```\n[\n  {\n    \"messages\": [\n      {\n        \"role\": \"system\",\n        \"content\": \"System prompt (optional)\"\n      },\n      {\n        \"role\": \"user\",\n        \"content\": \"Human instruction\"\n      },\n      {\n        \"role\": \"assistant\",\n        \"content\": \"Model response\"\n      }\n    ]\n  }\n]\n```\nRefer to the [Data Format](#training-data-format) section mentioned earlier for details.\n\n2. Define your dataset in the data/dataset_info.json file using the following format:\n```\n\"dataset_name\": {\n  \"file_name\": \"dataset.json\",\n  \"formatting\": \"sharegpt\",\n  \"columns\": {\n    \"messages\": \"messages\"\n  },\n  \"tags\": {\n    \"role_tag\": \"role\",\n    \"content_tag\": \"content\",\n    \"user_tag\": \"user\",\n    \"assistant_tag\": \"assistant\",\n    \"system_tag\": \"system\"\n  }\n}\n```\n\n#### Training execution\n\n1. Copy all files from the `train/llama_factory_support/example_configs` directory to the `example/hunyuan` directory in `LLaMA-Factory`.\n2. Modify the model path and dataset name in the configuration file `hunyuan_full.yaml`. Adjust other configurations as needed:\n```\n### model\nmodel_name_or_path: [!!!add the model path here!!!]\n\n### dataset\ndataset: [!!!add the dataset name here!!!]\n```\n3. Execute training commands:\n    *Single-node training\n    Note: Set the environment variable DISABLE_VERSION_CHECK to 1 to avoid version conflicts.\n    ```\n    export DISABLE_VERSION_CHECK=1\n    llamafactory-cli train examples/hunyuan/hunyuan_full.yaml\n    ```\n    *Multi-node training\n    Execute the following command on each node. Configure NNODES, NODE_RANK, MASTER_ADDR, and MASTER_PORT according to your environment:\n    ```\n    export DISABLE_VERSION_CHECK=1\n    FORCE_TORCHRUN=1 NNODES=${NNODES} NODE_RANK=${NODE_RANK} MASTER_ADDR=${MASTER_ADDR} MASTER_PORT=${MASTER_PORT} \\\n    llamafactory-cli train examples/hunyuan/hunyuan_full.yaml\n    ```\n\n&nbsp;\n\n\n## Quantization Compression\nWe used our own [AngleSlim](https://github.com/tencent/AngelSlim) compression tool to produce FP8 and INT4 quantization models. `AngleSlim` is a toolset dedicated to creating a more user-friendly, comprehensive and efficient model compression solution.\n\n### FP8 Quantization\nWe use FP8-static quantization, FP8 quantization adopts 8-bit floating point format, through a small amount of calibration data (without training) to pre-determine the quantization scale, the model weights and activation values will be converted to FP8 format, to improve the inference efficiency and reduce the deployment threshold. We you can use AngleSlim quantization, you can also directly download our quantization completed open source model to use [LINK](https://huggingface.co/).\n\n### Int4 Quantization\nWe use the GPTQ and AWQ algorithm to achieve W4A16 quantization.\n\nGPTQ processes the model weights layer by layer, uses a small amount of calibration data to minimize the reconfiguration error of the quantized weights, and adjusts the weights layer by layer by the optimization process of approximating the Hessian inverse matrix. The process eliminates the need to retrain the model and requires only a small amount of calibration data to quantize the weights, improving inference efficiency and lowering the deployment threshold. \nAWQ using a small amount of calibration data (without the need for training), the amplitude of the activation values is statistically calculated. For each weight channel, a scaling coefficient s is computed to expand the numerical range of important weights, allowing more information to be retained during quantization.\n\nYou can use  [AngleSlim](https://github.com/tencent/AngelSlim) quantization, you can also directly download our quantization completed open source model to use [LINK](https://huggingface.co/).\n\n\n\n#### Quantization Benchmark\nThis subsection describes the Benchmark metrics for the Hunyuan quantitative model.\n\n|     Bench     |           Quantization            |    Hunyuan-0.5B-Instruct     |     Hunyuan-1.8B-Instruct      |     Hunyuan-4B-Instruct      |     Hunyuan-7B-Instruct      |\n|:-------------:|:---------------------------------:|:----------------------------:|:------------------------------:|:----------------------------:|:----------------------------:|\n|     DROP      | B16<br>FP8<br>Int4GPTQ<br>Int4AWQ | 52.8<br>51.6<br>50.9<br>48.9 |  76.7<br>75.1<br>73.0<br>71.7  | 78.2<br>78.3<br>78.1<br>78.2 | 85.9<br>86.0<br>85.7<br>85.9 |\n| GPQA-Diamond  | B16<br>FP8<br>Int4GPTQ<br>Int4AWQ | 23.3<br>22.5<br>23.3<br>23.3 | 47.2<br>47.7<br>44.43<br>43.62 |  61.1<br>60.2<br>58.1<br>-   | 60.1<br>60.1<br>60.0<br>60.1 |\n| OlympiadBench | B16<br>FP8<br>Int4GPTQ<br>Int4AWQ | 29.6<br>29.6<br>26.8<br>26.3 |  63.4<br>62.5<br>60.9<br>61.7  | 73.1<br>73.1<br>71.1<br>71.2 | 76.5<br>76.6<br>76.2<br>76.4 |\n|   AIME 2024   | B16<br>FP8<br>Int4GPTQ<br>Int4AWQ |    17.2<br>17.2<br>-<br>-    |    56.7<br>55.17<br>-<br>-     |    78.3<br>76.6<br>-<br>-    | 81.1<br>80.9<br>81.0<br>80.9 |\n\n\n## Deployment   \n\nFor deployment, you can use frameworks such as **TensorRT-LLM**, **vLLM**, or **SGLang** to serve the model and create an OpenAI-compatible API endpoint.\n\nimage: https://hub.docker.com/r/hunyuaninfer/hunyuan-7B/tags \n\n\n### TensorRT-LLM\n\n#### Docker Image \n\nWe provide a pre-built Docker image based on the latest version of TensorRT-LLM.\n\nWe use tencent/Hunyuan-7B-Instruct for example\n- To get started:\n\nhttps://hub.docker.com/r/hunyuaninfer/hunyuan-large/tags \n\n```\ndocker pull hunyuaninfer/hunyuan-7B:hunyuan-moe-7B-trtllm\n```\n```\ndocker run --privileged --user root --name hunyuanLLM_infer --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all hunyuaninfer/hunyuan-7B:hunyuan-moe-7B-trtllm\n```\n\n- Prepare Configuration file:\n\n```\ncat >/path/to/extra-llm-api-config.yml <<EOF\nuse_cuda_graph: true\ncuda_graph_padding_enabled: true\ncuda_graph_batch_sizes:\n- 1\n- 2\n- 4\n- 8\n- 16\n- 32\nprint_iter_log: true\nEOF\n```\n\n\n- Start the API server:\n\n\n```\ntrtllm-serve \\\n  /path/to/HunYuan-moe-7B \\\n  --host localhost \\\n  --port 8000 \\\n  --backend pytorch \\\n  --max_batch_size 32 \\\n  --max_num_tokens 16384 \\\n  --tp_size 2 \\\n  --kv_cache_free_gpu_memory_fraction 0.6 \\\n  --trust_remote_code \\\n  --extra_llm_api_options /path/to/extra-llm-api-config.yml\n```\n\n\n### vllm\n\n#### Start\nPlease use vLLM version v0.10.0 or higher for inference.\n\nWe use tencent/Hunyuan-7B-Instruct for example\n- Download Model file: \n  - Huggingface:  will download automicly by vllm.\n  - ModelScope: `modelscope download --model Tencent-Hunyuan/Hunyuan-7B-Instruct`\n  \n- model download by huggingface:\n```shell\nexport MODEL_PATH=tencent/Hunyuan-7B-Instruct\n``` \n\n- model downloaded by modelscope:\n```shell\nexport MODEL_PATH=/root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-7B-Instruct/\n```\n\n- Start the API server:\n\n```shell\npython3 -m vllm.entrypoints.openai.api_server \\\n    --host 0.0.0.0 \\\n    --port 8000 \\\n    --trust-remote-code \\\n    --model ${MODEL_PATH} \\\n    --tensor-parallel-size 1 \\\n    --dtype bfloat16 \\\n    --quantization experts_int8 \\\n    --served-model-name hunyuan \\\n    2>&1 | tee log_server.txt\n``` \n- After running service script successfully, run the request script\n```shell\ncurl http://0.0.0.0:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{\n\"model\": \"hunyuan\",\n\"messages\": [\n    {\n        \"role\": \"system\",\n        \"content\": [{\"type\": \"text\", \"text\": \"You are a helpful assistant.\"}]\n    },\n    {\n        \"role\": \"user\",\n        \"content\": [{\"type\": \"text\", \"text\": \"请按面积大小对四大洋进行排序，并给出面积最小的洋是哪一个？直接输出结果。\"}]\n    }\n],\n\"max_tokens\": 2048,\n\"temperature\":0.7,\n\"top_p\": 0.6,\n\"top_k\": 20,\n\"repetition_penalty\": 1.05,\n\"stop_token_ids\": [127960]\n}'\n```\n#### Quantitative model deployment\nThis section describes the process of deploying a post-quantization model using vLLM.\n\nDefault server in BF16.\n\n##### Int8 quantitative model deployment\nDeploying the Int8-weight-only version of the HunYuan-7B model only requires setting the environment variables\n\nNext we start the Int8 service. Run:\n```shell\npython3 -m vllm.entrypoints.openai.api_server \\\n    --host 0.0.0.0 \\\n    --port 8000 \\\n    --trust-remote-code \\\n    --model ${MODEL_PATH} \\\n    --tensor-parallel-size 1 \\\n    --dtype bfloat16 \\\n    --served-model-name hunyuan \\\n    --quantization experts_int8 \\\n    2>&1 | tee log_server.txt\n```\n\n\n##### Int4 quantitative model deployment\nDeploying the Int4-weight-only version of the HunYuan-7B model only requires setting the environment variables , using the GPTQ method\n```shell\nexport MODEL_PATH=PATH_TO_INT4_MODEL\n```\nNext we start the Int4 service. Run\n```shell\npython3 -m vllm.entrypoints.openai.api_server \\\n    --host 0.0.0.0 \\\n    --port 8000 \\\n    --trust-remote-code \\\n    --model ${MODEL_PATH} \\\n    --tensor-parallel-size 1 \\\n    --dtype bfloat16 \\\n    --served-model-name hunyuan \\\n    --quantization gptq_marlin \\\n    2>&1 | tee log_server.txt\n```\n\n##### FP8 quantitative model deployment\nDeploying the W8A8C8 version of the HunYuan-7B model only requires setting the environment variables\n\n\nNext we start the FP8 service. Run\n```shell\npython3 -m vllm.entrypoints.openai.api_server \\\n    --host 0.0.0.0 \\\n    --port 8000 \\\n    --trust-remote-code \\\n    --model ${MODEL_PATH} \\\n    --tensor-parallel-size 1 \\\n    --dtype bfloat16 \\\n    --served-model-name hunyuan \\\n    --kv-cache-dtype fp8 \\\n    2>&1 | tee log_server.txt\n```\n\n\n\n\n### SGLang\n\n#### Docker Image \n\nWe also provide a pre-built Docker image based on the latest version of SGLang.\n\nWe use tencent/Hunyuan-7B-Instruct for example\n\nTo get started:\n\n- Pull the Docker image\n\n```\ndocker pull lmsysorg/sglang:latest\n```\n\n- Start the API server:\n\n```\ndocker run --entrypoint=\"python3\" --gpus all \\\n    --shm-size 32g \\\n    -p 30000:30000 \\\n    --ulimit nproc=10000 \\\n    --privileged \\\n    --ipc=host \\\n     lmsysorg/sglang:latest \\\n    -m sglang.launch_server --model-path hunyuan/huanyuan_7B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000\n```\n\n\n## Contact Us\n\nIf you would like to leave a message for our R&D and product teams, Welcome to contact our open-source team . You can also contact us via email (hunyuan_opensource@tencent.com).\n\n<!--End Original Model Card-->\n\n---\n\n# <span id=\"testllm\" style=\"color: #7F7FFF;\">🚀 If you find these models useful</span>\n\nHelp me test my **AI-Powered Quantum Network Monitor Assistant** with **quantum-ready security checks**:  \n\n👉 [Quantum Network Monitor](https://readyforquantum.com/?assistant=open&utm_source=huggingface&utm_medium=referral&utm_campaign=huggingface_repo_readme)  \n\n\nThe full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : [Source Code Quantum Network Monitor](https://github.com/Mungert69). You will also find the code I use to quantize the models if you want to do it yourself [GGUFModelBuilder](https://github.com/Mungert69/GGUFModelBuilder)\n\n💬 **How to test**:  \n Choose an **AI assistant type**:  \n   - `TurboLLM` (GPT-4.1-mini)  \n   - `HugLLM` (Hugginface Open-source models)  \n   - `TestLLM` (Experimental CPU-only)  \n\n### **What I’m Testing**  \nI’m pushing the limits of **small open-source models for AI network monitoring**, specifically:  \n- **Function calling** against live network services  \n- **How small can a model go** while still handling:  \n  - Automated **Nmap security scans**  \n  - **Quantum-readiness checks**  \n  - **Network Monitoring tasks**  \n\n🟡 **TestLLM** – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space):  \n- ✅ **Zero-configuration setup**  \n- ⏳ 30s load time (slow inference but **no API costs**) . No token limited as the cost is low.\n- 🔧 **Help wanted!** If you’re into **edge-device AI**, let’s collaborate!  \n\n### **Other Assistants**  \n🟢 **TurboLLM** – Uses **gpt-4.1-mini** :\n- **It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. \n- **Create custom cmd processors to run .net code on Quantum Network Monitor Agents**\n- **Real-time network diagnostics and monitoring**\n- **Security Audits**\n- **Penetration testing** (Nmap/Metasploit)  \n\n🔵 **HugLLM** – Latest Open-source models:  \n- 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita.\n\n### 💡 **Example commands you could test**:  \n1. `\"Give me info on my websites SSL certificate\"`  \n2. `\"Check if my server is using quantum safe encyption for communication\"`  \n3. `\"Run a comprehensive security audit on my server\"`\n4. '\"Create a cmd processor to .. (what ever you want)\" Note you need to install a [Quantum Network Monitor Agent](https://readyforquantum.com/Download/?utm_source=huggingface&utm_medium=referral&utm_campaign=huggingface_repo_readme) to run the .net code on. This is a very flexible and powerful feature. Use with caution!\n\n### Final Word\n\nI fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is [open source](https://github.com/Mungert69). Feel free to use whatever you find helpful.\n\nIf you appreciate the work, please consider [buying me a coffee](https://www.buymeacoffee.com/mahadeva) ☕. Your support helps cover service costs and allows me to raise token limits for everyone.\n\nI'm also open to job opportunities or sponsorship.\n\nThank you! 😊\n",
    "related_quantizations": []
  },
  "tags": [
    "transformers",
    "gguf",
    "base_model:tencent/Hunyuan-7B-Pretrain",
    "base_model:quantized:tencent/Hunyuan-7B-Pretrain",
    "endpoints_compatible",
    "region:us",
    "conversational"
  ],
  "likes": 0,
  "downloads": 288,
  "gated": false,
  "private": false,
  "last_modified": "2026-02-02T20:50:41.000Z",
  "created_at": "2026-02-02T17:22:17.000Z",
  "pipeline_tag": "",
  "library_name": "transformers"
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "6980dd490d47ac882886b89f",
  "id": "Mungert/Hunyuan-7B-Instruct-GGUF",
  "modelId": "Mungert/Hunyuan-7B-Instruct-GGUF",
  "sha": "def343acd21d51d99fce0a4e065599c9fa44cd6f",
  "createdAt": "2026-02-02T17:22:17.000Z",
  "lastModified": "2026-02-02T20:50:41.000Z",
  "author": "Mungert",
  "downloads": 288,
  "likes": 0,
  "gated": false,
  "private": false,
  "pipeline_tag": "",
  "library_name": "transformers",
  "siblings_count": 25
}