Model Intelligence Sheet

mungert/qwen2.5-omni-3b-gguf overview

Comprehensive model page for mungert/qwen2.5-omni-3b-gguf

transformersggufmultimodalany-to-anyenarxiv:2503.20215license:otherendpoints_compatibleregion:usconversational

Downloads

176

Likes

Pipeline

any-to-any

Library

transformers

Visibility

Public

Access

Open

Repository Files & Downloads

27 files detected

Direct downloads for all repository files

File	Type	Quantization	Size	Link
Qwen2.5-Omni-3B-bf16.gguf	GGUF	BF16	6.33 GB	Download
Qwen2.5-Omni-3B-bf16_q8_0.gguf	GGUF	BF16	4.77 GB	Download
Qwen2.5-Omni-3B-f16_q8_0.gguf	GGUF	F16	4.77 GB	Download
Qwen2.5-Omni-3B-iq2_m.gguf	GGUF	IQ2_M	1.31 GB	Download
Qwen2.5-Omni-3B-iq2_s.gguf	GGUF	IQ2_S	1.26 GB	Download
Qwen2.5-Omni-3B-iq2_xs.gguf	GGUF	IQ2_XS	1.23 GB	Download
Qwen2.5-Omni-3B-iq2_xxs.gguf	GGUF	IQ2_XXS	1.16 GB	Download
Qwen2.5-Omni-3B-iq3_m.gguf	GGUF	IQ3_M	1.53 GB	Download
Qwen2.5-Omni-3B-iq3_s.gguf	GGUF	IQ3_S	1.52 GB	Download
Qwen2.5-Omni-3B-iq3_xs.gguf	GGUF	IQ3_XS	1.46 GB	Download
Qwen2.5-Omni-3B-iq3_xxs.gguf	GGUF	IQ3_XXS	1.40 GB	Download
Qwen2.5-Omni-3B-iq4_nl.gguf	GGUF	IQ4_NL	1.86 GB	Download
Qwen2.5-Omni-3B-iq4_xs.gguf	GGUF	IQ4_XS	1.77 GB	Download
Qwen2.5-Omni-3B-q2_k_m.gguf	GGUF	Q2_K_M	1.41 GB	Download
Qwen2.5-Omni-3B-q2_k_s.gguf	GGUF	Q2_K_S	1.27 GB	Download
Qwen2.5-Omni-3B-q3_k_m.gguf	GGUF	Q3_K_M	1.71 GB	Download
Qwen2.5-Omni-3B-q3_k_s.gguf	GGUF	Q3_K_S	1.54 GB	Download
Qwen2.5-Omni-3B-q4_0.gguf	GGUF	—	1.79 GB	Download
Qwen2.5-Omni-3B-q4_1.gguf	GGUF	—	1.98 GB	Download
Qwen2.5-Omni-3B-q4_k_m.gguf	GGUF	Q4_K_M	2.02 GB	Download
Qwen2.5-Omni-3B-q4_k_s.gguf	GGUF	Q4_K_S	1.96 GB	Download
Qwen2.5-Omni-3B-q5_0.gguf	GGUF	—	2.18 GB	Download
Qwen2.5-Omni-3B-q5_1.gguf	GGUF	—	2.38 GB	Download
Qwen2.5-Omni-3B-q5_k_m.gguf	GGUF	Q5_K_M	2.31 GB	Download
Qwen2.5-Omni-3B-q5_k_s.gguf	GGUF	Q5_K_S	2.29 GB	Download
Qwen2.5-Omni-3B-q6_k_m.gguf	GGUF	Q6_K_M	2.60 GB	Download
Qwen2.5-Omni-3B-q8_0.gguf	GGUF	—	3.37 GB	Download

Model Details Live

Model Slug

mungert/qwen2.5-omni-3b-gguf

Author

Mungert

Pipeline Task

any-to-any

Library

transformers

Created

2025-06-10

Last Modified

2025-09-24

Gated

Private

HF SHA

51a713feff2c285a9b021c243a78cfb8cbd6a71f

License

other

Language

Base Model

Unknown

Metadata Inspector

Normalized metadata (stored in metadata_json)

{
  "metadata": {},
  "card_data": {
    "license": "other",
    "license_name": "qwen-research",
    "license_link": "LICENSE",
    "language": [
      "en"
    ],
    "tags": [
      "multimodal"
    ],
    "library_name": "transformers",
    "pipeline_tag": "any-to-any",
    "frontmatter": {
      "license": "other",
      "license_name": "qwen-research",
      "license_link": "LICENSE",
      "language": [
        "en"
      ],
      "tags": [
        "multimodal"
      ],
      "library_name": "transformers",
      "pipeline_tag": "any-to-any"
    },
    "hero_image_url": "https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5",
    "summary": "",
    "quick_links": [],
    "benchmark_table_html": "",
    "readme_markdown": "---\nlicense: other\nlicense_name: qwen-research\nlicense_link: LICENSE\nlanguage:\n- en\ntags:\n- multimodal\nlibrary_name: transformers\npipeline_tag: any-to-any\n---\n\n# <span style=\"color: #7FFF7F;\">Qwen2.5-Omni-3B GGUF Models</span>\n\n\n## <span style=\"color: #7F7FFF;\">Model Generation Details</span>\n\nThis model was generated using [llama.cpp](https://github.com/ggerganov/llama.cpp) at commit [`7f4fbe51`](https://github.com/ggerganov/llama.cpp/commit/7f4fbe5183b23b6b2e25fd1ccc5d1fa8bb010cb7).\n\n\n\n\n## <span style=\"color: #7FFF7F;\"> Quantization beyond the IMatrix</span>\n\nTesting a new quantization method using rules to bump important layers above what the standard imatrix would use.\n\nI have found that the standard IMatrix does not perform very well at low bit quantiztion and for MOE models. So I am using llama.cpp --tensor-type to bump up selected layers. See [Layer bumping with llama.cpp](https://github.com/Mungert69/GGUFModelBuilder/blob/main/model-converter/tensor_list_builder.py)\n\nThis does create larger model files but increases precision for a given model size.\n\n### **Please provide feedback on how you find this method performs**\n\n\n\n\n## **Choosing the Right Model Format**  \n\nSelecting the correct model format depends on your **hardware capabilities** and **memory constraints**.  \n\n### **BF16 (Brain Float 16) – Use if BF16 acceleration is available**  \n- A 16-bit floating-point format designed for **faster computation** while retaining good precision.  \n- Provides **similar dynamic range** as FP32 but with **lower memory usage**.  \n- Recommended if your hardware supports **BF16 acceleration** (check your device's specs).  \n- Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32.  \n\n📌 **Use BF16 if:**  \n✔ Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs).  \n✔ You want **higher precision** while saving memory.  \n✔ You plan to **requantize** the model into another format.  \n\n📌 **Avoid BF16 if:**  \n❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower).  \n❌ You need compatibility with older devices that lack BF16 optimization.  \n\n---\n\n### **F16 (Float 16) – More widely supported than BF16**  \n- A 16-bit floating-point **high precision** but with less of range of values than BF16. \n- Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs).  \n- Slightly lower numerical precision than BF16 but generally sufficient for inference.  \n\n📌 **Use F16 if:**  \n✔ Your hardware supports **FP16** but **not BF16**.  \n✔ You need a **balance between speed, memory usage, and accuracy**.  \n✔ You are running on a **GPU** or another device optimized for FP16 computations.  \n\n📌 **Avoid F16 if:**  \n❌ Your device lacks **native FP16 support** (it may run slower than expected).  \n❌ You have memory limitations.  \n\n---\n\n### **Hybrid Precision Models (e.g., `bf16_q8_0`, `f16_q4_K`) – Best of Both Worlds**  \nThese formats selectively **quantize non-essential layers** while keeping **key layers in full precision** (e.g., attention and output layers).\n\n- Named like `bf16_q8_0` (meaning **full-precision BF16 core layers + quantized Q8_0 other layers**).  \n- Strike a **balance between memory efficiency and accuracy**, improving over fully quantized models without requiring the full memory of BF16/F16.  \n\n📌 **Use Hybrid Models if:**  \n✔ You need **better accuracy than quant-only models** but can’t afford full BF16/F16 everywhere.  \n✔ Your device supports **mixed-precision inference**.  \n✔ You want to **optimize trade-offs** for production-grade models on constrained hardware.  \n\n📌 **Avoid Hybrid Models if:**  \n❌ Your target device doesn’t support **mixed or full-precision acceleration**.  \n❌ You are operating under **ultra-strict memory limits** (in which case use fully quantized formats).  \n\n---\n\n### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference**  \nQuantization reduces model size and memory usage while maintaining as much accuracy as possible.  \n- **Lower-bit models (Q4_K)** → **Best for minimal memory usage**, may have lower precision.  \n- **Higher-bit models (Q6_K, Q8_0)** → **Better accuracy**, requires more memory.  \n\n📌 **Use Quantized Models if:**  \n✔ You are running inference on a **CPU** and need an optimized model.  \n✔ Your device has **low VRAM** and cannot load full-precision models.  \n✔ You want to reduce **memory footprint** while keeping reasonable accuracy.  \n\n📌 **Avoid Quantized Models if:**  \n❌ You need **maximum accuracy** (full-precision models are better for this).  \n❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).  \n\n---\n\n### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)**  \nThese models are optimized for **very high memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint.  \n\n- **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **very high memory efficiency**.  \n  - **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large.  \n  - **Trade-off**: Lower accuracy compared to higher-bit quantizations.  \n\n- **IQ3_S**: Small block size for **maximum memory efficiency**.  \n  - **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive.  \n\n- **IQ3_M**: Medium block size for better accuracy than **IQ3_S**.  \n  - **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting.  \n\n- **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy.  \n  - **Use case**: Best for **low-memory devices** where **Q6_K** is too large.  \n\n- **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**.  \n  - **Use case**: Best for **ARM-based devices** or **low-memory environments**.  \n\n### **Ultra Low-Bit Quantization (IQ1_S IQ1_M IQ2_S IQ2_M IQ2_XS IQ2_XSS)** \n- *Ultra-low-bit quantization (1 2-bit) with **extreme memory efficiency**.  \n  - **Use case**: Best for  cases were you have to fit the model into very constrained memory\n  - **Trade-off**: Very Low Accuracy. May not function as expected. Please test fully before using.\n\n---\n\n### **Summary Table: Model Format Selection**  \n\n\n| Model Format             | Precision        | Memory Usage     | Device Requirements             | Best Use Case                                                |  \n|--------------------------|------------------|------------------|----------------------------------|--------------------------------------------------------------|  \n| **BF16**                 | Very High        | High             | BF16-supported GPU/CPU           | High-speed inference with reduced memory                    |  \n| **F16**                  | High             | High             | FP16-supported GPU/CPU           | Inference when BF16 isn’t available                     |  \n| **Q4_K**                 | Medium-Low       | Low              | CPU or Low-VRAM devices          | Memory-constrained inference                                |  \n| **Q6_K**                 | Medium           | Moderate         | CPU with more memory             | Better accuracy with quantization                           |  \n| **Q8_0**                 | High             | Moderate         | GPU/CPU with moderate VRAM       | Highest accuracy among quantized models                     |  \n| **IQ3_XS**               | Low              | Very Low         | Ultra-low-memory devices         | Max memory efficiency, low accuracy                         |  \n| **IQ3_S**                | Low              | Very Low         | Low-memory devices               | Slightly more usable than IQ3_XS                            |  \n| **IQ3_M**                | Low-Medium       | Low              | Low-memory devices               | Better accuracy than IQ3_S                                  |  \n| **Q4_0**                 | Low              | Low              | ARM-based/embedded devices       | Llama.cpp automatically optimizes for ARM inference                                 |  \n| **Ultra Low-Bit (IQ1/2_*)** | Very Low      | Extremely Low     | Tiny edge/embedded devices        | Fit models in extremely tight memory; low accuracy           |  \n| **Hybrid (e.g., `bf16_q8_0`)** | Medium–High | Medium           | Mixed-precision capable hardware | Balanced performance and memory, near-FP accuracy in critical layers |\n\n---\n\n\n\n\n\n# Qwen2.5-Omni\n<a href=\"https://chat.qwen.ai/\" target=\"_blank\" style=\"margin: 2px;\">\n    <img alt=\"Chat\" src=\"https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5\" style=\"display: inline-block; vertical-align: middle;\"/>\n</a>\n\n\n## Overview \n### Introduction\nQwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. \n\n<p align=\"center\">\n    <img src=\"https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/qwen_omni.png\" width=\"80%\"/>\n<p>\n\n### Key Features\n\n* **Omni and Novel Architecture**: We propose Thinker-Talker architecture, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We propose a novel position embedding, named TMRoPE (Time-aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio.\n\n* **Real-Time Voice and Video Chat**: Architecture designed for fully real-time interactions, supporting chunked input and immediate output.\n\n* **Natural and Robust Speech Generation**: Surpassing many existing streaming and non-streaming alternatives, demonstrating superior robustness and naturalness in speech generation.\n\n* **Strong Performance Across Modalities**: Exhibiting exceptional performance across all modalities when benchmarked against similarly sized single-modality models. Qwen2.5-Omni outperforms the similarly sized Qwen2-Audio in audio capabilities and achieves comparable performance to Qwen2.5-VL-7B.\n\n* **Excellent End-to-End Speech Instruction Following**: Qwen2.5-Omni shows performance in end-to-end speech instruction following that rivals its effectiveness with text inputs, evidenced by benchmarks such as MMLU and GSM8K.\n\n### Model Architecture\n\n<p align=\"center\">\n    <img src=\"https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/overview.png\" width=\"80%\"/>\n<p>\n\n### Performance\n\nWe conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates strong performance across all modalities when compared to similarly sized single-modality models and closed-source models like Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro. In tasks requiring the integration of multiple modalities, such as OmniBench, Qwen2.5-Omni achieves state-of-the-art performance. Furthermore, in single-modality tasks, it excels in areas including speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed-tts-eval and subjective naturalness).\n\n<p align=\"center\">\n    <img src=\"https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/bar.png\" width=\"80%\"/>\n<p>\n\n<details>\n<summary>Multimodality  -> Text</summary>\n\n<table class=\"tg\"><thead>\n  <tr>\n    <th class=\"tg-0lax\">Datasets</th>\n    <th class=\"tg-0lax\">Model</th>\n    <th class=\"tg-0lax\">Performance</th>\n  </tr></thead>\n<tbody>\n  <tr>\n    <td class=\"tg-0lax\" rowspan=\"10\">OmniBench<br>Speech | Sound Event | Music | Avg</td>\n    <td class=\"tg-0lax\">Gemini-1.5-Pro</td>\n    <td class=\"tg-0lax\">42.67%|42.26%|46.23%|42.91%</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MIO-Instruct</td>\n    <td class=\"tg-0lax\">36.96%|33.58%|11.32%|33.80%</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">AnyGPT (7B)</td>\n    <td class=\"tg-0lax\">17.77%|20.75%|13.21%|18.04%</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">video-SALMONN</td>\n    <td class=\"tg-0lax\">34.11%|31.70%|<strong>56.60%</strong>|35.64%</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">UnifiedIO2-xlarge</td>\n    <td class=\"tg-0lax\">39.56%|36.98%|29.25%|38.00%</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">UnifiedIO2-xxlarge</td>\n    <td class=\"tg-0lax\">34.24%|36.98%|24.53%|33.98%</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MiniCPM-o</td>\n    <td class=\"tg-0lax\">-|-|-|40.50%</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Baichuan-Omni-1.5</td>\n    <td class=\"tg-0lax\">-|-|-|42.90%</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B</td>\n    <td class=\"tg-0lax\">52.14%|52.08%|52.83%|52.19%</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B</td>\n    <td class=\"tg-0lax\"><strong>55.25%</strong>|<strong>60.00%</strong>|52.83%|<strong>56.13%</strong></td>\n  </tr>\n</tbody></table>\n</details>\n\n\n<details>\n<summary>Audio -> Text</summary>\n\n\n<table class=\"tg\"><thead>\n  <tr>\n    <th class=\"tg-0lax\">Datasets</th>\n    <th class=\"tg-0lax\">Model</th>\n    <th class=\"tg-0lax\">Performance</th>\n  </tr></thead>\n<tbody>\n  <tr>\n    <td class=\"tg-9j4x\" colspan=\"3\">ASR</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\" rowspan=\"12\">Librispeech<br>dev-clean | dev other | test-clean | test-other</td>\n    <td class=\"tg-0lax\">SALMONN</td>\n    <td class=\"tg-0lax\">-|-|2.1|4.9</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">SpeechVerse</td>\n    <td class=\"tg-0lax\">-|-|2.1|4.4</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Whisper-large-v3</td>\n    <td class=\"tg-0lax\">-|-|1.8|3.6</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Llama-3-8B</td>\n    <td class=\"tg-0lax\">-|-|-|3.4</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Llama-3-70B</td>\n    <td class=\"tg-0lax\">-|-|-|3.1</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Seed-ASR-Multilingual</td>\n    <td class=\"tg-0lax\">-|-|<strong>1.6</strong>|<strong>2.8</strong></td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MiniCPM-o</td>\n    <td class=\"tg-0lax\">-|-|1.7|-</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MinMo</td>\n    <td class=\"tg-0lax\">-|-|1.7|3.9</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen-Audio</td>\n    <td class=\"tg-0lax\">1.8|4.0|2.0|4.2</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2-Audio</td>\n    <td class=\"tg-0lax\"><strong>1.3</strong>|<strong>3.4</strong>|<strong>1.6</strong>|3.6</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B</td>\n    <td class=\"tg-0lax\">2.0|4.1|2.2|4.5</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B</td>\n    <td class=\"tg-0lax\">1.6|3.5|1.8|3.4</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\" rowspan=\"5\">Common Voice 15<br>en | zh | yue | fr</td>\n    <td class=\"tg-0lax\">Whisper-large-v3</td>\n    <td class=\"tg-0lax\">9.3|12.8|10.9|10.8</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MinMo</td>\n    <td class=\"tg-0lax\">7.9|6.3|6.4|8.5</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2-Audio</td>\n    <td class=\"tg-0lax\">8.6|6.9|<strong>5.9</strong>|9.6</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B</td>\n    <td class=\"tg-0lax\">9.1|6.0|11.6|9.6</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B</td>\n    <td class=\"tg-0lax\"><strong>7.6</strong>|<strong>5.2</strong>|7.3|<strong>7.5</strong></td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\" rowspan=\"8\">Fleurs<br>zh | en</td>\n    <td class=\"tg-0lax\">Whisper-large-v3</td>\n    <td class=\"tg-0lax\">7.7|4.1</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Seed-ASR-Multilingual</td>\n    <td class=\"tg-0lax\">-|<strong>3.4</strong></td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Megrez-3B-Omni</td>\n    <td class=\"tg-0lax\">10.8|-</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MiniCPM-o</td>\n    <td class=\"tg-0lax\">4.4|-</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MinMo</td>\n    <td class=\"tg-0lax\">3.0|3.8</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2-Audio</td>\n    <td class=\"tg-0lax\">7.5|-</td>\n  </tr>\n    <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B</td>\n    <td class=\"tg-0lax\">3.2|5.4</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B</td>\n    <td class=\"tg-0lax\"><strong>3.0</strong>|4.1</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\" rowspan=\"6\">Wenetspeech<br>test-net | test-meeting</td>\n    <td class=\"tg-0lax\">Seed-ASR-Chinese</td>\n    <td class=\"tg-0lax\"><strong>4.7|5.7</strong></td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Megrez-3B-Omni</td>\n    <td class=\"tg-0lax\">-|16.4</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MiniCPM-o</td>\n    <td class=\"tg-0lax\">6.9|-</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MinMo</td>\n    <td class=\"tg-0lax\">6.8|7.4</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B</td>\n    <td class=\"tg-0lax\">6.3|8.1</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B</td>\n    <td class=\"tg-0lax\">5.9|7.7</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\" rowspan=\"4\">Voxpopuli-V1.0-en</td>\n    <td class=\"tg-0lax\">Llama-3-8B</td>\n    <td class=\"tg-0lax\">6.2</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Llama-3-70B</td>\n    <td class=\"tg-0lax\"><strong>5.7</strong></td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B</td>\n    <td class=\"tg-0lax\">6.6</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B</td>\n    <td class=\"tg-0lax\">5.8</td>\n  </tr>\n  <tr>\n    <td class=\"tg-9j4x\" colspan=\"3\">S2TT</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\" rowspan=\"9\">CoVoST2<br>en-de | de-en | en-zh | zh-en</td>\n    <td class=\"tg-0lax\">SALMONN</td>\n    <td class=\"tg-0lax\">18.6|-|33.1|-</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">SpeechLLaMA</td>\n    <td class=\"tg-0lax\">-|27.1|-|12.3</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">BLSP</td>\n    <td class=\"tg-0lax\">14.1|-|-|-</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MiniCPM-o</td>\n    <td class=\"tg-0lax\">-|-|<strong>48.2</strong>|27.2</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MinMo</td>\n    <td class=\"tg-0lax\">-|<strong>39.9</strong>|46.7|26.0</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen-Audio</td>\n    <td class=\"tg-0lax\">25.1|33.9|41.5|15.7</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2-Audio</td>\n    <td class=\"tg-0lax\">29.9|35.2|45.2|24.4</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B</td>\n    <td class=\"tg-0lax\">28.3|38.1|41.4|26.6</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B</td>\n    <td class=\"tg-0lax\"><strong>30.2</strong>|37.7|41.4|<strong>29.4</strong></td>\n  </tr>\n  <tr>\n    <td class=\"tg-9j4x\" colspan=\"3\">SER</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\" rowspan=\"6\">Meld</td>\n    <td class=\"tg-0lax\">WavLM-large</td>\n    <td class=\"tg-0lax\">0.542</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MiniCPM-o</td>\n    <td class=\"tg-0lax\">0.524</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen-Audio</td>\n    <td class=\"tg-0lax\">0.557</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2-Audio</td>\n    <td class=\"tg-0lax\">0.553</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B</td>\n    <td class=\"tg-0lax\">0.558</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B</td>\n    <td class=\"tg-0lax\"><strong>0.570</strong></td>\n  </tr>\n  <tr>\n    <td class=\"tg-9j4x\" colspan=\"3\">VSC</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\" rowspan=\"6\">VocalSound</td>\n    <td class=\"tg-0lax\">CLAP</td>\n    <td class=\"tg-0lax\">0.495</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Pengi</td>\n    <td class=\"tg-0lax\">0.604</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen-Audio</td>\n    <td class=\"tg-0lax\">0.929</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2-Audio</td>\n    <td class=\"tg-0lax\"><strong>0.939</strong></td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B</td>\n    <td class=\"tg-0lax\">0.936</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B</td>\n    <td class=\"tg-0lax\"><strong>0.939</strong></td>\n  </tr>\n  <tr>\n    <td class=\"tg-9j4x\" colspan=\"3\">Music</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\" rowspan=\"3\">GiantSteps Tempo</td>\n    <td class=\"tg-0lax\">Llark-7B</td>\n    <td class=\"tg-0lax\">0.86</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B</td>\n    <td class=\"tg-0lax\"><strong>0.88</strong></td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B</td>\n    <td class=\"tg-0lax\"><strong>0.88</strong></td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\" rowspan=\"3\">MusicCaps</td>\n    <td class=\"tg-0lax\">LP-MusicCaps</td>\n    <td class=\"tg-0lax\">0.291|0.149|0.089|<strong>0.061</strong>|0.129|0.130</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B</td>\n    <td class=\"tg-0lax\">0.325|<strong>0.163</strong>|<strong>0.093</strong>|0.057|<strong>0.132</strong>|<strong>0.229</strong></td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B</td>\n    <td class=\"tg-0lax\"><strong>0.328</strong>|0.162|0.090|0.055|0.127|0.225</td>\n  </tr>\n  <tr>\n    <td class=\"tg-9j4x\" colspan=\"3\">Audio Reasoning</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\" rowspan=\"4\">MMAU<br>Sound | Music | Speech | Avg</td>\n    <td class=\"tg-0lax\">Gemini-Pro-V1.5</td>\n    <td class=\"tg-0lax\">56.75|49.40|58.55|54.90</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2-Audio</td>\n    <td class=\"tg-0lax\">54.95|50.98|42.04|49.20</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B</td>\n    <td class=\"tg-0lax\"><strong>70.27</strong>|60.48|59.16|63.30</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B</td>\n    <td class=\"tg-0lax\">67.87|<strong>69.16|59.76|65.60</strong></td>\n  </tr>\n  <tr>\n    <td class=\"tg-9j4x\" colspan=\"3\">Voice Chatting</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\" rowspan=\"9\">VoiceBench<br>AlpacaEval | CommonEval | SD-QA | MMSU</td>\n    <td class=\"tg-0lax\">Ultravox-v0.4.1-LLaMA-3.1-8B</td>\n    <td class=\"tg-0lax\"><strong>4.55</strong>|3.90|53.35|47.17</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MERaLiON</td>\n    <td class=\"tg-0lax\">4.50|3.77|55.06|34.95</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Megrez-3B-Omni</td>\n    <td class=\"tg-0lax\">3.50|2.95|25.95|27.03</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Lyra-Base</td>\n    <td class=\"tg-0lax\">3.85|3.50|38.25|49.74</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MiniCPM-o</td>\n    <td class=\"tg-0lax\">4.42|<strong>4.15</strong>|50.72|54.78</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Baichuan-Omni-1.5</td>\n    <td class=\"tg-0lax\">4.50|4.05|43.40|57.25</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2-Audio</td>\n    <td class=\"tg-0lax\">3.74|3.43|35.71|35.72</td>\n  </tr>\n    <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B</td>\n    <td class=\"tg-0lax\">4.32|4.00|49.37|50.23</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B</td>\n    <td class=\"tg-0lax\">4.49|3.93|<strong>55.71</strong>|<strong>61.32</strong></td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\" rowspan=\"9\">VoiceBench<br>OpenBookQA | IFEval | AdvBench | Avg</td>\n    <td class=\"tg-0lax\">Ultravox-v0.4.1-LLaMA-3.1-8B</td>\n    <td class=\"tg-0lax\">65.27|<strong>66.88</strong>|98.46|71.45</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MERaLiON</td>\n    <td class=\"tg-0lax\">27.23|62.93|94.81|62.91</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Megrez-3B-Omni</td>\n    <td class=\"tg-0lax\">28.35|25.71|87.69|46.25</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Lyra-Base</td>\n    <td class=\"tg-0lax\">72.75|36.28|59.62|57.66</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MiniCPM-o</td>\n    <td class=\"tg-0lax\">78.02|49.25|97.69|71.69</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Baichuan-Omni-1.5</td>\n    <td class=\"tg-0lax\">74.51|54.54|97.31|71.14</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2-Audio</td>\n    <td class=\"tg-0lax\">49.45|26.33|96.73|55.35</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B</td>\n    <td class=\"tg-0lax\">74.73|42.10|98.85|68.81</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B</td>\n    <td class=\"tg-0lax\"><strong>81.10</strong>|52.87|<strong>99.42</strong>|<strong>74.12</strong></td>\n  </tr>\n</tbody></table>\n</details>\n\n<details>\n<summary>Image -> Text</summary>\n\n| Dataset                        | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini | \n|--------------------------------|--------------|------------|------------|---------------|-------------|\n| MMMU<sub>val</sub>             | 59.2         | 53.1       | 53.9       | 58.6          | **60.0**    | \n| MMMU-Pro<sub>overall</sub>     | 36.6         | 29.7       | -          | **38.3**      | 37.6        | \n| MathVista<sub>testmini</sub>   | 67.9         | 59.4       | **71.9**   | 68.2          | 52.5        | \n| MathVision<sub>full</sub>      | 25.0         | 20.8       | 23.1       | **25.1**      | -           | \n| MMBench-V1.1-EN<sub>test</sub> | 81.8         | 77.8       | 80.5       | **82.6**      | 76.0        | \n| MMVet<sub>turbo</sub>          | 66.8         | 62.1       | **67.5**   | 67.1          | 66.9        | \n| MMStar                         | **64.0**     | 55.7       | **64.0**   | 63.9          | 54.8        | \n| MME<sub>sum</sub>              | 2340         | 2117       | **2372**   | 2347          | 2003        | \n| MuirBench                      | 59.2         | 48.0       | -          | **59.2**      | -           | \n| CRPE<sub>relation</sub>        | **76.5**     | 73.7       | -          | 76.4          | -           | \n| RealWorldQA<sub>avg</sub>      | 70.3         | 62.6       | **71.9**   | 68.5          | -           | \n| MME-RealWorld<sub>en</sub>     | **61.6**     | 55.6       | -          | 57.4          | -           | \n| MM-MT-Bench                    | 6.0          | 5.0        | -          | **6.3**       | -           | \n| AI2D                           | 83.2         | 79.5       | **85.8**   | 83.9          | -           | \n| TextVQA<sub>val</sub>          | 84.4         | 79.8       | 83.2       | **84.9**      | -           | \n| DocVQA<sub>test</sub>          | 95.2         | 93.3       | 93.5       | **95.7**      | -           | \n| ChartQA<sub>test Avg</sub>     | 85.3         | 82.8       | 84.9       | **87.3**      | -           | \n| OCRBench_V2<sub>en</sub>       | **57.8**     | 51.7       | -          | 56.3          | -           | \n\n\n| Dataset                  | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-VL-7B | Grounding DINO | Gemini 1.5 Pro | \n|--------------------------|--------------|---------------|---------------|----------------|----------------|\n| Refcoco<sub>val</sub>    | 90.5         | 88.7          | 90.0          | **90.6**       | 73.2           | \n| Refcoco<sub>textA</sub>  | **93.5**     | 91.8          | 92.5          | 93.2           | 72.9           | \n| Refcoco<sub>textB</sub>  | 86.6         | 84.0          | 85.4          | **88.2**       | 74.6           | \n| Refcoco+<sub>val</sub>   | 85.4         | 81.1          | 84.2          | **88.2**       | 62.5           | \n| Refcoco+<sub>textA</sub> | **91.0**     | 87.5          | 89.1          | 89.0           | 63.9           | \n| Refcoco+<sub>textB</sub> | **79.3**     | 73.2          | 76.9          | 75.9           | 65.0           | \n| Refcocog+<sub>val</sub>  | **87.4**     | 85.0          | 87.2          | 86.1           | 75.2           | \n| Refcocog+<sub>test</sub> | **87.9**     | 85.1          | 87.2          | 87.0           | 76.2           | \n| ODinW                    | 42.4         | 39.2          | 37.3          | **55.0**       | 36.7           | \n| PointGrounding           | 66.5         | 46.2          | **67.3**      | -              | -              | \n</details>\n\n\n<details>\n<summary>Video(without audio) -> Text</summary>\n\n| Dataset                     | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini | \n|-----------------------------|--------------|------------|------------|---------------|-------------|\n| Video-MME<sub>w/o sub</sub> | 64.3         | 62.0       | 63.9       | **65.1**      | 64.8        | \n| Video-MME<sub>w sub</sub>   | **72.4**     | 68.6       | 67.9       | 71.6          | -           | \n| MVBench                     | **70.3**     | 68.7       | 67.2       | 69.6          | -           | \n| EgoSchema<sub>test</sub>    | **68.6**     | 61.4       | 63.2       | 65.0          | -           | \n</details>\n\n<details>\n<summary>Zero-shot Speech Generation</summary>\n\n\n<table class=\"tg\"><thead>\n  <tr>\n    <th class=\"tg-0lax\">Datasets</th>\n    <th class=\"tg-0lax\">Model</th>\n    <th class=\"tg-0lax\">Performance</th>\n  </tr></thead>\n<tbody>\n  <tr>\n    <td class=\"tg-9j4x\" colspan=\"3\">Content Consistency</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\" rowspan=\"11\">SEED<br>test-zh | test-en | test-hard </td>\n    <td class=\"tg-0lax\">Seed-TTS_ICL</td>\n    <td class=\"tg-0lax\">1.11 | 2.24 | 7.58</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Seed-TTS_RL</td>\n    <td class=\"tg-0lax\"><strong>1.00</strong> | 1.94 | <strong>6.42</strong></td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MaskGCT</td>\n    <td class=\"tg-0lax\">2.27 | 2.62 | 10.27</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">E2_TTS</td>\n    <td class=\"tg-0lax\">1.97 | 2.19 | -</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">F5-TTS</td>\n    <td class=\"tg-0lax\">1.56 | <strong>1.83</strong> | 8.67</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">CosyVoice 2</td>\n    <td class=\"tg-0lax\">1.45 | 2.57 | 6.83</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">CosyVoice 2-S</td>\n    <td class=\"tg-0lax\">1.45 | 2.38 | 8.08</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B_ICL</td>\n    <td class=\"tg-0lax\">1.95 | 2.87 | 9.92</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B_RL</td>\n    <td class=\"tg-0lax\">1.58 | 2.51 | 7.86</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B_ICL</td>\n    <td class=\"tg-0lax\">1.70 | 2.72 | 7.97</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B_RL</td>\n    <td class=\"tg-0lax\">1.42 | 2.32 | 6.54</td>\n  </tr>\n  <tr>\n    <td class=\"tg-9j4x\" colspan=\"3\">Speaker Similarity</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\" rowspan=\"11\">SEED<br>test-zh | test-en | test-hard </td>\n    <td class=\"tg-0lax\">Seed-TTS_ICL</td>\n    <td class=\"tg-0lax\">0.796 | 0.762 | 0.776</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Seed-TTS_RL</td>\n    <td class=\"tg-0lax\"><strong>0.801</strong> | <strong>0.766</strong> | <strong>0.782</strong></td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">MaskGCT</td>\n    <td class=\"tg-0lax\">0.774 | 0.714 | 0.748</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">E2_TTS</td>\n    <td class=\"tg-0lax\">0.730 | 0.710 | -</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">F5-TTS</td>\n    <td class=\"tg-0lax\">0.741 | 0.647 | 0.713</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">CosyVoice 2</td>\n    <td class=\"tg-0lax\">0.748 | 0.652 | 0.724</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">CosyVoice 2-S</td>\n    <td class=\"tg-0lax\">0.753 | 0.654 | 0.732</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B_ICL</td>\n    <td class=\"tg-0lax\">0.741 | 0.635 | 0.748</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-3B_RL</td>\n    <td class=\"tg-0lax\">0.744 | 0.635 | 0.746</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B_ICL</td>\n    <td class=\"tg-0lax\">0.752 | 0.632 | 0.747</td>\n  </tr>\n  <tr>\n    <td class=\"tg-0lax\">Qwen2.5-Omni-7B_RL</td>\n    <td class=\"tg-0lax\">0.754 | 0.641 | 0.752</td>\n  </tr>\n</tbody></table>\n</details>\n\n<details>\n<summary>Text -> Text</summary>\n\n| Dataset                           | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-7B | Qwen2.5-3B | Qwen2-7B | Llama3.1-8B | Gemma2-9B | \n|-----------------------------------|-----------|------------|------------|------------|------------|-------------|-----------|\n| MMLU-Pro                          | 47.0      | 40.4       | **56.3**   | 43.7       | 44.1       | 48.3        | 52.1      | \n| MMLU-redux                        | 71.0      | 60.9       | **75.4**   | 64.4       | 67.3       | 67.2        | 72.8      | \n| LiveBench<sub>0831</sub>          | 29.6      | 22.3       | **35.9**   | 26.8       | 29.2       | 26.7        | 30.6      | \n| GPQA                              | 30.8      | 34.3       | **36.4**   | 30.3       | 34.3       | 32.8        | 32.8      | \n| MATH                              | 71.5      | 63.6       | **75.5**   | 65.9       | 52.9       | 51.9        | 44.3      | \n| GSM8K                             | 88.7      | 82.6       | **91.6**   | 86.7       | 85.7       | 84.5        | 76.7      | \n| HumanEval                         | 78.7      | 70.7       | **84.8**   |\t74.4       | 79.9       | 72.6        | 68.9      | \n| MBPP                              | 73.2      | 70.4       | **79.2**   | 72.7       | 67.2       | 69.6        | 74.9      | \n| MultiPL-E                         | 65.8      | 57.6       | **70.4**   | 60.2       | 59.1       | 50.7        | 53.4      | \n| LiveCodeBench<sub>2305-2409</sub> | 24.6      | 16.5       | **28.7**   | 19.9       | 23.9       | 8.3         | 18.9      | \n</details>\n\n## Quickstart\n\nBelow, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni has been in the latest Hugging face transformers and we advise you to build from source with command:\n```\npip uninstall transformers\npip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview\npip install accelerate\n```\nor you might encounter the following error:\n```\nKeyError: 'qwen2_5_omni'\n```\n\n\nWe offer a toolkit to help you handle various types of audio and visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved audio, images and videos. You can install it using the following command and make sure your system has `ffmpeg` installed:\n\n```bash\n# It's highly recommended to use `[decord]` feature for faster video loading.\npip install qwen-omni-utils[decord] -U\n```\n\nIf you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-omni-utils -U` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video.\n\n### 🤗  Transformers Usage\n\nHere we show a code snippet to show you how to use the chat model with `transformers` and `qwen_omni_utils`:\n\n```python\nimport soundfile as sf\n\nfrom transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor\nfrom qwen_omni_utils import process_mm_info\n\n# default: Load the model on the available device(s)\nmodel = Qwen2_5OmniForConditionalGeneration.from_pretrained(\"Qwen/Qwen2.5-Omni-3B\", torch_dtype=\"auto\", device_map=\"auto\")\n\n# We recommend enabling flash_attention_2 for better acceleration and memory saving.\n# model = Qwen2_5OmniForConditionalGeneration.from_pretrained(\n#     \"Qwen/Qwen2.5-Omni-3B\",\n#     torch_dtype=\"auto\",\n#     device_map=\"auto\",\n#     attn_implementation=\"flash_attention_2\",\n# )\n\nprocessor = Qwen2_5OmniProcessor.from_pretrained(\"Qwen/Qwen2.5-Omni-3B\")\n\nconversation = [\n    {\n        \"role\": \"system\",\n        \"content\": [\n            {\"type\": \"text\", \"text\": \"You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.\"}\n        ],\n    },\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\"type\": \"video\", \"video\": \"https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4\"},\n        ],\n    },\n]\n\n# set use audio in video\nUSE_AUDIO_IN_VIDEO = True\n\n# Preparation for inference\ntext = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)\naudios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)\ninputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors=\"pt\", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)\ninputs = inputs.to(model.device).to(model.dtype)\n\n# Inference: Generation of the output text and audio\ntext_ids, audio = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO)\n\ntext = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)\nprint(text)\nsf.write(\n    \"output.wav\",\n    audio.reshape(-1).detach().cpu().numpy(),\n    samplerate=24000,\n)\n```\n\n<details>\n<summary>Minimum GPU memory requirements</summary>\n\n|Model | Precision | 15(s) Video | 30(s) Video | 60(s) Video |\n|--------------|-----------| ------------- | ------------- | ------------------ |\n| Qwen-Omni-3B | FP32      | 89.10 GB      | Not Recommend | Not Recommend      |\n| Qwen-Omni-3B | BF16      | 18.38 GB      | 22.43 GB      | 28.22 GB           |\n| Qwen-Omni-7B | FP32      | 93.56 GB      | Not Recommend | Not Recommend      |\n| Qwen-Omni-7B | BF16      | 31.11 GB      | 41.85 GB      | 60.19 GB           |\n\nNote: The table above presents the theoretical minimum memory requirements for inference with `transformers` and `BF16` is test with `attn_implementation=\"flash_attention_2\"`; however, in practice, the actual memory usage is typically at least 1.2 times higher. For more information, see the linked resource [here](https://huggingface.co/docs/accelerate/main/en/usage_guides/model_size_estimator).\n</details>  \n\n<details>\n<summary>Video URL resource usage</summary>\n\nVideo URL compatibility largely depends on the third-party library version. The details are in the table below. Change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.\n\n| Backend     | HTTP | HTTPS |\n|-------------|------|-------|\n| torchvision >= 0.19.0 | ✅  | ✅   |\n| torchvision < 0.19.0  | ❌  | ❌   |\n| decord      | ✅  | ❌   |\n</details>\n\n<details>\n<summary>Batch inference</summary>\n\nThe model can batch inputs composed of mixed samples of various types such as text, images, audio and videos as input when `return_audio=False` is set. Here is an example.\n\n```python\n# Sample messages for batch inference\n\n# Conversation with video only\nconversation1 = [\n    {\n        \"role\": \"system\",\n        \"content\": [\n            {\"type\": \"text\", \"text\": \"You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.\"}\n        ],\n    },\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\"type\": \"video\", \"video\": \"/path/to/video.mp4\"},\n        ]\n    }\n]\n\n# Conversation with audio only\nconversation2 = [\n    {\n        \"role\": \"system\",\n        \"content\": [\n            {\"type\": \"text\", \"text\": \"You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.\"}\n        ],\n    },\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\"type\": \"audio\", \"audio\": \"/path/to/audio.wav\"},\n        ]\n    }\n]\n\n# Conversation with pure text\nconversation3 = [\n    {\n        \"role\": \"system\",\n        \"content\": [\n            {\"type\": \"text\", \"text\": \"You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.\"}\n        ],\n    },\n    {\n        \"role\": \"user\",\n        \"content\": \"who are you?\"\n    }\n]\n\n\n# Conversation with mixed media\nconversation4 = [\n    {\n        \"role\": \"system\",\n        \"content\": [\n            {\"type\": \"text\", \"text\": \"You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.\"}\n        ],\n    },\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\"type\": \"image\", \"image\": \"/path/to/image.jpg\"},\n            {\"type\": \"video\", \"video\": \"/path/to/video.mp4\"},\n            {\"type\": \"audio\", \"audio\": \"/path/to/audio.wav\"},\n            {\"type\": \"text\", \"text\": \"What are the elements can you see and hear in these medias?\"},\n        ],\n    }\n]\n\n# Combine messages for batch processing\nconversations = [conversation1, conversation2, conversation3, conversation4]\n\n# set use audio in video\nUSE_AUDIO_IN_VIDEO = True\n\n# Preparation for batch inference\ntext = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)\naudios, images, videos = process_mm_info(conversations, use_audio_in_video=USE_AUDIO_IN_VIDEO)\n\ninputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors=\"pt\", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)\ninputs = inputs.to(model.device).to(model.dtype)\n\n# Batch Inference\ntext_ids = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO, return_audio=False)\ntext = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)\nprint(text)\n```\n</details>\n\n### Usage Tips\n\n#### Prompt for audio output\nIf users need audio output, the system prompt must be set as \"You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.\", otherwise the audio output may not work as expected.\n```\n{\n    \"role\": \"system\",\n    \"content\": [\n        {\"type\": \"text\", \"text\": \"You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.\"}\n    ],\n}\n```\n#### Use audio in video\nIn the process of multimodal interaction, the videos provided by users are often accompanied by audio (such as questions about the content in the video, or sounds generated by certain events in the video). This information is conducive to the model providing a better interactive experience. So we provide the following options for users to decide whether to use audio in video.\n```python\n# first place, in data preprocessing\naudios, images, videos = process_mm_info(conversations, use_audio_in_video=True)\n```\n```python\n# second place, in model processor\ninputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors=\"pt\", \n                   padding=True, use_audio_in_video=True)\n```\n```python\n#  third place, in model inference\ntext_ids, audio = model.generate(**inputs, use_audio_in_video=True)\n```\nIt is worth noting that during a multi-round conversation, the `use_audio_in_video` parameter in these places must be set to the same, otherwise unexpected results will occur.\n\n#### Use audio output or not\n\nThe model supports both text and audio outputs, if users do not need audio outputs, they can call `model.disable_talker()` after init the model. This option will save about `~2GB` of GPU memory but the `return_audio` option for `generate` function will only allow to be set at `False`.\n```python\nmodel = Qwen2_5OmniForConditionalGeneration.from_pretrained(\n    \"Qwen/Qwen2.5-Omni-3B\",\n    torch_dtype=\"auto\",\n    device_map=\"auto\"\n)\nmodel.disable_talker()\n```\n\nIn order to obtain a flexible experience, we recommend that users can decide whether to return audio when `generate` function is called. If `return_audio` is set to `False`, the model will only return text outputs to get text responses faster.\n\n```python\nmodel = Qwen2_5OmniForConditionalGeneration.from_pretrained(\n    \"Qwen/Qwen2.5-Omni-3B\",\n    torch_dtype=\"auto\",\n    device_map=\"auto\"\n)\n...\ntext_ids = model.generate(**inputs, return_audio=False)\n```\n\n#### Change voice type of output audio\nQwen2.5-Omni supports the ability to change the voice of the output audio. The `\"Qwen/Qwen2.5-Omni-3B\"` checkpoint support two voice types as follow:\n\n| Voice Type | Gender | Description |\n|------------|--------|-------------|\n| Chelsie    | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity.|\n| Ethan      | Male   | A bright, upbeat voice with infectious energy and a warm, approachable vibe.|\n\nUsers can use the `speaker` parameter of `generate` function to specify the voice type. By default, if `speaker` is not specified, the default voice type is `Chelsie`.\n\n```python\ntext_ids, audio = model.generate(**inputs, speaker=\"Chelsie\")\n```\n\n```python\ntext_ids, audio = model.generate(**inputs, speaker=\"Ethan\")\n```\n\n#### Flash-Attention 2 to speed up generation\n\nFirst, make sure to install the latest version of Flash Attention 2:\n\n```bash\npip install -U flash-attn --no-build-isolation\n```\n\nAlso, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.\n\nTo load and run a model using FlashAttention-2, add `attn_implementation=\"flash_attention_2\"` when loading the model:\n\n```python\nfrom transformers import Qwen2_5OmniForConditionalGeneration\n\nmodel = Qwen2_5OmniForConditionalGeneration.from_pretrained(\n    \"Qwen/Qwen2.5-Omni-3B\",\n    device_map=\"auto\",\n    torch_dtype=torch.bfloat16,\n    attn_implementation=\"flash_attention_2\",\n)\n```\n\n\n## Citation\n\nIf you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)\n\n\n\n```BibTeX\n\n@article{Qwen2.5-Omni,\n  title={Qwen2.5-Omni Technical Report},\n  author={Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, Junyang Lin},\n  journal={arXiv preprint arXiv:2503.20215},\n  year={2025}\n}\n```\n\n<br>\n\n\n# <span id=\"testllm\" style=\"color: #7F7FFF;\">🚀 If you find these models useful</span>\n\nHelp me test my **AI-Powered Quantum Network Monitor Assistant** with **quantum-ready security checks**:  \n\n👉 [Quantum Network Monitor](https://readyforquantum.com/?assistant=open&utm_source=huggingface&utm_medium=referral&utm_campaign=huggingface_repo_readme)  \n\n\nThe full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : [Source Code Quantum Network Monitor](https://github.com/Mungert69). You will also find the code I use to quantize the models if you want to do it yourself [GGUFModelBuilder](https://github.com/Mungert69/GGUFModelBuilder)\n\n💬 **How to test**:  \n Choose an **AI assistant type**:  \n   - `TurboLLM` (GPT-4.1-mini)  \n   - `HugLLM` (Hugginface Open-source models)  \n   - `TestLLM` (Experimental CPU-only)  \n\n### **What I’m Testing**  \nI’m pushing the limits of **small open-source models for AI network monitoring**, specifically:  \n- **Function calling** against live network services  \n- **How small can a model go** while still handling:  \n  - Automated **Nmap security scans**  \n  - **Quantum-readiness checks**  \n  - **Network Monitoring tasks**  \n\n🟡 **TestLLM** – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space):  \n- ✅ **Zero-configuration setup**  \n- ⏳ 30s load time (slow inference but **no API costs**) . No token limited as the cost is low.\n- 🔧 **Help wanted!** If you’re into **edge-device AI**, let’s collaborate!  \n\n### **Other Assistants**  \n🟢 **TurboLLM** – Uses **gpt-4.1-mini** :\n- **It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. \n- **Create custom cmd processors to run .net code on Quantum Network Monitor Agents**\n- **Real-time network diagnostics and monitoring**\n- **Security Audits**\n- **Penetration testing** (Nmap/Metasploit)  \n\n🔵 **HugLLM** – Latest Open-source models:  \n- 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita.\n\n### 💡 **Example commands you could test**:  \n1. `\"Give me info on my websites SSL certificate\"`  \n2. `\"Check if my server is using quantum safe encyption for communication\"`  \n3. `\"Run a comprehensive security audit on my server\"`\n4. '\"Create a cmd processor to .. (what ever you want)\" Note you need to install a Quantum Network Monitor Agent to run the .net code from. This is a very flexible and powerful feature. Use with caution!\n\n### Final Word\n\nI fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is [open source](https://github.com/Mungert69). Feel free to use whatever you find helpful.\n\nIf you appreciate the work, please consider [buying me a coffee](https://www.buymeacoffee.com/mahadeva) ☕. Your support helps cover service costs and allows me to raise token limits for everyone.\n\nI'm also open to job opportunities or sponsorship.\n\nThank you! 😊\n",
    "related_quantizations": []
  },
  "tags": [
    "transformers",
    "gguf",
    "multimodal",
    "any-to-any",
    "en",
    "arxiv:2503.20215",
    "license:other",
    "endpoints_compatible",
    "region:us",
    "conversational"
  ],
  "likes": 3,
  "downloads": 176,
  "gated": false,
  "private": false,
  "last_modified": "2025-09-24T15:43:03.000Z",
  "created_at": "2025-06-10T12:18:40.000Z",
  "pipeline_tag": "any-to-any",
  "library_name": "transformers"
}

Source payload excerpt (from Hugging Face API)

{
  "_id": "684822a07feb5879a1c2c015",
  "id": "Mungert/Qwen2.5-Omni-3B-GGUF",
  "modelId": "Mungert/Qwen2.5-Omni-3B-GGUF",
  "sha": "51a713feff2c285a9b021c243a78cfb8cbd6a71f",
  "createdAt": "2025-06-10T12:18:40.000Z",
  "lastModified": "2025-09-24T15:43:03.000Z",
  "author": "Mungert",
  "downloads": 176,
  "likes": 3,
  "gated": false,
  "private": false,
  "pipeline_tag": "any-to-any",
  "library_name": "transformers",
  "siblings_count": 34
}