What license applies to cstr/nemotron-3.5-asr-streaming-GGUF?

License: openmdw-1.1. Verify terms on Hugging Face before commercial use.

How do I run cstr/nemotron-3.5-asr-streaming-GGUF locally?

Download a GGUF file from this page and load it in guIDE or llama.cpp. Pipeline task: automatic-speech-recognition.

Model Intelligence Sheet

cstr/nemotron-3.5-asr-streaming-GGUF overview

Nemotron 3.5 ASR Streaming 0.6B GGUF GGUF conversion of nvidia/nemotron 3.5 asr streaming 0.6b https://huggingface.co/nvidia/nemotron 3.5 asr streaming 0.6b fo…

ggufasrspeech-recognitionstreamingfastconformerrnntmultilingualcrispasrautomatic-speech-recognitionarbgcsdadeelenesetfifrhehihrhu

Runs locally from ~3.3 MB disk (4 GB VRAM class GPUs with llama.cpp / guIDE).

Downloads

110

Likes

Pipeline

automatic-speech-recognition

Author

cstr

Repository Files & Downloads

3 GGUF files detected

Direct downloads for local inference

File	Type	Quantization	Size	Link
nemotron-3.5-asr-streaming-0.6b-f16.gguf	GGUF	F16	1.20 GB	Download
nemotron-3.5-asr-streaming-0.6b-q4_k.gguf	GGUF	Q4_K	457.1 MB	Download
nemotron-3.5-asr-streaming-ref.gguf	GGUF	GGUF	3.3 MB	Download

Model Details

Model ID	cstr/nemotron-3.5-asr-streaming-GGUF
Author	cstr
Pipeline	automatic-speech-recognition
License	openmdw-1.1
Base model	nvidia/nemotron-3.5-asr-streaming-0.6b
Last modified	2026-06-15T06:41:33.000Z

Model README

---

license: openmdw-1.1

language:

- ar

- bg

- cs

- da

- de

- el

- en

- es

- et

- fi

- fr

- he

- hi

- hr

- hu

- it

- ja

- ko

- lt

- lv

- nb

- nl

- nn

- pl

- pt

- ro

- ru

- sk

- sl

- sv

- th

- tr

- uk

- vi

- zh

tags:

- asr

- speech-recognition

- gguf

- streaming

- fastconformer

- rnnt

- multilingual

- crispasr

pipeline_tag: automatic-speech-recognition

base_model: nvidia/nemotron-3.5-asr-streaming-0.6b

---

Nemotron-3.5-ASR-Streaming-0.6B GGUF

GGUF conversion of nvidia/nemotron-3.5-asr-streaming-0.6b for use with CrispASR.

Model details

Architecture: Cache-Aware Streaming FastConformer encoder (24 layers, d=1024, 8 heads) + RNN-T decoder (2-layer LSTM, hidden=640) + joint network (640 → 13088 vocab).

Languages: 39 languages, selected via prompt_kernel MLP conditioning:

|------|----------|------|----------|------|----------|

Key properties:

Sample rate: 16 kHz
128 mel filterbank features, n_fft=512, hop=160 (10ms), win=400 (25ms)
8× time downsampling (causal) → 80ms frame duration
Streaming: cache-aware attention with att_context_size=[[56,3],[56,0],[56,6],[56,13]]
Vocab: 13087 SentencePiece tokens + 1 blank (pure RNN-T, no TDT durations)
License: OpenMDW-1.1 (permissive)

Files

| File | Size | Description |

|------|------|-------------|

| nemotron-3.5-asr-streaming-0.6b-f16.gguf | ~1.2 GB | F16 weights (full precision) |

| nemotron-3.5-asr-streaming-0.6b-q4_k.gguf | ~0.4 GB | Q4_K quantized (recommended) |

| nemotron-3.5-asr-streaming-ref.gguf | ~1.2 GB | Reference GGUF (for parity testing) |

Usage with CrispASR

# Download
huggingface-cli download cstr/nemotron-3.5-asr-streaming-GGUF \
  nemotron-3.5-asr-streaming-0.6b-f16.gguf --local-dir models/

# Transcribe (English, default)
crispasr --backend nemotron \
  -m models/nemotron-3.5-asr-streaming-0.6b-f16.gguf \
  -f audio.wav

# Transcribe in German
crispasr --backend nemotron \
  -m models/nemotron-3.5-asr-streaming-0.6b-f16.gguf \
  -f audio.wav --language de-DE

# Streaming mode
crispasr --backend nemotron \
  -m models/nemotron-3.5-asr-streaming-0.6b-f16.gguf \
  -f audio.wav --stream

Conversion

Converted from the original NeMo .nemo checkpoint using:

python models/convert-nemotron-to-gguf.py \
  --nemo nvidia/nemotron-3.5-asr-streaming-0.6b \
  --output nemotron-3.5-asr-streaming-0.6b-f16.gguf

Quantized variants can be produced with:

crispasr-quantize models/nemotron-3.5-asr-streaming-0.6b-f16.gguf \
  models/nemotron-3.5-asr-streaming-0.6b-q4_k.gguf Q4_K

Architecture

Audio (16kHz) → Mel (128 bins, 10ms hop)
  → Pre-encode (3× causal Conv2d, 8× downsample, Linear 4352→1024)
  → 24× Cache-Aware FastConformer block:
      FFN1(½) → MHA(rel_pos, cache-aware) → DWConv(k=9, causal, LN) → FFN2(½) → LN
  → Prompt kernel (MLP: concat(enc[1024], lang_onehot[128]) → 2048 → 1024)
  → RNN-T decoder:
      Prediction: Embed(13088, 640) + 2-layer LSTM(640)
      Joint: enc(1024→640) + pred(640→640) → ReLU → Linear(640→13088)
  → Greedy / beam search decode

Original model

Paper: NVIDIA NeMo documentation
Source: nvidia/nemotron-3.5-asr-streaming-0.6b
License: OpenMDW-1.1

Run cstr/nemotron-3.5-asr-streaming-GGUF with guIDE

Download guIDE — the AI-native code editor with local LLM inference and 69 built-in tools.

Download guIDE → · Browse 524k+ models · Compare models

Source: Hugging Face · Compare models