Anyone who’s processed meeting recordings or podcasts has hit this wall: feed an hour of audio to a speech recognition service, it chops it into segments, context gets lost at the cut points, speaker information disappears, and you’re stuck stitching everything back together in post-processing.
Microsoft’s open-source VibeVoice addresses exactly that — 60 minutes of audio processed in a single pass through the model, no chunking required. Speaker diarization is built in, so you don’t need a separate model to figure out who said what.
The project has reached 44,746 stars on GitHub, adding 1,523 stars today alone.
What It Does
Traditional speech recognition models like OpenAI’s Whisper process long audio by slicing it into segments and handling each independently. This creates two problems:
- Context breaks — semantics around cut points can be lost, hurting accuracy
- Speaker information loss — the same speaker across segments can’t be automatically linked
VibeVoice’s architecture allows single-pass processing of up to 60 minutes of audio, maintaining context coherence throughout. Beyond ASR, the project includes TTS and fine-tuning modules — a complete speech AI toolkit.
Core capabilities:
- 60-minute single-pass processing: no manual chunking, no context loss
- Speaker diarization: built-in, automatic speaker labelling
- 50+ languages: covers major languages and dialects
- Custom hotwords: domain-specific vocabulary optimization
- vLLM plugin: high-performance inference acceleration
- Apple Silicon support: MPS backend already adapted
Cost-wise, local execution means zero transcription fees. Compare that to current services — Whisper API at ~$0.36/hour, Deepgram at ~$0.26/hour, ElevenLabs at ~$0.40/hour — and for high-frequency usage, the payback period for local deployment is short.
Getting Started
The project ships with a Gradio Demo for direct ASR and TTS experience through a web interface. For production, Docker deployment is supported.
If you have a GPU machine, the minimum path looks like:
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
# install dependencies per README
# launch Gradio demo
python demo/app.py
Apple Silicon users can run directly on Mac via the MPS backend — no external GPU needed.
What Still Needs Watching
The project is fresh, and a few things worth following:
- Chinese accuracy — 50+ languages is claimed, but actual per-language performance needs community validation. Chinese is a high-value use case worth tracking separately
- VRAM requirements — 60-minute single-pass processing has high VRAM demands; lower-spec machines may need to adjust batch size or use chunked mode if available
- Head-to-head vs Whisper-large-v3 — VibeVoice’s differentiation is long audio and speaker diarization, but the gap on short audio and high-noise scenarios needs real-world testing
Development activity is healthy: 134 commits, 112 closed issues, 32 PRs in progress.