Product developers have probably all encountered this scenario: you want to add a text-to-speech feature to your app, but the cost of TTS APIs makes you hesitate; you worry about user privacy data being sent to the cloud; or the language you need just isn't on the supported list.
Supertonic 3 solves all three problems at once.
Last week, South Korean audio technology company Supertone officially released Python SDK v1.3.1 for Supertonic 3, adding the supertonic serve command. This allows you to spin up a local HTTP server that exposes a native /v1/tts endpoint and an OpenAI-compatible /v1/audio/speech endpoint. This means any project currently using the OpenAI TTS API can switch to a local deployment simply by changing a single URL.
Key Metrics
Let's start with the core metrics:
99M Parameters. Most open-source TTS models currently range from 0.7B to 2B parameters. Supertonic 3 achieves comparable performance with less than 1/7 of the parameters—a direct impact on deployment costs. A smaller model means faster cold starts, lower memory footprint, and more importantly—the ability to run on devices without a GPU.
31 Languages. Arabic, Japanese, Korean, Vietnamese, Hindi... the coverage is quite extensive. It also supports a lang="na" mode—not sure what language the input text is? No problem, Supertonic will automatically process it in a language-agnostic way. This design is highly practical for real-world applications, as you often can't predict the user's input language in advance.
44.1kHz / 16-bit WAV Output. Not compressed MP3s, not low-sample-rate 22kHz audio, but direct studio-grade audio output. For scenarios like podcast production, audiobooks, and educational content, this quality is more than sufficient.
Powered by ONNX Runtime. Supports Python, Node.js, browser WebGPU, Java, C++, C#, Go, Swift, iOS, Rust, Flutter—there are SDK examples for almost every runtime you can think of. This isn't a project that "only runs in Python."
Expression Tags
I find this feature particularly interesting. Supertonic 3 supports 10 inline expression tags, such as <laugh>, <breath>, and <sigh>. You don't need to write prompts or provide reference audio; simply insert the tags directly into the text, and the generated speech will carry natural human intonation.
For example, with a text like this:
Finally finished this project today<sigh>, <laugh>great work everyone!
The generated speech will include a sigh after "this project," followed by a laugh. These natural tonal shifts, which previously required recording by professional voice actors, can now be controlled via tags.
Voice Builder: Zero-Shot Voice Cloning
Supertone has also launched Voice Builder, which supports zero-shot voice cloning. You upload a target voice sample, the system generates a corresponding voice profile (in JSON format), and you can then use this profile to generate speech for any text.
Even more practically, Voice Builder now supports downloading JSON files for both Supertonic 2 and Supertonic 3. If you previously created a voice profile for Supertonic 2, you can directly download the corresponding Supertonic 3 version from your My Page.
When to Use It (and When Not To)
Ideal Use Cases:
- Need to embed TTS features in an app/website without relying on external APIs
- Scenarios with strict data privacy requirements (healthcare, finance)
- Batch generation of multilingual content (audiobooks, educational materials)
- Edge device deployment (Raspberry Pi, embedded systems)
- Teams that need an OpenAI-compatible API but want to control costs
Less Suitable Scenarios:
- Scenarios requiring extreme naturalness, nearly indistinguishable from humans (e.g., film dubbing—while the quality is good, it still falls short of professional voice actors)
- Scenarios requiring real-time streaming output (Supertonic 3 operates in batch mode)
- Commercial projects with extremely high demands for specific voice timbres
Competitive Landscape
Supertonic isn't the first open-source TTS, nor the first to support multiple languages. However, in the 2026 open-source TTS ecosystem, its positioning is quite unique: it strikes a rare balance between parameter count, language coverage, and deployment flexibility.
Kokoro TTS is smaller (~82M parameters) but has limited language support. VITS-based models offer good quality but come with high deployment complexity. Supertonic 3 lowers the deployment barrier to a "pip install" level through ONNX Runtime's unified inference engine.
Coupled with the newly released supertonic serve command, it can now directly replace OpenAI's TTS API—making it a highly practical choice for teams looking to control costs and protect data privacy.
Conclusion
Supertonic 3 isn't the kind of model that pushes the absolute bleeding edge of technology. Its innovation lies more in engineering: achieving usable quality with fewer parameters, supporting as many languages as possible, providing SDKs for as many runtimes as possible, and making deployment as simple as possible.
In the AI tools space, sometimes "good enough + easy to use" matters more than "the most advanced." Supertonic 3 is walking exactly that path.