Supertonic 3: 99M Parameters, 31 Languages, Local Execution—Why This TTS Tool Can Replace Cloud APIs

Product developers have probably all encountered this scenario: you want to add a text-to-speech feature to your app, but the cost of TTS APIs makes you hesitate; you worry about user privacy data being sent to the cloud; or the language you need just isn't on the supported list.

Supertonic 3 solves all three problems at once.

Last week, South Korean audio technology company Supertone officially released Python SDK v1.3.1 for Supertonic 3, adding the supertonic serve command. This allows you to spin up a local HTTP server that exposes a native /v1/tts endpoint and an OpenAI-compatible /v1/audio/speech endpoint. This means any project currently using the OpenAI TTS API can switch to a local deployment simply by changing a single URL.

Key Metrics

Let's start with the core metrics:

99M Parameters. Most open-source TTS models currently range from 0.7B to 2B parameters. Supertonic 3 achieves comparable performance with less than 1/7 of the parameters—a direct impact on deployment costs. A smaller model means faster cold starts, lower memory footprint, and more importantly—the ability to run on devices without a GPU.

31 Languages. Arabic, Japanese, Korean, Vietnamese, Hindi... the coverage is quite extensive. It also supports a lang="na" mode—not sure what language the input text is? No problem, Supertonic will automatically process it in a language-agnostic way. This design is highly practical for real-world applications, as you often can't predict the user's input language in advance.

44.1kHz / 16-bit WAV Output. Not compressed MP3s, not low-sample-rate 22kHz audio, but direct studio-grade audio output. For scenarios like podcast production, audiobooks, and educational content, this quality is more than sufficient.

Powered by ONNX Runtime. Supports Python, Node.js, browser WebGPU, Java, C++, C#, Go, Swift, iOS, Rust, Flutter—there are SDK examples for almost every runtime you can think of. This isn't a project that "only runs in Python."

Expression Tags

I find this feature particularly interesting. Supertonic 3 supports 10 inline expression tags, such as <laugh>, <breath>, and <sigh>. You don't need to write prompts or provide reference audio; simply insert the tags directly into the text, and the generated speech will carry natural human intonation.

For example, with a text like this:

Finally finished this project today<sigh>, <laugh>great work everyone!

The generated speech will include a sigh after "this project," followed by a laugh. These natural tonal shifts, which previously required recording by professional voice actors, can now be controlled via tags.

Voice Builder: Zero-Shot Voice Cloning

Supertone has also launched Voice Builder, which supports zero-shot voice cloning. You upload a target voice sample, the system generates a corresponding voice profile (in JSON format), and you can then use this profile to generate speech for any text.

Even more practically, Voice Builder now supports downloading JSON files for both Supertonic 2 and Supertonic 3. If you previously created a voice profile for Supertonic 2, you can directly download the corresponding Supertonic 3 version from your My Page.

When to Use It (and When Not To)

Ideal Use Cases:

Need to embed TTS features in an app/website without relying on external APIs
Scenarios with strict data privacy requirements (healthcare, finance)
Batch generation of multilingual content (audiobooks, educational materials)
Edge device deployment (Raspberry Pi, embedded systems)
Teams that need an OpenAI-compatible API but want to control costs

Less Suitable Scenarios:

Scenarios requiring extreme naturalness, nearly indistinguishable from humans (e.g., film dubbing—while the quality is good, it still falls short of professional voice actors)
Scenarios requiring real-time streaming output (Supertonic 3 operates in batch mode)
Commercial projects with extremely high demands for specific voice timbres

Competitive Landscape

Supertonic isn't the first open-source TTS, nor the first to support multiple languages. However, in the 2026 open-source TTS ecosystem, its positioning is quite unique: it strikes a rare balance between parameter count, language coverage, and deployment flexibility.

Kokoro TTS is smaller (~82M parameters) but has limited language support. VITS-based models offer good quality but come with high deployment complexity. Supertonic 3 lowers the deployment barrier to a "pip install" level through ONNX Runtime's unified inference engine.

Coupled with the newly released supertonic serve command, it can now directly replace OpenAI's TTS API—making it a highly practical choice for teams looking to control costs and protect data privacy.

Conclusion

Supertonic 3 isn't the kind of model that pushes the absolute bleeding edge of technology. Its innovation lies more in engineering: achieving usable quality with fewer parameters, supporting as many languages as possible, providing SDKs for as many runtimes as possible, and making deployment as simple as possible.

In the AI tools space, sometimes "good enough + easy to use" matters more than "the most advanced." Supertonic 3 is walking exactly that path.

Key Metrics

Expression Tags

Voice Builder: Zero-Shot Voice Cloning

When to Use It (and When Not To)

Competitive Landscape

Conclusion

Related

CloakBrowser: The Stealth Browser That Passed 30/30 Anti-Detection Tests, 18,500 Stars

CodeGraph: A Code Knowledge Graph Tool That Saves 35% Tokens for Claude Code and Cursor

Cognee: Equipping AI Agents with a Memory System in 6 Lines of Code – The Real Demand Behind 17k Stars