C
ChaoBro

OpenAI's WebRTC Approach May Not Be the Optimal Solution for Voice AI

OpenAI's WebRTC Approach May Not Be the Optimal Solution for Voice AI

OpenAI published a technical blog a few days ago about how they use WebRTC for real-time voice AI communication. A WebRTC veteran pushed back — not with casual criticism, but arguing at the protocol level that you shouldn't be using this at all.

The author is Luke Curley, who wrote a WebRTC SFU at Twitch and rewrote Discord's WebRTC SFU in Rust. In his own words: "I'd consider myself a Certified WebRTC Expert. Which is why I never, never want to use WebRTC again."

His core argument is direct: WebRTC is a poor fit for voice AI.

Where the Problem Lies

WebRTC was designed for video conferencing — low latency, real-time interaction, dropped packets are dropped. But voice AI has different needs:

Your voice input is an expensive prompt. You say "should I walk or drive to the car wash" into your phone. It gets sent to OpenAI's servers, runs through LLM inference, and comes back as TTS-generated speech. Throughout this process, the user's input is the prompt. Lose it, and you've wasted money.

WebRTC's approach: aggressively drop packets under poor network conditions to maintain low latency. That garbled audio on conference calls? That's WebRTC. For meetings, latency matters more than quality. For voice AI, a complete prompt matters more than 200ms of latency.

More troubling: WebRTC does not support audio packet retransmission in browsers. Discord tried. Couldn't figure out the SDP munging. Even if you could enable audio NACKs, WebRTC's jitter buffer is too small.

The result: a user's sentence gets truncated on a bad network, the LLM receives an incomplete prompt, and returns a wrong answer. The user won't know it's their network — they'll just think the AI is stupid.

The TTS Buffering Problem

Another technical detail is more interesting.

OpenAI's TTS generates speech faster than real-time (say, 2 seconds of GPU time for 8 seconds of audio). Ideally: stream as it's generated, client buffers it, then plays back. That way, network jitter is absorbed by local buffer.

But WebRTC has no buffering mechanism — packets render on arrival time. To compensate, OpenAI has to add a sleep before sending each audio packet, timing it to arrive at the client at exactly the right moment.

This means artificially introducing latency, then aggressively dropping packets to "keep latency low." The author used a vivid analogy: it's like screen-sharing a YouTube video instead of buffering it.

Ports and Scalability

WebRTC has another engineering pitfall: port management.

A TCP server opens one port (say, 443) and serves all connections. WebRTC needs to manage大量 UDP ports, with each connection negotiating ICE, STUN, TURN. At scale, this is a non-trivial operational burden.

OpenAI's blog spent considerable space explaining how their load balancer solves this. The author's subtext: if your protocol didn't require these workarounds, you wouldn't need to spend this effort.

Alternative

The article recommends MoQ (Media over QUIC) — a real-time media protocol based on QUIC, currently being standardized at IETF. Cloudflare already provides CDN support.

MoQ's advantages:

  • QUIC-based, natively supports multiplexing and connection migration
  • Doesn't aggressively drop packets; retransmission strategies can be configured
  • Supports buffering, suitable for TTS's "fast generate, slow play" scenario
  • No complex port management needed

MoQ isn't production-ready yet. The IETF standard is still in draft, and the ecosystem is far less mature than WebRTC. But for voice AI — an emerging field — making the right infrastructure choice now might save a lot of rework later.

My Take

This article's value isn't "OpenAI got it wrong." It's a reminder that voice AI infrastructure selection is still in its early stages, and WebRTC may be path dependency rather than the optimal choice.

Taking a conferencing protocol and repurposing it for voice AI is like using HTTP/1.1 for real-time gaming — it works, but it's not the right tool.

For teams building voice AI products, it's worth carefully evaluating WebRTC's trade-offs. If prompt completeness matters more than 200ms of latency in your scenario, QUIC-based alternatives may deserve a look.


Primary sources: