Why real-time voice translation on a phone call is so hard
If you have ever used a translation app that makes you press a button, wait, then press another, you know the gulf between *translation* and *real-time translation*. The first is a tool. The second is a conversation. Closing that gap on an ordinary phone call is harder than it sounds, because the phone network was never designed for it.
This post is the user-facing version of why it is hard — without diving into specific implementation choices.
What "real-time" really means
Real-time means the conversation flows at human-conversation pace. Practically, that's about half a second of delay between when one person stops talking and when the other person starts hearing the translation. Anything more than that, and the call starts feeling like a walkie-talkie. Anything less than 500 ms is rare, even for humans.
Hitting that target requires a chain of operations to all happen quickly:
1. The system needs to know when a sentence ends. This sounds trivial; it is not. Phones carry breath, room noise, and incidental sound that look like speech to a naive detector. 2. The system needs to understand what was said. Speech recognition is far better than ten years ago, but accents, regional dialects, and bad-line conditions still confuse models. 3. The system needs to translate accurately. Word-by-word translation often produces nonsense across languages with different word orders. Sentence-level translation is more accurate but slower. 4. The system needs to speak the result back in a voice that doesn't sound like a robot reading the news.
Each of those four stages eats some of the latency budget. The system has roughly half a second to do all four end-to-end.
Why a phone call is harder than a video call
Apps that handle video calling between two browsers have several tools that a phone call does not:
- Higher audio quality. Browser audio runs at studio-grade quality. Phone calls run at the lower-fidelity audio spec the network was designed for in the 1980s — narrower frequency range, more compression artifacts. Speech recognition has more to work with on browser audio than on phone audio.
- Client-side processing. Browser apps can pre-process audio on the user's device — echo cancellation, noise suppression, sentence-boundary detection. Phones offer none of that to whoever is hosting the call.
- A controlled network path. Browsers negotiate the route between two endpoints. Phone calls go through whatever carrier path the network picks.
- A second screen for fallback. Video calls can show subtitles, language picker, mute buttons. Phone calls have audio only.
The phone network's biggest constraint is its biggest gift: the recipient doesn't need an app. That is the entire reason translated phone calls are useful for reaching grandparents, suppliers, hotline staff, and anyone else who isn't going to install something. Hiding the technical hardness behind "answer the phone like normal" is the product.
Where the latency hides
A healthy real-time translated call uses about this much time per direction:
- Detecting sentence end: ~500 ms (deliberately tuned — too short cuts people off, too long lags the call)
- Recognizing what was said: ~100 ms after sentence end
- Translating to the target language: ~100 ms
- Synthesizing the translated voice: ~150 ms
- Network and orchestration overhead: ~100 ms
Total: roughly 600 ms median, occasionally up to 1 second on noisy calls. That is why real-time translated calls feel like a slightly delayed satellite call.
What good translation feels like
A few signs the system is working:
- Both sides speak naturally. No "speak slowly" tutorial required. No pausing between every word.
- Each speaker keeps their own voice. A translation that sounds like a robot reading subtitles feels like a foreign call. A translation that matches the speaker's tone feels like a real conversation.
- You can interrupt. Real conversations have overlap. The system handles that without dropping audio.
- Code-switching works. People who speak two languages often mix them. The system follows along.
Where translation breaks
Real-time AI translation is not magic. Some things still degrade quality:
- Heavy regional dialects. Standard speech of major languages translates well. Heavy local dialects still give the system trouble.
- Multiple simultaneous speakers. A single voice at a time is the design assumption. A noisy room defeats it.
- Whispered or shouted speech. Recognition models are tuned for normal volume. Outliers in volume hurt accuracy.
- Highly technical jargon. A casual call works fine. A call about cardiothoracic surgical procedures or aerospace engineering may need a domain-specialized human interpreter.
The right framing is: real-time AI translation handles the broad middle of conversations brilliantly. The two ends — small-talk-only calls and high-stakes-domain calls — sometimes need a human.
Read more
- Pricing — per-country rates for 230+ destinations
- Owaa vs tuwa.ai — feature-by-feature comparison
- Help — using the hotline and web call