Live Conversation Translation

On-Device Live Conversation Translation with End-to-End Speech-to-Speech Pipeline

Two speakers talk naturally in their own language. Each hears the other translated in real time — transcribed, translated, and spoken aloud by an on-device pipeline with no audio sent to a cloud.

Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi
How on-device live conversation translation works

One real-time translation loop powered by on-device Voice AI and Language SDKs

On-device live conversation translation combines Cheetah Streaming Speech-to-Text, Zebra Translate, and Orca Streaming Text-to-Speech in a single local pipeline that runs in a loop. Cheetah converts Speaker 1's speech into text. Zebra translates it to Speaker 2's language. Orca reads the translated text aloud. When Speaker 2 hears the translation and responds, the same loop runs in reverse — Cheetah transcribes Speaker 2's response, Zebra translates it back, and Orca speaks it in Speaker 1's language. Live conversation translation removes the language barrier, but only if the delay between speaker turns is minimal enough to preserve natural conversation flow. The latency introduced by cloud translation APIs — typically hundreds of milliseconds per round-trip — compounds across every turn and breaks that flow. Lightweight on-device voice and language SDKs eliminate the round-trip entirely, which is why on-device processing is critical for real-world adoption.

Speaker AENGLISHCheetahSTREAMING STTZebraTRANSLATIONOrcaSTREAMING TTSSpeaker BESPAÑOL
Why Cheetah Streaming Speech-to-Text?

Lowest latency. Lowest compute. No accuracy tradeoff.

10.1%
WER (English) vs. 11.9% Google and 10.6% Moonshine Medium
0.08
CPU Core-Hour vs. 3.36 Moonshine Medium, 40x less
8.6%
WER (Spanish) vs. 11.6% Google and 9.4% Azure

Cheetah Streaming Speech-to-Text beats Google Cloud STT in word error rate and word emission latency across all tested languages, and outperforms Azure STT in several benchmarks, per the open-source real-time transcription benchmark — even before it's customized for the use case. It emits words at 590 ms median latency, typically one word behind the speaker. Cheetah requires less compute than any other local engine tested. Result: no tradeoff on accuracy, latency, or privacy, and no minimum hardware requirements.

English Word Error Rate
Lower is better
Amazon Streaming5.6%
Azure Real-time8.2%
Cheetah Streaming10.1%
Moonshine Streaming Medium10.6%
Vosk Streaming Large11.5%
Google Streaming11.9%
Whisper.cpp Streaming Base19.8%
English Punctuation Error Rate
Lower is better
Cheetah Streaming16.1%
Azure Real-time16.4%
Amazon Streaming24.4%
Google Streaming36%
Moonshine Streaming Medium45.1%
Whisper.cpp Streaming Base54.1%
Why Zebra Translate?

120 words per second. Opus-level accuracy. Zero network latency.

<100 MB
Peak Memory Usage
<80 MB
Model size per language pair
1:1
Opus MT by Helsinki NLP accuracy match

Zebra returns up to 120 words per second on-device — fast enough that translation is never the bottleneck in the pipeline. It matches the accuracy of Opus, one of the best-known open translation models, while running locally with zero network latency. On-device processing also gives Zebra a competitive edge: zero network latency, which can never be achieved by cloud translation APIs such as Google Translate and DeepL.

Translation accuracy (BLEU) — higher is better
Zebra (DE → EN)51
Opus (DE → EN)51
Zebra (EN → FR)55
Opus (EN → FR)55
Zebra (ES → IT)58
Opus (ES → IT)57
Translation speed (words / sec) — higher is better
Zebra (DE → EN)112
Opus (DE → EN)45
Zebra (EN → FR)105
Opus (EN → FR)41
Zebra (ES → IT)90
Opus (ES → IT)36
Why Orca Streaming Text-to-Speech?

Natural-sounding TTS at 29 MB peak memory.

29 MB
Peak Memory Usage
130 ms
First-token-to-speech latency
7 MB
Model Sizee

Most high-quality TTS solutions require hundreds of megabytes of RAM. Orca TTS uses 29 MB peak memory, 10–50× less than any other on-device alternative, except for ESpeak. This makes Orca the only natural-sounding TTS deployable in any environment, including browser tabs, mobile apps with strict out-of-memory limits, and embedded devices. Orca Streaming TTS is built for real-time LLM applications; it starts reading LLM responses as soon as LLMs create a meaningful word or phrase.

TTS Latency
Lower is better
Orca TTS Streaming128 ms
ElevenLabs TTS Streaming335 ms
ESpeak TTS1,430 ms
ElevenLabs TTS1,470 ms
Audio Quality
Listen and compare — grouped by peak memory usage.
Peak Memory Usage < 30 MB
ESpeak
Orca
On-device live conversation translation use cases

From wearables to clinic intake

Field interviews & law enforcement

On-device translation for field interviews

Law enforcement officers conducting interviews, taking witness statements, and communicating with suspects who speak different languages need two-way translation in the field — with no time to wait for a human interpreter. On-device live conversation translation matters for two reasons: body cameras and patrol devices frequently operate in areas with poor or no connectivity, and audio of police interactions is sensitive and cannot be routed through a third-party cloud service.

Wearables & IoT

Licensable translation engine for wearables & IoT

Wearable translators like earbuds and smart glasses typically depend on cloud APIs or whatever is available on the paired phone. Google ML Kit Translation cannot run on embedded devices like cars or kiosks without Google's permission and requires Google Translate attribution, limiting OEM branding. Picovoice ships on-device voice and language SDKs that hardware makers can license and embed for two-way live conversation translation across earbuds, handhelds, in-vehicle systems, and kiosks.

Tourists & expats

Two-way translation for tourists and expats

Tourists and expats navigate markets, hotels, restaurants, and transport in countries where they do not speak the language. On-device live conversation translation works on a plane, underground, in rural areas, and anywhere roaming data is expensive or unavailable — both speakers talk naturally and hear each other in their own language. No cloud dependency means no degraded experience when the signal drops mid-conversation, regardless of where in the world they are.

Customs & border interviews

Private, offline translation for customs and border interviews

Border agents interview travelers who may not speak English or the local language in noisy, time-pressured environments. The audio is sensitive and cannot be sent to a commercial cloud. Connectivity at land border crossings and remote processing centers is often unreliable. On-device live conversation translation keeps all audio off external servers, eliminates network latency, and gives both agent and traveler a real-time two-way voice channel without a human interpreter.

Emergency response

Live translation when seconds matter

Paramedics, firefighters, and emergency dispatchers face language barriers in situations where seconds matter. A patient in distress who cannot describe symptoms, or a bystander who cannot explain what happened, can delay critical decisions. On-device live conversation translation works even when connectivity is poor, adds no perceptible latency to the two-way exchange, and keeps sensitive audio and personally identifiable information off external servers entirely.

Conferences & trade shows

Real-time translation for conferences, summits, and trade shows

Conferences, summits, trade shows, and sporting events bring together multilingual attendees who need real-time two-way translation that does not depend on venue Wi-Fi. Convention center networks are notoriously overloaded during large events. On-device live conversation translation runs on the attendee's own device or event hardware, allowing participants to discuss even confidential deals with no delays or privacy concerns.

Get started

Build an on-device live conversation translation app: Python code example

A complete working recipe in Python. Open-source on GitHub. Runs 100% on-device.

recipe · live-conversation-translation
Difficulty
Beginner
Runtime
100% on-device
Language
Python
Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Prerequisites

Picovoice AccessKey from Picovoice Console and GitHub Repo Clone.

Usage

These instructions assume your current working directory is recipes/live-conversation-translation/python.
1

Create a virtual environment

Isolate the recipe's dependencies from your system Python.
2

Activate the virtual environment

Activation makes pip install into .venv instead of system Python.
Linux, macOS, or Raspberry Pi
Windows
3

Install requirements

Install the Cheetah, Zebra, Orca, PvRecorder, and PvSpeaker Python SDKs.
4

Download the Required Models

Run the setup script to download the models for Cheetah, Zebra, and Orca.
5

Run the live conversation translation demo

Pass your AccessKey and the the tuple of languages used in the conversation (e.g. es-en) run the demo. The demo runs the two-way translation pipeline locally — Cheetah transcribes each speaker, Zebra translates into the other speaker's language, and Orca speaks the translation aloud.
Have questions or looking for implementations in other languages? Visit the GitHub pico-cookbook Live Conversation Translation recipe, where you can find the open-source demo code and create an issue for demo-related technical questions.
Frequently asked questions

FAQ

+
What is live conversation translation?
Live conversation translation is a two-way, real-time speech translation flow where each speaker talks naturally in their own language and hears the other person translated as they converse. The pipeline transcribes one speaker's audio, translates it to the second speaker's language, synthesizes the translated voice, and does the same in reverse, with sub-second latency.
+
What is the difference between speech-to-speech translation and live conversation translation?
Speech-to-speech translation is a single-direction flow: one speaker talks and the output is translated into another language. Live conversation translation is bidirectional — two speakers talk naturally in their own languages, and each hears the other translated in real time, in a continuous loop. Picovoice supports both: the speech-to-speech translation guide covers single-direction use cases, and this page covers the two-way conversation flow.
+
Can on-device live translation work without an internet connection?
Yes. Cheetah Streaming Speech-to-Text, Zebra Translate, and Orca Streaming Text-to-Speech all run locally on the device, so live conversation translation works fully offline. Useful at borders, on flights, in clinics with weak coverage, or anywhere the carrier signal is unreliable.
+
What languages are supported?
Cheetah Streaming Speech-to-Text supports English, French, German, Spanish, Italian, and Portuguese. Zebra Translate supports language pairs across English, French, German, Spanish, Italian, Portuguese, Japanese, and Korean. Orca Text-to-Speech generates natural speech in English, French, German, Spanish, Italian, Portuguese, Japanese, and Korean. For detailed language support, check the product pages or documentation of each product.
+
Does live conversation translation send audio or text to the cloud?
No. The entire pipeline runs locally on the device. Audio is processed in memory and discarded. Picovoice has no data controller relationship with end users, which removes cloud voice data compliance obligations, including BAAs under HIPAA — a meaningful difference for clinical interpretation, financial advisory calls, and legal interviews. This makes on-device live translation suitable for clinical interpretation, financial advisory calls, and legal interviews, where patient or caller audio must not leave the device.
+
How is this different from Timekettle, Pocketalk, or other translator earbuds?
Translator earbuds are hardware bundles that ship a single integrated experience. Picovoice ships the underlying on-device translation pipeline as licensable SDKs — you decide the form factor: a phone app, a desktop app, an in-vehicle system, embedded headsets, or smart kiosks. Same engines, your product.
+
How fast is on-device translation?
Cheetah begins emitting transcript tokens within ~590 ms of first audio. Zebra Translate generates up to 120 words per second. Orca generates the first spoken token in 130 ms. End-to-end, the listener typically hears the translated voice begin within roughly one second after the speaker finishes a phrase, fast enough to preserve natural conversation flow without overlapping speech.
+
How can I get technical support?
Visit the GitHub pico-cookbook Live Conversation Translation recipe, where you can find the open-source demo code and create an issue for demo-related technical questions.