On-device Live Captioning and Translation

Live captioning and translation, running entirely on-device

Stream audio for live transcription and real-time translation, and watch source-language and translated captions appear together on screen. Sub-second end-to-end. Audio never leaves the device.

Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi
How live captioning and translation is built

On-device Voice AI and Language SDKs in a single pipeline

On-device live captioning and translation combines streaming speech-to-text with machine translation in a single local pipeline. Most implementations split these across a cloud STT API and a separate translation API, paying network round-trip latency twice and sending audio off the device at every step. Picovoice runs both stages on-device: the streaming STT engine produces source-language captions as the speaker talks, and the translation engine converts them into the target language in the same pipeline, with no audio or text ever transmitted to a third-party service.

MicrophoneAUDIO INCheetahSTREAMING STTZebraTRANSLATIONCaptionsBILINGUAL UI
Why Cheetah Streaming Speech-to-Text?

Lowest latency. Lowest compute. No accuracy tradeoff.

10.1%
WER (English) vs. 11.9% Google and 10.6% Moonshine Medium
0.08
CPU Core-Hour vs. 3.36 Moonshine Medium, 40x less
8.6%
WER (Spanish) vs. 11.6% Google and 9.4% Azure

Cheetah Streaming Speech-to-Text beats Google Cloud STT in word error rate and word emission latency across all tested languages, and outperforms Azure STT in several benchmarks, per the open-source real-time transcription benchmark — even before it's customized for the use case. It emits words at 590 ms median latency, typically one word behind the speaker. Cheetah requires less compute than any other local engine tested. Result: no tradeoff on accuracy, latency, or privacy, and no minimum hardware requirements.

English Word Error Rate
Lower is better
Amazon Streaming5.6%
Azure Real-time8.2%
Cheetah Streaming10.1%
Moonshine Streaming Medium10.6%
Vosk Streaming Large11.5%
Google Streaming11.9%
Whisper.cpp Streaming Base19.8%
English Punctuation Error Rate
Lower is better
Cheetah Streaming16.1%
Azure Real-time16.4%
Amazon Streaming24.4%
Google Streaming36%
Moonshine Streaming Medium45.1%
Whisper.cpp Streaming Base54.1%
Why Zebra Translate?

120 words per second. Opus-level accuracy. Zero network latency.

<100 MB
Peak Memory Usage
<80 MB
Model size per language pair
1:1
Opus MT by Helsinki NLP accuracy match

Zebra, Picovoice's on-device translation SDK, returns up to 120 words per second — far faster than the average 2 words per second a person speaks or the 5 words per second a person reads. The speed does not come at the cost of accuracy: it matches the accuracy of Opus, one of the best-known open translation models. On-device processing also gives Zebra a competitive edge: zero network latency, which can never be achieved by cloud translation APIs such as Google Translate and DeepL.

Translation accuracy (BLEU) — higher is better
Zebra (DE → EN)51
Opus (DE → EN)51
Zebra (EN → FR)55
Opus (EN → FR)55
Zebra (ES → IT)58
Opus (ES → IT)57
Translation speed (words / sec) — higher is better
Zebra (DE → EN)112
Opus (DE → EN)45
Zebra (EN → FR)105
Opus (EN → FR)41
Zebra (ES → IT)90
Opus (ES → IT)36
Live captioning and translation use case examples

From broadcasts to cross-platform CART captioning

Broadcasts & streaming

Live captions and translation for broadcasts

Broadcasts — whether for a company-wide meeting, a city-wide council discussion, or educational content — require live captioning and translation for accessibility and reach. They need captions that track audio within seconds and survive flaky uplinks. On-device live captioning and translation meets these requirements and captions the same feed in multiple languages simultaneously.

CART captioning & accessibility

Cross-platform CART captioning

Apple Live Captions and Google Live Captions work well within their own infrastructure. If you are shipping CART captioning or accessibility features in a cross-platform product — including lecture capture systems, hospital communication tools, or public-sector apps — Picovoice gives you the same on-device live captioning capability across Android, iOS, web, embedded, Linux, macOS, and Windows.

Get started

Build an on-device live captioning and translation app in 3 steps

A complete working recipe in Python. Open-source on GitHub. Runs 100% on-device.

recipe · live-captioning-and-translation
Difficulty
Beginner
Runtime
100% on-device
Language
Python
Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Prerequisites

Picovoice AccessKey from Picovoice Console and GitHub Repo Clone.

Usage

These instructions assume your current working directory is recipes/call-screen/python.
1

Create a virtual environment

Isolate the recipe's dependencies from your system Python.
2

Activate the virtual environment

Activation makes pip install into .venv instead of system Python.
Linux, macOS, or Raspberry Pi
Windows
3

Install dependencies

Pulls in the Cheetah and Zebra Python SDKs along with audio I/O.
4

Download the Required Models

Run the setup script to download the models for Cheetah Streaming Speech-to-Text and Zebra Translation.
5

Run the captioning pipeline

Cheetah Streaming Speech-to-Text streams partial captions in the source language. Phrases are shown as captions and also fed to Zebra Translate for translation. Both run in the same process, on the same audio frames, on the same machine.
Have questions or looking for implementations in other languages? Visit GitHub pico-cookbook Live Captioning & Translation recipe, where you can find the code and create an issue for demo-related technical questions.
Frequently asked questions

FAQ

+
What is live captioning?
Live captioning is the real-time conversion of spoken audio into on-screen text, generated continuously as someone speaks. Unlike offline transcription, which produces a finished transcript after the audio ends, live captioning streams partial captions within milliseconds of each word being spoken. Live captioning is used in broadcast TV, video conferencing, accessibility tools, in-vehicle assistants, and customer support apps.
+
What is live translation captioning?
Live translation captioning combines live captioning with machine translation: a speaker's words are transcribed in their language and immediately translated into one or more target languages, with both the source and translated captions appearing on screen in real time. This recipe builds exactly that — the streaming STT engine produces source-language captions, and the translation engine converts them into the target language as the speaker continues.
+
What is CART captioning?
CART stands for Communication Access Realtime Translation. It is a professional live captioning service for deaf and hard-of-hearing individuals, commonly used in legal, educational, and medical settings. Automated CART replaces the human stenographer with a speech recognition engine.
+
How is this different from Microsoft Live Captions or Apple Live Captions?
Microsoft Live Captions and Apple Live Captions are excellent on-device features for Windows/Copilot+ PCs and Apple devices respectively, but they ship as OS-level features, not licensable SDKs. If you are building a product that needs to run on Android, embedded hardware, or in a web browser, those options do not apply. Picovoice exposes the same on-device capability as a cross-platform SDK that any developer or OEM can integrate.
+
Does the captioning work offline?
Yes. Cheetah Streaming Speech-to-Text and Zebra Translate both run 100% on-device. Audio is locally processed and never sent to any server.
+
What languages are supported?
Cheetah Streaming Speech-to-Text currently supports English, French, German, Italian, Portuguese, and Spanish for speech-to-text, and Zebra Translate supports a wide set of language pairs across English, French, German, Korean, Japanese, Italian, Spanish, and Portuguese. For enterprises with specific needs, Picovoice offers custom model training for language pairs not currently in the standard catalog. For the up-to-date language list, see the Cheetah Streaming Speech-to-Text and Zebra Translate product pages.
+
Can I customize the transcription with industry vocabulary?
Yes. Cheetah supports custom vocabulary and Orca supports custom pronunciation — add brand names, jargon, technical terms, and proper nouns that matter for your domain.
+
How does this compare to running Whisper for live captioning?
Whisper is excellent for offline transcription, but was not designed for streaming. Whisper.cpp's streaming mode emits words at roughly 1.2–2.0 second latency vs Cheetah's 590 ms, uses substantially more compute, and ships base models in the 70–290 MB range vs Cheetah's 34 MB. For live captioning under tight latency and memory budgets, purpose-built Cheetah is a better choice.
+
How can I get technical support for the live captioning and translation recipe?