Open-Source Translation Benchmark
Zebra Translate matches Helsinki-NLP/opus-mt's BLEU score across language pairs while translating 2.4× faster and using only 18% of the RAM (approximately 75–95MB versus 383–572MB). Zebra delivers high translation accuracy on-device, with a fraction of the compute and memory footprint of the leading open-source alternative.
On-Device AI-powered Translation: Open-Source Benchmark
Translation, AI-powered translation, or machine translation (MT) is the task of automatically converting text from one language to another. On-device machine translation enables privacy-preserving, low-latency translation without sending data to external APIs. This open-source translation benchmark evaluates Zebra Translate and Helsinki-NLP/opus-mt on the metrics that matter most for production on-device translation:
- Accuracy — BLEU scores across five language pairs
- Speed — words translated per second
- Resource utilization — peak memory usage (RAM)
Open-Source Translation Benchmark Methodology
The Picovoice Open-Source Translation Benchmark measures translation accuracy and performance by running each engine on parallel sentence pairs from a public dataset and comparing the output against reference translations.
Dataset
Sentences are drawn from Tatoeba-Challenge, parallel corpora used in the Tatoeba Translation Challenge, and machine translation research.
Metrics
Accuracy
The accuracy of translation engines is measured using BLEU (Bilingual Evaluation Understudy), which is the standard metric for machine translation accuracy. BLEU measures the overlap between the engine's output and human reference translations using n-gram precision. The BLEU Score was proposed by IBM in 2001, and scores that range from 0 to 100 show:
- BLEU Score below 30: Hard to understand translations
- BLEU Score between 30 and 50: Understandable translations
- BLEU Score above 50: Good and fluent translations
Speed
The speed of the translation engines is measured using Words per Second. Words per Second shows the number of output words generated per second of wall-clock time. Higher is better.
Resource Utilization
The resource utilization of translation engines is measured by using the Peak Memory (RAM) Usage, which is the maximum RAM consumed by the translation engines while performing. It is critical, especially for mobile and embedded applications, where memory budgets are constrained.
Results
The Open-Source Translation Benchmark results below are measured on Ubuntu 22.04, Python 3.10, AMD Ryzen 9 5900X (12 cores @ 3.70GHz), using a consumer-grade CPU.
Accuracy
Picovoice Zebra and Helsinki-NLP/opus-mt achieve matching BLEU scores above 50, across German-English, English-French, French-Spanish, and Spanish-Italian language pairs. Both Zebra and Helsinki-NLP/opus-mt slightly fall short of 50 in Italian-German translation.
Overall, both translation engines return good and fluent translations across language pairs.
Speed
Picovoice Zebra translates 2.4× faster than Helsinki-NLP/opus-mt across all language pairs, averaging 90+ words per second versus 35–45 words per second for Helsinki.
Resource Utilization
Peak Memory Usage
Picovoice Zebra uses only 17.7% of the RAM required by Helsinki-NLP/opus-mt — averaging 80–95MB versus 383–572MB depending on the language pair.