Open-Source Translation Benchmark

Zebra Translate matches Helsinki-NLP/opus-mt's BLEU score across language pairs while translating 2.4× faster and using only 18% of the RAM (approximately 75–95MB versus 383–572MB). Zebra delivers high translation accuracy on-device, with a fraction of the compute and memory footprint of the leading open-source alternative.

On-Device AI-powered Translation: Open-Source Benchmark

Translation, AI-powered translation, or machine translation (MT) is the task of automatically converting text from one language to another. On-device machine translation enables privacy-preserving, low-latency translation without sending data to external APIs. This open-source translation benchmark evaluates Zebra Translate and Helsinki-NLP/opus-mt on the metrics that matter most for production on-device translation:

Accuracy — BLEU scores across five language pairs
Speed — words translated per second
Resource utilization — peak memory usage (RAM)

Open-Source Translation Benchmark Methodology

The Picovoice Open-Source Translation Benchmark measures translation accuracy and performance by running each engine on parallel sentence pairs from a public dataset and comparing the output against reference translations.

Dataset

Sentences are drawn from Tatoeba-Challenge, parallel corpora used in the Tatoeba Translation Challenge, and machine translation research.

Metrics

Accuracy

The accuracy of translation engines is measured using BLEU (Bilingual Evaluation Understudy), which is the standard metric for machine translation accuracy. BLEU measures the overlap between the engine's output and human reference translations using n-gram precision. The BLEU Score was proposed by IBM in 2001, and scores that range from 0 to 100 show:

BLEU Score below 30: Hard to understand translations
BLEU Score between 30 and 50: Understandable translations
BLEU Score above 50: Good and fluent translations

Speed

The speed of the translation engines is measured using Words per Second. Words per Second shows the number of output words generated per second of wall-clock time. Higher is better.

Resource Utilization

The resource utilization of translation engines is measured by using the Peak Memory (RAM) Usage, which is the maximum RAM consumed by the translation engines while performing. It is critical, especially for mobile and embedded applications, where memory budgets are constrained.

Results

The Open-Source Translation Benchmark results below are measured on Ubuntu 22.04, Python 3.10, AMD Ryzen 9 5900X (12 cores @ 3.70GHz), using a consumer-grade CPU.

Accuracy

Picovoice Zebra and Helsinki-NLP/opus-mt achieve matching BLEU scores above 50, across German-English, English-French, French-Spanish, and Spanish-Italian language pairs. Both Zebra and Helsinki-NLP/opus-mt slightly fall short of 50 in Italian-German translation.

Overall, both translation engines return good and fluent translations across language pairs.

Speed

Picovoice Zebra translates 2.4× faster than Helsinki-NLP/opus-mt across all language pairs, averaging 90+ words per second versus 35–45 words per second for Helsinki.

Resource Utilization

Peak Memory Usage

Picovoice Zebra uses only 17.7% of the RAM required by Helsinki-NLP/opus-mt — averaging 80–95MB versus 383–572MB depending on the language pair.

Usage

Was this doc helpful?

Issue with this doc?