Open-Source Language Identification Benchmark
TLDR: Picovoice Bat Spoken Language Identification achieves 93% accuracy on spoken language identification, ~8 percentage points higher than SpeechBrain with ~2x higher accuracy (~2x fewer identification errors at 7% vs. 15% miss rate). On efficiency, Bat requires 62x less RAM (5 MB vs. 333 MB), 23x less storage (5 MB vs. 118 MB), and 9x less CPU (0.4 vs 3.9 core-hours) than SpeechBrain.
Spoken Language Identification in Real Time: Open-Source Benchmark
Language Identification (LID) or Language Detection is the task of automatically detecting which language is being used in text or audio input. Spoken Language Identification (LID) focuses on audio streams and files. It is a foundational component in multilingual voice AI systems, used to route calls to the correct language-specific speech recognizer, adapt voice assistants to the user's language in real time, and enable automatic transcription of multilingual audio streams.
This open-source spoken language identification benchmark evaluates popular spoken LID engines, Picovoice Bat and SpeechBrain, using the metrics that matter most for production streaming audio applications:
- Accuracy: measured as the percentage of correctly identified languages.
- Resource utilization: measured in compute efficiency (CPU core-hours), memory footprint, and model size.
Open-Source Language Identification Benchmark Methodology
The Picovoice Open-Source Language Identification Benchmark feeds audio samples from multiple languages into each engine and compares the highest-scored language prediction against the ground truth label.
Open-Set Evaluation
Some language identification benchmarks use a closed-set protocol, evaluating engines only on the languages that they’re trained to recognize. This may overstate real-world accuracy as production systems encounter speech in any language, not just the ones in the training set.
Picovoice’s Open-Source Language Identification Benchmark uses an open-set evaluation protocol and includes test data audio from languages outside of the supported languages. If an engine correctly returns "unknown", the benchmark counts that response as correct, rewarding engines that handle out-of-distribution input gracefully.
Dataset
Audio samples are drawn from the VoxLingua107 dataset, a publicly available corpus covering 107 languages collected from YouTube. VoxLingua107 is the most widely used dataset in spoken language identification research, enabling direct comparison with published academic results.
Metrics
Accuracy — higher is better
Accuracy measures the percentage of correct language predictions over all inferences:
Resource Utilization — lower is better
- CPU Core Hour Ratio: CPU Core Hour Ratio refers to the number of CPU core-hours required to process one hour of audio. One core-hour equals one CPU core running at 100% utilization for one hour.
- CPU Core Hour Ratio < 1.0: A ratio below 1.0 means the engine processes faster than real-time while consuming less than one full core.
- CPU Core Hour Ratio = 1.0: A ratio of 1.0 means the engine fully occupies one CPU core just to keep up with real-time processing, leaving no headroom for other application logic.
- CPU Core Hour Ratio > 1.0: A ratio above 1.0 means the engine cannot keep up with real-time processing while consuming one CPU core.
- Peak Memory (RAM) Usage: Peak Memory (RAM) Usage shows the maximum RAM consumed by the spoken language identification engine during processing. It is critical, especially for mobile and embedded applications, where memory budgets are constrained.
- Model Size: Model Size refers to the total binary file size required to initialize the engine. Model size affects application download size and storage requirements, particularly important for over-the-air updates, mobile and web applications, where lean binaries improve the user's first experience.
Results
The Open-Source Language Identification Benchmark results below are measured on Ubuntu 22.04, Python 3.10, AMD Ryzen 9 5900X (12 cores @ 3.70GHz), CPU-only.
Accuracy — higher is better
Picovoice Bat Spoken Language Identification achieves 92.9% accuracy versus SpeechBrain Language Identification at 85.0%, a 7.9 percentage point advantage, or ~2x higher accuracy or ~2x fewer miss rate. (7.1% vs. 15%)
Computational Efficiency — lower is better
CPU Core Hour Ratio
CPU Core Hour Ratio measures the compute cost required to process one hour of audio. Picovoice Bat requires 0.44 core-hours, which is 8.9× fewer than SpeechBrain’s 3.90 core-hour need. To put this in perspective: In order to process one hour of audio, Bat consumes one CPU core for 26.4 minutes, whereas SpeechBrain does for 3 hours and 54 minutes (234 minutes). For real-time processing, Bat utilizes 44% of a single core or 11% of 4 cores, while SpeechBrain fully uses 4 CPU cores.
Peak Memory Usage — lower is better
Peak memory usage shows the maximum memory consumption during language detection.
Picovoice Bat Spoken Language Identification consumes only 5.4 MB of memory at peak, making it a great fit for any deployment, including embedded. SpeechBrain Lang ID, on the other hand, requires 333.4 MB of RAM, which is 62x more than Bat. As language identification is only a small part of a larger voice AI pipeline, this RAM requirement exceeds standard embedded and low-power mobile headroom.
Model Size — lower is better
Model size reflects the storage footprint for deployment. Picovoice Bat's 5.18MB model size makes it well-suited for over-the-air updates, mobile devices, and compute-constrained environments, unlike SpeechBrain's 117.57MB model size, which is 23× larger than Bat Spoken Language Identification.
Usage
The data and code used to create this benchmark are available on GitHub under the Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents: