Picovoice Wordmark
Start Building
Introduction
Introduction
AndroidC.NETiOSLinuxmacOSNode.jsPythonRaspberry PiWebWindows
AndroidC.NETiOSNode.jsPythonWeb
SummaryPicovoice picoLLMGPTQ
Introduction
AndroidC.NETFlutteriOSJavaLinuxmacOSNode.jsPythonRaspberry PiReactReact NativeRustWebWindows
AndroidC.NETFlutteriOSJavaNode.jsPythonReactReact NativeRustWeb
SummaryPicovoice LeopardAmazon TranscribeAzure Speech-to-TextGoogle ASRGoogle ASR (Enhanced)IBM Watson Speech-to-TextWhisper Speech-to-Text
FAQ
Introduction
AndroidC.NETFlutteriOSJavaLinuxmacOSNode.jsPythonRaspberry PiReactReact NativeRustWebWindows
AndroidC.NETFlutteriOSJavaNode.jsPythonReactReact NativeRustWeb
SummaryPicovoice Cheetah
FAQ
Introduction
AndroidC.NETiOSLinuxmacOSNode.jsPythonRaspberry PiWebWindows
AndroidC.NETiOSNode.jsPythonWeb
SummaryAmazon PollyAzure TTSElevenLabsOpenAI TTSPicovoice Orca
Introduction
AndroidCiOSLinuxmacOSPythonRaspberry PiWebWindows
AndroidCiOSPythonWeb
SummaryPicovoice KoalaMozilla RNNoise
Introduction
AndroidCiOSLinuxmacOSNode.jsPythonRaspberry PiWebWindows
AndroidCNode.jsPythoniOSWeb
SummaryPicovoice EaglepyannoteSpeechBrainWeSpeaker
Introduction
AndroidCiOSLinuxmacOSPythonRaspberry PiWebWindows
AndroidCiOSPythonWeb
SummaryPicovoice FalconAmazon TranscribeAzure Speech-to-TextGoogle Speech-to-Textpyannote
Introduction
AndroidArduinoCChrome.NETEdgeFirefoxFlutteriOSJavaLinuxmacOSMicrocontrollerNode.jsPythonRaspberry PiReactReact NativeRustSafariUnityWebWindows
AndroidC.NETFlutteriOSJavaMicrocontrollerNode.jsPythonReactReact NativeRustUnityWeb
SummaryPorcupineSnowboyPocketSphinx
Wake Word TipsFAQ
Introduction
AndroidCChrome.NETEdgeFirefoxFlutteriOSJavaLinuxmacOSNode.jsPythonRaspberry PiReactReact NativeRustSafariUnityWebWindows
AndroidC.NETFlutteriOSJavaNode.jsPythonReactReact NativeRustUnityWeb
SummaryPicovoice RhinoGoogle DialogflowAmazon LexIBM WatsonMicrosoft LUIS
Expression SyntaxFAQ
Introduction
AndroidC.NETiOSLinuxmacOSNode.jsPythonRaspberry PiRustWebWindows
AndroidC.NETiOSNode.jsPythonRustWeb
SummaryPicovoice CobraWebRTC VAD
FAQ
Introduction
AndroidC.NETFlutteriOSNode.jsPythonReact NativeRustUnityWeb
AndroidC.NETFlutteriOSNode.jsPythonReact NativeRustUnityWeb
Introduction
C.NETNode.jsPython
C.NETNode.jsPython
FAQGlossary

Speech-to-Text Benchmark

Automatic speech recognition (ASR) is the core building block of most voice applications to the point that practitioners use speech-to-text (STT) and speech recognition interchangeably. ASR systems achieving state-of-the-art accuracy often run in the cloud. Amazon Transcribe, Azure Speech-to-Text, Google Speech-to-Text, and IBM Watson Speech-to-Text for the current dominant transcription API providers.

STT’s reliance on the cloud makes it costly, less reliable, and laggy. On-device ASRs can be orders of magnitude more cost-effective than API counterparts. Additionally, offline ASRs are inherently reliable and real-time by removing the variable delay induced by network connectivity. Running an ASR engine offline without sacrificing accuracy is challenging. Common approaches to audio transcription involve massive graphs for language modelling and compute-intensive neural networks for acoustic modelling. Picovoice’s Leopard speech-to-text engine takes a different approach to achieve cloud-level accuracy while running offline on commodity hardware like a Raspberry Pi.

Below is a series of benchmarks to back our claims. They also empower customers to make data-driven decisions using the datasets that matter to their business.

The real-time transcription benchmark is also available if you’re interested in evaluating the performance of Cheetah Streaming Speech-to-Text.

Speech-to-Text Benchmark Languages

  • English Speech-to-Text Benchmark
  • French Speech-to-Text Benchmark
  • German Speech-to-Text Benchmark
  • Spanish Speech-to-Text Benchmark
  • Italian Speech-to-Text Benchmark
  • Portuguese Speech-to-Text Benchmark

Speech-to-Text Benchmark Metrics

Word Error Rate (WER)

Word error rate is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript. In other words, WER is the ratio of errors in a transcript to the total words spoken. Despite its limitations, WER is the most commonly used metric to measure speech-to-text engine accuracy. A lower WER (lower number of errors) means better accuracy in recognizing speech.

Core-Hour

The Core-Hour metric is used to evaluate the computational efficiency of the speech-to-text engine, indicating the number of CPU hours required to process one hour of audio. A speech-to-text engine with a lower Core-Hour is more computationally efficient. We omit this metric for cloud-based engines.

English Speech-to-Text Benchmark

English Speech Corpus

We use the following datasets for benchmarks:

  • LibriSpeech test-clean
  • LibriSpeech test-other
  • Common Voice test
  • TED-LIUM test

Results

Accuracy

The figure below shows the accuracy of each engine averaged over all datasets.

English Speech-to-Text Accuracy ComparisonEnglish Speech-to-Text Accuracy Comparison

Core Hour

The figure below shows the resource requirement of each engine.

English Speech-to-Text Core Hour ComparisonEnglish Speech-to-Text Core Hour Comparison

Please note that we ran the benchmark across the entire TED-LIUM dataset on an Ubuntu 22.04 machine with AMD CPU (AMD Ryzen 9 5900X (12) @ 3.70GHz), 64 GB of RAM, and NVMe storage, using 10 cores simultaneously and recorded the processing time to obtain the results below. Different datasets and platforms affect the Core-Hour. However, one can expect the same ratio among engines if everything else is the same. For example, Whisper Tiny requires 3x more resources or takes 3x more time compared to Picovoice Leopard.

French Speech-to-Text Benchmark

French Speech Corpus

We use the following datasets for benchmarks:

  • Multilingual LibriSpeech test
  • Common Voice test
  • VoxPopuli test

Results

Accuracy

The figure below shows the accuracy of each engine averaged over all datasets.

French Speech-to-Text Accuracy ComparisonFrench Speech-to-Text Accuracy Comparison

German Speech-to-Text Benchmark

German Speech Corpus

We use the following datasets for benchmarks:

  • Multilingual LibriSpeech test
  • Common Voice test
  • VoxPopuli test

Results

Accuracy

The figure below shows the accuracy of each engine averaged over all datasets.

German Speech-to-Text Accuracy ComparisonGerman Speech-to-Text Accuracy Comparison

Spanish Speech-to-Text Benchmark

Spanish Speech Corpus

We use the following datasets for benchmarks:

  • Multilingual LibriSpeech test
  • Common Voice test
  • VoxPopuli test

Results

Accuracy

The figure below shows the accuracy of each engine averaged over all datasets.

Spanish Speech-to-Text Accuracy ComparisonSpanish Speech-to-Text Accuracy Comparison

Italian Speech-to-Text Benchmark

Italian Speech Corpus

We use the following datasets for benchmarks:

  • Multilingual LibriSpeech test
  • Common Voice test
  • VoxPopuli test

Results

Accuracy

The figure below shows the accuracy of each engine averaged over all datasets.

Italian Speech-to-Text Accuracy ComparisonItalian Speech-to-Text Accuracy Comparison

Portuguese Speech-to-Text Benchmark

Portuguese Speech Corpus

We use the following datasets for benchmarks:

  • Multilingual LibriSpeech test
  • Common Voice test

Results

Accuracy

The figure below shows the accuracy of each engine averaged over all datasets.

Portuguese Speech-to-Text Accuracy ComparisonPortuguese Speech-to-Text Accuracy Comparison

Usage

The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents:

  • AWS Transcribe accuracy
  • Azure Speech-to-Text accuracy
  • Google Speech-to-Text accuracy
  • Google Speech-to-Text (Enhanced) accuracy - English Only
  • IBM Watson Speech-to-Text accuracy - English Only
  • Picovoice Leopard Speech-to-Text accuracy
  • Whisper Speech-to-Text accuracy

Was this doc helpful?

Issue with this doc?

Report a GitHub Issue
Speech-to-Text Benchmark
  • English Speech-to-Text Benchmark
  • French Speech-to-Text Benchmark
  • German Speech-to-Text Benchmark
  • Spanish Speech-to-Text Benchmark
  • Italian Speech-to-Text Benchmark
  • Portuguese Speech-to-Text Benchmark
Voice AI
  • Leopard Speech-to-Text
  • Cheetah Streaming Speech-to-Text
  • Orca Text-to-Speech
  • Koala Noise Suppression
  • Eagle Speaker Recognition
  • Falcon Speaker Diarization
  • Porcupine Wake Word
  • Rhino Speech-to-Intent
  • Cobra Voice Activity Detection
Local LLM
  • picoLLM Inference
  • picoLLM Compression
  • picoLLM GYM
Resources
  • Docs
  • Console
  • Blog
  • Use Cases
  • Playground
Sales & Services
  • Consulting
  • Foundation Plan
  • Enterprise Plan
  • Enterprise Support
Company
  • About us
  • Careers
Follow Picovoice
  • LinkedIn
  • GitHub
  • X
  • YouTube
  • AngelList
Subscribe to our newsletter
Terms of Use
Privacy Policy
© 2019-2025 Picovoice Inc.