Picovoice Wordmark
Start Building
Introduction
Introduction
AndroidC.NETiOSLinuxmacOSNode.jsPythonRaspberry PiWebWindows
AndroidC.NETiOSNode.jsPythonWeb
SummaryPicovoice picoLLMGPTQ
Introduction
AndroidC.NETFlutteriOSJavaLinuxmacOSNode.jsPythonRaspberry PiReactReact NativeRustWebWindows
AndroidC.NETFlutteriOSJavaNode.jsPythonReactReact NativeRustWeb
SummaryPicovoice LeopardAmazon TranscribeAzure Speech-to-TextGoogle ASRGoogle ASR (Enhanced)IBM Watson Speech-to-TextWhisper Speech-to-Text
FAQ
Introduction
AndroidC.NETFlutteriOSJavaLinuxmacOSNode.jsPythonRaspberry PiReactReact NativeRustWebWindows
AndroidC.NETFlutteriOSJavaNode.jsPythonReactReact NativeRustWeb
SummaryPicovoice Cheetah
FAQ
Introduction
AndroidC.NETiOSLinuxmacOSNode.jsPythonRaspberry PiWebWindows
AndroidC.NETiOSNode.jsPythonWeb
SummaryAmazon PollyAzure TTSElevenLabsOpenAI TTSPicovoice Orca
Introduction
AndroidCiOSLinuxmacOSPythonRaspberry PiWebWindows
AndroidCiOSPythonWeb
SummaryPicovoice KoalaMozilla RNNoise
Introduction
AndroidCiOSLinuxmacOSNode.jsPythonRaspberry PiWebWindows
AndroidCNode.jsPythoniOSWeb
SummaryPicovoice EaglepyannoteSpeechBrainWeSpeaker
Introduction
AndroidCiOSLinuxmacOSPythonRaspberry PiWebWindows
AndroidCiOSPythonWeb
SummaryPicovoice FalconAmazon TranscribeAzure Speech-to-TextGoogle Speech-to-Textpyannote
Introduction
AndroidArduinoCChrome.NETEdgeFirefoxFlutteriOSJavaLinuxmacOSMicrocontrollerNode.jsPythonRaspberry PiReactReact NativeRustSafariUnityWebWindows
AndroidC.NETFlutteriOSJavaMicrocontrollerNode.jsPythonReactReact NativeRustUnityWeb
SummaryPorcupineSnowboyPocketSphinx
Wake Word TipsFAQ
Introduction
AndroidCChrome.NETEdgeFirefoxFlutteriOSJavaLinuxmacOSNode.jsPythonRaspberry PiReactReact NativeRustSafariUnityWebWindows
AndroidC.NETFlutteriOSJavaNode.jsPythonReactReact NativeRustUnityWeb
SummaryPicovoice RhinoGoogle DialogflowAmazon LexIBM WatsonMicrosoft LUIS
Expression SyntaxFAQ
Introduction
AndroidC.NETiOSLinuxmacOSNode.jsPythonRaspberry PiRustWebWindows
AndroidC.NETiOSNode.jsPythonRustWeb
SummaryPicovoice CobraWebRTC VAD
FAQ
Introduction
AndroidC.NETFlutteriOSNode.jsPythonReact NativeRustUnityWeb
AndroidC.NETFlutteriOSNode.jsPythonReact NativeRustUnityWeb
Introduction
C.NETNode.jsPython
C.NETNode.jsPython
FAQGlossary

Text-to-Speech Latency Benchmark

In the era of large language models (LLMs), Text-to-Speech (TTS) engines are in high demand for use in voice assistants. One of the most critical factors in making conversations with AI agents feel natural and seamless is the response time. A responsive TTS system is essential for delivering a smooth conversational experience, ensuring that the AI's responses are conveyed promptly and without noticeable delays.

This benchmark evaluates the response times of different TTS engines when used in LLM-based voice assistants.

Picovoice Orca Streaming Text-to-Speech is compared against well-known TTS providers:

  • Amazon Polly
  • Azure Text-to-Speech
  • ElevenLabs
  • OpenAI TTS

Methodology

This benchmark simulates user - voice-assistant interactions, by generating LLM responses to user questions and synthesizing the response to speech as soon as possible. We sample user queries from a public dataset and feed them to ChatGPT (gpt-3.5-turbo) using OpenAI Chat Completion API. ChatGPT generates responses token-by-token, which are passed to different text-to-speech (TTS) engines to compare their response times.

Engines

All TTS engines listed above support streaming audio output. In addition, Elevenlabs also supports streaming text input using a websocket API. This is done by chunking the text at punctuation marks and sending pre-analyzed text chunks to the engine. Orca Streaming Text-to-Speech supports input text streaming without relying on special language markers. Orca can handle the raw LLM tokens as soon as they are produced.

Dataset

The public taskmaster2 dataset contains text data of goal oriented conversations between a user and an assistant. We randomly select user questions from these example conversations and use them as input to the LLM. The topics of the user queries are diverse and include flight booking, food ordering, hotel booking, movies and music recommendations, restaurant search, and sports. The LLM is prompted to answer the questions like a helpful voice assistant to simulate a real-world user - AI agent interactions. The responses of the LLM have various lengths, from a few words to a few sentences, to cover a wide range of realistic responses.

Metrics

Response times are typically measured with the time-to-first-byte metric, which is the time taken from the moment a request was sent until the first byte is received.

In the context of assistants we care about the time it takes for the assistant to respond to the user. For LLM-based voice assistants we define:

  • Voice Assistant Response Time (VART): Time taken from the moment the user's request is sent to the LLM, until the TTS engine produces the first byte of speech.

The VART metric is the sum of the following components:

  • Time to First Token (TTFT): Time taken from the moment the user's request is sent to the LLM, until the LLM produces the first byte of text.
  • First Token to Speech (FTTS): Time taken from the moment the LLM produces the first text token, until the TTS engine produces the first byte of speech.

The TTFT metric depends on the LLM and possibly network latency if an API is used. The FTTS metric depends on the capabilities of the TTS engine, and whether it can handle streaming input text, as well as the token-generation speed of the LLM. In order to measure the FTTS metric, it is important to keep the LLM behavior constant across all experiments.

We believe the FTTS metric is the most appropriate way to measure the response time of a TTS engine in the context of voice assistants. This is because it gets closest to the behavior of humans, who can start reading a response as soon as the first token appears.

Results

The figures below show the response times of each engine by calculating the average over roughly 200 simulated user - voice assistant interactions.

First Token to Speech

First Token to SpeechFirst Token to Speech

Voice Assistant Response Time

Voice Assistant Response TimeVoice Assistant Response Time

Usage

The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents:

  • Amazon Polly
  • Azure TTS
  • ElevenLabs
  • OpenAI TTS
  • Picovoice Orca

Was this doc helpful?

Issue with this doc?

Report a GitHub Issue
Text-to-Speech Latency Benchmark
  • Methodology
  • Engines
  • Dataset
  • Metrics
  • Results
  • First Token to Speech
  • Voice Assistant Response Time
  • Usage
Voice AI
  • Leopard Speech-to-Text
  • Cheetah Streaming Speech-to-Text
  • Orca Text-to-Speech
  • Koala Noise Suppression
  • Eagle Speaker Recognition
  • Falcon Speaker Diarization
  • Porcupine Wake Word
  • Rhino Speech-to-Intent
  • Cobra Voice Activity Detection
Local LLM
  • picoLLM Inference
  • picoLLM Compression
  • picoLLM GYM
Resources
  • Docs
  • Console
  • Blog
  • Use Cases
  • Playground
Sales & Services
  • Consulting
  • Foundation Plan
  • Enterprise Plan
  • Enterprise Support
Company
  • About us
  • Careers
Follow Picovoice
  • LinkedIn
  • GitHub
  • X
  • YouTube
  • AngelList
Subscribe to our newsletter
Terms of Use
Privacy Policy
© 2019-2025 Picovoice Inc.