On-device AI voice agents and assistants

Build an on-device AI voice agent for customer service, healthcare, and enterprise productivity.

A voice agent powered by wake word detection, streaming speech recognition, an on-device large language model, and streaming text-to-speech. Runs entirely on the device across mobile, embedded, desktop, and browser with no cloud processing.

Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi
How on-device AI voice agent works

Four on-device AI SDKs to run the entire speech-to-speech LLM agent pipeline

On-device AI voice agent listens for wake word, transcribes the user's speech in real time, generates a response using a local large language model, and speaks the answer back, all without sending any data to a cloud service. Picovoice's Porcupine Wake Word, Cheetah Streaming Speech-to-Text, picoLLM on-device LLM, and Orca Streaming Text-to-Speech compose into a single pipeline.

User"Jarvis, what are theside effects of ibuprofen?"PorcupineWAKE WORDWake detectedCheetahSTREAMING STTSource transcriptpicoLLMON-DEVICE LLMGenerated responseOrcaSTREAMING TTS"Common side effects..."Repeat for each conversation turn
Why Porcupine Wake Word?

Always-on, hands-free activation with minimal CPU usage.

3.8%
Single-Core CPU Utilization on Raspberry Pi 3
97.1%
Accuracy at 1 false alarm per 10 hours
~250K
Custom wake words trained and deployed in 2025

Porcupine Wake Word enables always-on, hands-free activation for the voice agent. It listens continuously with minimal CPU, hence battery, usage. So the device can stay ready without draining resources. When the user says the wake word, Porcupine immediately interrupts any in-progress LLM response and hands control to the speech-to-text engine. Enterprises can train branded wake words in seconds using the Picovoice Console and deploy them across all supported platforms.

Wake Word Detection Accuracy - higher the better
Porcupine97.1%
Snowboy68%
PocketSphinx52%
CPU Utilization - lower the better
Porcupine3.8%
Snowboy24.8%
PocketSphinx31.8%
Why Cheetah Streaming Speech-to-Text?

Lowest latency. Lowest compute. No accuracy tradeoff.

10.1%
WER (English) vs. 11.9% Google and 10.6% Moonshine Medium
0.08
CPU Core-Hour vs. 3.36 Moonshine Medium, 40x less
8.6%
WER (Spanish) vs. 11.6% Google and 9.4% Azure

Cheetah transcribes the user's question in real time as they speak, streaming words with an average emission latency of 590 ms. Cheetah matches or beats cloud STT API accuracy and supports custom vocabulary for industry-specific terms: product names in retail, drug names in clinical settings, part numbers in manufacturing. Cheetah requires less compute than any other local engine tested, efficiently running alongside picoLLM and Orca in the same session.

English Word Error Rate
Lower is better
Amazon Streaming5.6%
Azure Real-time8.2%
Cheetah Streaming10.1%
Moonshine Streaming Medium10.6%
Vosk Streaming Large11.5%
Google Streaming11.9%
Whisper.cpp Streaming Base19.8%
English Punctuation Error Rate
Lower is better
Cheetah Streaming16.1%
Azure Real-time16.4%
Amazon Streaming24.4%
Google Streaming36%
Moonshine Streaming Medium45.1%
Whisper.cpp Streaming Base54.1%
Why picoLLM?

Local LLM reasoning with no accuracy tradeoff.

99.9%
Accuracy retained at 3-bit vs. 83.1% for GPTQ for Llama-3-8b
94.5%
Accuracy retained at 2-bit vs. 38.7% for GPTQ for Llama-3-8b
Any
Any transformer architecture on any platform

picoLLM runs custom-trained LLMs and open-weight language models (Llama, Gemma, Phi, Mistral) locally on CPU or GPU with no cloud dependency. picoCompression quantizes language models to run on phones, browsers, and embedded boards with no cloud dependency while preserving the task accuracy. picoLLM's minimal memory footprint allows it to run alongside Cheetah and Orca in a single session without network latency or privacy exposure.

3-bit Quantized Llama-3-8b MMLU
Higher is better
Float16 (Original Model)64.9
picoLLM64.8
GPTQ53.9
2-bit Quantized Llama-3-8b MMLU
Higher is better
Float16 (Original Model)64.9
picoLLM61.3
GPTQ25.1
Why Orca Streaming Text-to-Speech?

Natural-sounding TTS at 29 MB peak memory.

29 MB
Peak Memory Usage
130 ms
First-token-to-speech latency
7 MB
Model Size

Orca speaks the LLM's response as tokens stream in, so the user hears the answer before generation is complete. Orca achieves first-token-to-speech latency of 130 ms and a peak memory footprint of 29 MB, allowing it to run alongside Porcupine, Cheetah, and picoLLM with no performance issues. The result is a natural-sounding conversational voice agent with no perceptible pause between the user's question and the spoken answer.

TTS Latency
Lower is better
Orca TTS Streaming128 ms
ElevenLabs TTS Streaming335 ms
ESpeak TTS1,430 ms
ElevenLabs TTS1,470 ms
Audio Quality
Listen and compare — grouped by peak memory usage.
Peak Memory Usage < 30 MB
ESpeak
Orca
On-device AI voice agent use cases

From customer service kiosks to IT help desks

Customer Service

Voice agents for customer support and troubleshooting

LLM voice AI agents can handle commonly asked questions, such as "How do I connect to the office printer?" instantly without waiting in a ticket queue. The on-device AI voice agent resolves routine queries from an internal knowledge base, keeping employee interactions off third-party servers.

Retail & Banking

On-device voice assistants for kiosks and in-store self-service

Retailers, banks, and telcos can deploy conversational voice agents on kiosks, POS terminals, and in-store displays that answer customer questions, look up information, and guide users through processes. The on-device stack has no dependency on network availability in stores, branches, or service centers.

Healthcare

Clinical voice agents that are HIPAA-compliant

Clinicians can query drug interactions, dosing guidelines, or documentation templates by voice during patient encounters. Running the voice agent on-device eliminates the need for a BAA covering voice data with a cloud LLM provider. No audio or text is ever transmitted to any third-party service.

Automotive

In-vehicle voice AI companions that work everywhere

Automakers and fleet operators can embed voice agents that let drivers control navigation, query vehicle diagnostics, or dictate messages without taking their hands off the wheel. The on-device stack eliminates dead zones on rural highways or in parking garages where cellular connectivity drops.

Get started

On-device AI voice agent in 5 steps: Python code example

A complete working recipe in Python. Open-source on GitHub. Runs 100% on-device.

recipe · llm-voice-AI-agent-assistant
Difficulty
Beginner
Runtime
100% on-device
Language
Python
Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Prerequisites

Picovoice AccessKey from Picovoice Console and GitHub Repo Clone.

Usage

These instructions assume your current working directory is recipes/call-assist/python.
1

Create a virtual environment

Isolate the recipe's dependencies from your system Python.
2

Activate the virtual environment

Activation makes pip install into .venv instead of system Python.
Linux, macOS, or Raspberry Pi
Windows
3

Install dependencies

Install the Porcupine, Cheetah, picoLLM, Orca, PvRecorder, and PvSpeaker Python SDKs.
4

Download an LLM

Open the Picovoice Console, go to picoLLM, and download a .pllm model file for your target device. Choose more compressed language models to run on hardware-constrained devices, such as Raspberry Pi.
5

Train the Wake Word

In Picovoice Console, go to Porcupine Wake Word, enter your wake phrase, train, and download the .ppn file for your target platform.
6

Run the AI voice agent demo

Pass your AccessKey, model paths, and run the demo. The demo opens both microphones and runs the on-device voice agent pipeline locally.
Have questions or looking for implementations in other languages? Visit the GitHub pico-cookbook LLM Voice Assistant Recipe, where you can find the open-source demo code and create an issue for demo-related technical questions.
Frequently asked questions

FAQ

+
What is an on-device AI voice agent?
An on-device AI voice agent is a conversational AI that listens for a wake word, transcribes speech, generates responses using a local large language model, and speaks the answer, all running on the device with no cloud dependency. No audio or query data is transmitted externally.
+
How is this different from cloud voice agent platforms?
Cloud voice agent platforms compose hosted speech-to-text, LLM, and text-to-speech into conversational pipelines. This recipe uses the same architecture but runs every component on the device. No cloud latency, no dependency on third-party uptime, and no voice data leaving the device. The wake word component adds always-on hands-free activation, which cloud voice agent platforms do not provide.
+
Which LLMs can run on-device with picoLLM?
picoLLM supports open-weight models including Llama, Gemma, Phi, Mistral, and Mixtral. Models are compressed using picoCompression, which recovers accuracy lost by standard quantization methods like GPTQ. Model files are available for download from the Picovoice Console.
+
Can the voice agent run on a Raspberry Pi?
Yes. The full pipeline — Porcupine Wake Word, Cheetah Streaming Speech-to-Text, picoLLM On-device LLM, and Orca Streaming Text-to-Speech — runs on Raspberry Pi 4 and 5. Smaller models, like Phi and Gemma, fit within the memory constraints of these devices.
+
How is this on-device AI voice assistant different from Siri, Alexa, or Google Assistant?
Siri, Alexa, and Google Assistant are closed platforms that depend on proprietary infrastructure. This voice agent uses open-weight LLMs running locally via licensable SDKs, with no dependency on Apple, Amazon, or Google services, and no voice data leaving the device. You control the model, the system prompt, and the deployment target.
+
Does the voice agent store or transmit audio?
No. All audio is processed on the device. It is never transmitted to Picovoice or any third-party cloud. Picovoice has no data controller relationship with the end users.
+
How can I get technical support for the voice assistant demo?