On-device AI voice agents and assistants

Build an on-device AI voice agent for customer service, healthcare, and enterprise productivity.

A voice agent powered by wake word detection, streaming speech recognition, an on-device large language model, and streaming text-to-speech. Runs entirely on the device across mobile, embedded, desktop, and browser with no cloud processing.

Products used

Porcupine Wake Word Cheetah Streaming Speech-to-Text picoLLM On-Device LLM Orca Streaming Text-to-Speech

Platforms supported

AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Start Free

Loved by developers, trusted by enterprises

How on-device AI voice agent works

Four on-device AI SDKs to run the entire speech-to-speech LLM agent pipeline

On-device AI voice agent listens for wake word, transcribes the user's speech in real time, generates a response using a local large language model, and speaks the answer back, all without sending any data to a cloud service. Picovoice's Porcupine Wake Word, Cheetah Streaming Speech-to-Text, picoLLM on-device LLM, and Orca Streaming Text-to-Speech compose into a single pipeline.

Why Porcupine Wake Word?

Always-on, hands-free activation with minimal CPU usage.

3.8%

Single-Core CPU Utilization on Raspberry Pi 3

97.1%

Accuracy at 1 false alarm per 10 hours

~250K

Custom wake words trained and deployed in 2025

Porcupine Wake Word enables always-on, hands-free activation for the voice agent. It listens continuously with minimal CPU, hence battery, usage. So the device can stay ready without draining resources. When the user says the wake word, Porcupine immediately interrupts any in-progress LLM response and hands control to the speech-to-text engine. Enterprises can train branded wake words in seconds using the Picovoice Console and deploy them across all supported platforms.

Wake Word Detection Accuracy - higher the better

Porcupine97.1%

Snowboy68%

PocketSphinx52%

* 1 False Alarm per 10 hours · Tested with noise at 10 dB SNR

CPU Utilization - lower the better

Porcupine3.8%

Snowboy24.8%

PocketSphinx31.8%

* Measured on a Raspberry Pi 3

Why Cheetah Streaming Speech-to-Text?

Lowest latency. Lowest compute. No accuracy tradeoff.

10.1%

WER (English) vs. 11.9% Google and 10.6% Moonshine Medium

0.08

CPU Core-Hour vs. 3.36 Moonshine Medium, 40x less

8.6%

WER (Spanish) vs. 11.6% Google and 9.4% Azure

Cheetah transcribes the user's question in real time as they speak, streaming words with an average emission latency of 590 ms. Cheetah matches or beats cloud STT API accuracy and supports custom vocabulary for industry-specific terms: product names in retail, drug names in clinical settings, part numbers in manufacturing. Cheetah requires less compute than any other local engine tested, efficiently running alongside picoLLM and Orca in the same session.

English Word Error Rate

Lower is better

Amazon Streaming5.6%

Azure Real-time8.2%

Cheetah Streaming10.1%

Moonshine Streaming Medium10.6%

Vosk Streaming Large11.5%

Google Streaming11.9%

Whisper.cpp Streaming Base19.8%

* Average of 4 public datasets. The lowest-scoring (highest accuracy) model is shown for Moonshine, Vosk, and Whisper.cpp. See the benchmark for the full comparison.

English Punctuation Error Rate

Lower is better

Cheetah Streaming16.1%

Azure Real-time16.4%

Amazon Streaming24.4%

Google Streaming36%

Moonshine Streaming Medium45.1%

Whisper.cpp Streaming Base54.1%

* Average of 3 public datasets. Vosk excluded due to no punctuation support. See the benchmark for details.

Why picoLLM?

Local LLM reasoning with no accuracy tradeoff.

99.9%

Accuracy retained at 3-bit vs. 83.1% for GPTQ for Llama-3-8b

94.5%

Accuracy retained at 2-bit vs. 38.7% for GPTQ for Llama-3-8b

Any

Any transformer architecture on any platform

picoLLM runs custom-trained LLMs and open-weight language models (Llama, Gemma, Phi, Mistral) locally on CPU or GPU with no cloud dependency. picoCompression quantizes language models to run on phones, browsers, and embedded boards with no cloud dependency while preserving the task accuracy. picoLLM's minimal memory footprint allows it to run alongside Cheetah and Orca in a single session without network latency or privacy exposure.

3-bit Quantized Llama-3-8b MMLU

Higher is better

Float16 (Original Model)64.9

picoLLM64.8

GPTQ53.9

2-bit Quantized Llama-3-8b MMLU

Higher is better

Float16 (Original Model)64.9

picoLLM61.3

GPTQ25.1

Why Orca Streaming Text-to-Speech?

Natural-sounding TTS at 29 MB peak memory.

29 MB

Peak Memory Usage

130 ms

First-token-to-speech latency

7 MB

Model Size

Orca speaks the LLM's response as tokens stream in, so the user hears the answer before generation is complete. Orca achieves first-token-to-speech latency of 130 ms and a peak memory footprint of 29 MB, allowing it to run alongside Porcupine, Cheetah, and picoLLM with no performance issues. The result is a natural-sounding conversational voice agent with no perceptible pause between the user's question and the spoken answer.

TTS Latency

Lower is better

Orca TTS Streaming128 ms

ElevenLabs TTS Streaming335 ms

ESpeak TTS1,430 ms

ElevenLabs TTS1,470 ms

* Values capped at 4,000ms for display purposes; actual latency reported.

Audio Quality

Listen and compare — grouped by peak memory usage.

Peak Memory Usage < 30 MB

ESpeak

Orca

On-device AI voice agent use cases

From customer service kiosks to IT help desks

Customer Service

Voice agents for customer support and troubleshooting

LLM voice AI agents can handle commonly asked questions, such as "How do I connect to the office printer?" instantly without waiting in a ticket queue. The on-device AI voice agent resolves routine queries from an internal knowledge base, keeping employee interactions off third-party servers.

Retail & Banking

On-device voice assistants for kiosks and in-store self-service

Retailers, banks, and telcos can deploy conversational voice agents on kiosks, POS terminals, and in-store displays that answer customer questions, look up information, and guide users through processes. The on-device stack has no dependency on network availability in stores, branches, or service centers.

Healthcare

Clinical voice agents that are HIPAA-compliant

Clinicians can query drug interactions, dosing guidelines, or documentation templates by voice during patient encounters. Running the voice agent on-device eliminates the need for a BAA covering voice data with a cloud LLM provider. No audio or text is ever transmitted to any third-party service.

Automotive

In-vehicle voice AI companions that work everywhere

Automakers and fleet operators can embed voice agents that let drivers control navigation, query vehicle diagnostics, or dictate messages without taking their hands off the wheel. The on-device stack eliminates dead zones on rural highways or in parking garages where cellular connectivity drops.

Get started

On-device AI voice agent in 5 steps: Python code example

A complete working recipe in Python. Open-source on GitHub. Runs 100% on-device.

recipe · llm-voice-AI-agent-assistant

Difficulty

Beginner

Runtime

100% on-device

Language

Python

Platforms supported

AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Prerequisites

Picovoice AccessKey from Picovoice Console and GitHub Repo Clone.

Usage

These instructions assume your current working directory is recipes/call-assist/python.

1

Create a virtual environment

Isolate the recipe's dependencies from your system Python.

2

Activate the virtual environment

Activation makes pip install into .venv instead of system Python.

Linux, macOS, or Raspberry Pi

Windows

3

Install dependencies

Install the Porcupine, Cheetah, picoLLM, Orca, PvRecorder, and PvSpeaker Python SDKs.

4

Download an LLM

Open the Picovoice Console, go to picoLLM, and download a .pllm model file for your target device. Choose more compressed language models to run on hardware-constrained devices, such as Raspberry Pi.

5

Train the Wake Word

In Picovoice Console, go to Porcupine Wake Word, enter your wake phrase, train, and download the .ppn file for your target platform.

6

Run the AI voice agent demo

Pass your AccessKey, model paths, and run the demo. The demo opens both microphones and runs the on-device voice agent pipeline locally.

Have questions or looking for implementations in other languages? Visit the GitHub pico-cookbook LLM Voice Assistant Recipe, where you can find the open-source demo code and create an issue for demo-related technical questions.

On-device AI cookbook examples

More recipes from picoCookbook

Frequently asked questions

FAQ

+

What is an on-device AI voice agent?

An on-device AI voice agent is a conversational AI that listens for a wake word, transcribes speech, generates responses using a local large language model, and speaks the answer, all running on the device with no cloud dependency. No audio or query data is transmitted externally.

+

How is this different from cloud voice agent platforms?

Cloud voice agent platforms compose hosted speech-to-text, LLM, and text-to-speech into conversational pipelines. This recipe uses the same architecture but runs every component on the device. No cloud latency, no dependency on third-party uptime, and no voice data leaving the device. The wake word component adds always-on hands-free activation, which cloud voice agent platforms do not provide.

+

Which LLMs can run on-device with picoLLM?

picoLLM supports open-weight models including Llama, Gemma, Phi, Mistral, and Mixtral. Models are compressed using picoCompression, which recovers accuracy lost by standard quantization methods like GPTQ. Model files are available for download from the Picovoice Console.

+

Can the voice agent run on a Raspberry Pi?

Yes. The full pipeline — Porcupine Wake Word, Cheetah Streaming Speech-to-Text, picoLLM On-device LLM, and Orca Streaming Text-to-Speech — runs on Raspberry Pi 4 and 5. Smaller models, like Phi and Gemma, fit within the memory constraints of these devices.

+

How is this on-device AI voice assistant different from Siri, Alexa, or Google Assistant?

Siri, Alexa, and Google Assistant are closed platforms that depend on proprietary infrastructure. This voice agent uses open-weight LLMs running locally via licensable SDKs, with no dependency on Apple, Amazon, or Google services, and no voice data leaving the device. You control the model, the system prompt, and the deployment target.

+

Does the voice agent store or transmit audio?

No. All audio is processed on the device. It is never transmitted to Picovoice or any third-party cloud. Picovoice has no data controller relationship with the end users.

+

How can I get technical support for the voice assistant demo?

Visit the GitHub pico-cookbook LLM Voice Assistant Recipe, where you can find the open-source demo code and create an issue for demo-related technical questions.

Build an on-device AI voice agent for customer service, healthcare, and enterprise productivity.

Four on-device AI SDKs to run the entire speech-to-speech LLM agent pipeline

Always-on, hands-free activation with minimal CPU usage.

Lowest latency. Lowest compute. No accuracy tradeoff.

Local LLM reasoning with no accuracy tradeoff.

Natural-sounding TTS at 29 MB peak memory.

From customer service kiosks to IT help desks

Voice agents for customer support and troubleshooting

On-device voice assistants for kiosks and in-store self-service

Clinical voice agents that are HIPAA-compliant

In-vehicle voice AI companions that work everywhere

On-device AI voice agent in 5 steps: Python code example

Prerequisites

Usage

Create a virtual environment

Activate the virtual environment

Install dependencies

Download an LLM

Train the Wake Word

Run the AI voice agent demo

More recipes from picoCookbook

Voice Memo Assistant

On-Device RAG Voice Document QA

On-Device AI Call Assist

FAQ