Speaker-Aware Voice Assistant

Build an AI Voice Assistant with Speaker Recognition for Personalization and Authentication

Identifies who is speaking by voiceprint, personalizes responses, and grants or denies voice commands based on the speaker's role. Built for enterprises developing their own Alexa Voice ID or Google Voice Match.

Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi
How the AI voice assistant with speaker recognition works

One on-device voice AI pipeline for speaker identification, voice commands, and personalized responses

An AI voice assistant with speaker recognition listens for a wake word, infers the user's intent directly from the follow-on spoken command, identifies who is speaking by voiceprint to grant or deny access based on the speaker's role, and responds to users all on-device with no cloud dependency. Porcupine Wake Word detects the wake word. Rhino Speech-to-Intent extracts the command as a structured intent. Eagle Speaker Recognition compares the speaker's voice against enrolled voiceprints and returns a similarity score. The assistant then checks the speaker's role (admin or user) against the command's permission level before acting. Orca Text-to-Speech speaks the response back.

UserSPEAKS COMMAND"Hey Assistant, unlockthe front door"PorcupineWAKE WORDWake detectedRhinoSPEECH-TO-INTENTIntent extracted{ intent: "adminOnly" action: "unlockDoor"}EagleSPEAKER RECOGNITIONVoiceprint matchedspeaker: "Alex"role: "admin"score: 0.92OrcaVOICE RESPONSE"Admin commandapproved."Returns to listening for wake word
Why Porcupine Wake Word?

Always-on, low-power wake word detection for embedded devices

3.8%
Single-Core CPU Utilization on Raspberry Pi 3
97.1%
Accuracy at 1 false alarm per 10 hours
~250K
Custom wake words trained and deployed in 2025

Porcupine Wake Word provides always-on, low-power wake word detection for embedded devices. It listens continuously with minimal CPU usage and command pipeline only when the wake word is detected, keeping power consumption low between activations. Custom wake words can be trained in seconds using the Picovoice Console and exported for embedded, mobile, web, and desktop applications.

Wake Word Detection Accuracy - higher the better
Porcupine97.1%
Snowboy68%
PocketSphinx52%
CPU Utilization - lower the better
Porcupine3.8%
Snowboy24.8%
PocketSphinx31.8%
Why Rhino Speech-to-Intent?

Structured voice commands without intermediate speech-to-text

6x
Higher accuracy than Big Tech average
97.3%
Accuracy tested across 6 to 24 dB Signal-to-Noise Ratio
Unlimited voice interactions per user

Rhino Speech-to-Intent extracts structured intents directly from spoken commands without an intermediate speech-to-text step, achieving a higher accuracy than Big Tech alternatives, like Google Dialogflow and Amazon Lex. For a speaker-aware voice assistant, Rhino's domain-specific models define which commands exist and what permission levels they require. You configure the intents (e.g., adminOnly, speakerPersonalized, generic) and slots in the Picovoice Console. Rhino only recognizes commands within the defined domain, eliminating out-of-domain misrecognition.

Voice Command Acceptance Accuracy
Higher is better
Rhino97.3%
Amazon Lex84.3%
Google Dialogflow77.3%
Voice Command Acceptance Accuracy at 21 dB SNR
Higher is better
Rhino99%
Amazon Lex87%
Google Dialogflow83%
Why Eagle Speaker Recognition?

On-device voiceprint matching for identity and role verification

0.18%
Equal Error Rate vs. SpeechBrain 0.49%
4.5 MB
Model Size vs. SpeechBrain 117.5 MB
Any
No language or passphrase restriction

Eagle Speaker Recognition identifies who is speaking by comparing the speaker's voice against enrolled voiceprints. Eagle returns a similarity score for each enrolled speaker, and the application uses that score to determine identity and role. Enrollment takes a few seconds of speech. Eagle runs entirely on-device; voiceprint data is never transmitted to Picovoice or any third-party server.

Equal Error Rate (EER) — lower is better
Eagle Speaker Recognition0.18%
SpeechBrain Speaker Recognition0.49%
pyannote Speaker Recognition0.70%
Model Size (MB to initialize) — lower is better
Eagle Speaker Recognition4.5 MB
SpeechBrain Speaker Recognition46.5 MB
pyannote Speaker Recognition117.5 MB
Why Orca Text-to-Speech?

Spoken responses at 29 MB peak memory

29 MB
Peak Memory Usage
130 ms
First-token-to-speech latency
7 MB
Model Size

Orca Text-to-Speech speaks the assistant's responses back to the user, such as "Hi Jane, playing your favourite playlist." "Admin command approved." or "Permission denied." after the access decision. Orca's 29 MB peak memory usage and 7MB model size make it well-suited for mobile and embedded applications.

TTS Latency
Lower is better
Orca TTS Streaming128 ms
ElevenLabs TTS Streaming335 ms
ESpeak TTS1,430 ms
ElevenLabs TTS1,470 ms
Audio Quality
Listen and compare — grouped by peak memory usage.
Peak Memory Usage < 30 MB
ESpeak
Orca
Speaker-aware voice assistant use cases

From smart homes to clinical workflows: voice authentication and personalization for real devices

Smart Home

Smart home access control

Smart home manufacturers can restrict sensitive commands to enrolled household members. "Disarm the alarm" only works for verified adults, while children and guests can change the music and temperature without access to security or purchase functions.

Enterprise

Role-based voice commands on shared enterprise equipment

Enterprises deploying shared kiosks, terminals, or workstations can assign different permission levels to different speakers. A floor manager can override production settings; an operator can only start and stop cycles. The device identifies the speaker by voiceprint in real time.

Healthcare

Voice-authenticated access to clinical systems

Only authorized clinicians can issue medication commands, adjust device settings, or access patient information by voice, giving clinicians hands-free authentication during procedures and consultations. All speaker recognition runs on the device, keeping voiceprint data off external servers.

Retail

Role-based voice commands at retail stores

Retailers can add voice commands to POS terminals and self-checkout kiosks where managers, store associates, and customers have different access levels. Voiceprints of employees are stored locally and never sent to Picovoice or any third party.

Get started

AI voice assistant with speaker recognition: Code example

A complete working recipe in Python. Open-source on GitHub. Runs 100% on-device.

recipe · ai-voice-assistant-speaker-recognition
Difficulty
Beginner
Runtime
100% on-device
Language
Python
Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Prerequisites

Picovoice AccessKey from Picovoice Console and GitHub Repo Clone.

Usage

These instructions assume your current working directory is recipes/speaker-aware-voice-assistant/python.
1

Create a virtual environment

Isolate the recipe's dependencies from your system Python and set up the virtual environment.
2

Activate the virtual environment

Activation makes pip install into .venv instead of system Python.
Linux, macOS, or Raspberry Pi
Windows
3

Install dependencies

Pulls in the Porcupine Wake Word, Rhino Speech-to-Intent, Eagle Speaker Recognition, and Orca Text-to-Speech Python SDKs, along with PvRecorder and PvSpeaker for audio input and output.
4

Train a wake word

In Picovoice Console, go to Porcupine Wake Word, enter your wake phrase, and download the .ppn model file optimized for your platform.
5

Design your voice commands

In Picovoice Console, go to Rhino Speech-to-Intent, create a context with intents like adminOnly, speakerPersonalized, and generic. Export the .rhn context file.
6

Enroll speakers

Create one Eagle speaker profile for each user you want the assistant to recognize.
7

Run the voice assistant

Pass your AccessKey, the wake word model, the Rhino context, and the speaker profiles with their assigned roles.
Have questions or looking for implementations in other languages? Visit the GitHub pico-cookbook Speaker-Aware Voice Assistant Recipe, where you can find the open-source demo code and create an issue for demo-related technical questions.
Frequently asked questions

FAQ

+
What is a speaker-aware voice assistant?
A speaker-aware voice assistant identifies who is speaking by voiceprint and adapts its behavior based on the speaker's identity and role. It uses voice biometrics to grant or deny access to specific commands, personalize responses, or restrict functionality to authorized users.
+
How does voice biometric authentication work in this recipe?
Eagle Speaker Recognition compares the speaker's voice against enrolled voiceprints and returns a similarity score. If the score exceeds a configurable threshold, the speaker is identified, and their assigned role (admin or user) determines which commands they can execute. All processing runs on-device.
+
How is this different from Alexa Voice ID, Google Voice Match, or Siri voice recognition?
Alexa Voice ID, Google Voice Match, and Siri all support some form of speaker recognition, but they are closed platforms tied to their respective ecosystems. For example, you cannot embed Alexa Voice ID into your own hardware product without joining Amazon's ecosystem, and all voice data flows through their cloud. Picovoice's Eagle Speaker Recognition is a licensable SDK that runs entirely on your hardware. Voiceprint enrollment and recognition happen on-device. No audio or voiceprint data is transmitted to Picovoice, Amazon, Google, Apple, or any third-party server. You control the hardware, the firmware, and the data.
+
How accurate is the speaker recognition?
Eagle Speaker Recognition achieves 0.18% EER on VoxConverse, a widely used multi-speaker dataset containing real conversations across multiple languages. That is 2.7x lower than SpeechBrain (0.49%) and 3.9x lower than pyannote (0.70%). EER measures the point where false acceptance and false rejection rates are equal; a lower EER means fewer impostors get through and fewer genuine users get blocked. The admin_similarity_threshold parameter lets you tune the tradeoff between security (higher threshold, fewer false accepts) and convenience (lower threshold, fewer false rejects).
+
Can someone spoof the voice authentication?
Eagle Speaker Recognition is not an anti-spoofing detector and does not perform liveness detection. Eagle compares the acoustic features of the speaker's voice. It is designed for convenience-level voice biometric authentication (device personalization, role-based access) rather than high-security authentication (financial transactions, facility access). So it does not distinguish between a live speaker and a high-quality recording or synthetic voice. Spoofing techniques, including voice cloning and deepfake audio, continue to advance. For high-security applications, combine voice biometrics with a second factor such as a PIN, badge, or biometric sensor.
+
How many speakers can be enrolled?
There is no hard limit on the number of enrolled speaker profiles. Each profile is a lightweight file generated from a few seconds of speech. The application passes all profiles to Eagle Speaker Recognition on each recognition call, and Eagle returns similarity scores for each.
+
Does the speaker recognition work offline?
Yes. Eagle Speaker Recognition runs entirely on-device. Voiceprint enrollment and recognition both happen locally. No audio or voiceprint data is transmitted to Picovoice or any third-party server.
+
Can I use speaker recognition without the voice command pipeline?
Yes. Eagle Speaker Recognition works as a standalone SDK. You can use it for speaker identification or verification in any application without Porcupine, Rhino, or Orca. This recipe combines all four to demonstrate a complete speaker-aware voice assistant.
+
Does the voice assistant store or transmit audio?
No. All audio is processed on the device and discarded. Voiceprint profiles are stored locally as .egl files. Nothing is transmitted to Picovoice or any third-party cloud. Picovoice has no data controller relationship with your end users.
+
How can I get technical support?