Voice Memo Assistant

On-Device Voice Memos and Notes with Wake Word, LLM Summarization, and Voice Playback

Build apps that record, transcribe, summarize, rewrite, and replay voice memos hands-free. Wake word activation, voice commands, real-time transcription, LLM summarization, and spoken playback. All on-device. No audio or text data ever leaves the device.

Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi
How on-device voice memo assistant works

Five on-device AI SDKs to record, summarize, rewrite, and replay voice notes.

On-device voice memo assistant combines Porcupine Wake Word, Rhino Speech-to-Intent, Cheetah Streaming Speech-to-Text, picoLLM, and Orca Streaming Text-to-Speech into a single local pipeline that records, summarizes, rewrites, and replays voice notes. Porcupine listens for the wake word, Rhino captures intents such as start, summarize, rewrite, or read the memo. Cheetah transcribes the dictation. picoLLM produces the summary or rewrite locally. Orca speaks playback. Memo content never leaves the device.

PorcupineWAKE WORDRhinoSPEECH-TO-INTENTCheetahSTREAMING STTpicoLLMSUMMARIZE / REWRITEOrcaSTREAMING TTSWake word starts the flow · Rhino routes the spoken intent to one of three engines
Why Porcupine Wake Word?

Always-listening wake word with very low CPU and battery cost.

3.8%
Single-Core CPU Utilization on Raspberry Pi 3
97.1%
Accuracy at 1 false alarm per 10 hours
~250K
Custom wake words trained and deployed in 2025

Porcupine Wake Word is lightweight, accurate, customizable, and production-ready. It enables an always-listening wake word with very low CPU and battery cost, which is critical for a voice memo app that lives in the background and needs to be activated without draining the battery of mobile devices. With Porcupine, enterprises can train branded wake words and always-listening commands in seconds and deploy them across platforms. Industry-leading accuracy that fits on MCUs.

Wake Word Detection Accuracy - higher the better
Porcupine97.1%
Snowboy68%
PocketSphinx52%
CPU Utilization - lower the better
Porcupine3.8%
Snowboy24.8%
PocketSphinx31.8%
Why Rhino Speech-to-Intent?

End-to-end intent. No transcript. No hallucinations.

6x
Higher accuracy than Big Tech average
97.3%
Accuracy tested across 6 to 24 dB Signal-to-Noise Ratio
Unlimited voice interactions per user

Rhino Speech-to-Intent maps spoken commands, such as "Start memo" and "Read memo," directly to actions with no intermediate transcript. Most voice command systems run a two-step pipeline: speech-to-text converts audio to a transcript, then a separate NLU model parses that transcript for intent. Every step accumulates error and compounds latency. Rhino achieves 6× higher accuracy than STT-plus-cloud-NLU stacks offered by Big Tech, even in noisy environments.

Voice Command Acceptance Accuracy
Higher is better
Rhino97.3%
Amazon Lex84.3%
Google Dialogflow77.3%
Voice Command Acceptance Accuracy at 21 dB SNR
Higher is better
Rhino99%
Amazon Lex87%
Google Dialogflow83%
Why Cheetah Streaming Speech-to-Text?

Real-time dictation transcription with custom vocabulary.

10.1%
WER (English) vs. 11.9% Google and 10.6% Moonshine Medium
0.08
CPU Core-Hour vs. 3.36 Moonshine Medium, 40x less
8.6%
WER (Spanish) vs. 11.6% Google and 9.4% Azure

Cheetah transcribes the memo dictation in real time as the user speaks. Cheetah matches or beats cloud STT API accuracy and supports custom vocabulary for industry jargon and proper nouns, which is important for memos that include client names, drug names, case numbers, or part numbers. Cheetah emits words at 590 ms average latency, typically one word behind the speaker, and requires less compute than any other local engine tested. The result: no tradeoff on accuracy, latency, or privacy.

English Word Error Rate
Lower is better
Amazon Streaming5.6%
Azure Real-time8.2%
Cheetah Streaming10.1%
Moonshine Streaming Medium10.6%
Vosk Streaming Large11.5%
Google Streaming11.9%
Whisper.cpp Streaming Base19.8%
English Punctuation Error Rate
Lower is better
Cheetah Streaming16.1%
Azure Real-time16.4%
Amazon Streaming24.4%
Google Streaming36%
Moonshine Streaming Medium45.1%
Whisper.cpp Streaming Base54.1%
Why picoLLM?

Local LLM summarization and rewriting with no cloud dependency.

99.9%
Accuracy retained at 3-bit vs. 83.1% for GPTQ for Llama-3-8b
94.5%
Accuracy retained at 2-bit vs. 38.7% for GPTQ for Llama-3-8b
Any
Any transformer architecture on any platform

picoCompression quantizes language models to run on phones, browsers, and embedded boards with no cloud dependency while preserving task accuracy, making it well-suited for memo summarization and rewriting. picoLLM's minimal memory footprint allows the model to run alongside Cheetah, Rhino, and Orca in a single session without introducing network latency or privacy exposure.

3-bit Quantized Llama-3-8b MMLU
Higher is better
Float16 (Original Model)64.9
picoLLM64.8
GPTQ53.9
2-bit Quantized Llama-3-8b MMLU
Higher is better
Float16 (Original Model)64.9
picoLLM61.3
GPTQ25.1
Why Orca Streaming Text-to-Speech?

Natural-sounding TTS at 29 MB peak memory.

29 MB
Peak Memory Usage
130 ms
First-token-to-speech latency
7 MB
Model Sizee

Orca reads back the memo, the summary, or the rewritten version with a natural-sounding voice. Orca Streaming TTS is built for real-time LLM applications. It starts reading LLM responses as soon as LLMs create a meaningful word or phrase, achieving first-token-to-speech latency of 130 ms. Orca's 29 MB peak memory allows it to run alongside Cheetah, picoLLM, Porcupine, and Rhino with no performance issues. Most high-quality TTS engines require hundreds of megabytes of RAM, which limits where they can deploy.

TTS Latency
Lower is better
Orca TTS Streaming128 ms
ElevenLabs TTS Streaming335 ms
ESpeak TTS1,430 ms
ElevenLabs TTS1,470 ms
Audio Quality
Listen and compare — grouped by peak memory usage.
Peak Memory Usage < 30 MB
ESpeak
Orca
On-device voice memo assistant use cases

From sales reps to clinicians

Sales & CRM capture

On-device voice memo for sales reps and CRM capture

Sales reps walk out of a meeting and record their notes into their phones or tablets on the way to another meeting or home. picoLLM structures the transcription into a summary with next steps, attendees, and follow-ups. The captured note can be pushed into the CRM. Customer voice and prospect details never leave the device.

Healthcare

HIPAA-compliant dictation for doctors and clinicians

Doctors, nurses, and therapists can dictate patient encounter notes between visits, reducing time spent on documentation. picoLLM summarizes them into a SOAP-friendly structure. No audio or transcript is sent to a cloud, which removes BAA requirements under HIPAA. Porcupine, Cheetah, and Rhino all support custom medical vocabulary.

Legal, journalism, FSI

On-device dictation and notes in handling sensitive data

Lawyers, journalists, and financial advisors capture interviews containing sensitive data that shouldn't be shared with third parties. Yet, besides Big Tech, newer companies like Otter.ai have hit the news and courtrooms for questionable privacy practices. On-device AI is the only way to keep voice memos private.

Field service

Offline voice notes for field service

Service technicians inspect assets and capture issues on manufacturing lines and in the field, where connectivity is unreliable, and content has commercial-sensitivity or source-protection requirements. On-device processing ensures notes are captured securely and accurately without depending on a network connection.

Get started

On-device voice memo assistant code example: Python code example

A complete working recipe in Python. Open-source on GitHub. Runs 100% on-device.

recipe · voice-memo-assistant
Difficulty
Beginner
Runtime
100% on-device
Language
Python
Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Prerequisites

Picovoice AccessKey from Picovoice Console and GitHub Repo Clone.

Usage

These instructions assume your current working directory is recipes/voice-memo-assistant/python.
1

Create a virtual environment

Isolate the recipe's dependencies from your system Python.
2

Activate the virtual environment

Activation makes pip install into .venv instead of system Python.
Linux, macOS, or Raspberry Pi
Windows
3

Install dependencies

Install the Porcupine, Rhino, Cheetah, picoLLM, Orca, PvRecorder, and PvSpeaker Python SDKs.
4

Download the on-device LLM

Download llama-3.2-1b-instruct-385.pllm from the Picovoice Console. Summarization and rewriting run locally on this model with no cloud LLM call.
5

Train the Wake Word

In Picovoice Console, go to Porcupine Wake Word, enter your wake phrase, train, and download the .ppn file for your target platform.
6

Train the Speech-to-Intent model

In Picovoice Console, go to Rhino Speech-to-Intent, create an empty context, and import the Rhino context YAML for this recipe. Intents include startMemo, readMemo, summarizeMemo, and rewriteMemo. Download the generated .rhn file.
7

Run the voice memo assistant demo

Pass your AccessKey and run the demo. The demo opens the microphone and runs the on-device voice memo assistant pipeline locally.
Have questions or looking for implementations in other languages? Visit the GitHub pico-cookbook Voice Memo Assistant Recipe, where you can find the open-source demo code and create an issue for demo-related technical questions.
Frequently asked questions

FAQ

+
What is a voice memo assistant?
A voice memo assistant is a hands-free dictation app that lets users start, summarize, rewrite, and replay voice memos using only their voice. The wake word starts the flow, the user speaks the memo, and speech-to-text transcribes it. Depending on the user's preference, a local LLM can further summarize or rewrite the captured text, and Orca can read it back. Picovoice's on-device pipeline runs entirely on the device; no audio, transcript, or summary is sent to the cloud.
+
Is the LLM in voice memo assistant running locally?
Yes. picoLLM Inference runs compressed open-weight LLMs such as Llama 3.2 directly on-device. The summary and rewrite are produced locally — no cloud LLM API call, no third party seeing your memo content.
+
Can the voice memo assistant work without an internet connection?
Yes. Porcupine Wake Word, Rhino Speech-to-Intent, Cheetah Streaming Speech-to-Text, picoLLM Inference, and Orca Streaming Text-to-Speech all run locally. The full record, transcribe, summarize, rewrite, and replay flow processes data locally without sending any data to the cloud.
+
Can I customize the wake word and the intents?
Yes. Custom wake words compatible with Porcupine Wake Word can be trained through Picovoice Console or the Porcupine Wake Word API powered by picoGYM. You can customize intents in Rhino Speech-to-Intent context YAML files to add voice commands such as "start memo" or "read memo" on Picovoice Console or the Rhino API powered by picoGYM.
+
Who uses this kind of on-device voice memo assistant?
Deskless workers who are on the go with no office desk, mobile professionals, field workers (utility, oil & gas, construction), journalists capturing source material, lawyers dictating case notes, doctors and clinicians dictating between patients, real-estate agents capturing showings, sales reps capturing post-call notes, researchers in the field, and consumers who simply prefer their voice notes to stay private.
+
How does the on-device voice memo assistant pipeline use a small on-device LLM responsibly?
The on-device voice memo assistant recipe uses a compact LLM such as Llama 3.2 compressed with picoCompression, which preserves task accuracy more effectively than GPTQ at the same bit-width. Summarization and rewriting are scoped tasks where small LLMs perform well, and the user can review the output before saving. The recording and transcript itself remains the source of truth — the LLM operates on top, not in place of it.
+
Can I build the on-device voice memo assistant on Android or iOS?
Yes. The on-device voice memo assistant recipe has Android and iOS implementations for developers to get started easily. For more implementations, visit the GitHub pico-cookbook Voice Memo Assistant Recipe.
+
Does the on-device voice memo assistant send audio to a third-party cloud?
No. Audio and text are processed on the device. Picovoice has no data controller relationship with end users. This removes processing-agreement and breach-surface concerns, which are critical for memos that include client, patient, source, or commercially sensitive material when the data is sent to the cloud.
+
How is this different from Otter, Notta, or Wispr Flow?
Otter, Notta, and Wispr Flow transcribe audio on cloud servers, use cloud LLMs for formatting and rewriting, and are not completely hands-free. Picovoice's voice memo assistant runs the full pipeline — wake word, intent, STT, LLM summarization and rewriting, and TTS — entirely on-device.
+
How can I get technical support for the voice memo assistant demo?
Visit the GitHub pico-cookbook Voice Memo Assistant Recipe where you can find the open-source demo code and create an issue for demo-related technical questions or reach out to your Picovoice contact.