Voice Memo Assistant

On-Device Voice Memos and Notes with Wake Word, LLM Summarization, and Voice Playback

Q: What is a voice memo assistant?

A voice memo assistant is a hands-free dictation app that lets users start, summarize, rewrite, and replay voice memos using only their voice. The wake word starts the flow, the user speaks the memo, and speech-to-text transcribes it. Depending on the user's preference, a local LLM can further summarize or rewrite the captured text, and Orca can read it back. Picovoice's on-device pipeline runs entirely on the device; no audio, transcript, or summary is sent to the cloud.

Q: Is the LLM in voice memo assistant running locally?

Yes. picoLLM Inference runs compressed open-weight LLMs such as Llama 3.2 directly on-device. The summary and rewrite are produced locally — no cloud LLM API call, no third party seeing your memo content.

Q: Can the voice memo assistant work without an internet connection?

Yes. Porcupine Wake Word, Rhino Speech-to-Intent, Cheetah Streaming Speech-to-Text, picoLLM Inference, and Orca Streaming Text-to-Speech all run locally. The full record, transcribe, summarize, rewrite, and replay flow processes data locally without sending any data to the cloud.

Q: Can I customize the wake word and the intents?

Yes. Custom wake words compatible with Porcupine Wake Word can be trained through Picovoice Console or the Porcupine Wake Word API powered by picoGYM. You can customize intents in Rhino Speech-to-Intent context YAML files to add voice commands such as "start memo" or "read memo" on Picovoice Console or the Rhino API powered by picoGYM.

Q: Who uses this kind of on-device voice memo assistant?

Deskless workers who are on the go with no office desk, mobile professionals, field workers (utility, oil & gas, construction), journalists capturing source material, lawyers dictating case notes, doctors and clinicians dictating between patients, real-estate agents capturing showings, sales reps capturing post-call notes, researchers in the field, and consumers who simply prefer their voice notes to stay private.

Q: How does the on-device voice memo assistant pipeline use a small on-device LLM responsibly?

The on-device voice memo assistant recipe uses a compact LLM such as Llama 3.2 compressed with picoCompression, which preserves task accuracy more effectively than GPTQ at the same bit-width. Summarization and rewriting are scoped tasks where small LLMs perform well, and the user can review the output before saving. The recording and transcript itself remains the source of truth — the LLM operates on top, not in place of it.

Q: Can I build the on-device voice memo assistant on Android or iOS?

Yes. The on-device voice memo assistant recipe has Android and iOS implementations for developers to get started easily. For more implementations, visit the GitHub pico-cookbook Voice Memo Assistant Recipe.

Q: Does the on-device voice memo assistant send audio to a third-party cloud?

No. Audio and text are processed on the device. Picovoice has no data controller relationship with end users. This removes processing-agreement and breach-surface concerns, which are critical for memos that include client, patient, source, or commercially sensitive material when the data is sent to the cloud.

Q: How is this different from Otter, Notta, or Wispr Flow?

Otter, Notta, and Wispr Flow transcribe audio on cloud servers, use cloud LLMs for formatting and rewriting, and are not completely hands-free. Picovoice's voice memo assistant runs the full pipeline — wake word, intent, STT, LLM summarization and rewriting, and TTS — entirely on-device.

Q: How can I get technical support for the voice memo assistant demo?

Visit the GitHub pico-cookbook Voice Memo Assistant Recipe where you can find the open-source demo code and create an issue for demo-related technical questions or reach out to your Picovoice contact.

Build apps that record, transcribe, summarize, rewrite, and replay voice memos hands-free. Wake word activation, voice commands, real-time transcription, LLM summarization, and spoken playback. All on-device. No audio or text data ever leaves the device.

Products used

Porcupine · Wake Word Rhino · Speech-to-Intent Cheetah · Streaming Speech-to-Text picoLLM · On-Device LLM Orca · Streaming Text-to-Speech

Platforms supported

AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Start Free

Loved by developers, trusted by enterprises

How on-device voice memo assistant works

Five on-device AI SDKs to record, summarize, rewrite, and replay voice notes.

On-device voice memo assistant combines Porcupine Wake Word, Rhino Speech-to-Intent, Cheetah Streaming Speech-to-Text, picoLLM, and Orca Streaming Text-to-Speech into a single local pipeline that records, summarizes, rewrites, and replays voice notes. Porcupine listens for the wake word, Rhino captures intents such as start, summarize, rewrite, or read the memo. Cheetah transcribes the dictation. picoLLM produces the summary or rewrite locally. Orca speaks playback. Memo content never leaves the device.

Why Porcupine Wake Word?

Always-listening wake word with very low CPU and battery cost.

3.8%

Single-Core CPU Utilization on Raspberry Pi 3

97.1%

Accuracy at 1 false alarm per 10 hours

~250K

Custom wake words trained and deployed in 2025

Porcupine Wake Word is lightweight, accurate, customizable, and production-ready. It enables an always-listening wake word with very low CPU and battery cost, which is critical for a voice memo app that lives in the background and needs to be activated without draining the battery of mobile devices. With Porcupine, enterprises can train branded wake words and always-listening commands in seconds and deploy them across platforms. Industry-leading accuracy that fits on MCUs.

Wake Word Detection Accuracy - higher the better

Porcupine97.1%

Snowboy68%

PocketSphinx52%

* 1 False Alarm per 10 hours · Tested with noise at 10 dB SNR

CPU Utilization - lower the better

Porcupine3.8%

Snowboy24.8%

PocketSphinx31.8%

* Measured on a Raspberry Pi 3

Why Rhino Speech-to-Intent?

End-to-end intent. No transcript. No hallucinations.

6x

Higher accuracy than Big Tech average

97.3%

Accuracy tested across 6 to 24 dB Signal-to-Noise Ratio

∞

Unlimited voice interactions per user

Rhino Speech-to-Intent maps spoken commands, such as "Start memo" and "Read memo," directly to actions with no intermediate transcript. Most voice command systems run a two-step pipeline: speech-to-text converts audio to a transcript, then a separate NLU model parses that transcript for intent. Every step accumulates error and compounds latency. Rhino achieves 6× higher accuracy than STT-plus-cloud-NLU stacks offered by Big Tech, even in noisy environments.

Voice Command Acceptance Accuracy

Higher is better

Rhino97.3%

Amazon Lex84.3%

Google Dialogflow77.3%

* Average of 7 benchmarks at 6, 9, 12, 15, 16, 21, and 24 dB SNR.

Voice Command Acceptance Accuracy at 21 dB SNR

Higher is better

Rhino99%

Amazon Lex87%

Google Dialogflow83%

* Measured at 21 dB SNR (Low Background Noise).

Why Cheetah Streaming Speech-to-Text?

Real-time dictation transcription with custom vocabulary.

10.1%

WER (English) vs. 11.9% Google and 10.6% Moonshine Medium

0.08

CPU Core-Hour vs. 3.36 Moonshine Medium, 40x less

8.6%

WER (Spanish) vs. 11.6% Google and 9.4% Azure

Cheetah transcribes the memo dictation in real time as the user speaks. Cheetah matches or beats cloud STT API accuracy and supports custom vocabulary for industry jargon and proper nouns, which is important for memos that include client names, drug names, case numbers, or part numbers. Cheetah emits words at 590 ms average latency, typically one word behind the speaker, and requires less compute than any other local engine tested. The result: no tradeoff on accuracy, latency, or privacy.

English Word Error Rate

Lower is better

Amazon Streaming5.6%

Azure Real-time8.2%

Cheetah Streaming10.1%

Moonshine Streaming Medium10.6%

Vosk Streaming Large11.5%

Google Streaming11.9%

Whisper.cpp Streaming Base19.8%

* Average of 4 public datasets. The lowest-scoring (highest accuracy) model is shown for Moonshine, Vosk, and Whisper.cpp. See the benchmark for the full comparison.

English Punctuation Error Rate

Lower is better

Cheetah Streaming16.1%

Azure Real-time16.4%

Amazon Streaming24.4%

Google Streaming36%

Moonshine Streaming Medium45.1%

Whisper.cpp Streaming Base54.1%

* Average of 3 public datasets. Vosk excluded due to no punctuation support. See the benchmark for details.

Why picoLLM?

Local LLM summarization and rewriting with no cloud dependency.

99.9%

Accuracy retained at 3-bit vs. 83.1% for GPTQ for Llama-3-8b

94.5%

Accuracy retained at 2-bit vs. 38.7% for GPTQ for Llama-3-8b

Any

Any transformer architecture on any platform

picoCompression quantizes language models to run on phones, browsers, and embedded boards with no cloud dependency while preserving task accuracy, making it well-suited for memo summarization and rewriting. picoLLM's minimal memory footprint allows the model to run alongside Cheetah, Rhino, and Orca in a single session without introducing network latency or privacy exposure.

3-bit Quantized Llama-3-8b MMLU

Higher is better

Float16 (Original Model)64.9

picoLLM64.8

GPTQ53.9

2-bit Quantized Llama-3-8b MMLU

Higher is better

Float16 (Original Model)64.9

picoLLM61.3

GPTQ25.1

Why Orca Streaming Text-to-Speech?

Natural-sounding TTS at 29 MB peak memory.

29 MB

Peak Memory Usage

130 ms

First-token-to-speech latency

7 MB

Model Sizee

Orca reads back the memo, the summary, or the rewritten version with a natural-sounding voice. Orca Streaming TTS is built for real-time LLM applications. It starts reading LLM responses as soon as LLMs create a meaningful word or phrase, achieving first-token-to-speech latency of 130 ms. Orca's 29 MB peak memory allows it to run alongside Cheetah, picoLLM, Porcupine, and Rhino with no performance issues. Most high-quality TTS engines require hundreds of megabytes of RAM, which limits where they can deploy.

TTS Latency

Lower is better

Orca TTS Streaming128 ms

ElevenLabs TTS Streaming335 ms

ESpeak TTS1,430 ms

ElevenLabs TTS1,470 ms

* Values capped at 4,000ms for display purposes; actual latency reported.

Audio Quality

Listen and compare — grouped by peak memory usage.

Peak Memory Usage < 30 MB

ESpeak

Orca

On-device voice memo assistant use cases

From sales reps to clinicians

Sales & CRM capture

On-device voice memo for sales reps and CRM capture

Sales reps walk out of a meeting and record their notes into their phones or tablets on the way to another meeting or home. picoLLM structures the transcription into a summary with next steps, attendees, and follow-ups. The captured note can be pushed into the CRM. Customer voice and prospect details never leave the device.

Healthcare

HIPAA-compliant dictation for doctors and clinicians

Doctors, nurses, and therapists can dictate patient encounter notes between visits, reducing time spent on documentation. picoLLM summarizes them into a SOAP-friendly structure. No audio or transcript is sent to a cloud, which removes BAA requirements under HIPAA. Porcupine, Cheetah, and Rhino all support custom medical vocabulary.

Legal, journalism, FSI

On-device dictation and notes in handling sensitive data

Lawyers, journalists, and financial advisors capture interviews containing sensitive data that shouldn't be shared with third parties. Yet, besides Big Tech, newer companies like Otter.ai have hit the news and courtrooms for questionable privacy practices. On-device AI is the only way to keep voice memos private.

Field service

Offline voice notes for field service

Service technicians inspect assets and capture issues on manufacturing lines and in the field, where connectivity is unreliable, and content has commercial-sensitivity or source-protection requirements. On-device processing ensures notes are captured securely and accurately without depending on a network connection.

Get started

On-device voice memo assistant code example: Python code example

A complete working recipe in Python. Open-source on GitHub. Runs 100% on-device.

recipe · voice-memo-assistant

Difficulty

Beginner

Runtime

100% on-device

Language

Python

Platforms supported

AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Prerequisites

Picovoice AccessKey from Picovoice Console and GitHub Repo Clone.

Usage

These instructions assume your current working directory is recipes/voice-memo-assistant/python.

1

Create a virtual environment

Isolate the recipe's dependencies from your system Python.

2

Activate the virtual environment

Activation makes pip install into .venv instead of system Python.

Linux, macOS, or Raspberry Pi

Windows

3

Install dependencies

Install the Porcupine, Rhino, Cheetah, picoLLM, Orca, PvRecorder, and PvSpeaker Python SDKs.

4

Download the on-device LLM

Download llama-3.2-1b-instruct-385.pllm from the Picovoice Console. Summarization and rewriting run locally on this model with no cloud LLM call.

5

Train the Wake Word

In Picovoice Console, go to Porcupine Wake Word, enter your wake phrase, train, and download the .ppn file for your target platform.

6

Train the Speech-to-Intent model

In Picovoice Console, go to Rhino Speech-to-Intent, create an empty context, and import the Rhino context YAML for this recipe. Intents include startMemo, readMemo, summarizeMemo, and rewriteMemo. Download the generated .rhn file.

7

Run the voice memo assistant demo

Pass your AccessKey and run the demo. The demo opens the microphone and runs the on-device voice memo assistant pipeline locally.

Have questions or looking for implementations in other languages? Visit the GitHub pico-cookbook Voice Memo Assistant Recipe, where you can find the open-source demo code and create an issue for demo-related technical questions.

On-device AI cookbook examples

More recipes from picoCookbook

Frequently asked questions

FAQ

+

What is a voice memo assistant?

A voice memo assistant is a hands-free dictation app that lets users start, summarize, rewrite, and replay voice memos using only their voice. The wake word starts the flow, the user speaks the memo, and speech-to-text transcribes it. Depending on the user's preference, a local LLM can further summarize or rewrite the captured text, and Orca can read it back. Picovoice's on-device pipeline runs entirely on the device; no audio, transcript, or summary is sent to the cloud.

+

Is the LLM in voice memo assistant running locally?

Yes. picoLLM Inference runs compressed open-weight LLMs such as Llama 3.2 directly on-device. The summary and rewrite are produced locally — no cloud LLM API call, no third party seeing your memo content.

+

Can the voice memo assistant work without an internet connection?

Yes. Porcupine Wake Word, Rhino Speech-to-Intent, Cheetah Streaming Speech-to-Text, picoLLM Inference, and Orca Streaming Text-to-Speech all run locally. The full record, transcribe, summarize, rewrite, and replay flow processes data locally without sending any data to the cloud.

+

Can I customize the wake word and the intents?

Yes. Custom wake words compatible with Porcupine Wake Word can be trained through Picovoice Console or the Porcupine Wake Word API powered by picoGYM. You can customize intents in Rhino Speech-to-Intent context YAML files to add voice commands such as "start memo" or "read memo" on Picovoice Console or the Rhino API powered by picoGYM.

+

Who uses this kind of on-device voice memo assistant?

Deskless workers who are on the go with no office desk, mobile professionals, field workers (utility, oil & gas, construction), journalists capturing source material, lawyers dictating case notes, doctors and clinicians dictating between patients, real-estate agents capturing showings, sales reps capturing post-call notes, researchers in the field, and consumers who simply prefer their voice notes to stay private.

+

How does the on-device voice memo assistant pipeline use a small on-device LLM responsibly?

The on-device voice memo assistant recipe uses a compact LLM such as Llama 3.2 compressed with picoCompression, which preserves task accuracy more effectively than GPTQ at the same bit-width. Summarization and rewriting are scoped tasks where small LLMs perform well, and the user can review the output before saving. The recording and transcript itself remains the source of truth — the LLM operates on top, not in place of it.

+

Can I build the on-device voice memo assistant on Android or iOS?

Yes. The on-device voice memo assistant recipe has Android and iOS implementations for developers to get started easily. For more implementations, visit the GitHub pico-cookbook Voice Memo Assistant Recipe.

+

Does the on-device voice memo assistant send audio to a third-party cloud?

No. Audio and text are processed on the device. Picovoice has no data controller relationship with end users. This removes processing-agreement and breach-surface concerns, which are critical for memos that include client, patient, source, or commercially sensitive material when the data is sent to the cloud.

+

How is this different from Otter, Notta, or Wispr Flow?

Otter, Notta, and Wispr Flow transcribe audio on cloud servers, use cloud LLMs for formatting and rewriting, and are not completely hands-free. Picovoice's voice memo assistant runs the full pipeline — wake word, intent, STT, LLM summarization and rewriting, and TTS — entirely on-device.

+

How can I get technical support for the voice memo assistant demo?

Visit the GitHub pico-cookbook Voice Memo Assistant Recipe where you can find the open-source demo code and create an issue for demo-related technical questions or reach out to your Picovoice contact.

On-Device Voice Memos and Notes with Wake Word, LLM Summarization, and Voice Playback

Five on-device AI SDKs to record, summarize, rewrite, and replay voice notes.

Always-listening wake word with very low CPU and battery cost.

End-to-end intent. No transcript. No hallucinations.

Real-time dictation transcription with custom vocabulary.

Local LLM summarization and rewriting with no cloud dependency.

Natural-sounding TTS at 29 MB peak memory.

From sales reps to clinicians

On-device voice memo for sales reps and CRM capture

HIPAA-compliant dictation for doctors and clinicians

On-device dictation and notes in handling sensitive data

Offline voice notes for field service

On-device voice memo assistant code example: Python code example

Prerequisites

Usage

Create a virtual environment

Activate the virtual environment

Install dependencies

Download the on-device LLM

Train the Wake Word

Train the Speech-to-Intent model

Run the voice memo assistant demo

More recipes from picoCookbook

On-Device AI Call Assist

Speech-to-Speech Translation

On-Device Voice Assistant

FAQ