Voice-powered document QA

Build an on-device Voice-Powered Document QA with RAG Pipeline

Ask questions about any document by voice and get spoken answers grounded in the document's content. Local embeddings, local LLM inference, speech-to-text, and text-to-speech, running entirely on-device with no cloud dependency.

Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi
How the on-device RAG pipeline works

On-device document embeddings, retrieval, LLM inference, speech-to-text, and text-to-speech powering voice-based document QA

An on-device AI pipeline loads a document, chunks it, and generates local embeddings using picoLLM. When the user asks a question by voice, Cheetah transcribes the question in real time. picoLLM generates an embedding for the question, retrieves the most relevant document chunks by cosine similarity, and generates an answer grounded in those chunks. Orca speaks the answer back. The entire pipeline runs on-device: No document content, no queries, and no audio data is transmitted externally.

1 · DOCUMENT INGESTIONDocumentPLAIN TEXTContracts, manuals, SOPsChunkingSPLIT WITH OVERLAP1,200 chars, 250 overlappicoLLMEMBEDDINGSEmbed each chunkLocal storeVECTORS ON DISKSave / load JSON2 · VOICE QA LOOPUserASKS BY VOICE"What are the terminationclauses?"CheetahSTREAMING STTTranscribed questionpicoLLMRETRIEVE + GENERATEAnswer from documentOrcaSTREAMING TTSSpeaks the answerListens for the next question
Why Cheetah Streaming Speech-to-Text?

Real-time transcription with endpoint detection for hands-free document queries

10.1%
WER (English) vs. 11.9% Google and 10.6% Moonshine Medium
0.08
CPU Core-Hour vs. 3.36 Moonshine Medium, 40x less
8.6%
WER (Spanish) vs. 11.6% Google and 9.4% Azure

Cheetah transcribes the user's question in real time as they speak, streaming partial results with automatic punctuation and endpoint detection. Cheetah detects when the user has finished speaking and immediately hands the complete question to the retrieval step. Running speech-to-text on-device means the user's spoken queries are never transmitted to a cloud transcription service.

English Word Error Rate
Lower is better
Amazon Streaming5.6%
Azure Real-time8.2%
Cheetah Streaming10.1%
Moonshine Streaming Medium10.6%
Vosk Streaming Large11.5%
Google Streaming11.9%
Whisper.cpp Streaming Base19.8%
English Punctuation Error Rate
Lower is better
Cheetah Streaming16.1%
Azure Real-time16.4%
Amazon Streaming24.4%
Google Streaming36%
Moonshine Streaming Medium45.1%
Whisper.cpp Streaming Base54.1%
Why picoLLM On-Device LLM?

Document embeddings and grounded answer generation in a single on-device model

99.9%
Accuracy retained at 3-bit vs. 83.1% for GPTQ for Llama-3-8b
94.5%
Accuracy retained at 2-bit vs. 38.7% for GPTQ for Llama-3-8b
Any
Any transformer architecture on any platform

picoLLM serves two roles in this pipeline. First, it generates document embeddings for chunked text, converting each chunk into a vector for similarity search. Second, it generates answers grounded in retrieved chunks using a system prompt that constrains responses to the provided document excerpts. picoCompression quantizes models while preserving task accuracy, making it possible to run capable models on laptops, desktops, and Raspberry Pi. picoLLM Inference runs any language models (Llama, Gemma, Phi, Mistral) locally on CPU or GPU. No document content or queries are sent to any external service.

3-bit Quantized Llama-3-8b MMLU
Higher is better
Float16 (Original Model)64.9
picoLLM64.8
GPTQ53.9
2-bit Quantized Llama-3-8b MMLU
Higher is better
Float16 (Original Model)64.9
picoLLM61.3
GPTQ25.1
Why Orca Streaming Text-to-Speech?

Speaks the answer as tokens stream in from the LLM

29 MB
Peak Memory Usage
130 ms
First-token-to-speech latency
7 MB
Model Size

Orca speaks the LLM's answer as tokens stream in, so the user hears the response before generation is complete. Unlike traditional TTS engines that wait for complete text, Orca processes raw LLM tokens as they arrive, achieving 130 ms first-token-to-speech. Orca uses 29 MB peak memory and a 7 MB model size, making it deployable alongside Cheetah and picoLLM on the same device. No audio is transmitted externally.

TTS Latency
Lower is better
Orca TTS Streaming128 ms
ElevenLabs TTS Streaming335 ms
ESpeak TTS1,430 ms
ElevenLabs TTS1,470 ms
Audio Quality
Listen and compare — grouped by peak memory usage.
Peak Memory Usage < 30 MB
ESpeak
Orca
Voice-powered document QA use cases

From legal documents to equipment manuals: voice-powered document QA for enterprise

Legal & Compliance

Voice queries over contracts, policies, and regulations

Legal teams, compliance officers, and auditors can query policy documents, contracts, and regulations by voice. "What are the termination clauses in this agreement?" The document and all queries stay on-device, so privileged or regulated content is never transmitted externally.

Field Service

Hands-free access to equipment manuals and procedures

Technicians can ask questions about equipment manuals, wiring diagrams, and repair procedures by voice while their hands are occupied. "What's the reset sequence for fault code E47?" The RAG pipeline runs offline on the technician's device, so it works in facilities with no connectivity.

Healthcare

Voice queries over drug references and clinical protocols

Clinicians can query drug reference guides, formularies, or protocol documents by voice during patient encounters. "What is the recommended dosage of amoxicillin for pediatric patients?" All document content and queries stay on-device, keeping clinical information off third-party servers.

Onboarding & Training

Voice-powered access to handbooks and SOPs

New employees can ask questions about handbooks, SOPs, and training materials by voice instead of searching through documents. "What's the return policy for opened items?" The voice interface makes internal knowledge accessible without navigation skills or document familiarity.

Get started

On-device Voice-Powered Document QA with RAG pipeline: Python code example

A complete working recipe in Python. Open-source on GitHub. Runs 100% on-device.

recipe · on-device-rag-voice-document-qa
Difficulty
Intermediate
Runtime
100% on-device
Language
Python
Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Prerequisites

Picovoice AccessKey from Picovoice Console and GitHub Repo Clone.

Usage

These instructions assume your current working directory is recipes/document-qa/python.
1

Create a virtual environment

Isolate the recipe's dependencies from your system Python.
2

Activate the virtual environment

Activation makes pip install into .venv instead of system Python.
Linux, macOS, or Raspberry Pi
Windows
3

Install dependencies

Pulls in the Cheetah, picoLLM, and Orca Python SDKs along with audio I/O.
4

Download LLM models

Open the Picovoice Console, go to picoLLM, and download two .pllm model files: one embedding model, EmbeddingGemma-300M, and one chat model, Llama-3.2-1B-Instruct.
5

Prepare your document

Place a plain text file in the recipe directory. The default document is the CPAL-1.0 open-source license. You can replace it with any text document you want to query.
6

Run the document QA pipeline

Pass your AccessKey and the paths to both models. The recipe chunks the document, generates embeddings, and then starts listening for voice questions.
Have questions or looking for implementations in other languages? Visit the GitHub pico-cookbook Document QA Recipe, where you can find the open-source demo code and create an issue for demo-related technical questions.
Frequently asked questions

FAQ

+
What is retrieval-augmented generation (RAG)?
Retrieval-augmented generation (RAG) is a technique that grounds an LLM's responses in the content of a specific document or knowledge base. Instead of relying on the model's training data, the pipeline retrieves relevant passages from the source document and includes them as context in the prompt. This reduces hallucination and produces answers traceable to the source material.
+
What is an on-device RAG pipeline?
A retrieval-augmented generation (RAG) pipeline chunks a document, generates embeddings, and retrieves relevant chunks to ground an LLM's answer in the document's actual content. An on-device RAG pipeline runs every step locally: the embeddings, the retrieval, the LLM inference, and, in this recipe, the speech-to-text and text-to-speech. No document content, queries, or audio leaves the device.
+
What documents can I use?
The recipe accepts plain text files. You can use contracts, equipment manuals, policies, handbooks, clinical protocols, technical specifications, or any text-based document. For PDFs, convert to plain text before ingestion.
+
Can the LLM hallucinate answers?
The voice-powered document QA with the RAG demo prompt constrains the LLM to answer only from the provided document excerpts. If the answer is not in the retrieved chunks, the LLM is instructed to say it does not know. This grounding reduces hallucination compared to unconstrained generation.
+
Does the voice-powered document QA with the RAG pipeline work without an internet connection?
Yes. All three SDKs, Cheetah Streaming Speech-to-Text, picoLLM On-device LLM, and Orca Streaming Text-to-Speech, run entirely on-device. No document content, queries, or audio is transmitted externally. An internet connection is only needed once to validate your AccessKey with Picovoice license servers.
+
How is this different from ChatGPT or cloud-based document QA?
Cloud-based document QA services upload your document to a third-party server for processing. This pipeline keeps the document, the embeddings, the queries, and the answers entirely on your device. No data is transmitted to Picovoice, OpenAI, Google, or any external service. You control the infrastructure and the data.
+
Does the voice-powered document QA with the RAG pipeline store or transmit audio?
No. All audio is processed on the device and discarded. The document, embeddings, and generated answers stay local. Nothing is transmitted to Picovoice or any third-party cloud. Picovoice has no data controller relationship with your end users.
+
How can I get technical support?
Visit the GitHub pico-cookbook Document QA Recipe, where you can find the open-source demo code and create an issue for demo-related technical questions.