Voice-powered document QA

Build an on-device Voice-Powered Document QA with RAG Pipeline

Ask questions about any document by voice and get spoken answers grounded in the document's content. Local embeddings, local LLM inference, speech-to-text, and text-to-speech, running entirely on-device with no cloud dependency.

Products used

Cheetah Streaming Speech-to-Text picoLLM On-Device LLM Orca Streaming Text-to-Speech

Platforms supported

AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Start Free

Loved by developers, trusted by enterprises

How the on-device RAG pipeline works

On-device document embeddings, retrieval, LLM inference, speech-to-text, and text-to-speech powering voice-based document QA

An on-device AI pipeline loads a document, chunks it, and generates local embeddings using picoLLM. When the user asks a question by voice, Cheetah transcribes the question in real time. picoLLM generates an embedding for the question, retrieves the most relevant document chunks by cosine similarity, and generates an answer grounded in those chunks. Orca speaks the answer back. The entire pipeline runs on-device: No document content, no queries, and no audio data is transmitted externally.

Why Cheetah Streaming Speech-to-Text?

Real-time transcription with endpoint detection for hands-free document queries

10.1%

WER (English) vs. 11.9% Google and 10.6% Moonshine Medium

0.08

CPU Core-Hour vs. 3.36 Moonshine Medium, 40x less

8.6%

WER (Spanish) vs. 11.6% Google and 9.4% Azure

Cheetah transcribes the user's question in real time as they speak, streaming partial results with automatic punctuation and endpoint detection. Cheetah detects when the user has finished speaking and immediately hands the complete question to the retrieval step. Running speech-to-text on-device means the user's spoken queries are never transmitted to a cloud transcription service.

English Word Error Rate

Lower is better

Amazon Streaming5.6%

Azure Real-time8.2%

Cheetah Streaming10.1%

Moonshine Streaming Medium10.6%

Vosk Streaming Large11.5%

Google Streaming11.9%

Whisper.cpp Streaming Base19.8%

* Average of 4 public datasets. The lowest-scoring (highest accuracy) model is shown for Moonshine, Vosk, and Whisper.cpp. See the benchmark for the full comparison.

English Punctuation Error Rate

Lower is better

Cheetah Streaming16.1%

Azure Real-time16.4%

Amazon Streaming24.4%

Google Streaming36%

Moonshine Streaming Medium45.1%

Whisper.cpp Streaming Base54.1%

* Average of 3 public datasets. Vosk excluded due to no punctuation support. See the benchmark for details.

Why picoLLM On-Device LLM?

Document embeddings and grounded answer generation in a single on-device model

99.9%

Accuracy retained at 3-bit vs. 83.1% for GPTQ for Llama-3-8b

94.5%

Accuracy retained at 2-bit vs. 38.7% for GPTQ for Llama-3-8b

Any

Any transformer architecture on any platform

picoLLM serves two roles in this pipeline. First, it generates document embeddings for chunked text, converting each chunk into a vector for similarity search. Second, it generates answers grounded in retrieved chunks using a system prompt that constrains responses to the provided document excerpts. picoCompression quantizes models while preserving task accuracy, making it possible to run capable models on laptops, desktops, and Raspberry Pi. picoLLM Inference runs any language models (Llama, Gemma, Phi, Mistral) locally on CPU or GPU. No document content or queries are sent to any external service.

3-bit Quantized Llama-3-8b MMLU

Higher is better

Float16 (Original Model)64.9

picoLLM64.8

GPTQ53.9

2-bit Quantized Llama-3-8b MMLU

Higher is better

Float16 (Original Model)64.9

picoLLM61.3

GPTQ25.1

Why Orca Streaming Text-to-Speech?

Speaks the answer as tokens stream in from the LLM

29 MB

Peak Memory Usage

130 ms

First-token-to-speech latency

7 MB

Model Size

Orca speaks the LLM's answer as tokens stream in, so the user hears the response before generation is complete. Unlike traditional TTS engines that wait for complete text, Orca processes raw LLM tokens as they arrive, achieving 130 ms first-token-to-speech. Orca uses 29 MB peak memory and a 7 MB model size, making it deployable alongside Cheetah and picoLLM on the same device. No audio is transmitted externally.

TTS Latency

Lower is better

Orca TTS Streaming128 ms

ElevenLabs TTS Streaming335 ms

ESpeak TTS1,430 ms

ElevenLabs TTS1,470 ms

* Values capped at 4,000ms for display purposes; actual latency reported.

Audio Quality

Listen and compare — grouped by peak memory usage.

Peak Memory Usage < 30 MB

ESpeak

Orca

Voice-powered document QA use cases

From legal documents to equipment manuals: voice-powered document QA for enterprise

Legal & Compliance

Voice queries over contracts, policies, and regulations

Legal teams, compliance officers, and auditors can query policy documents, contracts, and regulations by voice. "What are the termination clauses in this agreement?" The document and all queries stay on-device, so privileged or regulated content is never transmitted externally.

Field Service

Hands-free access to equipment manuals and procedures

Technicians can ask questions about equipment manuals, wiring diagrams, and repair procedures by voice while their hands are occupied. "What's the reset sequence for fault code E47?" The RAG pipeline runs offline on the technician's device, so it works in facilities with no connectivity.

Healthcare

Voice queries over drug references and clinical protocols

Clinicians can query drug reference guides, formularies, or protocol documents by voice during patient encounters. "What is the recommended dosage of amoxicillin for pediatric patients?" All document content and queries stay on-device, keeping clinical information off third-party servers.

Onboarding & Training

Voice-powered access to handbooks and SOPs

New employees can ask questions about handbooks, SOPs, and training materials by voice instead of searching through documents. "What's the return policy for opened items?" The voice interface makes internal knowledge accessible without navigation skills or document familiarity.

Get started

On-device Voice-Powered Document QA with RAG pipeline: Python code example

A complete working recipe in Python. Open-source on GitHub. Runs 100% on-device.

recipe · on-device-rag-voice-document-qa

Difficulty

Intermediate

Runtime

100% on-device

Language

Python

Platforms supported

AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Prerequisites

Picovoice AccessKey from Picovoice Console and GitHub Repo Clone.

Usage

These instructions assume your current working directory is recipes/document-qa/python.

1

Create a virtual environment

Isolate the recipe's dependencies from your system Python.

2

Activate the virtual environment

Activation makes pip install into .venv instead of system Python.

Linux, macOS, or Raspberry Pi

Windows

3

Install dependencies

Pulls in the Cheetah, picoLLM, and Orca Python SDKs along with audio I/O.

4

Download LLM models

Open the Picovoice Console, go to picoLLM, and download two .pllm model files: one embedding model, EmbeddingGemma-300M, and one chat model, Llama-3.2-1B-Instruct.

5

Prepare your document

Place a plain text file in the recipe directory. The default document is the CPAL-1.0 open-source license. You can replace it with any text document you want to query.

6

Run the document QA pipeline

Pass your AccessKey and the paths to both models. The recipe chunks the document, generates embeddings, and then starts listening for voice questions.

Have questions or looking for implementations in other languages? Visit the GitHub pico-cookbook Document QA Recipe, where you can find the open-source demo code and create an issue for demo-related technical questions.

On-device AI cookbook examples

More recipes from picoCookbook

Frequently asked questions

FAQ

+

What is retrieval-augmented generation (RAG)?

Retrieval-augmented generation (RAG) is a technique that grounds an LLM's responses in the content of a specific document or knowledge base. Instead of relying on the model's training data, the pipeline retrieves relevant passages from the source document and includes them as context in the prompt. This reduces hallucination and produces answers traceable to the source material.

+

What is an on-device RAG pipeline?

A retrieval-augmented generation (RAG) pipeline chunks a document, generates embeddings, and retrieves relevant chunks to ground an LLM's answer in the document's actual content. An on-device RAG pipeline runs every step locally: the embeddings, the retrieval, the LLM inference, and, in this recipe, the speech-to-text and text-to-speech. No document content, queries, or audio leaves the device.

+

What documents can I use?

The recipe accepts plain text files. You can use contracts, equipment manuals, policies, handbooks, clinical protocols, technical specifications, or any text-based document. For PDFs, convert to plain text before ingestion.

+

Can the LLM hallucinate answers?

The voice-powered document QA with the RAG demo prompt constrains the LLM to answer only from the provided document excerpts. If the answer is not in the retrieved chunks, the LLM is instructed to say it does not know. This grounding reduces hallucination compared to unconstrained generation.

+

Does the voice-powered document QA with the RAG pipeline work without an internet connection?

Yes. All three SDKs, Cheetah Streaming Speech-to-Text, picoLLM On-device LLM, and Orca Streaming Text-to-Speech, run entirely on-device. No document content, queries, or audio is transmitted externally. An internet connection is only needed once to validate your AccessKey with Picovoice license servers.

+

How is this different from ChatGPT or cloud-based document QA?

Cloud-based document QA services upload your document to a third-party server for processing. This pipeline keeps the document, the embeddings, the queries, and the answers entirely on your device. No data is transmitted to Picovoice, OpenAI, Google, or any external service. You control the infrastructure and the data.

+

Does the voice-powered document QA with the RAG pipeline store or transmit audio?

No. All audio is processed on the device and discarded. The document, embeddings, and generated answers stay local. Nothing is transmitted to Picovoice or any third-party cloud. Picovoice has no data controller relationship with your end users.

+

How can I get technical support?

Visit the GitHub pico-cookbook Document QA Recipe, where you can find the open-source demo code and create an issue for demo-related technical questions.

Build an on-device Voice-Powered Document QA with RAG Pipeline

On-device document embeddings, retrieval, LLM inference, speech-to-text, and text-to-speech powering voice-based document QA

Real-time transcription with endpoint detection for hands-free document queries

Document embeddings and grounded answer generation in a single on-device model

Speaks the answer as tokens stream in from the LLM

From legal documents to equipment manuals: voice-powered document QA for enterprise

Voice queries over contracts, policies, and regulations

Hands-free access to equipment manuals and procedures

Voice queries over drug references and clinical protocols

Voice-powered access to handbooks and SOPs

On-device Voice-Powered Document QA with RAG pipeline: Python code example

Prerequisites

Usage

Create a virtual environment

Activate the virtual environment

Install dependencies

Download LLM models

Prepare your document

Run the document QA pipeline

More recipes from picoCookbook

On-Device AI Call Assist

On-device AI Voice Agent

Voice Memo Assistant

FAQ