Voice DVIR, Maintenance and Inspection

Build a hands-free DVIR and inspection app that runs on-device

Voice prompts guide each inspection step and inspectors respond in natural speech. Captures asset ID, fluid levels, tire condition, and service status as structured slots, plus free-form defect notes. Runs entirely on the device.

Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi
How on-device voice guided inspection works

Five on-device voice AI SDKs. One hands-free inspection loop.

The on-device voice guided inspection pipeline runs five Picovoice SDKs in a loop on the operator's device: Porcupine Wake Word, Rhino Speech-to-Intent, Cheetah Streaming Speech-to-Text, Orca Text-to-Speech, and Koala Noise Suppression. Porcupine listens for the wake word. Koala suppresses shop, ramp, and engine noise. Orca speaks each inspection prompt aloud. Rhino captures structured slots — asset ID, fluid levels, tire condition, service status — directly from speech. Cheetah transcribes free-form defect notes. Audio never leaves the device.

MicrophoneOperator audioINPUTKoalaNOISE SUPPorcupineWAKE WORD"Start inspection"OperatorSpeaks the responseUSEROrcaTEXT-TO-SPEECHPrompt"What's the oil level?"RhinoStructured slotsSPEECH-TO-INTENTCheetahFree-form notesSTREAMING STTStructured inspection recordASSET ID + SLOT VALUES + DEFECT NOTES — DOT DVIR READYNext promptin the loop
Why Koala Noise Suppression?

2× more effective at shop and ramp noise. Same footprint.

17.3×
More effective than RNNoise at 0 dB SNR
5.4×
More effective than RNNoise at 5 dB SNR
4.3×
More effective than RNNoise on average

Shop bays, depots, and roadside walkarounds are loud. Koala Noise Suppression cleans air tools, idling engines, traffic, and generator noise before the audio reaches Porcupine, Rhino, and Cheetah. Koala suppresses background noise twice as effectively as RNNoise with the same compute footprint, which leaves headroom for the rest of the pipeline on embedded devices, legacy phones, and rugged handhelds from Honeywell or Zebra.

STOI Distance to Clean Speech at 0 dB
Lower is better
Original0.232
RNNoise0.226
Koala0.128
STOI Distance to Clean Speech at 5 dB
Lower is better
Original0.156
RNNoise0.142
Koala0.080
Why Porcupine Wake Word?

Always-on inspection trigger at low CPU and battery cost.

3.8%
Single-Core CPU Utilization on Raspberry Pi 3
97.1%
Accuracy at 1 false alarm per 10 hours
~250K
Custom wake words trained and deployed in 2025

Porcupine Wake Word starts the inspection workflow when the operator says the chosen phrase. Drivers can train a branded wake word or always-listening commands, such as “Start inspection” or “Begin walkaround” in the Picovoice Console, and deploy them across mobile and embedded devices. Porcupine runs always-on at low CPU and battery cost so the rest of the pipeline only spins up when needed.

Wake Word Detection Accuracy - higher the better
Porcupine97.1%
Snowboy68%
PocketSphinx52%
CPU Utilization - lower the better
Porcupine3.8%
Snowboy24.8%
PocketSphinx31.8%
Why Rhino Speech-to-Intent?

Structured DVIR slots directly from speech.

6x
Higher accuracy than Big Tech average
97.3%
Accuracy tested across 6 to 24 dB Signal-to-Noise Ratio
Unlimited voice interactions per user

Rhino Speech-to-Intent captures structured DVIR slots — unit ID, oil condition, tire condition, service status — directly from speech. Most voice command systems run a two-step pipeline: speech-to-text produces a transcript, then a separate NLU model parses that transcript for intent. Each step accumulates error and compounds latency. Rhino infers intent and typed slot values directly from audio, which holds higher accuracy even in noisy environments without hallucinations or compounding errors.

Voice Command Acceptance Accuracy
Higher is better
Rhino97.3%
Amazon Lex84.3%
Google Dialogflow77.3%
Voice Command Acceptance Accuracy at 21 dB SNR
Higher is better
Rhino99%
Amazon Lex87%
Google Dialogflow83%
Why Cheetah Streaming Speech-to-Text?

Free-form defect notes transcribed in real time.

10.1%
WER (English) vs. 11.9% Google and 10.6% Moonshine Medium
0.08
CPU Core-Hour vs. 3.36 Moonshine Medium, 40x less
8.6%
WER (Spanish) vs. 11.6% Google and 9.4% Azure

Cheetah Streaming Speech-to-Text transcribes free-form defect notes in real time. Per the open-source real-time transcription benchmark, Cheetah beats Google Cloud STT on word error rate and word emission latency across all tested languages, and outperforms Azure STT on several. It emits words at 590 ms median latency, typically one word behind the speaker, and requires less compute than any other local engine tested. Cheetah accepts custom vocabulary for fleet jargon and OEM part numbers, which raises accuracy further on the actual words inspectors say.

English Word Error Rate
Lower is better
Amazon Streaming5.6%
Azure Real-time8.2%
Cheetah Streaming10.1%
Moonshine Streaming Medium10.6%
Vosk Streaming Large11.5%
Google Streaming11.9%
Whisper.cpp Streaming Base19.8%
English Punctuation Error Rate
Lower is better
Cheetah Streaming16.1%
Azure Real-time16.4%
Amazon Streaming24.4%
Google Streaming36%
Moonshine Streaming Medium45.1%
Whisper.cpp Streaming Base54.1%
Why Orca Text-to-Speech?

Natural-sounding TTS at 29 MB peak memory.

29 MB
Peak Memory Usage
130 ms
First-token-to-speech latency
7 MB
Model Sizee

Orca Streaming Text-to-Speech reads each inspection prompt aloud — “What's the oil level?”, “What's the tire condition?” — so the operator never has to look at the screen. Most high-quality TTS engines require hundreds of megabytes of RAM. Orca uses 29 MB peak memory, 10 to 50 times less than any natural-sounding on-device alternative, which leaves enough headroom to run all five engines on a single rugged tablet without OOM crashes.

TTS Latency
Lower is better
Orca TTS Streaming128 ms
ElevenLabs TTS Streaming335 ms
ESpeak TTS1,430 ms
ElevenLabs TTS1,470 ms
Audio Quality
Listen and compare — grouped by peak memory usage.
Peak Memory Usage < 30 MB
ESpeak
Orca
On-device voice guided inspection use cases

From DVIR walkarounds to plant maintenance

Fleet & DOT DVIR

Hands-free DOT DVIR for fleet drivers and mechanics

A Driver Vehicle Inspection Report (DVIR) is a daily record that the U.S. Department of Transportation (DOT) requires for commercial motor vehicles. Voice guided inspection apps turn pre-trip and post-trip DVIR walkarounds into spoken workflows that capture assets, conditions, and status as structured slots, ready to push to Whip Around, Fleetio, Samsara, or Verizon Connect.

Construction & mining

On-device equipment inspection for construction, mining, and agriculture

Heavy equipment on construction sites, in mines, quarries, and large farms needs daily inspection in places with no signal. The on-device pipeline runs on a commodity Android phone or rugged tablet from Honeywell or Zebra and captures structured condition data and free-form defect notes with no connectivity required. No proprietary inspection hardware required.

Utilities, HVAC & manufacturing

Voice guided facility inspection for utilities, HVAC, and manufacturing

Plant inspectors, utilities crews, and HVAC technicians walk past boilers, pumps, motors, and panels that need a quick condition check. The on-device voice guided inspection pipeline prompts each asset, captures the reading and the inspector's free-form note, and works offline in basements, mechanical rooms, and remote substations where Wi-Fi doesn't reach.

Aviation & marine

Voice guided pre-departure checklists for aviation and marine

Aviation pre-flight checks and marine vessel pre-departure inspections are read-aloud, respond-aloud workflows by nature. The voice guided pipeline matches that workflow and runs offline, critical at smaller airfields, offshore platforms, and harbors where connectivity is unreliable. The same structured slot capture handles airframe, engine, and safety checks.

Get started

On-device voice DVIR and inspection Python code example

A complete working recipe in Python. Open-source on GitHub. Runs 100% on-device.

recipe · voice-dvir-and-inspection
Difficulty
Beginner
Runtime
100% on-device
Language
Python
Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Prerequisites

Picovoice AccessKey from Picovoice Console and GitHub Repo Clone.

Usage

These instructions assume your current working directory is recipes/voice-guided-maintenance-and-inspection/python.
1

Create a virtual environment

Isolate the recipe's dependencies from your system Python.
2

Activate the virtual environment

Activation makes pip install into .venv instead of system Python.
Linux, macOS, or Raspberry Pi
Windows
3

Install dependencies

Pulls in the Porcupine, Rhino, Cheetah, Orca, and Koala Python SDKs along with PvRecorder and PvSpeaker.
4

Train a wake word

Go to the Picovoice Console and train any phrase “Hey Siri”, “Hey Assistant”, or “your brand name” and download the .ppn file for your target platform.
5

Train the Speech-to-Intent model

Open the Picovoice Console, go to Rhino Speech-to-Intent, create an empty context, and import the Rhino context YAML for this recipe. Download the generated .rhn file for your target platform.
6

Run the DVIR, maintenance and inspection demo

Pass your AccessKey and the paths to the .ppn and .rhn files.
Have questions or looking for implementations in other languages? Visit the GitHub pico-cookbook voice guided maintenance and inspection recipe, where you can find the open-source demo code and create an issue for the demo-related technical questions.
Frequently asked questions

FAQ

+
What is voice guided maintenance and inspection?
Voice guided maintenance and inspection is a hands-free DVIR and equipment-inspection workflow where a wake word activates the app, the app asks each inspection step out loud, and the technician answers in natural speech. The app captures both structured slots, such as asset ID, oil condition, tire condition, or service status, and free-form notes, all without typing on a phone or tablet during a walkaround.
+
How is this different from DVIR apps like Whip Around, Fleetio, or Verizon Connect?
Existing DVIR apps are tap-and-photo workflows on a phone screen. Picovoice's voice-guided pipeline turns the same DVIR or maintenance checklist into a hands-free, eyes-free flow — drivers and technicians keep their hands on the equipment while the app prompts each step. The pipeline runs entirely on-device, so it works in shop bays, depots, and remote sites without connectivity.
+
Does voice guided inspection work without an internet connection?
Yes. Porcupine Wake Word, Rhino Speech-to-Intent, Cheetah Streaming Speech-to-Text, Orca Streaming Text-to-Speech, and Koala Noise Suppression all run locally. The full inspection workflow works offline — useful in shop bays with weak Wi-Fi, off-site depots, mining and construction sites, and rural transport routes.
+
Can the captured report meet DOT DVIR compliance requirements?
Yes. The recipe captures structured slots (asset/unit ID, fluid condition, tire condition, service status) plus free-form notes. The output can be mapped to DOT-compliant DVIR formats and pushed to your existing fleet maintenance software (Fleetio, Whip Around, Samsara, Verizon Connect, AssetWorks, etc.) through their APIs. Voice replaces the typing, not the system of record.
+
What happens in a noisy shop bay or roadside walkaround?
Koala Noise Suppression cleans audio before it reaches Cheetah Streaming Speech-to-Text and Rhino Speech-to-Intent, such as air tools, idling engines, traffic, generators. Rhino Speech-to-Intent is end-to-end, with intent accuracy that holds up where transcript-based pipelines collapse in noise. The same pipeline works in a quiet shop and on a busy ramp.
+
How does the wake word work?
Porcupine Wake Word listens continuously on-device with very low CPU and battery usage, and only triggers the rest of the pipeline when the user speaks the chosen wake phrase. The wake phrase is fully customizable, you choose what your drivers and technicians say, in any supported language.
+
Is operator audio sent to a third-party cloud?
No. Audio is processed locally on the device. Picovoice cannot access end-user audio. This removes processing-agreement and breach-surface concerns, which is important for fleets in regulated industries (food & pharma logistics, hazmat, defense logistics) and for fleets in jurisdictions with worker-voice rules.
+
Can I customize the inspection slots and prompts for my equipment?
Yes. The Rhino context YAML defines the unit IDs, the inspection slots (oil, tires, service status — and any others you add), and the accepted phrasings. The Orca Text-to-Speech prompts the operator hears are fully configurable text. Cheetah Streaming Speech-to-Text accepts custom vocabulary for part numbers and equipment names.
+
What hardware does the on-device voice guided inspection pipeline run on?
The full five-engine pipeline runs on commodity Android phones, iOS devices, rugged tablets from Honeywell and Zebra, and Linux-based fleet hardware. It also runs on Raspberry Pi for embedded telematics installs. No GPU, no NPU, and no dedicated voice hardware required.
+
Which assets beyond trucks fit this workflow?
Trucks, trailers, forklifts and warehouse equipment, construction and mining equipment, agriculture equipment, aviation pre-flight checks, marine vessels, and stationary plant equipment. Any asset class with a structured inspection checklist with optional free-form notes maps onto the same Porcupine Wake Word, Orca Text-to-Speech, Rhino Speech-to-Intent, Cheetah Streaming Speech-to-Text, and Koala Noise Suppression pipeline.
+
How can I get technical support for the voice guided maintenance and inspection demo?
Visit the GitHub pico-cookbook voice guided maintenance and inspection recipe where you can find the open-source demo code and create an issue for the demo-related technical questions or reach out to your Picovoice contact.