Warehouse Voice Picking

On-device voice picking for your warehouse

Hands-free, eyes-free pick-by-voice handling natural voice prompts. Runs on commodity Android handhelds, rugged devices, and embedded Linux. No cloud round-trip.

Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi
How on-device voice picking works

Four Voice AI SDKs. All on-device

On-device voice picking combines Koala Noise Suppression, Porcupine Wake Word, Rhino Speech-to-Intent, and Orca Text-to-Speech in a single local pipeline. Koala suppresses warehouse noise. Porcupine listens for the wake word. Orca speaks each pick instruction. Rhino captures the picker's check digits and picked quantity. The pipeline runs entirely on the picker's device with no audio leaving it.

PickervoiceINPUTKoalaNOISE SUPPorcupineWAKE WORDRhinoINTENTOrcaTTSWMSOUTPUTWake word starts each pick exchange · 0 ms network round-trip · Audio never leaves the device
Why Koala Noise Suppression?

2× more effective at warehouse noise. Same footprint.

17.3×
More effective than RNNoise at 0 dB SNR
5.4×
More effective than RNNoise at 5 dB SNR
4.3×
More effective than RNNoise on average

Koala suppresses conveyor noise, forklift backups, fan noise, and HVAC before audio reaches voice AI engines like Porcupine and Rhino. While it's 2× more effective at suppressing real-world warehouse noise, it's still the same size, making it a fit for embedded devices, low-tier or legacy mobile devices, and rugged handhelds, leaving more than sufficient headroom for the rest of the pipeline.

STOI Distance to Clean Speech at 0 dB
Lower is better
Original0.232
RNNoise0.226
Koala0.128
STOI Distance to Clean Speech at 5 dB
Lower is better
Original0.156
RNNoise0.142
Koala0.080
Why Porcupine Wake Word?

Lightweight, accurate, and customizable for warehouse speech.

3.8%
Single-Core CPU Utilization on Raspberry Pi 3
97.1%
Accuracy at 1 false alarm per 10 hours
~250K
Custom wake words trained and deployed in 2025

Porcupine Wake Word is lightweight, accurate, customizable, and production-ready. With Porcupine, pickers can wear a Bluetooth headset connected to a rugged Android handheld for an entire shift without draining the device's battery. The wake phrase is fully customizable so it doesn't collide with everyday warehouse speech.

Wake Word Detection Accuracy - higher the better
Porcupine97.1%
Snowboy68%
PocketSphinx52%
CPU Utilization - lower the better
Porcupine3.8%
Snowboy24.8%
PocketSphinx31.8%
Why Rhino Speech-to-Intent?

End-to-end intent. No transcript. No hallucinations.

6x
Higher accuracy than Big Tech average
97.3%
Accuracy tested across 6 to 24 dB Signal-to-Noise Ratio
Unlimited voice interactions per user

Most voice command systems run a two-step pipeline: speech-to-text converts audio to a transcript, then a separate NLU model parses that transcript for intent. Every step accumulates error and compounds latency. Rhino Speech-to-Intent is an end-to-end speech-to-intent engine with a single model that maps spoken audio directly to a structured intent with typed slot values. High accuracy even in noisy environments. No hallucinations. No intermediate transcript.

Voice Command Acceptance Accuracy
Higher is better
Rhino97.3%
Amazon Lex84.3%
Google Dialogflow77.3%
Voice Command Acceptance Accuracy at 21 dB SNR
Higher is better
Rhino99%
Amazon Lex87%
Google Dialogflow83%
Why Orca Text-to-Speech?

Natural-sounding TTS at 29 MB peak memory.

29 MB
Peak Memory Usage
130 ms
First-token-to-speech latency
7 MB
Model Sizee

Speaks each pick instruction the picker hears — aisle, bay, SKU, quantity, check-digit prompt. Most high-quality TTS solutions require hundreds of megabytes of RAM. Orca TTS uses 29 MB peak memory, making Orca the only natural-sounding TTS deployable in any environment, including mobile apps with strict out-of-memory limits, and embedded devices. It leaves headroom for the rest of the pipeline and more operators. Custom pronunciation handles SKU codes, brand names, and warehouse-specific terms.

TTS Latency
Lower is better
Orca TTS Streaming128 ms
ElevenLabs TTS Streaming335 ms
ESpeak TTS1,430 ms
ElevenLabs TTS1,470 ms
Audio Quality
Listen and compare — grouped by peak memory usage.
Peak Memory Usage < 30 MB
ESpeak
Orca
Voice picking built for warehouse and distribution operation

From e-commerce fulfillment to cold storage

3PL & e-commerce

Voice picking on commodity Android handhelds

3PLs and e-commerce fulfillment operators can deploy voice picking on the rugged Android handhelds and Bluetooth headsets they already own — no proprietary Vocollect or LYDIA hardware required. The same pipeline covers replenishment and putaway, so the same device works across the operation.

WMS integration

Voice picking with Manhattan, Blue Yonder, SAP EWM

Retail DCs running Manhattan, Blue Yonder, SAP EWM, Oracle WMS, or Infor can keep their system of record and add voice picking as the operator interface. Structured intents from the on-device voice picking pipeline can be integrated into any WMS.

Cold storage & freezers

Offline voice picking for dead zones

Cold storage facilities, food & beverage warehouses, and pharma distribution centers all share three problems: freezer rooms with no signal, worker-privacy regulations on voice recording, and zero tolerance for inventory errors. The on-device pipeline addresses all three — runs offline, keeps audio local, and avoids LLM hallucinations on the quantity field.

Manufacturing

Voice-directed kitting and line-side replenishment

Manufacturing kitting operations, line-side replenishment, and just-in-time material movement use the same voice-picking pattern. The Rhino YAML can be extended with kit IDs, station IDs, and exception codes specific to your line. The pipeline runs on the same Android tablets your line operators already carry.

Get started

On-device voice picking with Python: code example

A complete working recipe in Python. Open-source on GitHub. Runs 100% on-device.

recipe · voice-picking
Difficulty
Beginner
Runtime
100% on-device
Language
Python
Platforms supported
AndroidiOSLinuxmacOSWindowsChromeEdgeFirefoxSafariRaspberry Pi

Prerequisites

Picovoice AccessKey from Picovoice Console and GitHub Repo Clone.

Usage

These instructions assume your current working directory is recipes/voice-picking/python.
1

Create a virtual environment

Isolate the recipe's dependencies from your system Python.
2

Activate the virtual environment

Activation makes pip install into .venv instead of system Python.
Linux, macOS, or Raspberry Pi
Windows
3

Install dependencies

Install the Porcupine, Rhino, Orca, Koala, PvRecorder, and PvSpeaker Python SDKs.
4

Train a wake word

Open the Picovoice Console, go to Porcupine Wake Word, enter the wake phrase your pickers will use (something distinct from everyday warehouse speech), train, and download the .ppn file for your target platform.
5

Train a Speech-to-Intent model

In Picovoice Console, go to Rhino Speech-to-Intent, create an empty context, and import the Rhino context YAML for the voice-picking recipe. Intents include confirmLocation, confirmPickedQuantity, reportShortPick, reportDamagedItem, reportLocationEmpty, and exitWorkflow. Download the generated .rhn file.
6

Run the voice picking demo

Pass your AccessKey and the paths to the .ppn and .rhn files. The demo opens the microphone and runs the voice picking pipeline locally.
Have questions or looking for implementations in other languages? Visit the GitHub pico-cookbook Voice Picking Recipe, where you can find the open-source demo code and create an issue for demo-related technical questions.
Frequently asked questions

FAQ

+
What is voice picking?
Voice picking — also called pick-by-voice or voice-directed picking — is a hands-free, eyes-free order-fulfillment workflow. The system speaks instructions to the picker (aisle, bay, quantity, check digits) and the picker responds in natural speech. Voice picking is widely used in warehouses to free both hands for picking, scanning, and palletizing while keeping accuracy high.
+
How is this different from Honeywell Vocollect or LYDIA Voice?
Honeywell Vocollect and LYDIA Voice are bundled hardware-plus-software offerings tied to specific headsets and cloud services. Picovoice provides licensable on-device SDKs that you embed in your own WMS-connected mobile or rugged-device app — same engines run on Android, iOS, Linux, and Raspberry Pi, with no proprietary headset and no cloud round-trip. The pipeline is the same: wake word, intent, TTS, noise suppression.
+
Can on-device voice picking work without an internet connection?
Yes. Porcupine Wake Word, Rhino Speech-to-Intent, Orca Text-to-Speech, and Koala Noise Suppression all run locally on the device. The full picking workflow works offline — useful in cold storage facilities with weak signal, freezer rooms, mezzanines, and the back of large facilities where Wi-Fi coverage is patchy.
+
How does the wake word work in a noisy warehouse?
Porcupine Wake Word, trained using real-world data, is a noise-robust wake word detection engine that listens continuously on-device with very low CPU and battery cost. Koala Noise Suppression improves Porcupine's accuracy further by cleaning incoming audio of conveyor noise, forklift backups, and HVAC before the wake word check.
+
How accurate is voice command acceptance and intent detection in a noisy warehouse?
Rhino Speech-to-Intent is an end-to-end model that maps spoken audio directly to a structured intent, with no intermediate transcript and no hallucination risk. It outperforms alternatives in noisy environments, as shown in the open-source NLU benchmark. Combined with Koala Noise Suppression, accuracy improves further in high-noise warehouse environments.
+
Can I customize the workflow for my warehouse?
Yes. The Rhino context YAML defines the intents and accepted phrasings: confirm location with check digits, confirm picked quantity, report short pick, report damaged item, report location empty, exit workflow. You can add intents (replenishment, putaway, cycle count) and phrasings specific to your operation and fine-tune your own AI model. Orca prompts are fully configurable text with custom pronunciation. Wake words, voice commands, and guide responses can be swapped per the warehouse without changing the SDK.
+
What rugged devices does voice picking run on?
Picovoice SDKs run on standard rugged Android handhelds from Honeywell, Zebra, and Datalogic, as well as any Android phone with a Bluetooth headset, embedded Linux devices, and Raspberry Pi. No proprietary hardware is required.
+
Can voice picking support multilingual picker populations?
Yes. Porcupine Wake Word and Rhino Speech-to-Intent support multiple languages. You can run multiple wake words across languages or add Bat Spoken Language Identification to detect each picker's language automatically and route to the correct Rhino context.
+
Can I integrate this voicepicking app with my existing WMS?
Yes. The voice pipeline is the operator interface — your WMS remains the system of record. The captured intents (location confirmation, picked quantity, exception type) and the picker ID can flow into your WMS through whatever API or middleware you already use. Voicepicking systems powered by Picovoice can integrate with Manhattan, Blue Yonder, SAP EWM, Oracle WMS, Infor, and homegrown systems.
+
Is operator audio sent to a third-party cloud?
No. Audio is processed in memory on the device and discarded. Picovoice has no data controller relationship with end users. Important for fleets with worker-voice rules and for facilities operating under regulated frameworks (food & pharma logistics, hazmat, defense).
+
How can I get technical support?
Visit the GitHub pico-cookbook Voice Picking Recipe where you can find the open-source demo code and create an issue for demo-related technical questions or reach out to your Picovoice contact.