🚀 On-device Voice AI & LLMs
Build commercial, non-commercial, research projects using the Forever-Free Plan.
Start Free

TLDR: Add voice-control to smart TV with on-device voice AI in Python. Voice searches route through speech-to-intent for instant results, while open-ended requests go to Picovoice's local LLM. All voice models stay on-device, keeping latency low and user data private.

Smart TV voice search works best when responses feel instant. Cloud-based pipelines tend to add network latency at each processing stage, which can make the experience feel slower than expected. On-device voice AI keeps speech processing local to the device, cutting out network round-trips and keeping user data private.

This tutorial covers building a smart TV voice assistant in Python. A custom wake word activates the voice search hands-free, structured content commands route through speech-to-intent for instant catalog lookups, and open-ended requests go to a local LLM for personalized recommendations, all on-device.

What You'll Build:

A smart TV voice assistant that:

  • Activates using two custom wake phrases — one for voice commands (e.g., "Hey TV") and one for personalized recommendations (e.g., "Hey Assistant")
  • Searches the local content catalog instantly for structured queries
  • Routes open-ended requests to a local LLM for intelligent content matching
  • Responds with natural speech synthesis

The voice search system's fully on-device architecture ensures that it:

  • Delivers low-latency responses with all speech processing running locally on the device's hardware.
  • Keeps all user audio and viewing preferences on-device, meeting GDPR and CCPA privacy compliance expectations for in-home devices.

What You'll Need:

  • Python 3.9+
  • Laptop/Desktop with Microphone and speakers for testing
  • Picovoice AccessKey from the Picovoice Console

Smart TV Voice Search Architecture

This Python-based voice search system uses an on-device architecture designed for instant content discovery and personalized recommendations:

  1. Always-Listening Activation — The voice search system sits in a low-power, idle state using Porcupine Wake Word to monitor the audio stream for two distinct wake phrases. Detecting "Hey TV" routes to instant content search, while "Hey Assistant" routes to the personalized recommendation assistant. This dual-keyword approach lets viewers choose the right path upfront.

  2. Intent Recognition for Content Search — When "Hey TV" is detected, the audio is analyzed by Rhino Speech-to-Intent. Instead of transcribing words one by one, it maps the speech directly to a structured content query — like "search action movies" or "resume watching." The system queries the local content catalog and returns results immediately without further processing.

  3. Speech-to-Text for Personalized Requests — When "Hey Assistant" is detected, the system routes directly to Cheetah Streaming Speech-to-Text. This engine captures the full detail of open-ended requests that can't be matched to a fixed intent structure.

  4. On-Device Language Model — The transcribed request is passed to picoLLM along with the device's content catalog. The local language model interprets what the viewer is looking for and matches it against available titles, returning structured recommendations without any cloud processing.

  5. Voice Response GenerationOrca Streaming Text-to-Speech converts the response into natural speech, completing the hands-free loop from query to recommendation.

Content Search Workflow:

Personalized Recommendation Workflow:

All Picovoice models — Porcupine Wake Word, Rhino Speech-to-Intent, Cheetah Streaming Speech-to-Text, and Orca Streaming Text-to-Speech — support multiple languages including English, Spanish, German and more. Build multilingual voice search to serve international markets by training models in the languages your target regions speak.

  1. Sign up for a Picovoice Console account and navigate to the Porcupine page.
  2. Enter your first wake phrase for content commands (e.g., "Hey TV") and test it using the microphone button.
  3. Click "Train," select the target platform, and download the .ppn model file.
  4. Repeat steps 2 & 3 to train an additional wake word for personalized recommendations (e.g., "Hey Assistant").

Porcupine can detect multiple wake words simultaneously. For instance, support both "Hey TV" and "Hey Assistant" for different interaction modes. For tips on designing an effective wake word, review the choosing a wake word guide.

Define Voice Commands for Content Discovery

  1. Create an empty Rhino Speech-to-Intent Context.
  2. Click the "Import YAML" button in the top-right corner of the console and paste the YAML provided below to define intents for structured content search commands.
  3. Test the model with the microphone button and download the .rhn context file for your target platform.

You can refer to the Rhino Syntax Cheat Sheet for more details on building custom contexts.

YAML Context for Smart TV Content Discovery:

This context handles the most common structured content search commands. For open-ended requests like "something relaxing to watch tonight" or "a movie similar to what I watched last night," the assistant will use the picoLLM recommendation path.

Set Up Local Large Language Model

  1. Navigate to the picoLLM page in Picovoice Console.
  2. Select a model. This tutorial uses llama-3.2-3b-instruct-505.pllm.
  3. Download the .pllm file and place it in your project directory.

Install all required Python SDKs and dependencies using pip:

Add Wake Word Detection for Hands-Free Activation

The following code captures audio from your microphone and detects the custom wake word locally:

Porcupine Wake Word processes each audio frame on-device with acoustic models optimized for living room environments. By listening for multiple wake words simultaneously, it routes viewers to the right system path instantly — content search or personalized recommendations — without continuous cloud streaming.

Process Content Search Commands

Once the wake word is detected, Rhino Speech-to-Intent listens for structured content queries:

Rhino Speech-to-Intent directly infers intent from speech without requiring a separate transcription step, enabling instant content catalog lookups for structured queries.

Handle Personalized Recommendations with AI

When viewers say "Hey Assistant," the system routes directly to streaming speech-to-text and local LLM for open-ended content discovery:

This approach uses Cheetah Streaming Speech-to-Text to capture the viewer's open-ended request, then picoLLM to match it against the local content catalog and generate structured recommendations — all without leaving the device.

Add Voice Response Generation for Smart TV

Transform text responses into natural speech for TV playback:

Orca Streaming Text-to-Speech generates natural voice responses with first audio output in under 130ms, providing immediate verbal feedback when a viewer speaks a command.

Route Content Search Commands to Local Catalog

Map structured intents to content catalog queries and format results for voice delivery:

This implementation combines all components for a smart TV voice search system:

Run the Smart TV Voice Assistant

To run the voice search system, update the model paths to match your local files and have your Picovoice AccessKey ready:

Example Interactions

Content Search:

Resume Playback:

Personalized Recommendation:

You can start building your own commercial or non-commercial projects using Picovoice's self-service Console.

Start Building

Frequently Asked Questions

Will the voice search work accurately in noisy environments, with different accents, or with varied content titles?
Yes. Porcupine Wake Word, Rhino Speech-to-Intent, and Cheetah Streaming Speech-to-Text are designed to work reliably with background noise and various accents across supported languages.
Can I use different wake words instead of 'Hey TV' and 'Hey Assistant'?
Yes. Train any custom wake phrases using Picovoice Console in seconds without collecting training data. Simply enter your desired phrases and download the trained models. Porcupine detects multiple wake words simultaneously with no added runtime footprint, so both activation paths stay responsive. The wake word guide covers best practices for choosing effective wake phrases.
When should I use Rhino Speech-to-Intent versus picoLLM for content queries?
Use Rhino Speech-to-Intent for structured, predictable content searches like genre filters, resume commands, and top-rated lists. Use picoLLM for open-ended requests where viewers might phrase things in unpredictable ways. The dual wake word architecture lets viewers choose the appropriate path upfront — "Hey TV" for direct searches and "Hey Assistant" for personalized recommendations.