🏢 Enterprise AI Consulting
Get dedicated help specific to your use case and for your hardware and software choices.
Consult an AI Expert

You read Snow Crash by Neal Stephenson, watched Free Guy, heard Facebook, Microsoft or NVIDIA's plans for the Metaverse, or Gucci and Roblox collaboration, now you probably do not want to see another post on metaverse. Promise, this is different. If you're not there yet, let's catch you up quickly. The Metaverse is essentially connected communities where people can interact with each other to work, socialize, shop or play. It consists of extended reality, always-on digital environments powered by AR, VR and a mix of them. In this article, we'll discuss the benefits of on-device voice recognition for AR, VR and MR applications.

The Metaverse may seem far away when you think of waiting for Alexa to play the next song for almost a minute. When you rely on cloud providers for voice recognition, the shortcomings of broadband connectivity and cloud latency hinder the experience. However, we have a solution, at least for the voice part. Let's say you have an AR-enabled application where users can wear digital clothes and a user wants to change the colour.

Standard Approach

The user presses a push-to-talk button before talking. Then the voice data is recorded and sent to the cloud for transcription, how fast the data is sent depends on the user's internet service provider (ISP). Next, the text is passed to an NLU (Natural Language Understanding) service to infer the user's intent from the text. The latency of these steps depends on the cloud service providers' performance and proximity to the data center. Then the command is sent back to the device for the task and once again this part relies on the ISP. Finally, the user can see the color has changed.

It may sound exhausting. Next time, when getting Alexa to play the next song takes some time, just think about this long process.

Standard Approach

Picovoice Approach

Porcupine Wake Word eliminates the need for a push-to-talk button. The user can start with a voice command. Rhino Speech-to-Intent infers the intent directly from speech without a text representation. Orca Text-to-Speech reads the responses generated by machines when needed. During the whole process, voice data is processed locally on the device without sending the data to the cloud. With the Picovoice approach, since the voice data is not transmitted to the cloud, the users do not face reliability and latency issues.

Picovoice Approach

The user says “Porcupine (or your branded wake word) change the color to black”, and voila!

What's more with Picovoice?

Picovoice technology is not only fast but also very mindful when it comes to power consumption. For example, Porcupine, the wake word engine, uses less than 4% of the Raspberry Pi 3 CPUs and detects multiple wake words concurrently without any additional footprint. You can develop voice products for Metaverse by using Picovoice SDKs for web, Android, iOS or Unity.

1o = pvporcupine.create(
2 access_key=access_key,
3 keyword_paths=keyword_paths)
4
5while True:
6 keyword_index =
7 o.process(audio_frame())
8 if keyword_index >= 0:
9 // Detection callback
1let o = new Porcupine(
2 accessKey,
3 keywordPaths,
4 sensitivities);
5
6while (true) {
7 let keywordIndex =
8 o.process(audioFrame());
9 if (keywordIndex >= 0) {
10 // Detection callback
11 }
12}
1PorcupineManagerCallback
2callback =
3new PorcupineManagerCallback() {
4 @Override
5 public void
6 invoke(int keywordIndex) {
7 // Detection callback
8 }
9}
10
11PorcupineManager o =
12new PorcupineManager.Builder()
13 .setAccessKey(accessKey)
14 .setKeywordPath(keywordPath)
15 .build(
16 appContext,
17 callback);
18
19o.start()
1let o = try PorcupineManager(
2 accessKey,
3 keywordPath: keywordPath,
4 onDetection: { keywordIndex in
5 // Detection callback
6 })
7
8try o.start()
1const {
2 keywordDetection,
3 isLoaded,
4 isListening,
5 error,
6 init,
7 start,
8 stop,
9 release,
10} = usePorcupine();
11
12await init(
13 accessKey,
14 keywords,
15 model
16);
17
18await start();
19
20useEffect(() => {
21 if (keywordDetection !== null) {
22 // Handle keyword detection
23 }
24}, [keywordDetection])
1PorcupineManager o =
2await PorcupineManager
3 .fromKeywordPaths(
4 accessKey,
5 keywordPaths,
6 (keywordIndex) => {
7 // Detection callback
8 });
9
10await o.start()
1let o = await PorcupineManager
2 .fromKeywordPaths(
3 accessKey,
4 keywordPaths,
5 (keywordIndex) => {
6 // Detection callback
7 });
8
9await o.start()
1PorcupineManager o =
2PorcupineManager
3 .FromKeywordPaths(
4 accessKey,
5 keywordPaths,
6 (keywordIndex) => {
7 // Detection callback
8 });
9
10o.start();
1Porcupine o =
2 Porcupine.FromKeywordPaths(
3 accessKey,
4 keywordPaths);
5
6while (true)
7{
8 int keywordIndex =
9 o.Process(Audio());
10 if (keywordIndex >= 0)
11 {
12 // Detection callback
13 }
14}
1Porcupine o =
2 new Porcupine.Builder()
3 .setAccessKey(accessKey)
4 .setKeywordPath(keywordPath)
5 .build();
6
7while (true) {
8 int keywordIndex =
9 o.process(audioFrame());
10 if (keywordIndex >= 0) {
11 // Detection callback
12 }
13}
1pv_porcupine_t *porcupine = NULL;
2pv_porcupine_init(
3 access_key,
4 model_path,
5 num_keywords,
6 keyword_paths,
7 &sensitivities,
8 &porcupine);
9
10while (true) {
11 pv_porcupine_process(
12 porcupine,
13 audio_frame(),
14 &keyword_index);
15 if (keyword_index >= 0) {
16 // Detection callback
17 }
18}

Watch the demo!

To show how it works, we've built a Voice-Controlled VR Video Player with Unity.

Start Building!

Adding on-device voice recognition to AR, VR or MR applications is just a few clicks away. Explore Metaverse by creating a Picovoice Console account for free.

AR, VR and MR stand for Augmented Reality, Virtual Reality and Mixed Reality, respectively. Extended Reality can be shortened as XR and is an umbrella term for AR, VR and MR.