Speech to Text Transcription in iOS Tutorial

🚀 Best-in-class Voice AI!

Build compliant and low-latency AI applications running entirely on mobile without sharing user data with 3rd parties.

Mobile apps are an ideal use case for Speech Recognition, whether it be for hands-free diction, voice interfaces for mobile games, or generating subtitles for video and audio messages.

Apple devices, such as the iPhone, iPad and Apple Watch are powered by iOS, Apple's popular flagship operating system. iOS features it's own Speech Recognition API, but it can be clumsy and verbose to integrate. Crucially, not all languages it supports have on-device recognition and even those that do may choose to stream audio to Apple's servers, introducing privacy concerns and latency.

Fortunately, Picovoice's Speech-to-Text technology does not have these downsides, and integrates seamlessly into the iOS ecosystem.

In addition to iOS, Picovoice's Speech-to-Text engines are compatible in a wide array of environments, such as Android, Linux, macOS, Windows, and modern web browsers (via WebAssembly).

With Speech-to-Text transcription, there are two main approaches: Real-Time and Batch.

Real-Time Speech-to-Text

Real-time Speech-to-Text systems offer text output in real time as a user speaks, mirroring how humans listen and convert speech into text mentally during conversations. A downside to this method is that it can lead to errors arising from auditory or semantic difficulties, which often only become apparent after a sentence is finished. Therefore, it's crucial to take this drawback into account when determining if an application necessitates real-time transcription.

Real-Time Speech-to-Text, Online Automatic Speech Recognition, and Streaming Speech-to-Text all refer to the same core technology.

For iOS devices, Picovoice provides Cheetah Streaming Speech-to-Text, a unique technology that performs all voice recognition in real-time directly on the device. This approach avoids network-related delays and minimizes the latency between the user's speech input and the transcription output.

Below is the list of software development kits (SDKs) supported by Cheetah, along with corresponding code snippets and quick-start guides.

1o = pvcheetah.create(access_key)
2
3partial_transcript, is_endpoint =
4  o.process(get_next_audio_frame())

1const o = new Cheetah(accessKey)
2
3const [partialTranscript, isEndpoint] =
4  o.process(audioFrame);

1Cheetah o = new Cheetah.Builder()
2  .setAccessKey(accessKey)
3  .setModelPath(modelPath)
4  .build(appContext);
5
6CheetahTranscript partialResult =
7  o.process(getNextAudioFrame());

1let cheetah = Cheetah(
2  accessKey: accessKey,
3  modelPath: modelPath)
4
5let partialTranscript, isEndpoint =
6  try cheetah.process(
7    getNextAudioFrame())

1Cheetah o = new Cheetah.Builder()
2  .setAccessKey(accessKey)
3  .build();
4
5CheetahTranscript r =
6  o.process(getNextAudioFrame());

1Cheetah o =
2  Cheetah.Create(accessKey);
3
4CheetahTranscript partialResult =
5  o.Process(GetNextAudioFrame());

1const {
2  result,
3  isLoaded,
4  isListening,
5  error,
6  init,
7  start,
8  stop,
9  release,
10} = useCheetah();
11
12await init(
13  accessKey,
14  model
15);
16
17await start();
18await stop();
19
20useEffect(() => {
21  if (result !== null) {
22    // Handle transcript
23  }
24}, [result])

1_cheetah = await Cheetah.create(
2  accessKey,
3  modelPath);
4
5CheetahTranscript partialResult =
6  await _cheetah.process(
7    getAudioFrame());

1const cheetah = await Cheetah.create(
2  accessKey,
3  modelPath)
4
5const partialResult =
6  await cheetah.process(
7    getAudioFrame())

1pv_cheetah_t *cheetah = NULL;
2pv_cheetah_init(
3  access_key,
4  model_file_path,
5  endpoint_duration_sec,
6  enable_automatic_punctuation,
7  &cheetah);
8
9const int16_t *pcm = get_next_audio_frame();
10
11char *partial_transcript = NULL;
12bool is_endpoint = false;
13const pv_status_t status = pv_cheetah_process(
14  cheetah,
15  pcm,
16  &partial_transcript,
17  &is_endpoint);

1const cheetah =
2  await CheetahWorker.create(
3    accessKey,
4    (cheetahTranscript) => {
5      // callback
6    },
7    {
8      base64: cheetahParams,
9      // or
10      publicPath: modelPath,
11    }
12  );
13
14WebVoiceProcessor.subscribe(cheetah);

Batch Speech-to-Text

Unlike real-time transcription, Batch Speech-to-Text waits for the complete spoken phrase to complete before providing a transcription. Compared to real-time approaches, this method boasts higher accuracy and runtime efficiency. It can anticipate spoken words, making adjustments for better precision in both linguistic and acoustic aspects. Additionally, it streamlines the process by eliminating the need to switch between listening and transcribing, thus improving overall efficiency.

For iOS-based devices, Picovoice offers Leopard Speech-to-Text, a state-of-the-art technology for batch transcription tasks. Like Cheetah, Leopard processes all voice audio data on device, ensuring privacy by design and compliance with regulations such as HIPAA and GDPR. To further improve accuracy, users can incorporate custom vocabulary and boosting specific phrases via the Picovoice Console.

Below is the list of SDKs supported by Leopard, along with corresponding code snippets and quick-start guides.

1o = pvleopard.create(access_key)
2
3transcript, words =
4  o.process_file(path)

1const o = new Leopard(accessKey)
2
3const { transcript, words } =
4  o.processFile(path)

1Leopard o = new Leopard.Builder()
2  .setAccessKey(accessKey)
3  .setModelPath(modelPath)
4  .build(appContext);
5
6LeopardTranscript r =
7  o.processFile(path);

1let o = Leopard(
2  accessKey: accessKey,
3  modelPath: modelPath)
4
5let r = o.processFile(path)

1Leopard o = new Leopard.Builder()
2  .setAccessKey(accessKey)
3  .build();
4
5LeopardTranscript r =
6  o.processFile(path);

1Leopard o =
2  Leopard.Create(accessKey);
3
4LeopardTranscript result =
5  o.ProcessFile(path);

1const {
2  result,
3  isLoaded,
4  error,
5  init,
6  processFile,
7  startRecording,
8  stopRecording,
9  isRecording,
10  recordingElapsedSec,
11  release,
12} = useLeopard();
13
14await init(
15  accessKey,
16  model
17);
18
19await processFile(audioFile);
20
21useEffect(() => {
22  if (result !== null) {
23    // Handle transcript
24  }
25}, [result])

1Leopard o = await Leopard.create(
2  accessKey,
3  modelPath);
4
5LeopardTranscript result =
6  await o.processFile(path);

1const o = await Leopard.create(
2  accessKey,
3  modelPath)
4
5const {transcript, words} =
6  await o.processFile(path)

1pv_leopard_t *leopard = NULL;
2pv_leopard_init(
3  access_key,
4  model_path,
5  enable_automatic_punctuation,
6  &leopard);
7
8char *transcript = NULL;
9int32_t num_words = 0;
10pv_word_t *words = NULL;
11pv_leopard_process_file(
12  leopard,
13  path,
14  &transcript,
15  &num_words,
16  &words);

1const leopard =
2  await LeopardWorker.
3    fromPublicDirectory(
4      accessKey,
5      modelPath
6    );
7
8const {
9  transcript,
10  words
11} =
12  await leopard.process(pcm);

iOS Speech to Text

Real-Time Speech-to-Text

Batch Speech-to-Text

More from Picovoice