Falcon Speaker Diarization

Find "who spoke when" in Whisper or other STT transcripts

On-device speaker diarization that identifies "who spoke when" in multi-speaker audio by tagging segments like Speaker 1, Speaker 2, making AI transcripts readable and analyzable.

Press the button
to start diarizing with Falcon
What is Falcon Speaker Diarization?

Falcon Speaker Diarization identifies speakers in an audio stream by finding speaker change points and grouping speech segments based on speaker voice characteristics.

Powered by deep learning, Falcon Speaker Diarization enables machines and humans to read and analyze conversation transcripts created by Speech-to-Text APIs or SDKs.

Get started with just a few lines of code

1f = pvfalcon.create(access_key)
2
3segments = f.process_file(path)
1pv_falcon_t *falcon = NULL;
2pv_falcon_init(
3 access_key,
4 model_path,
5 &falcon);
6
7int32_t num_segments = 0;
8pv_segment_t *segments = NULL;
9pv_falcon_process_file(
10 falcon,
11 path,
12 &num_segments,
13 &segments);
1const f =
2 await FalconWorker
3 .create(accessKey);
4
5const segments =
6 await f.process(pcm);
1Falcon f = new Falcon.Builder()
2 .setAccessKey(accessKey)
3 .build(appContext);
4
5FalconSegment segments =
6 f.processFile(path);
1let f = Falcon(
2 accessKey: accessKey)
3
4let segments =
5 f.processFile(path)
Why Falcon Speaker Diarization is needed for conversations transcribed by Whisper Speech-to-Text

Whisper and other speech-to-text engines leave a critical question unanswered: "Who said what?" while transcribing conversations. Without speaker identification, meeting transcripts, interviews, and multi-speaker content become difficult to understand and analyze.

Falcon Speaker Diarization fills this gap. As the only modular speaker diarization SDK, Falcon works with any speech-to-text engine and adds precise "Speaker 1, Speaker 2" labels. Unlike cloud APIs, Falcon Speaker Diarization keeps all processing on-device, ensuring privacy.

Why choose Falcon Speaker Diarization over Speaker Diarization Tools?

Frequently asked questions

What is Speaker Diarization?

Speaker Diarization deals with identifying “who spoke when”. Speaker Diarization splits an audio stream that contains human speech into homogeneous segments using speaker voice characteristics, then associates each with individual speakers.

What are the steps in Speaker Diarization?

Speaker Diarization consists of two main steps: speaker segmentation and speaker clustering. Speaker segmentation focuses on finding speaker change points in an audio stream. Clustering groups speech segments together based on speakers’ voice characteristics.

How does Speaker Diarization differ from Speech-to-Text?

Speech-to-Text deals with “what is said.” It converts speech into text without distinguishing speakers, i.e., “who?”. Speech-to-text with timestamps also includes timing information, i.e., “when”.

Speaker Diarization differentiates speakers, answering “who spoke, when” without analyzing “what’s said.” Thus, developers use Speech-to-Text and Speaker Diarization together to identify “who said what and when.”

In short, Speaker Diarization and Speech-to-Text are complementary speech-processing technologies. Speaker Diarization enhances the Speech-to-Text transcripts for conversations where multiple speakers are involved. The transcription result tags each word with a number assigned to individual speakers. A transcription result can include numbers up to as many speakers as Speaker Diarization can uniquely identify in the audio sample.

Leopard Speech-to-Text and Cheetah Streaming Speech-to-Text are Picovoice’s Speech-to-Text engines. Leopard Speech-to-Text is ideal for batch audio transcription, while Cheetah Streaming Speech-to-Text is for real-time transcription.

How does Speaker Diarization differ from Speaker Recognition?

Speaker Diarization and Speaker Recognition are similar but different technologies enabling different use cases. Both identify speakers by analyzing the voice characteristics of speakers. Speaker recognition identifies “known” speakers, whereas Speaker Diarization differentiates speakers without knowing who they are. Speaker Recognition returns recorded names of the enrolled speakers, such as Jane and Joe. Speaker Recognition cannot identify speakers without enrolled voice prints. Speaker Diarization, on the other hand, returns labels such as Speaker 1 and Speaker 2 without requiring speakers’ voice prints. Speaker Diarization does not transfer information between audio files, meaning a speaker can be Speaker 1 in one file and Speaker 2 in another.

In short, Speaker Recognition can verify speakers, whereas Speaker Diarization does not match voice characteristics to verify speakers. Check out Eagle Speaker Recognition and its web demo to learn more about speaker recognition.

What can I build with Speaker Diarization?

Enterprises, from medical and legal practices to call centers, leverage audio transcription to transcribe calls, meetings, and conversations. Speaker Diarization plays a critical role by improving the readability of transcripts and enabling further analysis. Some industries benefit from Speaker Diarization:

What is engine-agnostic Speaker Diarization?

Most vendors offer Speaker Diarization embedded into their Speech-to-Text software as developers use Speaker Diarization to identify speakers within a transcript provided by Speech-to-Text. Offering them jointly simplifies the development process. However, it limits developers to choose what works best for them. Engine-agnostic Speaker Diarization works with any Speech-to-Text software. Developers who are unsatisfied with the performance of embedded Speaker Diarization or those who prefer a Speech-to-Text software that doesn't offer embedded Speaker Diarization can use Falcon Speaker Diarization with Speech-to-Text of their choice.

Can I use Falcon Speaker Diarization with OpenAI Whisper Speech-to-Text?

Yes, you can use Falcon Speaker Diarization with OpenAI’s Whisper Speech-to-Text or any other automatic speaker recognition engine, including but not limited to Amazon Transcribe, Google Speech-to-Text, and Microsoft Azure Speech-to-Text.

Does Falcon Speaker Diarization require knowing the number of speakers ahead of time?

No. Falcon Speaker Diarization automatically detects and labels speakers without requiring any prior knowledge of speaker count.

What's the maximum number of speakers Falcon Speaker Diarization supports?

There is no limit on the number of speakers that Falcon Speaker Diarization supports. In other words, Falcon Speaker Diarization works with an unlimited number of speakers.

Does Falcon Speaker Diarization support real-time Speaker Diarization?

Falcon Speaker Diarization doesn’t support real-time Speaker Diarization out of the box. Enterprise Plan customers can work with Picovoice Consulting for custom development.

Does Falcon Speaker Diarization support real-time Speaker Diarization?

Falcon Speaker Diarization doesn’t support real-time Speaker Diarization out-of-the-box. If your use case requires real-time Speaker Diarization, please engage with Picovoice Consulting.

Which platforms does Falcon Speaker Diarization support?
How do I get technical support for Falcon Speaker Diarization?

Picovoice docs, blog, Medium posts, and GitHub are great resources to learn about voice AI, Picovoice technology, and how to perform speaker diarization. Enterprise customers get dedicated support specific to their applications from Picovoice Product & Engineering teams. While Picovoice customers reach out to their contacts, prospects can also purchase Enterprise Support before committing to any paid plan.

How can I get informed about updates and upgrades?

Version changes appear in the and LinkedIn. Subscribing to GitHub is the best way to get notified of patch releases. If you enjoy building with Falcon Speaker Diarization, show it by giving a GitHub star!