Open source projects for real-time and local speech-to-text, translation, and speaker differentiation

The WhisperLiveKit project is a real-time speech-to-text system that integrates the latest research findings, including technologies such as SimulStreaming, WhisperStreaming, Streaming Sortformer, and Diart, supporting multiple languages and speaker recognition.

Project Introduction

“WhisperLiveKit” is an open-source project (by QuentinFuxa) that aims to achieve real-time and local speech-to-text, translation, and speaker differentiation(speaker diarization), both server-side and web UI.

That is, it can convert language into text in real time and identify who is speaking, and it can also translate text into other languages, running entirely or primarily in an on-premises environment (not necessarily entirely dependent on cloud services).

Core features:

Specific features include:

Speech-to-text: Converts speech content into text.
Translation: Translate the transcription of the spoken content into the target language. You can use Whisper’s internal translation function, or use the NLLB backend.
Speaker diarization: Identify the speech passages of different speakers in a multi-person conversation and mark them with labels such as “Speaker A” and “Speaker B”.
Voice Activity Detection (VAD): Detects if someone is speaking to reduce blank or useless audio processing overhead.
Browser UI + Backend Service: Provides a front-end page that can directly record/transmit speech with the browser and display real-time transcription/translation/speaker-discrimination results. It can also be integrated via WebSockets, for example.

Architecture and technical details

Some key technical/architectural details:

SimulStreaming, AlignAtt: A method for ultra-low latency transcription. Ordinary Whisper model design tends to be full sentences or longer paragraphs, and may not perform well in real-time small fragments. The latest research results such as “SimulStreaming + AlignAtt” were used to improve real-time performance.
NLLB (No Language Left Behind): This is a large-scale multilingual translation model that supports translations into over 100 languages. This can be selected as the translation backend in the project.
WhisperStreaming, LocalAgreement policy: Another streaming method for lower latency identification tasks.
Sortformer / Diart: Model backend option for speaker separation/differentiation. Sortformer is the newer option and Diart is the older/alternate option.
Optional acceleration or hardware optimizations: such as an optimized backend (MLX Whisper) with support for Apple Silicon, GPU or CPU running, etc.
Frontend + Backend Communication Through methods such as WebSockets, the frontend can get real-time results and display them on the UI.

How to use:

The Quick Start method is roughly as follows:

whisperlivekit Install the package (pip install).
Start a server-side, such as a command whisperlivekit-server --model base --language en. This will start a service that accepts audio input and outputs text, etc.
Open the corresponding address in the browser (default localhost:8000), and the front-end page can capture your microphone audio, and then the browser displays real-time transcription.
You can add parameters to control whether to do diarization, whether to translate, choose the model size, language, etc.
Docker deployment is supported for use in production.

Advantages and limitations

Here are my opinion advantages and possible limitations/challenges of this project.

Pros:

Strong real-time: In order to support low latency, Whisper is optimized for streaming/incremental processing and buffering processing. It is suitable for meetings, conversations, live broadcasts, and other scenarios.
Local processing (or partially local): Reduces dependence on network and cloud services, which is beneficial for privacy, security, and latency control.
Speaker differentiation + multilingual support: This makes it more versatile than just ASR.
Full set of front and back ends: Provide UI + server + scalability, users can quickly set up and customize.

Limitations/Challenges

Resource consumption: To achieve real-time recognition + translation + speaker distinction, the requirements for model and computing resources are not low, especially when using large models or in multi-voice and multi-noise environments. GPU or strong CPU is recommended.
Trade-off between latency vs. precision: To reduce latency, precision may be sacrificed in certain situations, such as incomplete context or sentences being cut unnaturally.
Language detection/translation quality varies by environment: the recognition quality of language/accent/background noise can vary greatly. Translation models also have limitations.
Deployment complexity: While support for Docker and other devices is available, it may require a lot of tuning when deploying at scale or running on constrained hardware (edge devices, embedded devices).

Application scenarios

It can be used in the following places:

Automatic captioning and translation for online meetings
Educational environment (classroom, lecture) allows hearing impaired or non-native speakers to keep up with the content
Call Center Recording + Automatic Translation + Recognition of Different Speakers
Podcast or interview recording + post-processing, recording while showing transcription/translation
Live Streaming Scene (Live Video with Subtitles)

Github：https://github.com/QuentinFuxa/WhisperLiveKit

Tubing: