Whisper WebGPU: Using OpenAI Whisper for real-time in-browser speech recognition

The following content is from the translation of the original text:

Achieving real-time speech recognition directly in a web browser has long been a sought-after milestone. The Whisper WebGPU developed by Hugging Face engineers (nicknamed “Xenova”) is a breakthrough technology that uses OpenAI’s Whisper model to achieve real-time speech recognition in the browser. This remarkable development is a huge shift in interacting with artificial intelligence-driven web applications.

The core of Whisper WebGPU is the Whisper-base model, a 73-million-parameter speech recognition model carefully optimized for network reasoning. Whisper-base’s model size is approximately 200MB and is designed to be lightweight but powerful, making it ideal for real-time applications. After downloading the model, it will be cached for future use, ensuring that subsequent interactions are fast and seamless.

The real innovation of Whisper WebGPU is its ability to run entirely in the user’s browser. The model leverages Hugging Face Transformers.js and ONNX Runtime Web to perform all calculations locally without having to send data to the server. This enhances privacy and enables features even when the device is offline. Users can disconnect from the Internet after the initial model loads and benefit from Whisper’s powerful speech recognition capabilities.

A key aspect of Whisper WebGPU’s stand out is its use of ONNX (Open Neural Network Exchange) weights. ONNX is an open source format for artificial intelligence models that allows models trained in different frameworks to be seamlessly shared and used. Xenova’s approach to building a repository using ONNX weights in a dedicated subfolder called “onnx” sets a precedent for future network-ready models. As WebML (Network Machine Learning) technology matures, this temporary solution is expected to continue to evolve and is expected to achieve more simplified integrations in the future.

Whisper WebGPUs are not just about on-device processing; they are also about on-device processing. It’s about doing this with extraordinary versatility. The model supports multilingual transcription in 100 languages, making it a universal tool for speech recognition. Whether it’s transcription, translation, or ancillary applications, Whisper WebGPUs bring unprecedented real-time capabilities to the web.

In short, Xenova’s Whisper WebGPU is a paradigm shift in thinking about and leveraging AI on the web. Its real-time in-browser speech recognition capabilities, support for 100 languages, and powerful frameworks using ONNX and Transformers.js set new standards for Web-based AI applications.

If you want to learn more, you can click on the link below the video.
Thank you for watching this video. If you like it, please subscribe and like it. thank

Complete reading: https://marktechpost.com/2024/06/08/whisper-webgpu-real-time-in-browser-speech-recognition-with-openai-whisper/
Project: https://huggingface.co/spaces/Xenova/realtime-whisper-webgpu
GitHub： https://github.com/xenova/transformers.js/tree/v3/examples/webgpu-whisper
X connection:https://x.com/Marktechpost/status/1799469927876980919

Oil tubing: