WhisperSpeech: An open-source text-to-speech system

What’s remarkable is that it was achieved by reverse engineering OpenAI’s Whisper automatic speech recognition model.

Through this inversion process, WhisperSpeech is able to receive text input and generate natural-sounding speech output using the modified Whisper model.

The output speech is excellent in terms of pronunciation accuracy and naturalness.

WhisperSpeech project roadmap:

Acoustic Marker Extraction: Improve the extraction process of acoustic markers.
Semantic markup extraction: Generate and quantify semantic markup using the Whisper model.
S- > A model transformation: Develop a model that converts semantic markers to acoustic markers.
T- > S model transformation: Implement the transformation from text markup to semantic markup.
Improve EnCodec Speech Quality: Optimize EnCodec models to improve Text To Speech quality.
Short sentence inference optimization: improve the ability of the system to process short sentences.
Expand the emotional speech dataset: Collect larger emotional speech data.
Documenting the LibriLight dataset: Documenting the dataset on HuggingFace in detail.
Multilingual Speech Collection: Gather community resources to collect speech in multiple languages.
Training multilingual models: Develop text-to-speech models that support multiple languages.

GitHub:https://github.com/collabora/WhisperSpeech
Website:https://collabora.github.io/WhisperSpeech/
Online experience:https://replicate.com/lucataco/whisperspeech-small

The content in this video has been automatically translated by safari

Video: