What’s remarkable is that it was achieved by reverse engineering OpenAI’s Whisper automatic speech recognition model.
Through this inversion process, WhisperSpeech is able to receive text input and generate natural-sounding speech output using the modified Whisper model.
The output speech is excellent in terms of pronunciation accuracy and naturalness.
WhisperSpeech project roadmap:
- Acoustic Marker Extraction: Improve the extraction process of acoustic markers.
- Semantic markup extraction: Generate and quantify semantic markup using the Whisper model.
- S- > A model transformation: Develop a model that converts semantic markers to acoustic markers.
- T- > S model transformation: Implement the transformation from text markup to semantic markup.
- Improve EnCodec Speech Quality: Optimize EnCodec models to improve Text To Speech quality.
- Short sentence inference optimization: improve the ability of the system to process short sentences.
- Expand the emotional speech dataset: Collect larger emotional speech data.
- Documenting the LibriLight dataset: Documenting the dataset on HuggingFace in detail.
- Multilingual Speech Collection: Gather community resources to collect speech in multiple languages.
- Training multilingual models: Develop text-to-speech models that support multiple languages.
GitHub:https://github.com/collabora/WhisperSpeech
Website:https://collabora.github.io/WhisperSpeech/
Online experience:https://replicate.com/lucataco/whisperspeech-small
The content in this video has been automatically translated by safari
Video: