MetaVoice-1B: A highly realistic and natural text-to-speech (TTS) conversion model

The model has 120 million parameters and has been trained with 100,000 hours of speech data.

Focus on English Emotional Speech
Cross-language speech cloning
Support zero-sample cloning of American and British voices
Support Text To Speech for Long Content

Main features:

1. Emotional voice rhythm and tone: MetaVoice-1B focuses on the emotional expression of English voice and provides smooth and natural voice output without hallucinations.
2. Cross-language voice cloning: Support cross-language voice cloning through fine-tuning. For example, for Indian speakers, it only takes 1 minute of training data to achieve successful cloning.
3. Zero-sample cloning: For voices from the United States and the United Kingdom, MetaVoice can achieve zero-sample cloning with only 30 seconds of reference audio.
4. Long reading support: Suitable for Text To Speech for long text content.

Working principle:

1. Causal GPT prediction: MetaVoice uses a model called Causal GPT to process text and generate speech. Causal GPT is able to predict the next word or token based on a given text.
In MetaVoice, this model is used to predict the first two levels of EnCodec tokens, which represent the preliminary structure of speech. This prediction takes into account text content and audio samples, making the generated speech both accurate and natural.
2. Conditional transfer of speaker information: In order to allow the generated speech to imitate a specific speaker, MetaVoice adds speaker information to the token embedding layer. This information is obtained through a separately trained speaker verification network, which is able to identify specific attributes of the speaker, such as tone and accent. By fusing this information into the model, MetaVoice is able to generate speech output similar to the voice of the specified speaker.
3. Non-causal transformers predict remaining levels: MetaVoice next uses a small non-causal (coder-style) transformer model to predict the remaining six levels of the EnCodec token. This model has only about 10 million parameters, which is relatively small, but it has demonstrated astonishing efficiency and accuracy in predicting more detailed parts of speech. Because this model is non-causal, it can process multiple time steps simultaneously, accelerating the speech generation process.
4. Multi-band diffusion generation waveform: By using multi-band diffusion technology, MetaVoice can convert EnCodec tokens into detailed waveforms, which is the final audio output. This method improves sound quality by independently processing audio signals on different frequency bands and generates clearer and more natural speech.
DeepFilterNet cleans up background noise: The generated speech may contain some undesirable background noise, especially introduced by the multi-band diffusion process. To solve this problem, MetaVoice uses DeepFilterNet, a network specifically designed to remove background noise. Through this step, the generated speech becomes clearer and more natural, improving the listener’s experience.

Model download:https://huggingface.co/metavoiceio/metavoice-1B-v0.1
GitHub：https://github.com/metavoiceio/metavoice-src
Online experience:https://ttsdemo.themetavoice.xyz

Video: