Emilia, an open source, multilingual, high-quality voice dataset

More than 100,000 hours in six languages

CUHK (Shenzhen), in conjunction with the Institute of Acoustics of the Chinese Academy of Sciences, the Shanghai Artificial Intelligence Laboratory and other institutions, released more than 100,000 hours of diverse speech generation data set in 6 languages- Emilia!

Emilia is an open source, multilingual foreign speech dataset designed for large-scale speech generation research. It contains more than 101,000 hours of high-quality voice data and corresponding text transcriptions in six languages, covering a variety of speaking styles and content types, such as talk shows, interviews, debates, sports reviews and audiobooks.

Demand group:

“The Emilia dataset is aimed at scholars and researchers who need to conduct large-scale speech generation research, especially those who focus on multilingual Text To Speech and speech recognition technologies. “

Example usage scenarios:

Used to develop multilingual Text To Speech System
Used as a training dataset to improve the accuracy of speech recognition algorithms
In the field of education, it is used for language learning and pronunciation teaching

Product features:

Provides more than 101,000 hours of high-quality voice data in six languages
Contains voice and text transcripts in Chinese, English, Japanese, Korean, German and French
Derived from diverse video platforms and podcasts on the Internet, with rich content types
Support data preprocessing using Emilia-Pipe open source preprocessing pipeline
Allow researchers to download raw audio files and reconstruct the dataset
Emilia-Pipe supports customized voice data preprocessing to meet specific research needs

If you want to learn more, you can click on the link below the video.
Thank you for watching this video. If you like it, please subscribe and like it. thank

ArXiv: https://arxiv.org/abs/2407.05361
GitHub: https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia
Homepage: https://emilia-dataset.github.io/Emilia-Demo-Page/
HuggingFace: https://huggingface.co/datasets/amphion/Emilia

Oil tubing: