SpeechAlign: Use human feedback to change Text To Speech to enhance the naturalness and expressiveness of technical interactions

A research team at Fudan University has developed SpeechAlign, an innovative framework that targets the core of Text To Speech, making the generated speech consistent with human preferences. Unlike traditional models that prioritize technical accuracy, SpeechAlign introduces a huge transformation by directly incorporating human feedback into speech generation. This feedback loop ensures that the generated speech is technically reasonable and resonates on a human level.

SpeechAlign stands out through a systematic approach to learning from human feedback. It carefully constructs a dataset that puts the preferred speech pattern or gold mark together with the less preferred synthetic speech pattern. This comparison dataset is the basis for a series of optimization processes that iteratively refine the speech model. Each iteration is a step towards better understanding and replicating models of human voice preferences, using objective indicators and subjective human assessments to measure success.

Text To Speech has made tremendous progress in technological progress, reflecting the human pursuit of machines that speak like us.
As we enter an era when interacting with digital assistants and conversation agents has become commonplace, the need for voice that echoes the naturalness and expressiveness of human communication has become more urgent than ever. The core of this challenge is to synthesize speech that sounds like a human and conforms to an individual’s subtle preferences for speech, such as tone, speed and emotional expression.

SpeechAlign provides a comprehensive set of assessments, ranging from subjective assessments (where human listeners rate the naturalness and quality of speech) to objective measures (such as word error rate (WER) and speaker similarity (SIM))), demonstrating its power. WER improvements were achieved using the SpeechAlign optimized model, with a 0.8 reduction compared to the baseline model, and the speaker similarity score was enhanced, reaching the 0.90 mark. These indicators mark advances in technology and indicate a closer imitation of the human voice and its various nuances.

SpeechAlign demonstrated its versatility across different model sizes and datasets. It proved that its method was powerful enough to enhance smaller models and could extend its improvements to invisible speakers. This feature is critical to deploying Text To Speech technology in different scenarios, ensuring that SpeechAlign’s benefits are widely disseminated and are not limited to specific cases or datasets.

In summary, SpeechAlign research solves the key challenge of aligning synthesized speech with human preferences, a gap that traditional models have struggled to bridge. This method innovatively incorporates human feedback into iterative self-improvement strategies. It fine-tunes speech models through a detailed understanding of human preferences and quantitatively improves key indicators such as WER and SIM. These results highlight the effectiveness of SpeechAlign in enhancing the naturalness and expressiveness of synthesized speech.

If you want to learn more, you can click on the link below the video.
Thank you for watching this video. If you like it, please subscribe and like it. thank

Quick reading: https://marktechpost.com/2024/04/10/speechalign-transforming-speech-synthesis-with-human-feedback-for-enhanced-naturalness-and-expressiveness-in-technological-interactions/
Paper: https://arxiv.org/abs/2404.0560
Github: https://github.com/0nutation/SpeechGPT? tab=readme-ov-file

Video: