StreamVC: Real-time low-latency speech conversion

The following is translated from the original:

Abstract. Google treamVC, a streaming speech conversion solution that retains the content and rhythm of any source speech while matching the sound quality of any target speech.
Unlike previous methods, StreamVC can generate resulting waveforms from input signals with low latency even on mobile platforms, making it suitable for real-time communication scenarios such as calls and video conferencing, and solving use cases such as voice anonymity in these scenarios.
Google’s design leverages the architecture and training strategies of the SoundStream neural audio codec to achieve lightweight, high-quality Text To Speech.
Google has demonstrated the feasibility of causal learning soft speech units and the effectiveness of providing whitening fundamental frequency information to improve pitch stability without leaking source tone information.

profile
Speech conversion refers to changing the style of a speech signal while retaining its language content. Although style covers many aspects of speech, such as emotion, rhythm, accent, and whisper, in this work, we only focus on the transformation of the speaker’s timbre while keeping the language and paralinguistic information unchanged.

Early attempts at speech conversion relied on the idea of direct conversion based on CycleGAN or StarGAN, or automatic encoding by learning feature unwrapping. However, both fail to provide high-quality results. The former experience significant artifacts, while the latter rely mainly on creating information bottlenecks at the potential or architectural level that are difficult to adjust: such bottlenecks being too wide can cause information to leak source speaker information while making it too narrow can reduce content fidelity.

Recent solutions have focused on a design where content information is obtained by utilizing a pre-trained feature extraction network from a speech recognition system, called a post-phoneme graph (PPG) method, or from self-supervised representation learning. Specifically, use HuBERT and WavLM. The combination of content information and learned global speaker embeddings is used as inputs and conditions for certain vocoder models, such as the models used in, that are trained to reconstruct audio waveforms.

Our recommendation follows the same design pattern as and uses pseudo-tags derived from HuBERT to learn the content encoder that outputs soft speech units. The contributions and new design elements of our solution are as follows:

If you want to learn more, you can click on the link below the video.
Thank you for watching this video. If you like it, please subscribe and like it. thank

Blog:https://google-research.github.io/seanet/stream_vc/
ARXIV：https://arxiv.org/html/2401.03078v1

Oil tubing: