Full duplex real-time voice interaction 120 milliseconds ultra-low latency
Hertz-dev is the first open source model of session audio developed by Standard Intelligence. hertz-dev is a full-duplex, audio-only Transformer basic model.
Its main function is to generate dialogue audio, that is, voice generation that simulates human dialogue. Supports full-duplex audio, which can simultaneously receive and generate audio, just like a phone call or real-time conversation, without having to wait for a sentence to be finished before replying.
Project overview
Hertz-dev is an 8.5 billion parameter Transformer model designed specifically for dialogue audio generation. It is trained on 20 million hours of high-quality speech data and has excellent speech modeling capabilities, including natural pauses, emotional intonation and other characteristics. The theoretical delay is 80 milliseconds, and the measured delay on a single RTX4090 graphics card is about 120 milliseconds, which is significantly better than the response speed of existing open source models.
Technical architecture
Hertz-dev contains the following key components:
-
Hertz-codec: An efficient audio autoencoder that compresses 16kHz monaural speech into a potential representation of 8 Hz at a code rate of approximately 1kbps, and the compression efficiency is better than schemes such as Soundstream and Encodec.
-
Hertz-vae: A variational autoencoder (VAE) with 1.8 billion parameters to generate coherent speech output, support up to 17 minutes of contextual memory, suitable for long-term conversations.
-
Hertz-lm: A 6.6 billion-parameter Transformer model that partially initializes a self-pretrained language model that focuses on dialogue fluency and context understanding.
🚀Usage
The project provides multiple reasoning methods:
-
by
inference.ipynbGenerate mono or dual voice output. -
use
inference_client.pyandinference_server.pyAchieve real-time microphone interaction (currently tested on Ubuntu servers and macOS clients). -
using
inference_client_webrtc.py, combines Streamlit and WebRTC for real-time voice interaction in the browser.
All model weights will be automatically downloaded to ./ ckpt Catalog, also available from ckpt.si.inc Get.
Application scenarios
As a basic model, Hertz-dev has not undergone reinforcement learning or instruction fine-tuning. It is suitable for secondary development in the following scenarios:
-
Real-time voice assistant
-
Multilingual speech translation
-
Voice interaction of non-player characters (NPC) in the game
-
Customer service voice robot
-
Speech emotion recognition and generation
Project link
GitHub:https://github.com/Standard-Intelligence/hertz-dev
Oil tubing: