Hertz-dev: The first open source model for session audio

Full duplex real-time voice interaction 120 milliseconds ultra-low latency
Hertz-dev is the first open source model of session audio developed by Standard Intelligence. hertz-dev is a full-duplex, audio-only Transformer basic model.
Its main function is to generate dialogue audio, that is, voice generation that simulates human dialogue. Supports full-duplex audio, which can simultaneously receive and generate audio, just like a phone call or real-time conversation, without having to wait for a sentence to be finished before replying.

Project overview

Hertz-dev is an 8.5 billion parameter Transformer model designed specifically for dialogue audio generation. It is trained on 20 million hours of high-quality speech data and has excellent speech modeling capabilities, including natural pauses, emotional intonation and other characteristics. The theoretical delay is 80 milliseconds, and the measured delay on a single RTX4090 graphics card is about 120 milliseconds, which is significantly better than the response speed of existing open source models.

Technical architecture

Hertz-dev contains the following key components:

  • Hertz-codec: An efficient audio autoencoder that compresses 16kHz monaural speech into a potential representation of 8 Hz at a code rate of approximately 1kbps, and the compression efficiency is better than schemes such as Soundstream and Encodec.

  • Hertz-vae: A variational autoencoder (VAE) with 1.8 billion parameters to generate coherent speech output, support up to 17 minutes of contextual memory, suitable for long-term conversations.

  • Hertz-lm: A 6.6 billion-parameter Transformer model that partially initializes a self-pretrained language model that focuses on dialogue fluency and context understanding.

🚀Usage

The project provides multiple reasoning methods:

  • by inference.ipynb Generate mono or dual voice output.

  • use inference_client.py and inference_server.py Achieve real-time microphone interaction (currently tested on Ubuntu servers and macOS clients).

  • using inference_client_webrtc.py, combines Streamlit and WebRTC for real-time voice interaction in the browser.

All model weights will be automatically downloaded to ./ ckpt Catalog, also available from ckpt.si.inc Get.

Application scenarios

As a basic model, Hertz-dev has not undergone reinforcement learning or instruction fine-tuning. It is suitable for secondary development in the following scenarios:

  • Real-time voice assistant

  • Multilingual speech translation

  • Voice interaction of non-player characters (NPC) in the game

  • Customer service voice robot

  • Speech emotion recognition and generation

Project link

GitHub:https://github.com/Standard-Intelligence/hertz-dev
Oil tubing:

Scroll to Top