Supports end-to-end voice scheme (GLM-4-Voice-THG) and cascading scheme (ASR-LLM-TTS-THG). Customizable image and timbre without training, support tone cloning, and first packet latency as low as 3 seconds.
overview
The project demonstrates the ability to interact with customizable digital people in real time. It supports end-to-end (GLM-4-Voice) and cascading (ASR-LLM-TTS-THG) voice solutions. Users can customize the appearance and sound of digital people, and support sound cloning. The initial delay is as low as 3 seconds.
detailed description
The project provides a demonstration of how to create an interactive digital person that can engage in real-time voice conversations. Here is a detailed description of its key aspects:
1. Core functions:
- Real-time voice interaction: The core of the project is to enable digital people to have a natural dialogue with users.
- End-to-end and cascading solutions: Two treatment methods are provided:
- End-to-end (GLM-4-Voice): Directly process speech and generate dialogue avatars (THG) through the multimodal large language model (MLLM).
- Cascade (ASR-LLM-TTS-THG): The method divides the processing process into several stages: Automatic Speech Recognition (ASR), Large Language Model (LLM), Text-to-Speech Conversion (TTS), and Conversation Avatar Generation (THG).
- Customization: Users can customize the appearance and sound of digital people.
- Voice cloning: The project supports sound cloning, allowing users to provide specific or personalized sounds to digital people.
- Low latency: The project aims to achieve low latency, with a first packet delay of approximately 3 seconds.
2. Technology options:
- ASR: Use FunASR for automatic speech recognition.
- LLM: Use Qwen as the big language model.
- End-to-end MLLM: GLM-4-Voice handles end-to-end multimodal processing.
- TTS: Supports multiple TTS engines: GPT-SoVITS, CosyVoice and edge-tts.
- THG: Use MuseTalk to generate conversation avatar.
3. Local deployment:
- Hardware requirements:
- Cascade solution: Requires approximately 8GB of GPU memory (e.g. a single A100).
- End-to-end solution: Requires approximately 20GB of GPU memory.
- Software requirements:
- Ubuntu 22.04
- Python 3.10
- CUDA 12.2
- PyTorch 2.3.0
- Setting steps:
- Environment configuration: Provides instructions on how to clone a repository, create a conda environment, and install the necessary Python packages.
- Weight download: Instructions for the weights required to download MuseTalk, GPT-SoVITS, and GLM-4-Voice are provided, which can be downloaded directly or by using ModelScope.
- Other configurations:
- API key: Explains how to use API keys for LLM and TTS modules (Qwen API and CosyVoice API). If you do not want to use an API key, instructions for local reasoning are also provided. this involves using
QwenConduct local LLM reasoning or useEdge_TTSConduct TTS.
- API key: Explains how to use API keys for LLM and TTS modules (Qwen API and CosyVoice API). If you do not want to use an API key, instructions for local reasoning are also provided. this involves using
- Start the service: use the command
python app.pyStart the presentation.
4. Customization:
- Appearance of digital person: Users can add digital avatar videos they have recorded.
- Digital human voice: Users can add sound samples to
/data/audiofolder, and inapp.pyAdd a sound name to the file. supported formats arex (GPT-So-Vits)。
5. Key documents:
app.py: Main application files that handle the Gradio interface and logic.src/llm.py: Contains LLM implementation (Qwen, Qwen_API).src/tts.py: Contains TTS implementations (GPT_So_Vits_TTS, CosyVoice_API, Edge_TTS).src/thg.py: Use MuseTalk to process conversation avatar generation.
Github:https://github.com/Henry-23/VideoChat
Online demo:https://www.modelscope.cn/studios/AI-ModelScope/video_chat
Oil tubing: