Dedicated to providing multimodal, multilingual, high-performance speech understanding capabilities
project information
SenseVoice It is the basic model responsible for “speech understanding” in the FunAudioLLM project. It has the following main capabilities: automatic speech recognition (ASR), language recognition (LID), speech emotion recognition (SER), and audio event detection (AED).
Launched by Alibaba Tongyi Speech Team, the project aims to promote natural voice interaction between humans and computers through combining voice technology with large language models (LLMs)
core highlight
- High-precision multilingual speech recognition
- Uses more than 400,000 hours of training data and supports more than 50 languages.
- Recognition performance is better than OpenAI’s Whisper model on Chinese and Cantonese scenes
- Rich “rich text” capabilities
- It has accurate voice emotion recognition, and its performance even exceeds the current state-of-the-art models.
- Support audio event detection, including laughter, applause, coughing, sneezing, background music and other common human-computer interaction sounds
- Efficient reasoning performance
- SenseVoice-Small is a non-autoregressive end-to-end implementation with extremely low latency. Processing 10 seconds of audio takes only about 70 ms, which is about 15 times faster than Whisper-Large
- Easy fine-tuning
- Provides finetune scripts and policies to make it easier for users to handle issues such as small numbers of samples and tail samples in specific business scenarios
- Friendly service deployment
- Support multiple concurrent requests and provide client support on multiple platforms including Python, C++, HTML, Java, C#and other
- Version updates and extensions
- In July 2024, SenseVoice-Small was officially open source, supporting Chinese, English, Guangdong, Japan, South Korea and other languages, and supporting ONNX, libtorch export and Python runtime.
- Also launched at the same time were CosyVoice, a natural speech generation model for multilingual, timbre and emotion control, and a speech processing toolset called FunASR
Application scenarios and overall architecture
programme is affiliated with the FunAudioLLM Framework, which covers two basic models:
- SenseVoice: Used for “speech understanding”, covering ASR, emotion recognition, audio event detection, etc.;
- CosyVoice: Used for “speech generation”, supporting multi-language, multi-timbre, emotion control, zero-sample speech cloning and other functions
By combining these two capabilities with a large language model, the following rich forms of interaction can be achieved:
- Speech-to-Speech Translation
- Emotional Voice Chat
- Interactive Podcast
- Expressive Audiobook Narration
Use platform and interface support
- GitHub provides complete training, reasoning and fine-tuning code, including Python implementations and export processes in different formats (ONNX, libtorch, etc.)
- The sherpa-onnx framework also integrates and supports the SenseVoice model, and provides multi-language recognition capabilities (Chinese, Guangdong, English, Japan, and South Korea), rich APIs (Python, C++, C#, Go, Java, JS, Swift, Dart, etc.) and multi-platform support (Linux, macOS, Windows, Android, iOS)
Summary list
| Model/Framework | feature highlights | Examples of application scenarios |
|---|---|---|
| SenseVoice | Multilingual ASR, emotion recognition, audio event detection; high precision, low latency (70 ms) | Real-time speech recognition, emotion perception, interactive background sound monitoring |
| CosyVoice | Multilingual, timbre and emotion control, natural speech generation, support for zero-sample speech cloning | Highly anthropomorphic speech generation, audio content, cross-language broadcast |
| FunASR | Multifunctional toolset: VAD, punctuation recovery, language models, speaker recognition, etc. | All-round speech recognition processing supports complex multi-speaker scenarios |
| platform support | ONNX / libtorch export; rich API support; support sherpa-onnx multi-device deployment | Flexible integration into various applications (such as servers, mobile terminals, Web front-ends) |
last
SenseVoice It is a powerful speech understanding model with multi-language, high precision, low latency and wide application capabilities, and is suitable for real-time interaction scenarios. it is FunAudioLLM The core components of the framework, and CosyVoice Together, the models jointly promote the realization of a natural and emotionally rich voice human-computer interaction experience.
Github:https://github.com/FunAudioLLM/SenseVoice
Oil tubing: