SenseVoice's open source basic speech understanding model

Dedicated to providing multimodal, multilingual, high-performance speech understanding capabilities

project information

SenseVoice It is the basic model responsible for “speech understanding” in the FunAudioLLM project. It has the following main capabilities: automatic speech recognition (ASR), language recognition (LID), speech emotion recognition (SER), and audio event detection (AED).

Launched by Alibaba Tongyi Speech Team, the project aims to promote natural voice interaction between humans and computers through combining voice technology with large language models (LLMs)

core highlight

High-precision multilingual speech recognition
- Uses more than 400,000 hours of training data and supports more than 50 languages.
- Recognition performance is better than OpenAI’s Whisper model on Chinese and Cantonese scenes
Rich “rich text” capabilities
- It has accurate voice emotion recognition, and its performance even exceeds the current state-of-the-art models.
- Support audio event detection, including laughter, applause, coughing, sneezing, background music and other common human-computer interaction sounds
Efficient reasoning performance
- SenseVoice-Small is a non-autoregressive end-to-end implementation with extremely low latency. Processing 10 seconds of audio takes only about 70 ms, which is about 15 times faster than Whisper-Large
Easy fine-tuning
- Provides finetune scripts and policies to make it easier for users to handle issues such as small numbers of samples and tail samples in specific business scenarios
Friendly service deployment
- Support multiple concurrent requests and provide client support on multiple platforms including Python, C++, HTML, Java, C#and other
Version updates and extensions
- In July 2024, SenseVoice-Small was officially open source, supporting Chinese, English, Guangdong, Japan, South Korea and other languages, and supporting ONNX, libtorch export and Python runtime.
- Also launched at the same time were CosyVoice, a natural speech generation model for multilingual, timbre and emotion control, and a speech processing toolset called FunASR

Application scenarios and overall architecture

programme is affiliated with the FunAudioLLM Framework, which covers two basic models:

SenseVoice: Used for “speech understanding”, covering ASR, emotion recognition, audio event detection, etc.;
CosyVoice: Used for “speech generation”, supporting multi-language, multi-timbre, emotion control, zero-sample speech cloning and other functions

By combining these two capabilities with a large language model, the following rich forms of interaction can be achieved:

Speech-to-Speech Translation
Emotional Voice Chat
Interactive Podcast
Expressive Audiobook Narration

Use platform and interface support

GitHub provides complete training, reasoning and fine-tuning code, including Python implementations and export processes in different formats (ONNX, libtorch, etc.)
The sherpa-onnx framework also integrates and supports the SenseVoice model, and provides multi-language recognition capabilities (Chinese, Guangdong, English, Japan, and South Korea), rich APIs (Python, C++, C#, Go, Java, JS, Swift, Dart, etc.) and multi-platform support (Linux, macOS, Windows, Android, iOS)

Summary list

Model/Framework	feature highlights	Examples of application scenarios
SenseVoice	Multilingual ASR, emotion recognition, audio event detection; high precision, low latency (70 ms)	Real-time speech recognition, emotion perception, interactive background sound monitoring
CosyVoice	Multilingual, timbre and emotion control, natural speech generation, support for zero-sample speech cloning	Highly anthropomorphic speech generation, audio content, cross-language broadcast
FunASR	Multifunctional toolset: VAD, punctuation recovery, language models, speaker recognition, etc.	All-round speech recognition processing supports complex multi-speaker scenarios
platform support	ONNX / libtorch export; rich API support; support sherpa-onnx multi-device deployment	Flexible integration into various applications (such as servers, mobile terminals, Web front-ends)

last

SenseVoice It is a powerful speech understanding model with multi-language, high precision, low latency and wide application capabilities, and is suitable for real-time interaction scenarios. it is FunAudioLLM The core components of the framework, and CosyVoice Together, the models jointly promote the realization of a natural and emotionally rich voice human-computer interaction experience.

Github：https://github.com/FunAudioLLM/SenseVoice

Oil tubing: