VITA Open Source Video + Voice Model

Project function: Open source video + voice model
Project Information: An open source multimodal large language model designed to achieve real-time visual and voice interaction.

The ability to simultaneously process video, images, text and audio data reaches a level close to GPT-4o by reducing interaction delays, enhancing voice processing capabilities, and improving multimodal understanding.

The following is the explanation from the original station:

VITA-1.5, which includes a series of improvements:

Significantly reduce interaction latency. The end-to-end voice interaction delay has been shortened from approximately 4 seconds to 1.5 seconds, achieving near-instant interaction and greatly improving the user experience.

Enhanced multi-modal efficiency. The average performance of multimodal benchmarks such as MME, MMBench, and MathVista significantly improved from 59.8 to 70.8.

Improvements in voice processing. Speech processing capabilities have been improved to a new level, with ASR WER (Single Word Error Rate, Test Others) reduced from 18.4 to 7.5. In addition, we replaced VITA-1.0’s stand-alone TTS module with an end-to-end TTS module, which accepts embedding of LLM as input.

Progressive training strategy. In this way, the addition of speech has little impact on other multimodal representations (visual language). Average image understanding performance only dropped from 71.3 to 70.8.

Project:https://github.com/VITA-MLLM/VITA

Oil tubing: