CLASI: End-to-end speech synchronous translation system developed by ByteDance

CLASI is a high-quality simultaneous speech translation system developed by ByteDance, similar to a professional human translator. It translates voice content in real time, maintaining high translation quality and low latency. CLASI leverages advanced data strategies and multimodal retrieval technology to process complex terms and unclear speech information.

CLASI generates accurate and fault-tolerant translations based on current audio content, combined with external knowledge bases and historical context. It performs very well on various test datasets and can convey more effective information.

Translation strategy: CLASI uses an innovative strategy to balance accuracy and speed of translation to ensure that translation is fast and accurate.
System architecture: The system processes the current audio data, retrieves relevant information, loads the historical context, and then outputs the translation results. This process is constantly looped to ensure real-time translation.
Performance: In real-life applications, CLASI’s translation accuracy is significantly higher than the best commercial and open source systems currently available. For example, the translation accuracy rate from Chinese to English reaches 81.3%.

CLASI solves the following key issues:

Balancing translation quality and latency: Traditional speech translation systems often use tandem systems involving multiple models (such as automatic speech recognition models, punctuation models, and machine translation models), which often affect translation quality due to error propagation and latency. CLASI provides high-quality real-time translation by imitating the strategies of human translators and adopting data-driven read-write strategies to balance translation quality and latency.
Translation of domain terms: During the translation process, especially in professional fields, accurate translation of domain terms is a major challenge. CLASI uses a multimodal retrieval enhanced generation (MM-RAG) module to enhance translation quality by retrieving relevant terms and information from external databases and ensure accurate translation of professional terms.
Lack of training data: At the same time, the scarcity of data for translation tasks seriously affects the performance improvement of the system. CLASI uses a multi-stage training method and uses large-scale pre-training, continuous training and fine-tuning steps to enable the model to imitate the translation behavior of professional human translators with the help of a small amount of high-quality human annotation data and improve the robustness and quality of translation.
The gap between human assessment and automatic assessment: Existing automatic assessment indicators (such as BLEU) may not fully reflect the quality of translation, especially for long speech segments. CLASI introduces the effective information ratio (VIP) as a new evaluation indicator, which reflects the ability of translation systems to convey effective information in real scenarios and is significantly superior to existing systems in this indicator.

If you want to learn more, you can click on the link below the video.
Thank you for watching this video. If you like it, please subscribe and like it. thank

Projects and demonstrations:https://byteresearchcla.github.io/clasi/
Thesis:https://byteresearchcla.github.io/clasi/technical_report.pdf

Oil tubing: