MetaAI releases real-time AI language translation model: Seamless

This model unifies the previous three Seamless series models and can translate more than 100 languages in real time with a delay of less than 2 seconds, starting to translate while the speaker is still speaking.

Seamless translation is not only a text conversion, but also maintains the speaker’s emotion, tone, intonation, etc., making the translated speech more natural and realistic.

Key features:

1. Preserving Original Emotions: The Seamless Expressive model focuses on maintaining the expressiveness of the original speech, including intonation, emotion, and style, in speech-to-speech translation. Preserve the speaker’s tone and emotion.

2. Real-time translation: The real-time translation function has a delay of only about two seconds. It starts translating while the speaker is still speaking, making the conversation more fluid and natural, compared to traditional translation systems.

3. Support multiple languages: It supports automatic speech recognition and speech-to-text translation in nearly 100 input and output languages, as well as speech-to-speech translation in nearly 100 input languages and 36 output languages.

4. Toxicity mitigation and accuracy: When building AI translation systems, Meta pays special attention to accuracy and avoiding misunderstandings. They explored how to reduce errors and inappropriate content that can occur during the translation process, which is crucial for ensuring the quality and safety of communication.

5. Audio watermarking technology: To prevent abuse and imitation, Meta has also developed an audio watermarking technology. This technology allows for the embedding of audio without being noticed by the human ear, ensuring traceability of the audio source.

The Seamless model unifies the functionality of SeamlessExpressive, SeamlessStreaming, and SeamlessM4T v2. Designed for multilingual, expressive, and smooth voice translation.

Key features and capabilities of these models:

SeamlessM4T v2: This is an improved version of the large-scale multilingual and multimodal translation model. Improved quality and inference latency for speech generation tasks. It is based on the updated UnitY2 framework and is trained on more low-resource language data. SeamlessM4T v2 provides the basis for other models.

SeamlessM4T v2 enables state-of-the-art speech-to-speech and speech-to-text translation of results in 100 languages. In the same model, it also beats Whisper v3 in average automatic speech recognition, especially for lower-resource languages.

SeamlessM4T v2 is a 10% improvement over the model released in August and more than 17% better than the strongest cascading model when translated into English. For speech-to-speech translation, SeamlessM4T v2 improves by more than 15% when translating into English and 25% better than SeamlessM4T (v1) when translating from English.

The following tasks are supported:
• Speech-to-speech translation (S2ST)
• Speech-to-Text Translation (S2TT)
• Text-to-speech translation (T2ST)
• Text-to-text translation (T2TT)
• Automatic Speech Recognition (ASR)

SeamlessExpressive: This model maintains the style and rhythm of the voice during translation. Compared to previous studies of expressive speech, Seamless Expressive focuses on some underexplored prosodic aspects such as speech rate and pauses, while preserving the speaker’s vocal style.

SeamlessStreaming：

This is a streaming translation model that supports both voice input and voice/text output. It supports the following tasks:
• Speech-to-speech translation (S2ST)
• Speech-to-Text Translation (S2TT)
• Automatic Speech Recognition (ASR)

This model utilizes the Efficient Monotonic Multi-Head Attention (EMMA) mechanism to generate low-latency target translations without waiting for the full source statement. SeamlessStreaming is the first model capable of simultaneous speech-to-speech/text translation in multiple source and target languages.

Meta AI has also released a series of metadata, data, and data alignment tools related to the Seamless Communication project to support the research community.

SeamlessAlign Extended Metadata: Contains an additional 115,000 hours of speech and text alignment data, plus the existing 470,000 hours. This latest version of SeamlessAlign covers a wider range of languages, increasing from 37 to 76 previously. This corpus is by far the largest public speech/speech and speech/text parallel corpus in terms of total volume and language coverage.

Details: https://ai.meta.com/blog/seamless-communication
Official website: https://ai.meta.com/research/seamless-communication/
Thesis: https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/
GitHub：https://github.com/facebookresearch/seamless_communication
Online experience: https://seamless.metademolab.com/expressive?utm_source=metaai&utm_medium=web&utm_campaign=fair10&utm_content=blog