New AI Model Breaks Language Barriers with Multimodal Translation

Meta, the company formerly known as Facebook, recently unveiled its latest AI model, SeamlessM4T, which aims to revolutionize language translation. This “massively multilingual” model has the capability to translate speech and text in up to 100 languages, facilitating effective communication between individuals who speak different languages.

What sets SeamlessM4T apart is its ability to handle both text-to-speech and speech-to-text translations, as well as speech-to-speech and text-to-text translations. The model boasts features such as speech recognition, speech-to-text translation, speech-to-speech translation, text-to-text translation, and text-to-speech translation. With support for approximately 100 languages in text translation and 36 languages in speech output, SeamlessM4T offers an extensive range of language capabilities.

Meta’s approach to AI is commendable, as the company is releasing SeamlessM4T under a research license that permits developers to build on the model’s capabilities. This move promotes collaboration and innovation in the field of translation AI. In addition, Meta is releasing SeamlessAlign, a valuable dataset for multimodal translation research, consisting of an impressive 270,000 hours of speech and text alignments. This dataset is expected to facilitate the development of future translation AI models by researchers worldwide.

To train SeamlessM4T, Meta’s researchers created a multimodal corpus called SeamlessAlign, comprising over 470,000 hours of automatically aligned speech translations. They further refined this corpus using human-labeled and pseudo-labeled data, resulting in an extensive training set totaling 406,000 hours. While Meta is somewhat discreet about the sources of its training data, the text data is derived from a dataset deployed in NLLB, comprising sentences sourced from various reputable platforms and translated by professional human translators. As for the speech data, it comes from a vast repository of web audio, amounting to 4 million hours, with 1 million hours in English.

While Meta is not the first to venture into machine-learning translation tools, SeamlessM4T distinguishes itself by expanding multimodal translation to a wider array of languages. Previous advancements in audio processing, such as OpenAI’s Whisper, have paved the way for Meta’s foray into this realm. What sets SeamlessM4T apart is its “single system approach,” which streamlines the translation process by employing a monolithic AI model instead of a chain of multiple models. This approach enhances efficiency and reduces errors.

For a deeper understanding of SeamlessM4T, technical details are available on Meta’s website, and the model’s code and weights can be accessed on the Hugging Face platform.


1. Can SeamlessM4T translate both speech and text?

Yes, SeamlessM4T is a multimodal AI model that can handle both speech and text translations.

2. How many languages does SeamlessM4T support?

SeamlessM4T supports translations in up to 100 languages for text translation and approximately 36 languages for speech output.

3. Under what license is SeamlessM4T released?

Meta has released SeamlessM4T under a research license (CC BY-NC 4.0) that allows developers to build upon the model’s capabilities.

4. Where did Meta obtain the training data for SeamlessM4T?

Meta used a variety of sources for training data, including a dataset deployed in NLLB for text data and a repository of web audio for speech data.

5. What is the significance of SeamlessAlign?

SeamlessAlign is a dataset released by Meta that consists of 270,000 hours of speech and text alignments. It serves as a valuable resource for researchers working in the field of multimodal translation.

Subscribe Google News Channel