Mistral AI Launches Open-Source Voxtral TTS Model for Multilingual Voice Agents

By admin | Mar 26, 2026 | 5 min read

French AI firm Mistral introduced a new open-source text-to-speech model on Thursday, designed for applications ranging from voice AI assistants to enterprise functions such as customer support. This model enables businesses to develop voice agents for sales and customer engagement, positioning Mistral in direct rivalry with companies like ElevenLabs, Deepgram, and OpenAI.

Named Voxtral TTS, the new model supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. According to the company, customer demand drove the creation of a compact speech model capable of running on devices like smartwatches, smartphones, laptops, and other edge hardware.

Mistral highlighted that the model can adapt a custom voice using a sample shorter than five seconds, while also capturing nuances such as subtle accents, inflections, intonations, and irregularities in speech flow. Built on the Ministral 3B architecture, it can switch between languages seamlessly without losing the voice's unique characteristics, making it suitable for applications like dubbing or real-time translation. The company emphasized a focus on achieving a natural, human-like sound rather than a robotic tone.

EMBED_PLACEHOLDER_0

The model is engineered for real-time performance, with a time-to-first-audio (TTFA)—the delay before the model begins speaking after receiving input—of 90 milliseconds for a 10-second sample of 500 characters. It also boasts a real-time factor (RTF) of 6x, meaning it can generate a 10-second audio clip in approximately 1.6 seconds.

Earlier this year, Mistral released two transcription models: one for large-scale batch processing and another for low-latency, real-time use cases. With this new speech model, the company appears to be building a comprehensive suite of voice products for enterprises. Mistral envisions an end-to-end platform capable of processing multimodal inputs—including audio, text, and images—and producing corresponding outputs. The advantage, as noted, is that an end-to-end agentic system supporting audio input and output can capture significantly more information.

Mistral's strategy centers on its open-source approach and customization capabilities, which it believes will encourage enterprises to choose its voice models over competitors, as they can fine-tune the technology to their specific needs.