Microsoft AI Unveils Trio of Multimodal Models for Text, Voice, and Image Generation

By admin | Apr 02, 2026 | 2 min read

Microsoft AI, the company's dedicated research division, unveiled three new foundational AI models on Thursday. These models are designed to generate text, voice, and images, marking a significant step in Microsoft's strategy to develop its own comprehensive suite of multimodal AI systems. This move strengthens its competitive position against other AI labs, even as it maintains its established partnership with OpenAI.

EMBED_PLACEHOLDER_0

The first model, MAI-Transcribe-1, specializes in speech-to-text conversion across 25 languages. According to the company's announcement, it operates 2.5 times faster than Microsoft's existing Azure Fast service. The second, MAI-Voice-1, is an audio generation model capable of producing 60 seconds of audio in just one second and includes features for creating custom voices. The third model, MAI-Image-2, generates video. It was initially made available on MAI Playground, a new large language model testing platform, on March 19.

All three models are now being released on Microsoft Foundry. Additionally, the transcription and voice models are also accessible through MAI Playground. These tools were developed by the MAI Superintelligence team, an AI research group led by Microsoft AI CEO Mustafa Suleyman. This team was officially formed and announced in November 2025.

In a blog post, Suleyman outlined the philosophy behind the development. "At Microsoft AI, we’re building Humanist AI. We have a distinct view when creating our AI models - putting humans at the center, optimizing for how people actually communicate, training for practical use," he wrote. He also hinted at future releases, stating, "You’ll see more models from us soon in Foundry and directly in Microsoft products and experiences."

EMBED_PLACEHOLDER_1

The company highlighted competitive pricing as a key advantage in the crowded large language model market, positioning these models as more cost-effective than offerings from Google and OpenAI. The pricing structure begins at $0.36 per hour for MAI-Transcribe-1. MAI-Voice-1 starts at $22 per 1 million characters, while MAI-Image-2 is priced at $5 for 1 million tokens for text input and $33 for 1 million tokens for image output.

Despite launching its own proprietary models, Suleyman has reaffirmed Microsoft's ongoing commitment to its partnership with OpenAI. In an interview with VentureBeat, he noted that a recent renegotiation of the partnership terms has actually enabled Microsoft to advance its own superintelligence research. Microsoft has invested over $13 billion in OpenAI and integrates its models across various products through a multi-year agreement. This dual approach mirrors Microsoft's strategy in other areas, such as semiconductors, where it both develops its own chips and sources them from external suppliers.