Home / Tech News / Microsoft Launches 3 MAI Models to Rival OpenAI, Google

Microsoft Launches 3 MAI Models to Rival OpenAI, Google

Microsoft Launches 3 MAI Models to Rival OpenAI, Google — editorial featured image showing the topic context, key signals, and reader intent
Table of Contents
  1. What Are the Three Microsoft MAI Models?
  2. Benchmarks, Pricing, and the Competitive Picture
  3. How Developers Can Access the MAI Models Now
  4. Common Questions — Microsoft MAI Models
  5. Conclusion

Microsoft built its own AI models — and they are already competing at the top of global benchmarks. On April 2, 2026, the company unveiled three Microsoft MAI models: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2. All three are available immediately through Microsoft Foundry, with no reliance on OpenAI infrastructure. This is a direct challenge to OpenAI’s Whisper, Google’s Gemini, and every other AI lab competing for enterprise contracts.

3D rendered abstract design featuring a digital brain visual with vibrant colors. — Photo by Google DeepMind on Pexels

The timing is deliberate. Microsoft renegotiated its OpenAI contract in September 2025, freeing itself to pursue independent model development. The MAI Superintelligence team — formed just five months ago in November 2025 — has already shipped three production-ready models.

In this article, you will learn what each model does, how they benchmark against rivals, what they cost, and what this means for developers and businesses building on Azure.

What Are the Three Microsoft MAI Models?

The three models were built by Microsoft’s MAI Superintelligence team under the direction of Mustafa Suleyman, CEO of Microsoft AI. Together, they cover the three most commercially important AI capabilities in enterprise software: speech-to-text, voice generation, and image creation.

MAI-Transcribe-1 — Accurate Speech Recognition at Scale

MAI-Transcribe-1 converts spoken audio to text across 25 languages. On the FLEURS benchmark — the industry standard for multilingual speech recognition — it achieves a word error rate (WER) of approximately 3.9%, ranking first among all tested models. It outperforms Google’s Gemini 3.1 Flash-Lite on 22 of 25 languages and beats OpenAI’s Whisper-large-v3 on overall accuracy.

Speed is equally impressive. Batch transcription runs 2.5x faster than Microsoft’s previous Azure Fast offering. For businesses processing large volumes of audio — call centers, meeting platforms, media companies — that throughput advantage translates directly into cost savings. Pricing starts at $0.36 per audio hour.

MAI-Voice-1 — Expressive, High-Speed Voice Generation

MAI-Voice-1 generates natural-sounding speech from text. Its standout performance metric: it produces 60 seconds of audio in under one second on a single GPU. The model preserves speaker identity across long-form content and supports a wide emotional range — useful for applications like audiobooks, customer service bots, and accessibility tools.

MAI-Voice-1 integrates directly with Azure Speech’s 700+ voice gallery, giving developers access to quality and selection simultaneously. Pricing starts at $22 per million characters.

MAI-Image-2 — Top-Tier Text-to-Image Generation

MAI-Image-2 debuted in the top 3 on the Arena.ai image generation leaderboard, placing it among the most capable image models available globally. Microsoft reports at least 2x faster generation times on Foundry and Copilot compared to its predecessor, based on real-world production traffic data.

The model is already rolling out inside Copilot, Bing, and PowerPoint, which means millions of enterprise users will interact with MAI-Image-2 without switching tools or managing API keys.

Benchmarks, Pricing, and the Competitive Picture

A 3D rendering of a neural network with abstract neuron connections in soft colors. — Photo by Google DeepMind on Pexels

Microsoft’s pricing strategy is explicitly designed to compete with — and undercut — hyperscalers. Mustafa Suleyman stated the company is pricing its models to be the very best value offered by any hyperscaler. Here is a side-by-side comparison:

ModelCapabilityKey BenchmarkStarting Price
MAI-Transcribe-1Speech-to-text#1 on FLEURS (WER ~3.9%)$0.36 / audio hour
MAI-Voice-1Text-to-speech60s audio in <1 second$22 / 1M characters
MAI-Image-2Text-to-imageTop 3 on Arena.ai$5 / 1M text input tokens

MAI-Image-2 carries a dual pricing structure: $5 per million text input tokens and $33 per million image output tokens. For comparison, DALL-E 3 via OpenAI’s API charges $0.04 to $0.08 per image at standard resolutions — but Microsoft’s token-based model may offer cost advantages for high-volume generation workflows.

The enterprise angle is where Microsoft’s strategy becomes most significant. Businesses running Microsoft 365 E5 — the top-tier enterprise subscription — will receive MAI model capabilities bundled into their existing contracts. That means no incremental per-use billing for transcription, voice, and image generation. Independent AI providers cannot replicate that bundling advantage. Follow the latest tech news to track how this pricing shift reshapes the AI market in the months ahead.

How Developers Can Access the MAI Models Now

All three models are live. Developers can experiment in the MAI Playground and deploy through Microsoft Foundry or Azure Speech. The path to production is straightforward for teams already working in the Azure ecosystem.

To access MAI-Transcribe-1 or MAI-Voice-1 via Azure Speech, follow these steps:

  1. Sign into the Azure Portal and navigate to Azure AI Foundry.
  2. Open the MAI Playground to test models with your own audio or text inputs.
  3. Select MAI-Transcribe-1, MAI-Voice-1, or MAI-Image-2 from the model catalog.
  4. Deploy using the Azure Speech SDK or Foundry REST API endpoints.
  5. For image generation, access MAI-Image-2 via Copilot or the Foundry API.

MAI-Image-2 also works inside PowerPoint through Copilot integration, removing the need for any developer setup for non-technical users. Full documentation is available via the Microsoft Community Hub, which includes API reference guides and deployment examples.

For developers in Southeast Asia, this launch carries particular weight. Azure operates data centers in Singapore, Japan, and India, meaning latency-sensitive applications — live transcription, real-time voice synthesis — can be deployed closer to regional users. With MAI-Transcribe-1 ranking first across 25 languages including major Southeast Asian languages, multilingual markets across the region now have a strong case for switching from Whisper or Google Speech-to-Text.

Mustafa Suleyman described Microsoft’s competitive position directly: the company now considers itself a top-three AI lab behind only OpenAI and Google. That claim would have seemed improbable a year ago. The MAI model launch gives it credibility.

Common Questions — Microsoft MAI Models

Q: What are Microsoft’s MAI models?

A: Microsoft’s MAI models are three in-house foundational AI models released on April 2, 2026: MAI-Transcribe-1 for speech-to-text, MAI-Voice-1 for text-to-speech, and MAI-Image-2 for text-to-image generation. They are available through Microsoft Foundry, the MAI Playground, and Azure Speech services.

Q: How does MAI-Transcribe-1 compare to OpenAI Whisper?

A: MAI-Transcribe-1 ranks #1 on the FLEURS benchmark with a word error rate of approximately 3.9%, outperforming Whisper-large-v3 on overall accuracy across 25 languages. It also transcribes audio 2.5x faster than Microsoft’s previous Azure Fast service, at a starting price of $0.36 per audio hour.

Q: How much do the Microsoft MAI models cost?

A: MAI-Transcribe-1 costs $0.36 per audio hour. MAI-Voice-1 costs $22 per million characters. MAI-Image-2 costs $5 per million text input tokens and $33 per million image output tokens. Microsoft 365 E5 enterprise subscribers may receive these capabilities bundled into their existing contracts at no incremental per-use cost.

Q: Are Microsoft MAI models available globally?

A: Yes. All three models are available immediately worldwide through Microsoft Foundry and the MAI Playground. MAI-Transcribe-1 and MAI-Voice-1 are also accessible via Azure Speech. MAI-Image-2 is rolling out in Copilot, Bing, and PowerPoint for enterprise Microsoft 365 users.

Conclusion

Microsoft’s MAI model launch signals a fundamental shift in the company’s AI strategy. Three key takeaways stand out:

  • Microsoft is now an independent AI lab, competing directly with OpenAI and Google rather than distributing their models.
  • Benchmark performance is real: MAI-Transcribe-1 holds the #1 position on FLEURS as of April 2026, and MAI-Image-2 ranks top 3 on Arena.ai.
  • Enterprise bundling changes the economics: M365 E5 customers gain access to newer AI capabilities without paying per-use rates that independent providers must charge.

For developers, the MAI models offer a credible alternative to OpenAI and Google APIs — especially within Azure-native architectures. For enterprises, the M365 bundling strategy may make Microsoft the default AI provider by choice and by contract. Explore our AI coverage for deeper analysis of foundation model developments, and check our Tech News section for daily updates on the latest releases.

About the author: TouchEVA is a tech journalist covering AI, software, and cybersecurity for Hubkub.com — independent tech media since 2025. Every article is researched from primary sources and verified data.

Last Updated: April 13, 2026

TouchEVA

TouchEVA

Founder and lead writer at Hubkub. Covers software, AI tools, cybersecurity, and practical Windows/Linux workflows.

Tagged: