In late May 2026, MiniMax’s Head of Engineering, Skyler Miao, broke cover with the first technical preview of MiniMax M3, the Shanghai lab’s next flagship language model. The pitch is sharp: 9.7× faster prefill and 15.6× faster decoding at 1-million-token context versus the current MiniMax-M2.7. The trick is reintroducing sparse attention, an architecture MiniMax explicitly killed in its own M2 generation a year earlier.
That reversal is the story. In this preview we cover what’s confirmed, what’s still teased, when M3 is likely to ship, and how the speed claims hold up. We also look at what it means for the race against Claude, GPT, Gemini, and DeepSeek, and what you should do until the weights drop. Everything here is sourced from MiniMax’s own posts, its Hugging Face model cards, and the public prediction markets tracking the release.
The Key Takeaways
- No weights yet. As of May 27, 2026, M3 has no Hugging Face repo, no API, and no
minimax.io/newsentry. The current shipping flagship is MiniMax-M2.7 (229B parameters), released March 18, 2026.- The hook is sparse attention. M3 returns to sparse attention after MiniMax abandoned it for M2, claiming 9.7× faster prefill and 15.6× faster decoding at 1M tokens versus M2.
- Release window. MiniMax has committed to the second half of 2026. Prediction markets give M3 a 65% chance by July, 80% by September, and 87% by December 2026.
- No accuracy benchmarks yet. All numbers so far are MiniMax-supplied speed claims. SWE-Bench, BFCL, and BrowseComp results land at launch.
- Open-weight, probably non-commercial. If M3 follows M2.7’s licensing, weights will be downloadable but commercial use will require written authorization.
When is MiniMax M3 coming out?
Officially, MiniMax has said only that the next generation of M-series (and Hailuo 3) models will arrive in the second half of 2026. That is the entire on-record commitment.
Independent prediction markets are more specific. The Manifold M3 release-date market currently prices the launch like this.
| Release window | Implied probability |
|---|---|
| Before July 2026 | 65% |
| Before September 2026 | 80% |
| Before December 2026 | 87% |
The “May drop” rumors that circulated on X earlier this month, including a widely-shared post claiming M3 would land in May with “huge gains in agentic capabilities” and a focus on office scenarios, did not pan out. May 27, no weights.
Skyler Miao’s actual public posture is more grounded; in earlier posts on X he wrote “M3 is scaling up” and “you won’t be disappointed. m3 is gonna stretch people’s imagination.” No date. No specs. Just an architecture preview.
A realistic read: H2 2026 means anytime between July and December, with the August-to-September window most heavily traded by the market.
What’s new in MiniMax M3? Sparse attention is back
The headline change is architectural. MiniMax M3 returns to sparse attention after the company spent the entire M2 generation, including M2, M2.1, M2.5, and M2.7, on full attention across the context window.
In plain English, full attention means every token in the context can “look at” every other token. That is accurate, but expensive. Sparse attention lets the model skip most of those connections and focus only on the tokens that matter for the next word. The savings compound as the context gets longer.
How sparse attention works in MiniMax M3
From the technical preview, M3’s design has two parts working in sequence. First, a lightweight index branch scans incoming tokens and picks which blocks of past tokens deserve attention. Then sparse attention runs only on those relevant key-value blocks.
The base is Grouped Query Attention (GQA), not Multi-head Latent Attention (MLA). Independent researcher Elie Bakouch describes the design as “block level selection like in CSA but attention is done on the real KV, not in the compressed dimension.” Translation: M3 keeps the precision of real attention computation while only doing it where it actually counts.
Why MiniMax killed sparse attention in M2 and is bringing it back
This part matters because it is a self-correction. In MiniMax’s own engineering blog explaining the M2 architecture, the team wrote that “the infrastructure for linear and sparse attention is much less mature” than full attention, and that “in a real-world, industrial-grade system, the truth is that efficient attention still has some way to go before it can definitively beat full attention.”
That was published roughly a year ago. The same team is now claiming production-ready sparse attention with 9.7× and 15.6× speedups. Either the infrastructure caught up fast, or MiniMax has solved something specific. The accuracy benchmarks at launch will tell us which.
How fast is MiniMax M3? Breaking down the 9.7× and 15.6× numbers
MiniMax keeps repeating two figures, both measured against M2 at 1-million-token context length.
| Phase | Speedup vs MiniMax M2 |
|---|---|
| Prefill (processing the input) | 9.7× faster |
| Decoding (generating output) | 15.6× faster |
For perspective, a 1M-token prompt on M2 or M2.7 today is technically possible but practically painful, the prefill stage alone can take many minutes and the per-token output cost is significant. A 9.7× prefill speedup at that scale is not a small optimization. It is the difference between viable and unviable for enterprise agent workloads that need to reason over an entire codebase, a year of email, or a 500-page legal corpus.
What MiniMax has not disclosed is equally important. There are no accuracy benchmarks yet, no comparison of M3 against M2.7 on SWE-Bench Verified, BFCL, BrowseComp, or MMLU. No public quality-vs-speed tradeoff curve. And there is no third-party verification of the speedup numbers either, just the MiniMax-supplied figures.
Apply the same skepticism you would to any vendor benchmark. The numbers are real until something contradicts them, but a 15.6× speedup with zero accuracy regression would be one of the most impressive long-context engineering results of 2026 if it survives independent testing.
MiniMax M3 vs M2.7 vs M2.5: How the M-series stacks up
Here is the practical comparison across MiniMax’s current M-series and a major Western reference point.
| MiniMax M3 (teased) | MiniMax M2.7 | MiniMax M2.5 | Claude Opus 4.7 (reference) | |
|---|---|---|---|---|
| Status | Teased, H2 2026 | Released Mar 18, 2026 | Released Feb 12, 2026 | Released |
| Parameters | TBD | 229B | 229B | Undisclosed |
| Attention | Sparse (new) | Full | Full | Proprietary |
| 1M-token decoding | 15.6× vs M2 | Baseline | Baseline | Varies |
| Open weight | Likely (TBD) | Yes, non-commercial | Yes | いいえ |
| Input pricing | TBD | ~$0.30 / M tokens | Comparable | $5 / M tokens |
| Focus | Long context + agents | Self-evolution + office | Coding + agents | Reasoning + agents |
The M2.7 row is the one to anchor against. It is the current shipping flagship, 229B parameters, full attention, with a focus on self-evolution (iteratively rewriting its own scaffold code) and office-suite editing in Excel, PowerPoint, and Word.
If you are using MiniMax today, you are almost certainly using M2.7. M3 is not replacing M2.7 in your stack on day one; it is the model that lets you justify the 1M-token context that M2.7 makes too slow to bother with. For background on how the M-series got here, see our MiniMax M2.5 breakdown.
How does MiniMax M3 compare to Claude, GPT-5.5, Gemini, and DeepSeek?
Until M3 ships with verified accuracy benchmarks, this is a positioning question, not a leaderboard question. Here is how it stacks up on the dimensions we can read today.
Against Claude Opus 4.7. Anthropic’s flagship costs $5 per million input tokens and $25 per million output tokens, and is closed-weight. M3 will almost certainly be more than 15× cheaper on input if it follows M2.7’s pricing of around $0.30/M input tokens, and it will be open-weight. Claude wins on reasoning quality today; M3’s pitch is that long-context economics are about to shift.
Against GPT-5.5. OpenAI does not disclose architecture or pricing internals in the same way, so the comparison is murkier. M3’s sparse-attention pitch is essentially “we made the math work for cheap 1M-token inference,” something OpenAI has not publicly claimed in the same terms.
Against Gemini 3.5. Google has been investing heavily in long context, with Gemini already advertising 1M-token windows. M3 is targeting the same problem from the open-weight side. If the speed claims hold, M3 becomes the first credible open-weight alternative to Gemini’s long-context economics.
Against DeepSeek V4 and Qwen3.7-Max. This is the closest fight. DeepSeek V4 and Qwen3.7-Max are M3’s real competitive set; all Chinese, all open-weight, all racing for the same agentic-AI use cases. The architecture story, sparse attention specifically, is M3’s differentiator in this lane.
For a broader read on how the U.S.–China model race is shaping pricing and licensing for everyone, see our Anthropic vs OpenAI breakdown.
Will MiniMax M3 be open source?
Probably open-weight, almost certainly not fully open-source in the traditional sense.
Here is the distinction that matters. Open-weight means you can download the model files from Hugging Face and run them locally. Open-source, strictly speaking, also means the training data, training code, and license terms allow unrestricted commercial use.
MiniMax-M2 shipped under a modified-MIT license in 2025, the closest the company has come to fully open. The newer M2.7 license restricts commercial use of the model or derivatives unless MiniMax grants prior written authorization, with commercial APIs and hosted services explicitly named in the license definition. That is open-weight, not open-source.
The reasonable forecast for M3 is weights downloadable from huggingface.co/MiniMaxAI/MiniMax-M3 (the URL does not exist yet, check there first when M3 lands), a non-commercial license by default, and enterprise licensing available through MiniMax’s direct sales for anyone who needs to ship M3 in a commercial product.
Who is making MiniMax M3?
MiniMax is a Shanghai-based AI lab founded in 2021. Outside China it is best known for its Hailuo video model, covered in our Hailuo 02 review, and for the M-series of language models. The company shipped M1 in mid-2025 (456B parameters) before pivoting to the smaller, denser, more agentic M2 family.
The public face of the M3 teaser is Skyler Miao (@SkylerMiao7), MiniMax’s Head of Engineering. Miao’s preview posts on X are the source of every architectural detail in this article. He is also the one who set the tone for what to expect; M3 is “scaling up” and will “stretch people’s imagination.” Marketing language, sure, but the technical preview that followed has substance.
What MiniMax M3 means for the global AI race
Two things shift if M3 lands and the numbers hold.
First, sparse attention becomes the new baseline for long-context production systems. MiniMax was not bluffing when it called sparse attention not-yet-production-ready in 2025; if they are now confident enough to lead with the architecture, the broader field has a year of catch-up to do. Anthropic, Google DeepMind, and OpenAI all have efficient-attention research in progress, but none have shipped a flagship with this kind of public efficiency commitment.
Second, Chinese open-weight labs keep widening the cost advantage. DeepSeek opened this front. Qwen kept the pressure on. M3 raises it further: cheap to run, open to download, increasingly competitive on capability.
The U.S. labs’ pricing premium gets harder to defend with each launch. For the latest snapshot of where every flagship sits, see our Best AI February 2026 rankings.
Should you wait for MiniMax M3 or use M2.7 now?
If you are running agent workloads or coding pipelines on MiniMax today, stay on M2.7. It is stable, well-priced at around $0.30/M input tokens, and benchmarks competitively against Claude Opus on SWE-Bench Verified.
If your workload needs 1M-token context, things like whole-codebase analysis, multi-document research agents, or long-running session memory, then waiting makes sense. That is the exact case M3 is being built to serve, and the payoff of holding for a 15.6× decoding speedup is enormous.
If you just want to test multiple Chinese and Western flagship models without juggling separate API keys, the Fello AI app for Mac and iOS ($9.99/month) routes between Claude, ChatGPT, Gemini, Grok, and DeepSeek through one interface. M3 is not in Fello AI yet, but the multi-model setup means you will be one tap away the moment any team adds it.
What to watch next
When M3 actually ships, the first signal will come from one of three places. Watch huggingface.co/MiniMaxAI for a new repo titled MiniMax-M3. Watch minimax.io/news for an official launch post. And watch Skyler Miao’s X account for the announcement.
Until one of those three lights up, M3 is still a teaser. We will update this article in place the moment it does.
FAQ
When is MiniMax M3 coming out?
MiniMax has said only that M3 will launch in the second half of 2026. Prediction markets give it a 65% chance of shipping by July, 80% by September, and 87% by December 2026. There is no confirmed date yet.
Is MiniMax M3 open source?
Most likely open-weight, not fully open-source. If M3 follows the M2.7 license precedent, weights will be downloadable from Hugging Face but commercial use will require written authorization from MiniMax.
How much faster is MiniMax M3 than M2?
MiniMax claims 9.7× faster prefill and 15.6× faster decoding at 1-million-token context, both measured against M2. No accuracy benchmarks have been disclosed yet, so this is speed only.
What’s the difference between MiniMax M3 and M2.7?
M3 uses sparse attention while M2.7 uses full attention. The shift targets dramatically lower cost and latency for very long contexts. M2.7 remains MiniMax’s shipping flagship for everything else and is not being deprecated.




