A Chinese AI startup just dropped an open-source model with benchmark scores that rival Anthropic and OpenAI’s best — at a fraction of the cost. MiniMax M2.5 claims state-of-the-art performance in coding, web browsing, and tool-calling, while running at 100 tokens per second for roughly $1 per hour. If the numbers hold up, it could reshape the economics of AI agents.
Introducing M2.5, an open-source frontier model designed for real-world productivity.
— MiniMax (official) (@MiniMax_AI) February 12, 2026
– SOTA performance at coding (SWE-Bench Verified 80.2%), search (BrowseComp 76.3%), agentic tool-calling (BFCL 76.8%) & office work.
– Optimized for efficient execution, 37% faster at complex… pic.twitter.com/UwiKzzQNG8
Here’s everything we know so far.
What Is MiniMax M2.5?
MiniMax M2.5 is a large language model built specifically for agentic tasks — coding, web browsing, tool-calling, and multi-step autonomous workflows. It was released in February 2025 by Shanghai-based AI company MiniMax.
The model comes in two variants:
- M2.5 Standard — 50 tokens/sec output speed
- M2.5-Lightning — 100 tokens/sec output speed (higher output token cost)
MiniMax positions M2.5 as a “digital employee” designed for sustained, independent work rather than simple chat interactions. The company claims that within their own organization, 30% of tasks are completed autonomously by M2.5 and 80% of newly committed code is M2.5-generated.
Benchmark Performance
MiniMax reports the following scores:
| Benchmark | Score | What It Measures |
|---|---|---|
| SWE-Bench Verified | 80.2% | Solving real GitHub issues (software engineering) |
| BrowseComp | 76.3%* | Finding hard-to-locate information across the web |
| BFCL Multi-Turn | 76.8% | Calling functions/tools correctly in multi-step tasks |
| AIME 2025 | 86.3% | Competition-level mathematics |
| GPQA-Diamond | 85.2% | Graduate-level science questions |
| Multi-SWE-Bench | 51.3% | Multi-repo software engineering tasks |
| OpenCode | 76.1% | Code generation and understanding |
*BrowseComp score reported “with context management” — an important qualifier discussed below.
MiniMax also claims M2.5 completes SWE-Bench tasks in an average of 22.8 minutes, which is 37% faster than its predecessor M2.1 (31.3 minutes) and roughly on par with Claude Opus 4.6 (22.9 minutes).
How It Compares to Other Models
Coding (SWE-Bench Verified)
| Model | Score |
|---|---|
| Claude Opus 4.5 | 80.9% |
| Claude Opus 4.6 | 80.8% |
| MiniMax M2.5 | 80.2% |
| GPT-5.2 | 80.0% |
| Gemini 3 Flash | 78.0% |
| Claude Sonnet 4.5 | 77.2% |
| Gemini 3 Pro | 76.2% |
| DeepSeek V3.2 | 73.0% |
M2.5 sits within 0.6 percentage points of the top-scoring Claude Opus 4.5 and ahead of GPT-5.2. Notably, DeepSeek V3.2 — the previous best open-source model — scores 73.0%, putting M2.5 about 7 points ahead in the open-weight category.

Web Browsing (BrowseComp)
BrowseComp is an OpenAI-created benchmark with 1,266 problems that require browsing multiple websites, reformulating queries, and synthesizing scattered information.
| Model | Score |
|---|---|
| MiniMax M2.5 | 76.3%* |
| Kimi K2 Thinking | 60.2% |
| GPT-5 | 54.9% |
| o4-mini | 51.5% |
| o3 | 49.7% |
*MiniMax’s score carries a “with context management” qualifier, which likely means an augmented evaluation setup. Direct comparisons to other models should be treated with caution.
Tool-Calling (BFCL)
The Berkeley Function Calling Leaderboard evaluates how well models call APIs and functions. MiniMax claims 76.8% on the multi-turn subset — significantly ahead of what they report for Claude Opus 4.6 (63.3%) and Gemini 3 Pro (61.0%). However, this score has not been independently confirmed on the official BFCL leaderboard.
Pricing and Speed
This is where M2.5 gets interesting regardless of benchmark debates.
| M2.5 Standard | M2.5-Lightning | |
|---|---|---|
| Input | $0.30/M tokens | $0.30/M tokens |
| Output | $1.20/M tokens | $2.40/M tokens |
| Speed | 50 tps | 100 tps |
| Approx. hourly cost | ~$0.30/hr | ~$1.00/hr |
For context, Claude Opus 4.6 charges around $75 per million output tokens. Even accounting for the fact that larger models may use fewer tokens per task, M2.5 is roughly 10-20x cheaper for equivalent workloads.
The “$1 per hour” figure comes from running M2.5-Lightning continuously at 100 tokens/sec: 360,000 output tokens/hour at $2.40/M equals approximately $0.86/hour (MiniMax rounds up). Both variants support prompt caching for additional savings.
MiniMax describes this as making “infinite scaling of long-horizon agents economically possible” — and the math supports the claim. Running autonomous coding agents 24/7 on M2.5-Lightning would cost roughly $720/month, compared to tens of thousands for frontier proprietary models.
Technical Specifications
| Spec | Detail |
|---|---|
| Architecture | Mixture of Experts (MoE), Transformer-based |
| Total parameters | ~230 billion |
| Active parameters per token | ~10 billion |
| Number of experts | 256 (8 active per token) |
| Context window | 200,000 tokens |
| Training method | Reinforcement learning using proprietary “Forge” framework |
| RL algorithm | CISPO (Clipped IS-weight Policy Optimization) |
| Supported languages | Python, JavaScript, TypeScript, Java, C++, Go, Rust, C, Kotlin, PHP, Lua, Dart, Ruby |
The Mixture of Experts architecture is key to M2.5’s cost efficiency. With 230B total parameters but only 10B active per token, the model can maintain the knowledge capacity of a large model while keeping inference costs low.
MiniMax’s training approach is notable. Their “Forge” framework deploys models into live environments — real code repos, browsers, office apps, API endpoints — and optimizes based on actual task completion rather than synthetic benchmarks. They used over 200,000 real-world training environments and a tree-structured merging strategy that achieved roughly 40x training speedup.
The RL algorithm, CISPO, clips importance sampling weights rather than token updates (unlike PPO or GRPO), reportedly achieving comparable performance to DAPO in half the training steps.
MiniMax Agent Platform
Alongside the model, MiniMax offers MiniMax Agent (agent.minimax.io) — a general-purpose AI agent platform powered by M2.5. It supports:
- Shell command execution
- Web browsing
- Python code interpreter
- MCP (Model Context Protocol) tool integration
- Multimodal input/output (text, images, voice, documents)
The platform includes an “Expert Builder” system where users create specialized AI agents using natural language instructions. Over 10,000 user-built experts currently exist on the platform.
MiniMax also offers a dedicated CodingPlan subscription at platform.minimax.io/subscribe/coding-plan for developers who want a coding-focused experience.
The API is available at platform.minimax.io.
Open-Source Availability
M2.5 is available under a Modified MIT License — permissive, but with one condition: commercial users must “prominently display ‘MiniMax M2.5’ on the user interface” of products built with the model.
You can access the model weights on:
- HuggingFace: MiniMaxAI/MiniMax-M2.5
- GitHub: MiniMax-AI/MiniMax-M2.5
- Ollama: Available as
minimax-m2.5for local deployment - ModelScope for users in China
GGUF quantized versions are available for running on consumer hardware. Supported inference frameworks include vLLM, SGLang, Transformers, and KTransformers.
Who Is MiniMax?
MiniMax is a Shanghai-based AI company founded in late 2021 by Yan Junjie, a former VP at SenseTime, along with co-founders Yang Bin and Zhou Yucong.
Key milestones:
- Funding: ~$850 million across 7 rounds, with investors including Alibaba, Tencent, Sequoia China, and Hillhouse Capital
- IPO: Listed on the Hong Kong Stock Exchange in January 2026 (ticker: 00100.HK), raising ~$710 million. The stock doubled on debut, valuing the company at roughly $6.5 billion
- Products: Hailuo AI (consumer platform for text, music, and video generation), video-01 (text-to-video), and the M-series language models
The company is one of China’s “AI Tigers” — a group of well-funded Chinese AI startups that also includes Zhipu AI, Moonshot AI, and Baichuan.
Should You Trust the Benchmarks?
This is the critical question. Here’s an honest assessment:
What independent evaluators found:
- OpenHands (independent SWE-Bench evaluation team) ranked M2.5 4th on their composite index, calling it “the first open model that has exceeded Claude Sonnet on recent tests.” They also found issues: the model occasionally pushed to wrong branches and forgot to format answers correctly.
- Artificial Analysis gave M2.5 a score of 42 on their Intelligence Index (a composite of 10 evaluations). This is well above average for open-weight models but significantly below top frontier models, suggesting M2.5 may be optimized for specific agentic benchmarks rather than general intelligence.
- Artificial Analysis measured actual API speed at ~80.8 tokens/sec, below the claimed 100 tps for Lightning.
Red flags to consider:
- All headline benchmark scores are self-reported by MiniMax. Independent replication under identical conditions is limited.
- SWE-Bench scores are highly scaffold-dependent — different evaluation harnesses can produce materially different results for the same model.
- The BrowseComp score of 76.3% carries a “with context management” qualifier that may indicate augmented evaluation rather than raw model capability.
- The BFCL multi-turn score of 76.8% could not be confirmed on the official Berkeley leaderboard.
- User reports on Hacker News include anecdotes of the model writing tests against fake data and declaring everything working.
What’s solidly confirmed:
- The pricing is real and independently verifiable — M2.5 is genuinely 10-20x cheaper than proprietary frontier models.
- The model weights are actually open-source and available for download.
- The 230B total / 10B active MoE architecture is confirmed.
- The 200K context window is confirmed across multiple sources.
Bottom Line
MiniMax M2.5 is a significant release for the open-source AI ecosystem. If its benchmark claims hold up under independent scrutiny, it’s the first open-weight model to genuinely compete with the best proprietary models on agentic coding tasks — and it does so at a dramatically lower price point.
The more conservative read: M2.5 is clearly a strong model that advances the state of the art for open-weight models, but some of its most impressive benchmark numbers carry qualifiers and haven’t been independently verified. The gap between targeted benchmark performance and general intelligence (as suggested by Artificial Analysis’s composite score) deserves attention.
Either way, the pricing changes the calculus. At $1/hour for 100 tokens/sec, running autonomous AI agents continuously becomes economically feasible in a way it simply wasn’t before. For developers building agentic applications, M2.5 is worth evaluating — just don’t take the headline benchmarks at face value without testing on your own workloads.




