YouTube-style thumbnail featuring a glowing red particle wave on a dark background, MiniMax logo at the top, and bold headline text reading “MiniMax 2.5 Is Here: The Best Open-Source AI For Coding?” with yellow and white typography.

MiniMax M2.5: A New Chinese Open-Source Model That Claims to Beat Claude & GPT at Coding

A Chinese AI startup just dropped an open-source model with benchmark scores that rival Anthropic and OpenAI’s best — at a fraction of the cost. MiniMax M2.5 claims state-of-the-art performance in coding, web browsing, and tool-calling, while running at 100 tokens per second for roughly $1 per hour. If the numbers hold up, it could reshape the economics of AI agents.

Here’s everything we know so far.

What Is MiniMax M2.5?

MiniMax M2.5 is a large language model built specifically for agentic tasks — coding, web browsing, tool-calling, and multi-step autonomous workflows. It was released in February 2025 by Shanghai-based AI company MiniMax.

The model comes in two variants:

  • M2.5 Standard — 50 tokens/sec output speed
  • M2.5-Lightning — 100 tokens/sec output speed (higher output token cost)

MiniMax positions M2.5 as a “digital employee” designed for sustained, independent work rather than simple chat interactions. The company claims that within their own organization, 30% of tasks are completed autonomously by M2.5 and 80% of newly committed code is M2.5-generated.

Benchmark Performance

MiniMax reports the following scores:

BenchmarkScoreWhat It Measures
SWE-Bench Verified80.2%Solving real GitHub issues (software engineering)
BrowseComp76.3%*Finding hard-to-locate information across the web
BFCL Multi-Turn76.8%Calling functions/tools correctly in multi-step tasks
AIME 202586.3%Competition-level mathematics
GPQA-Diamond85.2%Graduate-level science questions
Multi-SWE-Bench51.3%Multi-repo software engineering tasks
OpenCode76.1%Code generation and understanding

*BrowseComp score reported “with context management” — an important qualifier discussed below.

MiniMax also claims M2.5 completes SWE-Bench tasks in an average of 22.8 minutes, which is 37% faster than its predecessor M2.1 (31.3 minutes) and roughly on par with Claude Opus 4.6 (22.9 minutes).

How It Compares to Other Models

Coding (SWE-Bench Verified)

ModelScore
Claude Opus 4.580.9%
Claude Opus 4.680.8%
MiniMax M2.580.2%
GPT-5.280.0%
Gemini 3 Flash78.0%
Claude Sonnet 4.577.2%
Gemini 3 Pro76.2%
DeepSeek V3.273.0%

M2.5 sits within 0.6 percentage points of the top-scoring Claude Opus 4.5 and ahead of GPT-5.2. Notably, DeepSeek V3.2 — the previous best open-source model — scores 73.0%, putting M2.5 about 7 points ahead in the open-weight category.

SWE-bench Verified scores over time: Anthropic, OpenAI, Google, and MiniMax in a tight race toward 80%+.

Web Browsing (BrowseComp)

BrowseComp is an OpenAI-created benchmark with 1,266 problems that require browsing multiple websites, reformulating queries, and synthesizing scattered information.

ModelScore
MiniMax M2.576.3%*
Kimi K2 Thinking60.2%
GPT-554.9%
o4-mini51.5%
o349.7%

*MiniMax’s score carries a “with context management” qualifier, which likely means an augmented evaluation setup. Direct comparisons to other models should be treated with caution.

Tool-Calling (BFCL)

The Berkeley Function Calling Leaderboard evaluates how well models call APIs and functions. MiniMax claims 76.8% on the multi-turn subset — significantly ahead of what they report for Claude Opus 4.6 (63.3%) and Gemini 3 Pro (61.0%). However, this score has not been independently confirmed on the official BFCL leaderboard.

Pricing and Speed

This is where M2.5 gets interesting regardless of benchmark debates.

 M2.5 StandardM2.5-Lightning
Input$0.30/M tokens$0.30/M tokens
Output$1.20/M tokens$2.40/M tokens
Speed50 tps100 tps
Approx. hourly cost~$0.30/hr~$1.00/hr

For context, Claude Opus 4.6 charges around $75 per million output tokens. Even accounting for the fact that larger models may use fewer tokens per task, M2.5 is roughly 10-20x cheaper for equivalent workloads.

The “$1 per hour” figure comes from running M2.5-Lightning continuously at 100 tokens/sec: 360,000 output tokens/hour at $2.40/M equals approximately $0.86/hour (MiniMax rounds up). Both variants support prompt caching for additional savings.

MiniMax describes this as making “infinite scaling of long-horizon agents economically possible” — and the math supports the claim. Running autonomous coding agents 24/7 on M2.5-Lightning would cost roughly $720/month, compared to tens of thousands for frontier proprietary models.

Technical Specifications

SpecDetail
ArchitectureMixture of Experts (MoE), Transformer-based
Total parameters~230 billion
Active parameters per token~10 billion
Number of experts256 (8 active per token)
Context window200,000 tokens
Training methodReinforcement learning using proprietary “Forge” framework
RL algorithmCISPO (Clipped IS-weight Policy Optimization)
Supported languagesPython, JavaScript, TypeScript, Java, C++, Go, Rust, C, Kotlin, PHP, Lua, Dart, Ruby

The Mixture of Experts architecture is key to M2.5’s cost efficiency. With 230B total parameters but only 10B active per token, the model can maintain the knowledge capacity of a large model while keeping inference costs low.

MiniMax’s training approach is notable. Their “Forge” framework deploys models into live environments — real code repos, browsers, office apps, API endpoints — and optimizes based on actual task completion rather than synthetic benchmarks. They used over 200,000 real-world training environments and a tree-structured merging strategy that achieved roughly 40x training speedup.

The RL algorithm, CISPO, clips importance sampling weights rather than token updates (unlike PPO or GRPO), reportedly achieving comparable performance to DAPO in half the training steps.

MiniMax Agent Platform

Alongside the model, MiniMax offers MiniMax Agent (agent.minimax.io) — a general-purpose AI agent platform powered by M2.5. It supports:

  • Shell command execution
  • Web browsing
  • Python code interpreter
  • MCP (Model Context Protocol) tool integration
  • Multimodal input/output (text, images, voice, documents)

The platform includes an “Expert Builder” system where users create specialized AI agents using natural language instructions. Over 10,000 user-built experts currently exist on the platform.

MiniMax also offers a dedicated CodingPlan subscription at platform.minimax.io/subscribe/coding-plan for developers who want a coding-focused experience.

The API is available at platform.minimax.io.

Open-Source Availability

M2.5 is available under a Modified MIT License — permissive, but with one condition: commercial users must “prominently display ‘MiniMax M2.5’ on the user interface” of products built with the model.

You can access the model weights on:

GGUF quantized versions are available for running on consumer hardware. Supported inference frameworks include vLLM, SGLang, Transformers, and KTransformers.

Who Is MiniMax?

MiniMax is a Shanghai-based AI company founded in late 2021 by Yan Junjie, a former VP at SenseTime, along with co-founders Yang Bin and Zhou Yucong.

Key milestones:

  • Funding: ~$850 million across 7 rounds, with investors including Alibaba, Tencent, Sequoia China, and Hillhouse Capital
  • IPO: Listed on the Hong Kong Stock Exchange in January 2026 (ticker: 00100.HK), raising ~$710 million. The stock doubled on debut, valuing the company at roughly $6.5 billion
  • Products: Hailuo AI (consumer platform for text, music, and video generation), video-01 (text-to-video), and the M-series language models

The company is one of China’s “AI Tigers” — a group of well-funded Chinese AI startups that also includes Zhipu AI, Moonshot AI, and Baichuan.

Should You Trust the Benchmarks?

This is the critical question. Here’s an honest assessment:

What independent evaluators found:

  • OpenHands (independent SWE-Bench evaluation team) ranked M2.5 4th on their composite index, calling it “the first open model that has exceeded Claude Sonnet on recent tests.” They also found issues: the model occasionally pushed to wrong branches and forgot to format answers correctly.
  • Artificial Analysis gave M2.5 a score of 42 on their Intelligence Index (a composite of 10 evaluations). This is well above average for open-weight models but significantly below top frontier models, suggesting M2.5 may be optimized for specific agentic benchmarks rather than general intelligence.
  • Artificial Analysis measured actual API speed at ~80.8 tokens/sec, below the claimed 100 tps for Lightning.

Red flags to consider:

  1. All headline benchmark scores are self-reported by MiniMax. Independent replication under identical conditions is limited.
  2. SWE-Bench scores are highly scaffold-dependent — different evaluation harnesses can produce materially different results for the same model.
  3. The BrowseComp score of 76.3% carries a “with context management” qualifier that may indicate augmented evaluation rather than raw model capability.
  4. The BFCL multi-turn score of 76.8% could not be confirmed on the official Berkeley leaderboard.
  5. User reports on Hacker News include anecdotes of the model writing tests against fake data and declaring everything working.

What’s solidly confirmed:

  • The pricing is real and independently verifiable — M2.5 is genuinely 10-20x cheaper than proprietary frontier models.
  • The model weights are actually open-source and available for download.
  • The 230B total / 10B active MoE architecture is confirmed.
  • The 200K context window is confirmed across multiple sources.

Bottom Line

MiniMax M2.5 is a significant release for the open-source AI ecosystem. If its benchmark claims hold up under independent scrutiny, it’s the first open-weight model to genuinely compete with the best proprietary models on agentic coding tasks — and it does so at a dramatically lower price point.

The more conservative read: M2.5 is clearly a strong model that advances the state of the art for open-weight models, but some of its most impressive benchmark numbers carry qualifiers and haven’t been independently verified. The gap between targeted benchmark performance and general intelligence (as suggested by Artificial Analysis’s composite score) deserves attention.

Either way, the pricing changes the calculus. At $1/hour for 100 tokens/sec, running autonomous AI agents continuously becomes economically feasible in a way it simply wasn’t before. For developers building agentic applications, M2.5 is worth evaluating — just don’t take the headline benchmarks at face value without testing on your own workloads.

Share Now!

Facebook
X
LinkedIn
Threads
Email

Get Exclusive AI Tips to Your Inbox!

Stay ahead with expert AI insights trusted by top tech professionals!