MiniMax M2.5: A New Chinese Open-Source Model That Claims to Beat Claude & GPT at Coding

A Chinese AI startup just dropped an open-source model with benchmark scores that rival Anthropic and OpenAI’s best — at a fraction of the cost. MiniMax M2.5 claims state-of-the-art performance in coding, web browsing, and tool-calling, while running at 100 tokens per second for roughly $1 per hour. If the numbers hold up, it could reshape the economics of AI agents.

Introducing M2.5, an open-source frontier model designed for real-world productivity.

– SOTA performance at coding (SWE-Bench Verified 80.2%), search (BrowseComp 76.3%), agentic tool-calling (BFCL 76.8%) & office work.

– Optimized for efficient execution, 37% faster at complex… pic.twitter.com/UwiKzzQNG8
— MiniMax (official) (@MiniMax_AI) February 12, 2026

Here’s everything we know so far.

MiniMax’s open models hold a spot in our ranking of the best open source AI models, where the newer M3 release now carries the efficiency pick.

Table of Contents hide

What Is MiniMax M2.5?

Benchmark Performance

How It Compares to Other Models

Coding (SWE-Bench Verified)

Web Browsing (BrowseComp)

Tool-Calling (BFCL)

Pricing and Speed

Technical Specifications

MiniMax Agent Platform

Open-Source Availability

Who Is MiniMax?

Should You Trust the Benchmarks?

Bottom Line

What Is MiniMax M2.5?

MiniMax M2.5 is a large language model built specifically for agentic tasks — coding, web browsing, tool-calling, and multi-step autonomous workflows. It was released in February 2025 by Shanghai-based AI company MiniMax.

The model comes in two variants:

M2.5 Standard — 50 tokens/sec output speed
M2.5-Lightning — 100 tokens/sec output speed (higher output token cost)

MiniMax positions M2.5 as a “digital employee” designed for sustained, independent work rather than simple chat interactions. The company claims that within their own organization, 30% of tasks are completed autonomously by M2.5 and 80% of newly committed code is M2.5-generated.

MiniMax has since previewed the successor, MiniMax M3, which uses sparse attention to claim 9.7× and 15.6× speedups over M2.5 at the 1M-token mark. The M2.5 specs below remain the current shipping model; M3 weights have not dropped yet.

Benchmark Performance

MiniMax reports the following scores:

Benchmark	Score	What It Measures
SWE-Bench Verified	80.2%	Solving real GitHub issues (software engineering)
BrowseComp	76.3%*	Finding hard-to-locate information across the web
BFCL Multi-Turn	76.8%	Calling functions/tools correctly in multi-step tasks
AIME 2025	86.3%	Competition-level mathematics
GPQA-Diamond	85.2%	Graduate-level science questions
Multi-SWE-Bench	51.3%	Multi-repo software engineering tasks
OpenCode	76.1%	Code generation and understanding

*BrowseComp score reported “with context management” — an important qualifier discussed below.

MiniMax also claims M2.5 completes SWE-Bench tasks in an average of 22.8 minutes, which is 37% faster than its predecessor M2.1 (31.3 minutes) and roughly on par with Claude Opus 4.6 (22.9 minutes).

How It Compares to Other Models

Coding (SWE-Bench Verified)

Model	Score
Claude Opus 4.5	80.9%
Claude Opus 4.6	80.8%
MiniMax M2.5	80.2%
GPT-5.2	80.0%
Gemini 3 Flash	78.0%
Claude Sonnet 4.5	77.2%
Gemini 3 Pro	76.2%
DeepSeek V3.2	73.0%

M2.5 sits within 0.6 percentage points of the top-scoring Claude Opus 4.5 and ahead of GPT-5.2. Notably, DeepSeek V3.2 — the previous best open-source model — scores 73.0%, putting M2.5 about 7 points ahead in the open-weight category.

SWE-bench Verified scores over time: Anthropic, OpenAI, Google, and MiniMax in a tight race toward 80%+.

Web Browsing (BrowseComp)

BrowseComp is an OpenAI-created benchmark with 1,266 problems that require browsing multiple websites, reformulating queries, and synthesizing scattered information.

Model	Score
MiniMax M2.5	76.3%*
Kimi K2 Thinking	60.2%
GPT-5	54.9%
o4-mini	51.5%
o3	49.7%

*MiniMax’s score carries a “with context management” qualifier, which likely means an augmented evaluation setup. Direct comparisons to other models should be treated with caution.

Tool-Calling (BFCL)

The Berkeley Function Calling Leaderboard evaluates how well models call APIs and functions. MiniMax claims 76.8% on the multi-turn subset — significantly ahead of what they report for Claude Opus 4.6 (63.3%) and Gemini 3 Pro (61.0%). However, this score has not been independently confirmed on the official BFCL leaderboard.

Pricing and Speed

This is where M2.5 gets interesting regardless of benchmark debates.

	M2.5 Standard	M2.5-Lightning
Input	$0.30/M tokens	$0.30/M tokens
Output	$1.20/M tokens	$2.40/M tokens
Speed	50 tps	100 tps
Approx. hourly cost	~$0.30/hr	~$1.00/hr

For context, Claude Opus 4.6 charges around $75 per million output tokens. Even accounting for the fact that larger models may use fewer tokens per task, M2.5 is roughly 10-20x cheaper for equivalent workloads.

The “$1 per hour” figure comes from running M2.5-Lightning continuously at 100 tokens/sec: 360,000 output tokens/hour at $2.40/M equals approximately $0.86/hour (MiniMax rounds up). Both variants support prompt caching for additional savings.

MiniMax describes this as making “infinite scaling of long-horizon agents economically possible” — and the math supports the claim. Running autonomous coding agents 24/7 on M2.5-Lightning would cost roughly $720/month, compared to tens of thousands for frontier proprietary models.

Technical Specifications

Spec	Detail
Architecture	Mixture of Experts (MoE), Transformer-based
Total parameters	~230 billion
Active parameters per token	~10 billion
Number of experts	256 (8 active per token)
Context window	200,000 tokens
Training method	Reinforcement learning using proprietary “Forge” framework
RL algorithm	CISPO (Clipped IS-weight Policy Optimization)
Supported languages	Python, JavaScript, TypeScript, Java, C++, Go, Rust, C, Kotlin, PHP, Lua, Dart, Ruby

The Mixture of Experts architecture is key to M2.5’s cost efficiency. With 230B total parameters but only 10B active per token, the model can maintain the knowledge capacity of a large model while keeping inference costs low.

MiniMax’s training approach is notable. Their “Forge” framework deploys models into live environments — real code repos, browsers, office apps, API endpoints — and optimizes based on actual task completion rather than synthetic benchmarks. They used over 200,000 real-world training environments and a tree-structured merging strategy that achieved roughly 40x training speedup.

The RL algorithm, CISPO, clips importance sampling weights rather than token updates (unlike PPO or GRPO), reportedly achieving comparable performance to DAPO in half the training steps.

MiniMax Agent Platform

Alongside the model, MiniMax offers MiniMax Agent (agent.minimax.io) — a general-purpose AI agent platform powered by M2.5. It supports:

Shell command execution
Web browsing
Python code interpreter
MCP (Model Context Protocol) tool integration
Multimodal input/output (text, images, voice, documents)

The platform includes an “Expert Builder” system where users create specialized AI agents using natural language instructions. Over 10,000 user-built experts currently exist on the platform.

MiniMax also offers a dedicated CodingPlan subscription (see our full MiniMax pricing guide) at platform.minimax.io/subscribe/coding-plan for developers who want a coding-focused experience.

The API is available at platform.minimax.io.

Open-Source Availability

M2.5 is available under a Modified MIT License — permissive, but with one condition: commercial users must “prominently display ‘MiniMax M2.5’ on the user interface” of products built with the model.

You can access the model weights on:

HuggingFace: MiniMaxAI/MiniMax-M2.5
GitHub: MiniMax-AI/MiniMax-M2.5
Ollama: Available as minimax-m2.5 for local deployment
ModelScope for users in China

GGUF quantized versions are available for running on consumer hardware. Supported inference frameworks include vLLM, SGLang, Transformers, and KTransformers.

Who Is MiniMax?

MiniMax is a Shanghai-based AI company founded in late 2021 by Yan Junjie, a former VP at SenseTime, along with co-founders Yang Bin and Zhou Yucong.

Key milestones:

Funding: ~$850 million across 7 rounds, with investors including Alibaba, Tencent, Sequoia China, and Hillhouse Capital
IPO: Listed on the Hong Kong Stock Exchange in January 2026 (ticker: 00100.HK), raising ~$710 million. The stock doubled on debut, valuing the company at roughly $6.5 billion
Products: Hailuo AI (consumer platform for text, music, and video generation), video-01 (text-to-video), and the M-series language models

The company is one of China’s “AI Tigers” — a group of well-funded Chinese AI startups that also includes Zhipu AI, Moonshot AI, and Baichuan.

Should You Trust the Benchmarks?

This is the critical question. Here’s an honest assessment:

What independent evaluators found:

OpenHands (independent SWE-Bench evaluation team) ranked M2.5 4th on their composite index, calling it “the first open model that has exceeded Claude Sonnet on recent tests.” They also found issues: the model occasionally pushed to wrong branches and forgot to format answers correctly.
Artificial Analysis gave M2.5 a score of 42 on their Intelligence Index (a composite of 10 evaluations). This is well above average for open-weight models but significantly below top frontier models, suggesting M2.5 may be optimized for specific agentic benchmarks rather than general intelligence.
Artificial Analysis measured actual API speed at ~80.8 tokens/sec, below the claimed 100 tps for Lightning.

Red flags to consider:

All headline benchmark scores are self-reported by MiniMax. Independent replication under identical conditions is limited.
SWE-Bench scores are highly scaffold-dependent — different evaluation harnesses can produce materially different results for the same model.
The BrowseComp score of 76.3% carries a “with context management” qualifier that may indicate augmented evaluation rather than raw model capability.
The BFCL multi-turn score of 76.8% could not be confirmed on the official Berkeley leaderboard.
User reports on Hacker News include anecdotes of the model writing tests against fake data and declaring everything working.

What’s solidly confirmed:

The pricing is real and independently verifiable — M2.5 is genuinely 10-20x cheaper than proprietary frontier models.
The model weights are actually open-source and available for download.
The 230B total / 10B active MoE architecture is confirmed.
The 200K context window is confirmed across multiple sources.

Bottom Line

MiniMax M2.5 is a significant release for the open-source AI ecosystem. If its benchmark claims hold up under independent scrutiny, it’s the first open-weight model to genuinely compete with the best proprietary models on agentic coding tasks — and it does so at a dramatically lower price point.

The more conservative read: M2.5 is clearly a strong model that advances the state of the art for open-weight models, but some of its most impressive benchmark numbers carry qualifiers and haven’t been independently verified. The gap between targeted benchmark performance and general intelligence (as suggested by Artificial Analysis’s composite score) deserves attention.

Either way, the pricing changes the calculus. At $1/hour for 100 tokens/sec, running autonomous AI agents continuously becomes economically feasible in a way it simply wasn’t before. For developers building agentic applications, M2.5 is worth evaluating — just don’t take the headline benchmarks at face value without testing on your own workloads.

Share Now!

Get Exclusive AI Tips to Your Inbox!

Stay ahead with expert AI insights trusted by top tech professionals!

Michal Langmajer
February 14, 2026
AI, AI coding, chatGPT, china, claude, gpt, llm, LLMs, MiniMax, Open-source model

Get Fello AI: All-In-One AI Chatbot

All top AI models like GPT, Claude, Gemini, or Grok – in one app that works on Mac, iPhone, and iPad.

Get Fello AI Now!

MiniMax M2.5: A New Chinese Open-Source Model That Claims to Beat Claude & GPT at Coding

What Is MiniMax M2.5?

Benchmark Performance