Qwen3.7-Max Review: Alibaba’s New Flagship Just Beat Claude and Gemini on These Benchmarks

Alibaba’s Qwen3.7-Max launched on May 20, 2026 at the Alibaba Cloud Summit in Hangzhou, and the numbers are loud. It scored 56.6 on the Artificial Analysis Intelligence Index at launch (good for #5 that week, currently sitting top 10 of 151 measured models), the highest-ranked Chinese AI model on that leaderboard to date. It carries a 1 million-token context window, costs $2.50 per 1M input tokens, and Alibaba’s internal testing reports a 35-hour autonomous coding run that fired 1,158 tool calls and hit a 10× speedup over the standard Triton reference.

So is it actually better than GPT-5.5, Claude Opus 4.7o Gemini 3.5? Short answer, on a price-per-intelligence basis, Qwen3.7-Max is now one of the strongest frontier models you can call through an API. Long answer, there’s a caveat in how it earns its hallucination score that you’ll want to know about before betting your workflow on it. We tested the released benchmarks, pulled the official pricing, and broke down what this drop actually means for the coding, agent, and AI tooling space heading into mid-2026.

Índice hide

What Is Qwen3.7-Max?

Benchmarks: How Qwen3.7-Max Stacks Up

Pricing: $2.50 Input, $7.50 Output, $0.25 Cached

$0.25 cached input is the killer feature

Where you can pay for it

The 35-Hour Autonomous Run, Explained

Where to Try Qwen3.7-Max

Verdict: Should You Switch?

FAQ

The Key Takeaways

Qwen3.7-Max scores 56.6 on the Artificial Analysis Intelligence Index, ranked top 10 globally and the highest placement for a Chinese model to date

1 million-token context window and a max output of 65,536 tokens, with extended thinking enabled by default

API pricing is $2.50 input / $7.50 output per 1M tokens, with cached input dropping to $0.25 per 1M (90% discount)

Demonstrated 35 hours of continuous autonomous coding, 1,158 tool calls, and a 10× geometric speedup vs the Triton reference kernel

Proprietary model (no open weights), available via Alibaba Cloud Model Studio, OpenRouter, and direct Claude Code integration through the Anthropic API protocol

What Is Qwen3.7-Max?

Qwen3.7-Max is Alibaba’s flagship proprietary reasoning model, designed as an “agent foundation” rather than a general chat assistant. It’s the successor to Qwen3.6 Max Preview, and the first Qwen model to break the 1M-token context barrier, jumping up from 256K. It uses extended chain-of-thought reasoning by default and is positioned to compete head-on with GPT-5.5, Claude Opus 4.7, y Gemini 3.5 Flash on tasks that need long-horizon autonomy.

Alongside Qwen3.7-Max, Alibaba quietly shipped Qwen3.7-Plus-Preview, a multimodal variant that adds vision input and runs at a lower price point. Plus is the budget pick for high-volume, routine workloads, while Max is the heavy hitter for reasoning, agentic coding, and document-scale context. Unlike many earlier Qwen releases, neither model is open-weight; both run only through Alibaba’s hosted API. In June 2026, Rio de Janeiro’s city government released Rio 3.5 Open 397B, an open-weight fine-tune that benchmarks itself against Qwen 3.7 Plus.

Benchmarks: How Qwen3.7-Max Stacks Up

The launch numbers are strong across the board. According to Artificial Analysis, Qwen3.7-Max sits between Gemini 3.5 Flash and Claude Opus 4.7 on the overall Intelligence Index, but it punches above its weight on a handful of demanding reasoning benchmarks.

Model	AA Intelligence Index	Input $/M	Output $/M	Context
GPT-5.5	60.2	$5.00	$30.00	1M
Claude Opus 4.7	57.3	$5.00	$25.00	1M
Qwen3.7-Max	56.6	$2.50	$7.50	1M
Gemini 3.5 Flash	55.3	$1.50	$9.00	1M
DeepSeek V4 Pro	52.0	$1.74	$3.48	1M

On individual benchmarks, Qwen3.7-Max posts 92.4 on GPQA Diamond (ahead of Claude Opus 4.6 Max at 91.3, behind GPT-5.5 at 93.6) and 97.1 on HMMT 2026 February, the highest score in its comparison group. It also lands 44.5 on Apex (well ahead of DeepSeek V4 Pro at 38.3), 80.4 on SWE-Verified for software engineering tasks, and 41.4 on Humanity’s Last Exam, narrowly beating Opus 4.6 Max at 40.0.

The honest caveat: Qwen3.7-Max’s low hallucination rate is partly an artifact of higher abstention. According to Officechai’s benchmark analysis, the model’s attempt rate fell to 48.0%, the lowest among comparable frontier models. In plain English, it refuses to answer more often, which lowers wrong answers but also lowers usefulness on edge cases. That trade-off matters if you’re plugging it into an agent that needs to push through ambiguity.

Pricing: $2.50 Input, $7.50 Output, $0.25 Cached

Qwen3.7-Max is one of the better-priced frontier models on the market right now. The API costs $2.50 per 1 million input tokens y $7.50 per 1 million output tokens, with a max output of 65,536 tokens per request. Cached input drops to $0.25 per 1M tokens, a 90% discount that makes repeated long-context calls cheap.

$0.25 cached input is the killer feature

For agent workflows that re-read the same codebase, document, or chat history across hundreds of turns, the cache discount turns Qwen3.7-Max into a pricing outlier. Claude Opus 4.7 charges $5 per 1M input tokens y $25 output, double the input and over triple the output rate of Qwen3.7-Max. GPT-5.5 is $5 input and $30 output, the most expensive of the bunch. Gemini 3.5 Flash ($1.50/$9) and DeepSeek V4 Pro ($1.74/$3.48) are cheaper but score lower on reasoning benchmarks. If your workload is “long context, lots of tool calls, occasional thinking,” Qwen3.7-Max sits in the sweet spot between frontier intelligence and frontier pricing.

Where you can pay for it

The API is live on Alibaba Cloud Model Studio, OpenRouter, y Together AI, with Qwen Chat (chat.qwen.ai) offering limited free preview access through the web. There is no permanent free tier, and the Plus-Preview variant is text-only on the API even though it accepts image input through the chat interface.

The 35-Hour Autonomous Run, Explained

The headline demo from launch is the 35-hour continuous autonomous coding run, and it deserves a closer look because most launch articles glossed over what actually happened. According to Alibaba’s internal testing (no independent verification has been published yet), Qwen3.7-Max was given an isolated server equipped with a Zhenwu M890 AI accelerator, a brand-new hardware architecture the model had never seen in training. The task was to optimize an attention kernel from scratch.

Over 35 straight hours, the model executed 1,158 distinct tool calls, ran 432 kernel evaluations, diagnosed compilation failures on its own, and iteratively rewrote its code. It ended with a 10× geometric mean speedup over the Triton reference implementation, per VentureBeat. That single demo isn’t proof of general agent intelligence, but it does show the model can hold context, recover from failure, and stay coherent across a length where most other reasoning models start hallucinating variables or looping.

For developers, the practical takeaway is this. If you’re running a coding agent or a research assistant overnight, Qwen3.7-Max is now one of the few models tested at this duration. Whether your wallet can keep up with the token spend is a separate question.

Where to Try Qwen3.7-Max

There are four paths to get hands-on, and your choice depends on whether you want a chat interface, an API, or an agent harness.

Qwen Chat (chat.qwen.ai) is the free preview entry point. It has a daily message limit and exposes both Qwen3.7-Max and Qwen3.7-Plus-Preview through a dropdown. Useful for kicking the tires before committing to API spend.

Alibaba Cloud Model Studio is the official API home with full pricing applied and the longest context limits. You’ll need an Alibaba Cloud account, which is more friction than other providers but enables prompt caching and direct support.

OpenRouter mirrors the model at the same price tier and is easier to sign up for, particularly if you’re already routing multiple models through a single key. Together AI also hosts it for fine-tuning-friendly workloads.

Claude Code is the one that surprised people. Qwen3.7-Max natively supports the Anthropic API protocol, which means you can point Claude Code, OpenClaw, or any other Anthropic-compatible harness at the Qwen endpoint and run it as a drop-in. If you’ve been priced out of Claude Code pricing but want the same workflow, this is the cheapest credible alternative you’ve had access to in months.

If you’d rather skip the API setup and just compare frontier models from one app, the Fello AI Mac and iOS app bundles Claude, ChatGPT, Gemini, Grok, and DeepSeek under one $9.99/month subscription, so you can A/B test prompts across providers without juggling separate billing accounts.

Verdict: Should You Switch?

Qwen3.7-Max is the most credible Chinese frontier model launched to date, and the pricing makes it the natural pick for cost-sensitive agentic and long-context work. If your bottleneck is GPT-5.5 or Claude Opus 4.7 API bills, this is the most direct way to cut spend without dropping into the “Gemini 3.5 Flash, but slightly worse at reasoning” tier.

You should switch if you’re building agents that run for hours, you’re handling million-token contexts, or you need the $0.25 cached input rate to make a use case profitable. You should not switch if your workload depends on full multimodal vision (use Plus-Preview or wait for Max-Vision), or if your application can’t tolerate the higher abstention rate on hard questions. For pure raw intelligence and instruction-following at the absolute frontier, GPT-5.5 still leads on the Intelligence Index, but the price gap is widening.

The bigger story is what this launch signals. Alibaba is now shipping a proprietary model that ties directly to its own custom AI accelerator (the Zhenwu M890) and rack-scale server (Panjiu AL128), in the same week, on the same stage. That’s the full-stack AI play that NVIDIA y OpenAI are also racing toward, and Alibaba is the first Chinese vendor to credibly join that race. Qwen3.7-Max is the model layer of that bet, and on the numbers we have, the bet is paying off.

For the wider picture on how Chinese AI labs caught up this fast, our deep dive on the Chinese AI race and our piece on Kimi K2.5 give you the prior context. If you’re shopping for best free coding models, Qwen3.7-Max isn’t free, but it’s worth comparing on benchmark-per-dollar. For the latest in that lineage, see our guide to Kimi K2.7 Code, Moonshot’s newest open-weight coding model.

Qwen isn’t the only Chinese frontier move this month. MiniMax previewed MiniMax M3, a sparse-attention LLM claiming 9.7× y 15.6× speedups at the 1M-token mark over M2.5, signaling that efficiency at long context, not just raw scale, is the new battleground for Chinese labs.

FAQ

Is Qwen3.7-Max open source?

No. Qwen3.7-Max is proprietary, and weights are not released. It’s available only through Alibaba Cloud Model Studio, OpenRouter, and Together AI. Earlier Qwen 3.6 family models remain open-weight on Hugging Face.

How much does Qwen3.7-Max cost?

The API costs $2.50 per 1M input tokens and $7.50 per 1M output tokens, with a 90% discount on cached input at $0.25 per 1M.

Does Qwen3.7-Max work with Claude Code?

Yes. The model natively supports the Anthropic API protocol, so you can point Claude Code, OpenClaw, or any Anthropic-compatible harness directly at the Qwen endpoint as a drop-in.

Is Qwen3.7-Max better than GPT-5.5?

On the Artificial Analysis Intelligence Index, GPT-5.5 still leads at 60.2 vs Qwen3.7-Max at 56.6. On price-per-intelligence and on long-horizon agentic coding, Qwen3.7-Max is the stronger pick.

What’s the difference between Qwen3.7-Max and Qwen3.7-Plus-Preview?

Max is the text-only reasoning flagship with the highest benchmarks and the largest context. Plus-Preview is multimodal (vision input) and priced for higher-volume, lower-stakes workloads.