Best AI for Writing
Best AI for Writing: Claude Sonnet 4.6 ($3 / $15 per 1M tokens, top in hands-on writing tests)
The best AI for writing is Claude Sonnet 4.6, with GPT-5.5 as the alternative for structured business writing and Claude Opus 4.8 as the alternative for long-form work where every sentence matters. Sonnet 4.6 leads on writing style, voice fidelity, and instruction-following in our hands-on tests, and scores 1,643 Elo on GDPval-AA, Artificial Analysis’s benchmark for real professional deliverables across 44 occupations (a professional-task score, not a prose-only metric). GPT-5.5 Thinking, launched April 23, 2026, is more factually reliable than GPT-5.4 in OpenAI’s selected evaluation and now leads GDPval-AA overall, which makes it the safer default for fact-anchored writing like reports and briefs. Gemini 3.5 Flash (May 19) hit 1,656 GDPval-AA Elo, edging Sonnet 4.6 at lower cost, and is the price-performance pick for bulk content work. Claude Opus 4.8 is the call when you want chain-of-thought editing, long-form revision, or you want the model to push back on weak arguments.
|
Model
|
Best For
|
Strength
|
Weakness
|
Price (per 1M tokens)
|
|
Claude Sonnet 4.6
|
Style, voice, instruction-following
|
Top of Anthropic line for style in hands-on tests
|
More cautious than GPT on opinions
|
$3 / $15
|
|
GPT-5.5
|
Business writing, factual reports
|
improved factual reliability vs GPT-5.4 (OpenAI eval)
|
Style less expressive than Sonnet
|
$5 / $30
|
|
Gemini 3.5 Flash
|
Bulk content, drafts at scale
|
1,656 GDPval-AA Elo, 40% cheaper than Pro
|
Weaker on hardest reasoning
|
$1.50 / $9.00
|
|
Claude Opus 4.8
|
Long-form, high-stakes copy
|
Best editor for argument structure
|
Most expensive option here
|
$5 / $25
|
|
Grok 4.3
|
Casual, opinionated, X-style
|
Native X grounding, fewer guardrails
|
Not the natural pick for formal copy
|
$1.25 / $2.50
|
Runner-up and alternatives: Gemini 3.5 Flash is the runner-up for sheer volume at near-Sonnet quality, and GPT-5.5 is the runner-up for factual accuracy. Claude Opus 4.8 is the splurge pick for long-form. Grok 4.3 is the niche pick when you want X-style voice or live web context inside the draft.
What changed this month: No major writing-specific launches in June 2026. Gemini 3.5 Flash (May 19) still hits GDPval-AA 1,656 Elo, just above Claude Sonnet 4.6 at 1,643 and just below GPT-5.4 at 1,671, en $1.50 / $9.00 per 1M tokens. GPT-5.5 still leads GDPval-AA overall and stays the default for structured knowledge work, with improved factual reliability over GPT-5.4 in OpenAI’s selected evaluation. Sonnet 4.6 still leads on style.
Best AI for Chat / Daily Assistant
Best AI for Chat & Daily Assistant: GPT-5.5 Instant ($20/month ChatGPT Plus, 52.5% fewer hallucinated claims)
The best AI for everyday chat and daily assistant work is GPT-5.5 Instant, ChatGPT’s default model, with Claude Opus 4.8 as the alternative when you want a more thoughtful tone and Gemini 3.5 Flash as the budget alternative inside the free Gemini app. On high-stakes prompts, OpenAI reports GPT-5.5 Instant produces 52.5% fewer hallucinated claims than GPT-5.3 Instant, y 37.3% fewer inaccurate claims on conversations users had flagged for factual errors, on top of faster response times and a refreshed memory system that make it the most reliable default for general-purpose tasks. It is available inside ChatGPT (free with limits, Plus at $20/month, Pro at $100/month for roughly 5x Plus usage or $200/month for roughly 20x Plus usage), through the API as the gpt-5.5 model at $5 / $30 per 1M tokens, and bundled inside Fello AI alongside Claude, Gemini, Grok, and DeepSeek.
Claude Opus 4.8 is the better pick when you want a model that pushes back on weak prompts and reasons more carefully through ambiguous questions; Gemini 3.5 Flash is the better pick when you are running everything through the free Gemini app or care about speed.
|
Model
|
Best For
|
Strength
|
Weakness
|
Price
|
|
GPT-5.5 Instant
|
Everyday chat, default assistant
|
52.5% fewer hallucinated claims vs 5.3 Instant
|
Less expressive than Sonnet 4.6
|
$20/mo Plus; separate gpt-5.5 API at $5 / $30
|
|
Claude Opus 4.8
|
Thoughtful, nuanced answers
|
Strong reasoning, pushes back well
|
$25 output API is the priciest here
|
$20/mo Pro, $5 / $25 API
|
|
Gemini 3.5 Flash
|
Fast, free, multimodal
|
Free in Gemini app, 1M context
|
Weaker on hardest reasoning
|
Free / $1.50 input / $9.00 output per 1M API
|
|
Grok 4.3
|
Live news, X integration
|
Real-time X & web grounding
|
Smaller ecosystem
|
$30/mo SuperGrok
|
|
Fello AI
|
All five models, one app
|
$9.99/mo for ChatGPT + Claude + Gemini + Grok + DeepSeek
|
Routed via app, not direct
|
$9.99/mo
|
Runner-up and alternatives: Claude Opus 4.8 is the runner-up for thoughtful daily use, Gemini 3.5 Flash is the runner-up for fast/free, and Grok 4.3 is the niche pick for live-news heavy days. Fello AI is the natural pick if you want all five top models in one Mac/iOS app for $9.99/month instead of juggling subscriptions.
What changed this month: GPT-5.5 Instant stayed the default for chat with no June regressions. Gemini 3.5 Flash (May 19) made the free Gemini app meaningfully faster and now matches Sonnet 4.6 on GDPval-AA at zero cost in the consumer app. Claude Opus 4.8 holds the #1 spot on the Artificial Analysis Intelligence Index at 61, ahead of GPT-5.5.
Best AI for Images
Best AI for Images: ChatGPT Images 2.0 (included in ChatGPT Plus, leader on readable text)
The best AI for image generation is ChatGPT Images 2.0, with Google Nano Banana Pro (Gemini 3 Pro Image) as the alternative for photorealism, Reve 2.0 as the new layout-and-typography alternative, and Midjourney v8 as the alternative for stylized art. ChatGPT Images 2.0 (April 21, 2026) leads on text rendering, multilingual scripts, and infographic-style output, which makes it the natural pick when your image needs to contain words. Google’s Nano Banana Pro (Gemini 3 Pro Image, with the lower-cost Nano Banana 2 / Gemini 3.1 Flash Image as its sibling) is the natural pick for photoreal portraits and product shots, priced around $0.134 per 1K/2K image y $0.24 per 4K image. Reve 2.0 (June 3) jumped to #2 on the Arena text-to-image leaderboard with native 4K output and editing that preserves typography. Midjourney v8 stays the niche choice for distinctive style.
|
Model
|
Best For
|
Strength
|
Weakness
|
Price
|
|
ChatGPT Images 2.0
|
Images with readable text
|
Best multilingual text rendering
|
Less photoreal than Nano Banana
|
Included in ChatGPT Plus
|
|
Nano Banana Pro (Gemini 3 Pro Image)
|
Photoreal portraits, products
|
Photorealism, ~$0.134 per 1K/2K image
|
Style less distinctive
|
Gemini app / AI Studio
|
|
Reve 2.0
|
Layout, typography, native 4K
|
#2 Arena, 16MP output, layout editing
|
New, smaller ecosystem
|
Free / from $7.99/mo
|
|
Midjourney v8
|
Stylized art, illustration
|
Aesthetic baseline most artists like
|
Weaker on text in image
|
$10-$120/mo
|
|
Grok Imagine
|
NSFW / Spicy Mode
|
Most permissive guardrails
|
Smallest model behind
|
$30/mo SuperGrok
|
|
MAI-Image-2.5
|
Microsoft ecosystem
|
#3 text-to-image leaderboard, native in Copilot
|
Just launched, US-first
|
Included in Copilot
|
Runner-up and alternatives: Nano Banana Pro is the runner-up overall and the leader for photoreal work; Reve 2.0 is the new runner-up for layout and typography; Midjourney v8 is the niche pick for art-direction-heavy use. Grok Imagine is the only major model that allows Spicy Mode adult content.
What changed this month: Reve 2.0 (June 3) launched at #2 on the Arena text-to-image leaderboard with native 4K rendering and layout-based editing. Microsoft’s MAI-Image-2.5 (June 2) arrived at #3 on the text-to-image leaderboard, native in Copilot. ChatGPT Images 2.0 still leads on text-in-image.
Best AI for Video
Best AI for Video: Google Veo 3.1 (Gemini App / AI Studio, Sora 2 consumer app retired April 26, 2026)
The best AI for video generation is Google Veo 3.1, with Kling 3.5 as the alternative for fast iteration and Runway Gen-4 as the alternative for cinematic motion control. OpenAI retired the Sora 2 consumer web and app experience on April 26, 2026 (the Sora 2 API remains available to developers until September 24, 2026), so OpenAI no longer ranks in this consumer category. Veo 3.1 is available inside the Gemini app, Google AI Studio, and via Vertex AI, with native audio generation, 1080p output, and the strongest physics consistency in the current lineup. Kling 3.5 stays the speed pick at lower cost; Runway Gen-4 is the choice when you need precise camera control. Pika 2.0 y Luma Ray 3 remain credible alternatives for shorter clips.
|
Model
|
Best For
|
Strength
|
Weakness
|
Price
|
|
Google Veo 3.1
|
Highest-fidelity AI video + audio
|
1080p, native audio, physics consistency
|
Compute-heavy, slower
|
Gemini AI Pro / Ultra
|
|
Kling 3.5
|
Fast iteration
|
Quick turnaround, strong motion
|
Less stable on long shots
|
From $10/mo
|
|
Runway Gen-4
|
Cinematic control
|
Best-in-class camera/motion control
|
Pricing premium
|
Free / $12 mo billed annually, or $15 monthly
|
|
Pika 2.0
|
Short clips, social
|
Cheap, fast, easy UX
|
Lower max resolution
|
From $10/mo
|
|
Luma Ray 3
|
Photoreal scenes
|
Strong realism for landscapes
|
Smaller community
|
Free / from $9.99/mo
|
Runner-up and alternatives: Kling 3.5 is the runner-up overall and the cost-conscious pick; Runway Gen-4 is the runner-up for filmmakers and ad teams. Sora 2‘s consumer app is retired; only the developer API remains, through September 24, 2026.
What changed this month: No major video launches in June 2026, so Veo 3.1 stays uncontested at the top of the still-supported video models. Google is widely expected to refresh Veo at its next AI event; we will update this section when that happens.
Best AI for Coding
Best AI for Coding: Claude Opus 4.8 vs GPT-5.5 ($5 / $25 vs $5 / $30 per 1M tokens)
The best AI for coding is Claude Opus 4.8, with GPT-5.5 as the proprietary alternative, Gemini 3.5 Flash as the price-performance pick for agent-style coding, Qwen 3.7 Max as the mid-tier value pick, MiniMax M3 as the new open-weight frontier pick, and Microsoft’s MAI-Code-1-Flash as the new budget pick. Claude Opus 4.8 holds Anthropic’s top SWE-bench Verified score and remains the favourite inside Claude Code y Cursor. GPT-5.5 (April 23) is right behind, with OpenAI reporting 58.6% on SWE-Bench Pro and a state-of-the-art 82.7% on Terminal-Bench 2.0, and it leads on FrontierMath.
Gemini 3.5 Flash (May 19) hit 76.2% on Terminal-Bench 2.1 y 83.6% on MCP Atlas at $1.50 / $9.00 per 1M tokens, making it the strongest price-performance option for agent workflows. MiniMax M3 (June 1) posts 59% on SWE-Bench Pro y 66% on Terminal-Bench 2.1 at roughly $0.60 per million input tokens (it launched at a $0.30 promo), the cheapest frontier-class coder you can self-host. Microsoft’s MAI-Code-1-Flash (June 2) beats Claude Haiku 4.5 on SWE-Bench Verified (71.6 vs 66.6) while using up to 60% fewer tokens, rolling out now inside VS Code and the GitHub Copilot CLI. If you want to self-host, Kimi K2.6 (Modified MIT, 58.6% SWE-Bench Pro) and GLM-5.1 (MIT, 58.4% SWE-Bench Pro, built for 8-hour autonomous runs) are the strongest open-weight coders with clear commercial licenses.
|
Model
|
Best For
|
Strength
|
Weakness
|
Price (per 1M tokens)
|
|
Claude Opus 4.8
|
Long-running agentic coding
|
Anthropic-leading SWE-bench, adaptive thinking
|
Most expensive
|
$5 / $25
|
|
GPT-5.5
|
Frontier proprietary alternative
|
58.6% SWE-Bench Pro, 82.7% Terminal-Bench 2.0
|
Less agent-tuned than Opus
|
$5 / $30
|
|
Gemini 3.5 Flash
|
Agent coding at scale
|
76.2% Terminal-Bench, 83.6% MCP Atlas
|
Weaker on hardest reasoning
|
$1.50 / $9.00
|
|
MiniMax M3
|
Cheap frontier-class, self-host
|
59% SWE-Bench Pro, 1M context, multimodal
|
Weights/license still rolling out
|
~$0.60 input
|
|
MAI-Code-1-Flash
|
Budget IDE coding
|
71.6 SWE-Verified, 60% fewer tokens
|
Tiny model, VS Code-first
|
Pricing TBD
|
|
DeepSeek V4-Flash
|
Cheap open-weight coding
|
MIT, 1M context, II 47
|
Below V4-Pro on hardest tasks
|
$0.14 / $0.28
|
Runner-up and alternatives: GPT-5.5 is the proprietary runner-up; Gemini 3.5 Flash is the runner-up for price-performance; Qwen 3.7 Max is the runner-up for mid-tier value; MiniMax M3 y DeepSeek V4 are the runners-up for open-weight self-hosters. Inside IDEs, Cursor + Claude Opus 4.8 is the most popular pairing and Claude Code is the natural pick if you live in the terminal.
What changed this month: MiniMax M3 (June 1) shipped frontier-class coding as an open-weight model at roughly $0.60 per million input tokens (launch promo $0.30). Microsoft’s MAI-Code-1-Flash (June 2) beat Claude Haiku 4.5 on SWE-Verified while using up to 60% fewer tokens. The May leaders hold at the top: Gemini 3.5 Flash for cheap agent coding and Qwen 3.7 Max at 80.4 SWE-Verified, with Claude Opus 4.8 y GPT-5.5 still the picks when budget is not the constraint. Open-weight coding now has a deep bench: MiniMax M3 (II 55), Kimi K2.6, y GLM-5.1 all post roughly 58-59% on SWE-Bench Pro under permissive or commercial licenses.
Best AI for Creativity
Best AI for Creativity: Grok 4.3 (xAI, $30/month SuperGrok, fewer guardrails)
The best AI for creative writing, brainstorming, and unfiltered ideation is Grok 4.3, with Claude Opus 4.8 as the alternative for structured creative work and Gemini 3.1 Pro as the alternative for multimodal creative tasks. Grok 4.3 (April 30, 2026) has the most permissive guardrails of any frontier model and the strongest native X integration, which makes it the natural pick for opinionated, on-trend, real-time creative work. Claude Opus 4.8 is the better pick when you want a model that holds a long creative thread, edits its own drafts, and engages with the substance of your work. Gemini 3.1 Pro is the better pick when your creative project mixes text with images, video, and live web context.
|
Model
|
Best For
|
Strength
|
Weakness
|
Price
|
|
Grok 4.3
|
Unfiltered, opinionated, on-trend
|
Fewest guardrails, X integration
|
Less polished for structured work
|
$30/mo SuperGrok
|
|
Claude Opus 4.8
|
Long-form structured creativity
|
Holds long threads, self-edits
|
Most cautious of the four
|
$20/mo Pro, $5 / $25 API
|
|
Gemini 3.1 Pro
|
Multimodal creative
|
Strong text + image + video chain
|
Quotas inside Gemini app
|
Free / $2.00-$4.00 API in
|
|
ChatGPT-5.5
|
Mainstream creative writing
|
Best at hitting briefs
|
Heavier guardrails
|
$20/mo Plus, $5 / $30 API
|
|
Grok Imagine (Spicy Mode)
|
NSFW / adult creative
|
Most permissive image generation
|
Niche use case
|
$30/mo SuperGrok
|
Runner-up and alternatives: Claude Opus 4.8 is the runner-up overall and the right pick for projects that need to hold together across many turns. Gemini 3.1 Pro is the multimodal runner-up. For adult creative work, Grok Imagine Spicy Mode is the only frontier-grade option.
What changed this month: No major creativity-specific launches in June 2026. Grok 4.3 stayed the category leader. On the multimodal side, the new Reve 2.0 image model (June 3) is a strong addition for creative workflows that need precise layout and typography control.
Best AI for Accuracy
Best AI for Accuracy: Gemini 3.1 Pro (94.3% GPQA Diamond, 44.4% Humanity’s Last Exam, 77.1% ARC-AGI-2)
The best AI for accuracy and research is Gemini 3.1 Pro, with Qwen 3.7 Max as the value alternative and GPT-5.5 Pro as the alternative for hallucination-sensitive work. Gemini 3.1 Pro leads the hardest pure-reasoning tests at 94.3% on GPQA Diamond, 44.4% on Humanity’s Last Exam, y 77.1% on ARC-AGI-2, with native Google Search grounding for live factual answers. Qwen 3.7 Max (May 20) entered the top tier at 92.4 on GPQA Diamond, tied with Claude Opus 4.8, at half the API cost.
GPT-5.5 Pro (April 23) carries GPT-5.5’s factual-reliability gains over GPT-5.4 (claims 23% more likely to be factually correct on OpenAI’s flagged-conversation set), which makes it the right pick when factual reliability matters more than raw benchmark depth. Gemini 3.5 Flash (May 19) outscores Gemini 3.1 Pro on coding and agent benchmarks but trails Pro on these accuracy tests (HLE 40.2% vs 44.4%, ARC-AGI-2 72.1% vs 77.1%), so Pro stays the accuracy pick.
|
Model
|
Best For
|
Key Benchmark
|
Weakness
|
Price
|
|
Gemini 3.1 Pro
|
Hardest reasoning + research
|
94.3% GPQA, 44.4% HLE, 77.1% ARC-AGI-2
|
API quotas in app
|
$2.00-$4.00 / $12.00-$18.00 (tiered)
|
|
Qwen 3.7 Max
|
Frontier accuracy at value pricing
|
92.4 GPQA Diamond
|
API-only, no chat front-end
|
$1.25 / $3.75 promo through June 22; $2.50 / $7.50 list
|
|
GPT-5.5 Pro
|
Hallucination-sensitive work
|
improved factual reliability vs GPT-5.4 (OpenAI eval)
|
Pricier API tier
|
$100/mo ChatGPT Pro
|
|
Claude Opus 4.8
|
Long-form factual writing
|
#1 Intelligence Index (61)
|
Slower on hardest math
|
$5 / $25
|
|
Grok 4.3
|
Live web facts
|
Native real-time grounding
|
Smaller benchmark coverage
|
$30/mo SuperGrok
|
Runner-up and alternatives: Qwen 3.7 Max is the runner-up and the value pick at the frontier. GPT-5.5 Pro is the runner-up for hallucination-sensitive work. Claude Opus 4.8 is the runner-up for long-form factual writing.
What changed this month: No new accuracy leaders shipped in June, so Gemini 3.1 Pro holds the top of the category. The one to watch is Gemini 3.5 Pro, which Google says is in internal use and rolling out this month; its specs are not yet public, but it could reset this ranking the moment it reaches general availability.
Best AI for Problem Solving
Best AI for Problem Solving: GPT-5.5 Pro & Qwen 3.7 Max (39.6% FrontierMath Tier 4, 97.1 HMMT 2026 Feb)
The best AI for hard problem-solving is GPT-5.5 Pro for FrontierMath-style abstract math and Qwen 3.7 Max for competition math, with Claude Opus 4.8 as the alternative for long agentic reasoning chains. GPT-5.5 Pro still leads at 39.6% on FrontierMath Tier 4 (nearly double Claude Opus 4.8’s 22.9%), which makes it the right pick when you need step-by-step working on the hardest math and physics problems. Qwen 3.7 Max (May 20) hit 97.1 on HMMT 2026 February, the highest score in its comparison group, and 44.5 on Apex, which makes it the right pick for competition-style problem-solving at half the cost of GPT-5.5 Pro.
Claude Opus 4.7 (April 16) introduced task budgets, a primitive for guiding agentic token spend on long chains; Claude Opus 4.8 (May 28) instead uses adaptive thinking controlled by an effort parameter, and does not support extended-thinking budgets. Gemini 3.5 Flash trades raw reasoning depth for speed and price; for the hardest problems, Gemini 3.1 Pro and the Thinking variants still lead.
|
Model
|
Best For
|
Key Benchmark
|
Weakness
|
Price
|
|
GPT-5.5 Pro
|
Abstract math, physics
|
39.6% FrontierMath Tier 4
|
Highest cost tier
|
$100/mo ChatGPT Pro
|
|
Qwen 3.7 Max
|
Competition math
|
97.1 HMMT 2026 Feb, 44.5 Apex
|
API-only
|
$1.25 / $3.75 promo through June 22; $2.50 / $7.50 list
|
|
Claude Opus 4.8
|
Long agentic reasoning
|
Adaptive thinking, effort control, #1 Intelligence Index
|
Slower on math
|
$5 / $25
|
|
Gemini 3.1 Pro
|
Multimodal reasoning + research
|
94.3 GPQA, 77.1 ARC-AGI-2
|
API quotas
|
$2.00-$4.00 / $12.00-$18.00 (tiered)
|
|
DeepSeek V4-Flash
|
Open-weight problem solving
|
MIT, 1M context, II 47
|
Below V4-Pro on hardest
|
$0.14 / $0.28
|
Runner-up and alternatives: Claude Opus 4.8 is the runner-up overall and the natural pick for agentic, long-chain problem-solving. Gemini 3.1 Pro is the multimodal runner-up. DeepSeek V4-Flash is the open-weight runner-up.
What changed this month: No new problem-solving leaders shipped in June. GPT-5.5 Pro still leads FrontierMath Tier 4 at 39.6% y Qwen 3.7 Max still leads competition math at 97.1 HMMT 2026 February. Reasoning depth remains Pro/Thinking territory; the open-weight MiniMax M3 (June 1) is a strong cheaper option for agentic chains.
Best AI Agents
Best AI Agent: Gemini Spark vs Claude Cowork ($100/month Ultra vs $20/month Pro)
The best AI agent right now is Gemini Spark for 24/7 cloud-resident work and Claude Cowork for desktop-resident work, with ChatGPT Codex as the alternative for coding agents and OpenAI Operator-class browser agents as the alternative for web tasks. AI agents are the fastest-moving category of 2026: each top vendor now ships an agent product, and the practical choice is between agents that live in the cloud (run while your laptop is closed) and agents that live on your desktop (drive your apps directly).
Gemini Spark launched at Google I/O on May 19, 2026 and is the first 24/7 cloud agent. Claude Cowork launched in general availability on April 9, 2026 and runs as a desktop agent that drives your local apps. ChatGPT Codex Mobile (May 14) is the pick for coding-agent work, now usable from iOS and Android. Read the full Gemini Spark vs Claude Cowork comparison.
|
Agent
|
Best For
|
Where It Runs
|
Strength
|
Price
|
|
Gemini Spark
|
24/7 cloud tasks, Workspace workflows
|
Google Cloud VM (always-on)
|
First true 24/7 agent, deep Workspace integration
|
$100/mo Google AI Ultra
|
|
Claude Cowork
|
Desktop, app-driving, design + code
|
Your Mac/Windows desktop
|
Drives local apps, sees your screen
|
$20/mo Claude Pro
|
|
ChatGPT Codex Mobile
|
Coding agent on phone
|
OpenAI cloud + iOS/Android
|
Approve diffs and redirect work from phone
|
Included in ChatGPT plans
|
|
Grok Agentic (Grok 4.3)
|
Real-time research, X scraping
|
xAI cloud
|
Native X integration
|
$30/mo SuperGrok
|
|
OpenAI Operator-class
|
Browser tasks, web forms
|
OpenAI cloud + your browser
|
Web automation
|
ChatGPT Pro
|
Runner-up and alternatives: Claude Cowork is the runner-up overall and the natural pick when you want the agent on your machine driving your apps. ChatGPT Codex Mobile is the runner-up for coding agents. Grok Agentic is the niche pick for real-time research.
What changed this month: No new consumer agents shipped in June, so the Gemini Spark (cloud) vs Claude Cowork (desktop) choice still drives most agent decisions for individual users. The new open models, NVIDIA Nemotron 3 Ultra y MiniMax M3, both ship strong agentic benchmarks, which matters for teams building their own agents on open weights.