Best AI for Writing
Best AI for Writing: Claude Sonnet 5 ($2 / $10 introductory, new GDPval-AA leader)
The best AI for writing is Claude Sonnet 5, which Anthropic launched on June 30, 2026 as its new default model, with GPT-5.5 as the alternative for fact-anchored business writing and Claude Opus 4.8 as the alternative for long-form work where every sentence matters. Sonnet 5 jumps roughly 223 GDPval-AA Elo over Sonnet 4.6 (which scored 1,643) to reach about 1,866 and take the top of Artificial Analysis’s professional-deliverables benchmark, ahead of both Opus 4.8 and GPT-5.5, while keeping the Sonnet line’s lead on writing style, voice fidelity, and instruction-following in our hands-on tests.
It ships with a 1-million-token context window at introductory pricing of $2 / $10 per 1M tokens through August 31, 2026 (then $3 / $15), and is the new free and Pro default on claude.ai, so most writers get it at no cost; note its updated tokenizer maps the same text to roughly 1.0-1.35x more tokens, which narrows the real-world price gap. GPT-5.5 stays the safer default for fact-anchored writing like reports and briefs, the older Claude Sonnet 4.6 remains a cheaper legacy option at $3 / $15, Gemini 3.5 Flash (1,656 GDPval-AA Elo) is the price-performance pick for bulk content, and Claude Opus 4.8 is the call for long-form revision where you want the model to push back on weak arguments.
Model | Best For | Strength | Weakness | Price (per 1M tokens) |
|---|
Claude Sonnet 5 | Style + top GDPval-AA writing | New #1 GDPval-AA (~1,866), beats Opus 4.8 & GPT-5.5, 1M context | New tokenizer inflates token counts ~1.0-1.35x | $2 / $10 intro (then $3 / $15) |
GPT-5.5 | Business writing, factual reports | Improved factual reliability vs GPT-5.4 (OpenAI eval) | Style less expressive than Sonnet 5 | $5 / $30 |
Gemini 3.5 Flash | Bulk content, drafts at scale | 1,656 GDPval-AA Elo, 40% cheaper than Pro | Weaker on hardest reasoning | $1.50 / $9.00 |
Claude Opus 4.8 | Long-form, high-stakes copy | Best editor for argument structure | Most expensive option here | $5 / $25 |
Claude Sonnet 4.6 | Budget Claude writing | Prior Sonnet, 1,643 GDPval-AA | Superseded by Sonnet 5 | $3 / $15 |
Grok 4.3 | Casual, opinionated, X-style | Native X grounding, fewer guardrails | Not the natural pick for formal copy | $1.25 / $2.50 |
Runner-up and alternatives: Gemini 3.5 Flash is the runner-up for sheer volume at near-Sonnet quality, and GPT-5.5 is the runner-up for factual accuracy. Claude Opus 4.8 is the splurge pick for long-form. Grok 4.3 is the niche pick when you want X-style voice or live web context inside the draft.
What changed this month: Claude Sonnet 5 (June 30) is the headline writing launch, jumping ~223 GDPval-AA Elo over Sonnet 4.6 to lead the professional-writing benchmark ahead of Opus 4.8 and GPT-5.5, and it is now the free and Pro default on claude.ai. GPT-5.5 stays the pick for fact-anchored business writing, and Gemini 3.5 Flash (1,656 GDPval-AA Elo) remains the bulk-content value pick.
Best AI for Chat & Daily Assistant
Best AI for Chat & Daily Assistant: GPT-5.5 Instant ($20/month ChatGPT Plus, 52.5% fewer hallucinated claims)
The best AI for everyday chat and daily assistant work is GPT-5.5 Instant, ChatGPT’s default model, with Claude Opus 4.8 as the alternative when you want a more thoughtful tone and Gemini 3.5 Flash as the budget alternative inside the free Gemini app. On high-stakes prompts, OpenAI reports GPT-5.5 Instant produces 52.5% fewer hallucinated claims than GPT-5.3 Instant, and 37.3% fewer inaccurate claims on conversations users had flagged for factual errors, on top of faster response times and a refreshed memory system that make it the most reliable default for general-purpose tasks. It is available inside ChatGPT (free with limits, Plus at $20/month, Pro at $100/month for roughly 5x Plus usage or $200/month for roughly 20x Plus usage), through the API as the gpt-5.5 model at $5 / $30 per 1M tokens, and bundled inside Fello AI alongside Claude, Gemini, Grok, and DeepSeek.
Claude Opus 4.8 is the better pick when you want a model that pushes back on weak prompts and reasons more carefully through ambiguous questions; Gemini 3.5 Flash is the better pick when you are running everything through the free Gemini app or care about speed.
Model | Best For | Strength | Weakness | Price |
|---|
GPT-5.5 Instant | Everyday chat, default assistant | 52.5% fewer hallucinated claims vs 5.3 Instant | Less expressive than Claude Sonnet 5 | $20/mo Plus; gpt-5.5 API at $5 / $30 |
Claude Opus 4.8 | Thoughtful, nuanced answers | Strong reasoning, pushes back well | $25 output API is the priciest here | $20/mo Pro, $5 / $25 API |
Gemini 3.5 Flash | Fast, free, multimodal | Free in Gemini app, 1M context | Weaker on hardest reasoning | Free / $1.50 / $9.00 API |
Grok 4.3 | Live news, X integration | Real-time X & web grounding | Smaller ecosystem | $30/mo SuperGrok |
Fello AI | All five models, one app | ChatGPT + Claude + Gemini + Grok + DeepSeek | Routed via app, not direct | $9.99/mo |
Runner-up and alternatives: Claude Opus 4.8 is the runner-up for thoughtful daily use, Gemini 3.5 Flash is the runner-up for fast/free, and Grok 4.3 is the niche pick for live-news heavy days. Fello AI is the natural pick if you want all five top models in one Mac/iOS app for $9.99/month instead of juggling subscriptions.
What changed this month: GPT-5.5 Instant stayed the default for chat with no regressions. Its successor GPT-5.6 (Terra for everyday work, Luna for speed) is in a gated preview limited to roughly 20 organizations and is not yet a consumer option. Claude Opus 4.8 holds the #1 spot on the Artificial Analysis Intelligence Index at 61, ahead of GPT-5.5. On the Claude side, the new default is Claude Sonnet 5 (June 30), a cheaper near-Opus model at $2 / $10 introductory pricing.
Best AI for Images
Best AI for Images: ChatGPT Images 2.0 (included in ChatGPT Plus, leader on readable text)
The best AI for image generation is ChatGPT Images 2.0, with Google Nano Banana Pro (Gemini 3 Pro Image) as the alternative for photorealism, Reve 2.0 as the layout-and-typography alternative, and Midjourney v8 as the alternative for stylized art. ChatGPT Images 2.0 (April 21, 2026) leads on text rendering, multilingual scripts, and infographic-style output, which makes it the natural pick when your image needs to contain words. Google’s Nano Banana Pro (Gemini 3 Pro Image, with the lower-cost Nano Banana 2 / Gemini 3.1 Flash Image as its sibling) is the natural pick for photoreal portraits and product shots, priced around $0.134 per 1K/2K image and $0.24 per 4K image. Reve 2.0 (June 3) jumped to #2 on the Arena text-to-image leaderboard with native 4K output and editing that preserves typography. Midjourney v8 stays the niche choice for distinctive style.
Model | Best For | Strength | Weakness | Price |
|---|
ChatGPT Images 2.0 | Images with readable text | Best multilingual text rendering | Less photoreal than Nano Banana | Included in ChatGPT Plus |
Nano Banana Pro (Gemini 3 Pro Image) | Photoreal portraits, products | Photorealism, ~$0.134 per 1K/2K image | Style less distinctive | Gemini app / AI Studio |
Reve 2.0 | Layout, typography, native 4K | #2 Arena, 16MP output, layout editing | New, smaller ecosystem | Free / from $7.99/mo |
Midjourney v8 | Stylized art, illustration | Aesthetic baseline most artists like | Weaker on text in image | $10-$120/mo |
Grok Imagine | NSFW / Spicy Mode | Most permissive guardrails | Smallest model behind | $30/mo SuperGrok |
MAI-Image-2.5 | Microsoft ecosystem | #3 text-to-image leaderboard, native in Copilot | Just launched, US-first | Included in Copilot |
Runner-up and alternatives: Nano Banana Pro is the runner-up overall and the leader for photoreal work; Reve 2.0 is the runner-up for layout and typography; Midjourney v8 is the niche pick for art-direction-heavy use. Grok Imagine is the only major model that allows Spicy Mode adult content.
What changed this month: No major image launches in July 2026. Reve 2.0 (June 3) still holds #2 on the Arena text-to-image leaderboard with native 4K rendering and layout-based editing, and Microsoft’s MAI-Image-2.5 (June 2) sits at #3, native in Copilot. ChatGPT Images 2.0 still leads on text-in-image.
Best AI for Video
Best AI for Video: Google Veo 3.1 (Gemini App / AI Studio, Sora 2 consumer app retired April 26, 2026)
The best AI for video generation is Google Veo 3.1, with Kling 3.5 as the alternative for fast iteration and Runway Gen-4 as the alternative for cinematic motion control. OpenAI retired the Sora 2 consumer web and app experience on April 26, 2026 (the Sora 2 API remains available to developers until September 24, 2026), so OpenAI no longer ranks in this consumer category. Veo 3.1 is available inside the Gemini app, Google AI Studio, and via Vertex AI, with native audio generation, 1080p output, and the strongest physics consistency in the current lineup. Kling 3.5 stays the speed pick at lower cost; Runway Gen-4 is the choice when you need precise camera control. Pika 2.0 and Luma Ray 3 remain credible alternatives for shorter clips.
Model | Best For | Strength | Weakness | Price |
|---|
Google Veo 3.1 | Highest-fidelity AI video + audio | 1080p, native audio, physics consistency | Compute-heavy, slower | Gemini AI Pro / Ultra |
Kling 3.5 | Fast iteration | Quick turnaround, strong motion | Less stable on long shots | From $10/mo |
Runway Gen-4 | Cinematic control | Best-in-class camera/motion control | Pricing premium | Free / $12 mo billed annually, or $15 monthly |
Pika 2.0 | Short clips, social | Cheap, fast, easy UX | Lower max resolution | From $10/mo |
Luma Ray 3 | Photoreal scenes | Strong realism for landscapes | Smaller community | Free / from $9.99/mo |
Runner-up and alternatives: Kling 3.5 is the runner-up overall and the cost-conscious pick; Runway Gen-4 is the runner-up for filmmakers and ad teams. Sora 2’s consumer app is retired; only the developer API remains, through September 24, 2026.
What changed this month: No major video launches in July 2026, so Veo 3.1 stays uncontested at the top of the still-supported video models. Google is widely expected to refresh Veo at its next AI event; we will update this section when that happens.
Best AI for Coding
Best AI for Coding: Claude Fable 5 (returned July 1, 80.3% SWE-Bench Pro)
The best AI for coding is Claude Fable 5, which returned on July 1 after the US government lifted the June 12 export-control order that had pulled it offline. Anthropic’s Mythos-class flagship retakes the coding crown at 80.3% on SWE-Bench Pro, the highest score of any model you can use, and is purpose-built for long-horizon autonomous runs at $10 / $50 per 1M tokens. Claude Opus 4.8 is the everyday-value pick right behind it, holding Anthropic’s top SWE-bench Verified score, remaining the favourite inside Claude Code and Cursor, and costing half as much at $5 / $25. GPT-5.5 is the proprietary alternative, Gemini 3.5 Flash is the price-performance pick for agent-style coding, Qwen 3.7 Max is the mid-tier value pick, Nex-N2-Pro is the strongest open-weight pick at 80.8 SWE-Bench Verified, and MiniMax M3, LongCat-2.0, and Microsoft’s MAI-Code-1-Flash are the open-weight and budget alternatives.
Gemini 3.5 Flash (May 19) hit 76.2% on Terminal-Bench 2.1 and 83.6% on MCP Atlas at $1.50 / $9.00 per 1M tokens, making it the strongest price-performance option for agent workflows. On the open-weight side, MiniMax M3 (June 1) posts 59% on SWE-Bench Pro and 66% on Terminal-Bench 2.1 at roughly $0.60 per million input tokens, and Meituan’s new LongCat-2.0 (June 29, MIT) posts 59.5% on SWE-Bench Pro and 70.8 on Terminal-Bench, both edging GPT-5.5. Microsoft’s MAI-Code-1-Flash (June 2) beats Claude Haiku 4.5 on SWE-Bench Verified (71.6 vs 66.6) while using up to 60% fewer tokens, rolling out inside VS Code and the GitHub Copilot CLI. If you want to self-host, Kimi K2.6 (Modified MIT), GLM-5.2 (MIT, 1M context, built for long autonomous runs), and LongCat-2.0 are the strongest open-weight coders with clear commercial licenses.
Model | Best For | Strength | Weakness | Price (per 1M tokens) |
|---|
Claude Fable 5 | Best coding overall, long-horizon agentic | 80.3% SWE-Bench Pro, Mythos-class | Priciest; built for hours-long runs | $10 / $50 |
Claude Opus 4.8 | Everyday-value agentic coding | Anthropic-leading SWE-bench, adaptive thinking | Half the price, below Fable 5 on SWE-Bench Pro | $5 / $25 |
GPT-5.5 | Frontier proprietary alternative | 58.6% SWE-Bench Pro, 82.7% Terminal-Bench 2.0 | Less agent-tuned than Claude | $5 / $30 |
Gemini 3.5 Flash | Agent coding at scale | 76.2% Terminal-Bench, 83.6% MCP Atlas | Weaker on hardest reasoning | $1.50 / $9.00 |
LongCat-2.0 | Open-weight frontier coder | 59.5% SWE-Bench Pro, MIT, 1M context | New, self-host/provider only | Open weights (MIT) |
MiniMax M3 | Cheap frontier-class, self-host | 59% SWE-Bench Pro, 1M context, multimodal | Weights/license still rolling out | ~$0.60 input |
DeepSeek V4-Flash | Cheap open-weight coding | MIT, 1M context, II 47 | Below V4-Pro on hardest tasks | $0.14 / $0.28 |
Runner-up and alternatives: GPT-5.5 is the proprietary runner-up; Gemini 3.5 Flash is the runner-up for price-performance; Qwen 3.7 Max is the runner-up for mid-tier value; MiniMax M3, LongCat-2.0, and DeepSeek V4 are the runners-up for open-weight self-hosters. Inside IDEs, Cursor + Claude Opus 4.8 is the most popular pairing and Claude Code is the natural pick if you live in the terminal.
What changed this month: Claude Fable 5 returned on July 1 and retakes the best-coding pick at 80.3% SWE-Bench Pro, the highest of any usable model. Meituan open-sourced LongCat-2.0 (June 29), a 1.6T MIT model at 59.5% SWE-Bench Pro. OpenAI’s GPT-5.6 family (including a Codex-tuned Sol) began a gated preview on June 26 but is not yet generally available. Open-weight coding now has a deep bench: MiniMax M3 (II 55), LongCat-2.0, Kimi K2.6, and GLM-5.2 all sit at or above 58-59% on SWE-Bench Pro under permissive or commercial licenses. Anthropic’s new Claude Sonnet 5 (June 30) posts 63.2% SWE-Bench Pro at $2 / $10 introductory pricing, a cheaper near-Opus Claude coder.
Best AI for Creativity
Best AI for Creativity: Grok 4.3 (xAI, $30/month SuperGrok, fewer guardrails)
The best AI for creative writing, brainstorming, and unfiltered ideation is Grok 4.3, with Claude Opus 4.8 as the alternative for structured creative work and Gemini 3.1 Pro as the alternative for multimodal creative tasks. Grok 4.3 (April 30, 2026) has the most permissive guardrails of any frontier model and the strongest native X integration, which makes it the natural pick for opinionated, on-trend, real-time creative work. Claude Opus 4.8 is the better pick when you want a model that holds a long creative thread, edits its own drafts, and engages with the substance of your work. Gemini 3.1 Pro is the better pick when your creative project mixes text with images, video, and live web context.
Model | Best For | Strength | Weakness | Price |
|---|
Grok 4.3 | Unfiltered, opinionated, on-trend | Fewest guardrails, X integration | Less polished for structured work | $30/mo SuperGrok |
Claude Opus 4.8 | Long-form structured creativity | Holds long threads, self-edits | Most cautious of the four | $20/mo Pro, $5 / $25 API |
Gemini 3.1 Pro | Multimodal creative | Strong text + image + video chain | Quotas inside Gemini app | Free / $2.00-$4.00 API in |
ChatGPT-5.5 | Mainstream creative writing | Best at hitting briefs | Heavier guardrails | $20/mo Plus, $5 / $30 API |
Grok Imagine (Spicy Mode) | NSFW / adult creative | Most permissive image generation | Niche use case | $30/mo SuperGrok |
Runner-up and alternatives: Claude Opus 4.8 is the runner-up overall and the right pick for projects that need to hold together across many turns. Gemini 3.1 Pro is the multimodal runner-up. For adult creative work, Grok Imagine Spicy Mode is the only frontier-grade option.
What changed this month: No major creativity-specific launches in July 2026. Grok 4.3 stayed the category leader; its successor Grok 4.5 was revealed in private beta on June 28 but has no public release date yet, so it does not change the pick this month.
Best AI for Accuracy
Best AI for Accuracy: Gemini 3.1 Pro (94.3% GPQA Diamond, 44.4% Humanity’s Last Exam, 77.1% ARC-AGI-2)
The best AI for accuracy and research is Gemini 3.1 Pro, with Qwen 3.7 Max as the value alternative and GPT-5.5 Pro as the alternative for hallucination-sensitive work. Gemini 3.1 Pro leads the hardest pure-reasoning tests at 94.3% on GPQA Diamond, 44.4% on Humanity’s Last Exam, y 77.1% on ARC-AGI-2, with native Google Search grounding for live factual answers. Qwen 3.7 Max (May 20) entered the top tier at 92.4 on GPQA Diamond, tied with Claude Opus 4.8, at half the API cost.
GPT-5.5 Pro (April 23) carries GPT-5.5’s factual-reliability gains over GPT-5.4 (claims 23% more likely to be factually correct on OpenAI’s flagged-conversation set), which makes it the right pick when factual reliability matters more than raw benchmark depth. Gemini 3.5 Flash (May 19) outscores Gemini 3.1 Pro on coding and agent benchmarks but trails Pro on these accuracy tests (HLE 40.2% vs 44.4%, ARC-AGI-2 72.1% vs 77.1%), so Pro stays the accuracy pick.
Model | Best For | Key Benchmark | Weakness | Price |
|---|
Gemini 3.1 Pro | Hardest reasoning + research | 94.3% GPQA, 44.4% HLE, 77.1% ARC-AGI-2 | API quotas in app | $2.00-$4.00 / $12.00-$18.00 (tiered) |
Qwen 3.7 Max | Frontier accuracy at value pricing | 92.4 GPQA Diamond | API-only, no chat front-end | $1.25 / $3.75 promo; $2.50 / $7.50 list |
GPT-5.5 Pro | Hallucination-sensitive work | Improved factual reliability vs GPT-5.4 (OpenAI eval) | Pricier API tier | $100/mo ChatGPT Pro |
Claude Opus 4.8 | Long-form factual writing | #1 Intelligence Index (61) | Slower on hardest math | $5 / $25 |
Grok 4.3 | Live web facts | Native real-time grounding | Smaller benchmark coverage | $30/mo SuperGrok |
Runner-up and alternatives: Qwen 3.7 Max is the runner-up and the value pick at the frontier. GPT-5.5 Pro is the runner-up for hallucination-sensitive work. Claude Opus 4.8 is the runner-up for long-form factual writing.
What changed this month: No new accuracy leaders shipped, so Gemini 3.1 Pro holds the top of the category. The one to watch is Gemini 3.5 Pro, now cleared for a July general-availability launch after slipping from June; its specs are not yet public, but it could reset this ranking the moment it reaches general availability.
Best AI for Problem Solving
Best AI for Problem Solving: GPT-5.5 Pro & Qwen 3.7 Max (39.6% FrontierMath Tier 4, 97.1 HMMT 2026 Feb)
The best AI for hard problem-solving is GPT-5.5 Pro for FrontierMath-style abstract math and Qwen 3.7 Max for competition math, with Claude Opus 4.8 as the alternative for long agentic reasoning chains. GPT-5.5 Pro still leads at 39.6% on FrontierMath Tier 4 (nearly double Claude Opus 4.8’s 22.9%), which makes it the right pick when you need step-by-step working on the hardest math and physics problems. Qwen 3.7 Max (May 20) hit 97.1 on HMMT 2026 February, the highest score in its comparison group, and 44.5 on Apex, which makes it the right pick for competition-style problem-solving at half the cost of GPT-5.5 Pro.
Claude Opus 4.7 (April 16) introduced task budgets, a primitive for guiding agentic token spend on long chains; Claude Opus 4.8 (May 28) instead uses adaptive thinking controlled by an effort parameter, and does not support extended-thinking budgets. Gemini 3.5 Flash trades raw reasoning depth for speed and price; for the hardest problems, Gemini 3.1 Pro and the Thinking variants still lead.
Model | Best For | Key Benchmark | Weakness | Price |
|---|
GPT-5.5 Pro | Abstract math, physics | 39.6% FrontierMath Tier 4 | Highest cost tier | $100/mo ChatGPT Pro |
Qwen 3.7 Max | Competition math | 97.1 HMMT 2026 Feb, 44.5 Apex | API-only | $1.25 / $3.75 promo; $2.50 / $7.50 list |
Claude Opus 4.8 | Long agentic reasoning | Adaptive thinking, effort control, #1 Intelligence Index | Slower on math | $5 / $25 |
Gemini 3.1 Pro | Multimodal reasoning + research | 94.3 GPQA, 77.1 ARC-AGI-2 | API quotas | $2.00-$4.00 / $12.00-$18.00 (tiered) |
DeepSeek V4-Flash | Open-weight problem solving | MIT, 1M context, II 47 | Below V4-Pro on hardest | $0.14 / $0.28 |
Runner-up and alternatives: Claude Opus 4.8 is the runner-up overall and the natural pick for agentic, long-chain problem-solving. Gemini 3.1 Pro is the multimodal runner-up. DeepSeek V4-Flash is the open-weight runner-up.
What changed this month: No new problem-solving leaders shipped, so GPT-5.5 Pro still leads FrontierMath Tier 4 at 39.6% and Qwen 3.7 Max still leads competition math at 97.1 HMMT 2026 February. OpenAI’s GPT-5.6 Sol is the one to watch here once its gated preview opens up. The open-weight MiniMax M3 and LongCat-2.0 are strong cheaper options for agentic reasoning chains.
Best AI Agent
Best AI Agent: Gemini Spark vs Claude Cowork ($100/month Ultra vs $20/month Pro)
The best AI agent right now is Gemini Spark for 24/7 cloud-resident work and Claude Cowork for desktop-resident work, with ChatGPT Codex as the alternative for coding agents and OpenAI Operator-class browser agents as the alternative for web tasks. AI agents are the fastest-moving category of 2026: each top vendor now ships an agent product, and the practical choice is between agents that live in the cloud (run while your laptop is closed) and agents that live on your desktop (drive your apps directly).
Gemini Spark launched at Google I/O on May 19, 2026 and is the first 24/7 cloud agent. Claude Cowork launched in general availability on April 9, 2026 and runs as a desktop agent that drives your local apps. ChatGPT Codex Mobile (May 14) is the pick for coding-agent work, now usable from iOS and Android. Read the full Gemini Spark vs Claude Cowork comparison.
Agent | Best For | Where It Runs | Strength | Price |
|---|
Gemini Spark | 24/7 cloud tasks, Workspace workflows | Google Cloud VM (always-on) | First true 24/7 agent, deep Workspace integration | $100/mo Google AI Ultra |
Claude Cowork | Desktop, app-driving, design + code | Your Mac/Windows desktop | Drives local apps, sees your screen | $20/mo Claude Pro |
ChatGPT Codex Mobile | Coding agent on phone | OpenAI cloud + iOS/Android | Approve diffs and redirect work from phone | Included in ChatGPT plans |
Grok Agentic (Grok 4.3) | Real-time research, X scraping | xAI cloud | Native X integration | $30/mo SuperGrok |
OpenAI Operator-class | Browser tasks, web forms | OpenAI cloud + your browser | Web automation | ChatGPT Pro |
Runner-up and alternatives: Claude Cowork is the runner-up overall and the natural pick when you want the agent on your machine driving your apps. ChatGPT Codex Mobile is the runner-up for coding agents. Grok Agentic is the niche pick for real-time research.
What changed this month: No new consumer agents shipped, so the Gemini Spark (cloud) vs Claude Cowork (desktop) choice still drives most agent decisions for individual users. With Claude Fable 5 back online, the strongest model you can run inside an agent system for long-horizon autonomous work is available again. The open models LongCat-2.0, NVIDIA Nemotron 3 Ultra, and MiniMax M3 all ship strong agentic benchmarks, which matters for teams building their own agents on open weights.