Best AI for Writing
Claude Sonnet 4.6 does not just score well on writing benchmarks, it leads them by a clear margin. On the GDPval-AA Elo benchmark, which measures real expert-level office work including drafting, editing, and document creation, Sonnet 4.6 scores 1,633 points, higher than every other model including its own sibling Opus 4.6 (1,606). For professional writing tasks, it consistently outperforms models that cost twice as much per token.
The practical advantage comes from Anthropic’s focus on instruction-following. Sonnet 4.6 reliably maintains tone, follows complex style guides, and produces clean structured output without extensive prompt engineering. It handles long-form documents with strong coherence, maintaining argument structure and factual consistency across thousands of words. This precision is notable because Gemini 3.1 Pro, despite its benchmark dominance across 12 of 18 categories, scores only 1,317 on GDPval-AA, far below both Claude models.
Sonnet 4.6 achieves its GDPval-AA results through adaptive thinking, which means it self-allocates more processing effort to complex writing tasks. The trade-off is token consumption: Sonnet 4.6 uses roughly four times as many total tokens as Sonnet 4.5 on the same GDPval-AA tasks. For individual writers, that cost difference is invisible. For teams running high-volume automated content pipelines, it is worth modelling the per-task spend before committing.
Pricing reinforces the case for most users. At $3/$15 per million tokens, Sonnet 4.6 sits between the budget and premium tiers, delivering writing quality that competes with Opus 4.6 at a lower cost. For content teams needing to balance output volume and quality, Sonnet 4.6 is the clear default. If budget is not a constraint, Opus 4.6 handles longer multi-section documents with marginally stronger structural reasoning due to its 1M token context window (beta).
| Model |
Writing Benchmark |
Instruction Following |
Price (I/O per 1M) |
Best For |
| Claude Sonnet 4.6 |
GDPval-AA: 1,633 Elo (1st) |
Excellent |
$3 / $15 |
Long-form, professional writing |
| GPT-5.4 |
GDPval: 83% (1st overall) |
Very Good |
$2.50 / $15 |
Documents, reports, knowledge work |
| Gemini 3.1 Pro |
GPQA Diamond: 94.3% |
Good |
$2 / $12 |
Research-heavy, accuracy-critical |
| Claude Opus 4.6 |
GDPval-AA strong |
Excellent |
$5 / $25 |
Complex writing with reasoning |
Runner-up and Alternatives
GPT-5.4 is a strong second. Its 83% GDPval score reflects strong document and knowledge-work capability, and it has a wider tool ecosystem for writers who need integrated search and web access. Gemini 3.1 Pro is worth considering for accuracy-critical writing, such as scientific summaries or financial content.
What Changed This Month: GPT-5.4’s launch strengthens the competition in knowledge-work writing. Sonnet 4.6 still leads on style and instruction-following, but the gap for structured documents has narrowed.
Best AI for Chat / Daily Assistant
GPT-5.4 replaces GPT-5.2 as the default for everyday AI use, and the upgrade is substantial. It is the first general-purpose model to surpass human performance on OSWorld (75.0% vs. human baseline of 72.4%), meaning it can reliably operate software, fill out forms, manage files, and execute multi-step desktop workflows without step-by-step guidance. That capability alone reframes what a daily AI assistant can mean: instead of just advising on a task, GPT-5.4 can complete it.
The model also reduces individual hallucinated statements by 33% compared to GPT-5.2, with 18% fewer errors in complete answers. Its context window is 1 million tokens via API, with premium pricing applied to prompts exceeding 272K input tokens. ChatGPT’s breadth of integrations also contributes: GPT-5.4 ships with native Tool Search for real-time web access, integrates with more third-party workflows than any other model, and is available in GitHub Copilot from day one.
For users who do not need computer-use capabilities, Gemini 3.1 Pro (currently in preview) is the most competitive alternative. Its native Google Search integration provides grounded, citation-backed answers, and at $2/$12 per million tokens it costs less than GPT-5.4 at the API level. Grok 4.20 is the strongest option for real-time X and web data, and its per-token pricing is significantly lower, making it cost-effective for developers building chatbot applications.
GPT-5.4’s thinking mode shows a structured work plan before generating an answer. For complex multi-step requests, this transparency helps users catch misunderstood instructions before a full response is generated. Its new Tool Search feature also cuts costs for developers: in testing across 36 MCP servers, it reduced total token usage by 47% while maintaining accuracy, a significant saving for teams running large agentic tool ecosystems.
| Model |
Chat Quality |
Tool / Web Access |
Computer Use |
Best For |
| GPT-5.4 |
Excellent |
Native + Tool Search |
Yes (OSWorld 75%) |
Daily tasks, automation |
| Gemini 3.1 Pro |
Excellent |
Google Search native |
Limited |
Research-heavy conversations |
| Grok 4.20 |
Very Good |
Real-time X/web data |
Ne |
Current events, creative chat |
| Claude Opus 4.6 |
Very Good |
Limited |
Agent teams |
Deep analytical conversations |
Runner-up and Alternatives
Gemini 3.1 Pro is the strongest alternative for users who prioritize accuracy and research depth. Grok 4.20 is the best choice for real-time information and costs a fraction of GPT-5.4 at the API level.
What Changed This Month: GPT-5.4 launched March 5, directly replacing GPT-5.2. This is the winner change for this category.
Best AI for Images
The top spot in AI image generation is genuinely contested as of March 2026. The two major crowd-sourced image leaderboards disagree: on arena.ai (LM Arena), Gemini 3.1 Flash Image (also known as Nano Banana 2) leads at 1,268 Elo, with GPT-Image-1.5 at 1,248 – a 20-point gap in Google’s favour, though Gemini’s score is marked Preliminary with fewer votes. On Artificial Analysis, GPT-Image-1.5 leads at 1,268 Elo, with Gemini 3.1 Flash Image at 1,262 – a 6-point gap in OpenAI’s favour. Both leaderboards use blind human preference voting but draw from different user pools.
We give GPT-Image-1.5 a narrow edge for professional and commercial use based on its practical strengths: it is the first image generator to simultaneously handle accurate text rendering, photorealism, and artistic stylization without forcing a trade-off between them. Text in images – labels, signs, logos, and UI elements – renders accurately rather than distorting into illegible noise. For any project requiring readable on-image copy, GPT-Image-1.5 remains the most reliable choice.
Gemini 3.1 Flash Image is the stronger pick if speed, cost, or multilingual text rendering are priorities. It generates faster, costs roughly half the price per image, and is deeply integrated across Google products (Gemini app, Search AI Mode, Google Ads, Flow). For high-volume production workflows where cost-per-image matters, Gemini 3.1 Flash Image may be the better default despite its Preliminary leaderboard status.
Flux 2 [max] (arena.ai Elo 1,167; Artificial Analysis Elo 1,207) excels at photographic skin texture and fine-art aesthetics, and remains the strongest open-ecosystem option for artistic style diversity. For projects where artistic range matters more than photorealism or text accuracy, Flux 2 is competitive.
| Model |
Elo (arena.ai) |
Elo (Art. Anls.) |
Best Strength |
Known Weakness |
Best For |
| GPT-Image-1.5 |
1,248 |
1,268 |
Photorealism + text accuracy |
Cost |
Professional, branded content |
| Gemini 3.1 Flash Image |
1,268 (Prelim.) |
1,262 |
Speed + multilingual + cost |
Less artistic range |
High-volume, multilingual |
| Gemini 3 Pro Image |
1,236 |
1,221 |
Diverse style range |
Slightly lower realism |
Varied creative projects |
| Flux 2 [max] |
1,167 |
1,207 |
Artistic, skin texture |
Text rendering |
Fine art, photography |
Note: Elo scores from arena.ai (LM Arena) and Artificial Analysis Image Arena as of March 8, 2026. Rankings differ between the two leaderboards.
What Changed This Month: Gemini 3.1 Flash Image (Nano Banana 2) launched February 26 and immediately claimed the top spot on both major image leaderboards. GPT-Image-1.5 has since regained #1 on Artificial Analysis but trails on arena.ai. The top two are closer than ever – the winner depends on which leaderboard you trust and which strengths matter for your use case.
Best AI for Video
Veo 3.1 produces the most cinematic output of any AI video model. It generates at professional 24fps with optional 4K upscaling, produces native synchronized audio – sound effects, ambient noise, and dialogue generated natively – and follows complex multi-element prompts better than any competitor. Released in October 2025, with major feature updates in January 2026, it includes two capabilities that separate it from the field: Scene Extension for continuous narratives exceeding 60 seconds, and Ingredients to Video, which lets you upload up to three reference images to lock character face, clothing, and environment consistently across all scenes. For anyone building branded video series or consistent character-driven content, that scene-level consistency is a practical production advantage no other model currently matches.
Native audio is now table stakes. All four major video models generate synchronized audio as of early 2026. The differentiator has shifted to visual quality, prompt accuracy, and scene-level consistency, and Veo 3.1 leads on all three. The image-to-image transition generation feature (First and Last Frame) also adds polish that previously required manual editing: Veo 3.1 auto-generates smooth transitions between scenes with matched audio, removing a step that typically required post-production.
Sora 2 is the strongest alternative for physically realistic motion. Its physics simulation training means falling objects, water, and crowds behave more convincingly than in Veo 3.1. For storytelling-driven content where physical realism matters more than visual fidelity, Sora 2 is worth testing. Kling 3.0 remains the best option for rapid prototyping and social content, generating at comparable 1080p/24fps quality at lower cost and faster turnaround.
Seedance 2.0 occupies a different niche: its multi-modal input with audio reference makes it the best tool for music video production and brand content that needs to match a specific audio track. Its audio reference input system allows the generated video to sync visually to an existing music bed, a capability the other three models do not offer natively.
| Model |
Native Audio |
Resolution |
Best Strength |
Best For |
| Veo 3.1 |
Yes |
Up to 4K / 24fps |
Prompt accuracy, cinematic, scene consistency |
Broadcast, commercial, film |
| Sora 2 |
Yes |
1080p / 24fps |
Physics simulation |
Realistic motion, storytelling |
| Kling 3.0 |
Yes |
1080p / 24fps |
Low cost, fast |
Rapid prototyping, social |
| Seedance 2.0 |
Yes (+ audio ref) |
1080p / 24fps |
Multi-modal input |
Music video, brand content |
What Changed This Month: All four major video models now include native audio. Prompt adherence, visual quality, and scene consistency are now the differentiators.
Best AI for Coding
Claude Opus 4.6 scores 80.8% on SWE-bench Verified, leading every general-purpose model. The SWE-bench test evaluates real GitHub issues, not synthetic coding puzzles, requiring the model to understand an existing codebase, identify the relevant files, and write a correct patch. At 80.8%, Opus 4.6 resolves roughly four in five of these real-world engineering problems without human guidance. (Note: this is a marginal 0.1 percentage point regression from Opus 4.5’s 80.9%, suggesting SWE-bench performance has plateaued at the ~80% level across frontier models. The gains in Opus 4.6 are in reasoning and agentic capabilities, not raw SWE-bench scores.)
The architecture advantage is the multi-agent system. Through Claude Code, Opus 4.6 can spawn and coordinate parallel sub-agents, delegating different parts of a codebase to independent processes and recombining results. On large refactors or feature additions spanning multiple files and modules, this approach handles work that single-context models struggle with. Anthropic also specifically trained Opus 4.6 to reduce logic hallucinations – the class of error where code is syntactically valid but logically incorrect – which is the failure mode that wastes the most developer time in AI-assisted coding.
Gemini 3.1 Pro (currently in preview) is a genuine challenger, scoring 80.6% on SWE-bench, just 0.2 percentage points behind Opus 4.6. Its 1M token context window (standard, not beta) makes it stronger on very large codebases where loading entire repositories matters. At $2/$12 per million tokens compared to Opus 4.6’s $5/$25, it is significantly cheaper for teams running continuous coding automation. For teams on tight API budgets, Gemini 3.1 Pro delivers near-equivalent coding performance at less than half the price.
Claude Sonnet 4.6 sits at 79.6% on SWE-bench and is worth considering for daily coding assistance. At $3/$15 it costs less than Opus 4.6 and handles most coding tasks with nearly identical quality. GPT-5.4 scored 54.6% on Toolathlon (a multi-tool benchmark relevant to agentic coding) and brings strong computer-use integration for developers who need to automate IDE interactions. For prototyping and greenfield development, GPT-5.4’s tool ecosystem and speed make it a practical choice alongside the Claude models.
| Model |
SWE-bench |
Agent / Multi-file |
Context |
Price (I/O per 1M) |
Best For |
| Claude Opus 4.6 |
80.8% |
Excellent (agent teams) |
200K (1M beta) |
$5 / $25 |
Complex, agentic coding |
| Gemini 3.1 Pro |
80.6% |
Good |
1M |
$2 / $12 |
Long-context, cost-sensitive |
| Claude Sonnet 4.6 |
79.6% |
Good |
200K (1M beta) |
$3 / $15 |
Daily coding, near-Opus |
| GPT-5.4 |
Competitive |
Good |
1M |
$2.50 / $15 |
Rapid prototyping, tool-use |
What Changed This Month: The Opus 4.6 vs. Gemini 3.1 Pro SWE-bench gap is now just 0.2 percentage points. GPT-5.4 launched with strong Toolathlon scores (54.6%).
Best AI for Creativity
Creativity is the hardest category to measure objectively. There is no authoritative benchmark equivalent to SWE-bench or GPQA Diamond. What we can say with evidence: Grok 4.20 holds a crowd-sourced Arena Elo of 1,493 (rank 4 overall), and human raters consistently prefer its outputs in open-ended conversation, the domain most relevant to creative collaboration. Note that Grok 4.20 is currently in beta and available only to SuperGrok (~$30/month) and X Premium+ subscribers.
Grok 4.20’s four-agent architecture is the key differentiator. Four specialized sub-agents – Grok, Harper, Benjamin, and Lucas – deliberate in parallel, fact-check each other, and reach consensus before responding. This process tends to push outputs away from the statistically safest, most expected answer. The results are less predictable than other frontier models, which is either an advantage or a drawback depending on your creative workflow. For brainstorming, concept generation, and ideation under uncertainty, that divergence from the expected is exactly what you want.
Real-time data access through X and the broader web gives Grok 4.20 a further creative edge. It can incorporate current cultural references, trending formats, and breaking news into its outputs in a way that models without live data access cannot. For content creators working on topical or trend-driven material, this gives Grok 4.20 relevance that Claude and Gemini cannot match without supplementary search tools.
This is the most subjective category we rank. If you need tight style constraints rather than open-ended divergence, Claude Sonnet 4.6 is the better fit. Its instruction-following precision means it will stay inside defined creative parameters far more reliably than Grok 4.20. GPT-5.4, with its Tool Search integration, is the best option for creative projects that blend research with ideation, such as long-form journalism or strategy documents.
| Model |
Creative Approach |
Real-time Data |
Arena Rank |
Best For |
| Grok 4.20* |
Multi-agent deliberation |
Yes (X + web) |
4 (1,493 Elo) |
Topical, brainstorming |
| Claude Sonnet 4.6 |
Deep instruction following |
Ne |
High |
Structured creative writing |
| GPT-5.4 |
Versatile, tool-enabled |
Yes (Tool Search) |
TBD (new) |
Creative + research |
| Gemini 3.1 Pro |
Technically rigorous |
Yes (Google) |
2 (1,500 Elo) |
Science writing, journalism |
Note: * Grok 4.20 is currently in beta.
What Changed This Month: Grok 4.20 Beta 2 (March 3) updated Beta 1 with improved instruction following and LaTeX output. Grok 4.20 replaced Grok 4.1 as the winner for this category when Beta 1 launched in February.
Best AI for Accuracy
Gemini 3.1 Pro (currently in preview) is the most factually reliable LLM released to date. Its headline numbers: 94.3% on GPQA Diamond (graduate-level science questions), 77.1% on ARC-AGI-2 (novel problem-solving requiring genuine reasoning), and 80.6% on SWE-bench Verified. It leads 12 of 18 standardized benchmarks tracked across the major model evaluation frameworks. The ARC-AGI-2 score represents a 2.5x improvement over its predecessor (31.1%), the largest single-generation reasoning jump recorded by any frontier model.
The native Google Search grounding is the operational advantage. For use cases where correctness matters most, such as medical queries, legal summaries, scientific research, and financial analysis, Gemini 3.1 Pro automatically grounds its answers against current search results when needed. This means factual errors from knowledge cutoffs are far less common than in models without live search integration. The combination of the highest benchmark scores and real-time grounding makes it uniquely reliable for professional research use.
Claude Opus 4.6 is the strongest challenger on reasoning accuracy specifically. It holds Arena rank 1 with 1,504 Elo and scores 68.8% on ARC-AGI-2, up sharply from Opus 4.5’s 37.6%. On pure logic and mathematical problem-solving, Opus 4.6’s extended thinking mode can match or exceed Gemini 3.1 Pro’s performance. For tasks where chain-of-thought reasoning matters more than factual grounding, Opus 4.6 is worth testing as an alternative.
GPT-5.4 adds competitive accuracy credentials through its knowledge-work benchmark results (83% GDPval) and Tool Search integration for real-time fact access. However, Gemini 3.1 Pro’s lead on scientific reasoning benchmarks has not been displaced by GPT-5.4’s March 5 launch. For research, analysis, and any task where a factual error has real consequences, Gemini 3.1 Pro remains the safest default.
| Model |
GPQA Diamond |
ARC-AGI-2 |
SWE-bench |
Arena Elo |
Best For |
| Gemini 3.1 Pro |
94.3% |
77.1% |
80.6% |
1,500 (rank 2) |
Research, science, factual |
| GPT-5.4 |
Strong |
Competitive |
Competitive |
TBD (new) |
Knowledge-work accuracy |
| Claude Opus 4.6 |
91.3% |
68.8% |
80.8% |
1,504 (rank 1) |
Logic, coding accuracy |
| Grok 4.20 |
Competitive |
Strong |
– |
1,493 (rank 4) |
Forecasting, real-time |
Note: Claude Opus 4.6’s GPQA Diamond score (91.3%) added for completeness based on published benchmarks.
What Changed This Month: GPT-5.4 launched as a strong challenger but has not displaced Gemini 3.1 Pro’s lead on scientific reasoning benchmarks. Claude Opus 4.6’s ARC-AGI-2 score (68.8%) is a notable jump from Opus 4.5’s 37.6%.
Best AI for Problem Solving
Claude Opus 4.6 Thinking is Anthropic’s extended chain-of-thought mode, holding Arena rank 3 with an Elo of 1,500. The core capability is explicit step-by-step reasoning: the model surfaces its assumptions, considers alternative paths, and shows the working before committing to an answer. For problems where the reasoning process matters as much as the answer, such as strategic planning, mathematical proofs, and multi-constraint optimization, this transparency is operationally useful.
The agent team architecture is the decisive advantage for complex problem-solving. Opus 4.6 can decompose a hard problem, assign subtasks to parallel sub-agents via Claude Code, and synthesise results into a coherent solution. This is not a token-level reasoning improvement but a structural one: the model breaks a problem into independently solvable components and recombines them. For problems with no single correct answer, the thinking mode surfaces assumptions and explores alternatives before converging, reducing the risk of confidently wrong outputs.
Gemini 3.1 Pro’s Deep Think mode (currently in preview) is the strongest alternative, specifically for scientific and mathematical problems. It holds the same 1,500 Arena Elo and leads on GPQA Diamond (94.3%) and ARC-AGI-2 (77.1%). For hypothesis testing, research design, and problems with verifiable ground truth, Gemini 3.1 Pro Deep Think now rivals Claude Opus 4.6 Thinking. The choice between them often comes down to domain: Claude Opus 4.6 Thinking is stronger on multi-step logic and engineering problems, while Gemini 3.1 Pro Deep Think is stronger on scientific and empirical reasoning.
Grok 4.20 offers a structurally different approach: its four-agent deliberation is always active, not a separately enabled mode. The four sub-agents fact-check each other in parallel before responding, producing a consensus answer rather than a single chain of thought. For forecasting, multi-perspective analysis, and scenarios where contrarian views improve the output, Grok 4.20’s architecture provides a meaningful alternative to the Claude and Gemini extended-thinking approaches.
| Model |
Extended Reasoning |
Multi-agent |
Arena Elo |
Best For |
| Claude Opus 4.6 Thinking |
Yes (chain-of-thought) |
Yes (Claude Code) |
1,500 (rank 3) |
Complex reasoning, agentic |
| Gemini 3.1 Pro (Deep Think)* |
Yes |
Limited |
1,500 (rank 2) |
Scientific problems, research |
| GPT-5.4 Thinking |
Yes |
Limited |
TBD (new) |
Structured logic, knowledge-work |
| Grok 4.20 |
Yes (4-agent) |
Built-in |
1,493 (rank 4) |
Forecasting, multi-perspective |
* Gemini 3.1 Pro is currently in preview.
** Grok 4.20 is currently in beta.
What Changed This Month: Gemini 3.1 Pro’s improved Deep Think mode now rivals Claude Opus 4.6 Thinking on scientific problems specifically. GPT-5.4 added a Thinking mode at launch.