Best AI for Writing
Claude Sonnet 4.6 remains the strongest writing model in April 2026, even as the benchmark landscape shifts beneath it. On the GDPval-AA Elo leaderboard, the metric that measures real expert-level office work including drafting, editing, and document creation, GPT-5.5 (released April 23) now leads with stronger performance than both GPT-5.4 (1,671 Elo) and Sonnet 4.6 (1,643 Elo). But GDPval-AA measures structured knowledge-work output across 44 occupations, not writing quality in the sense most writers care about, voice, tone fidelity, narrative coherence, and the ability to follow a tightly defined style guide without drifting. On those dimensions, Sonnet 4.6 still has no real competitor.
The practical advantage comes from Anthropic’s focus on instruction-following. Sonnet 4.6 reliably maintains tone, follows complex style guides, and produces clean structured output without extensive prompt engineering. It handles long-form documents with strong coherence, maintaining argument structure and factual consistency across thousands of words. For branded content, ghostwriting, editorial work, and any project where the output needs to sound like a specific human voice, Sonnet 4.6 is the model writers actually reach for. Anthropic released it on February 17, 2026, with a 1M token context window and 64K max output tokens.
GPT-5.5 is the strongest runner-up and is now the better choice for high-volume structured knowledge work: reports, summaries, business documents, technical writeups. Its 60% drop in hallucinations versus GPT-5.4 means fewer factual errors in research-heavy prose, and its native Tool Search integration makes it the best option for writers who blend research with drafting. At $5 / $30 per million tokens, it is more expensive than Sonnet 4.6 ($3 / $15), so factor that in for high-volume work.
Gemini 3.1 Pro, despite strong accuracy benchmarks like 94.3% GPQA Diamond and 77.1% ARC-AGI-2, scores below both Claude and GPT models on the writing leaderboard, which is why it does not lead this category despite leading on factual tests. It is worth considering for accuracy-critical writing such as scientific summaries or financial content where factual grounding matters more than prose quality.
Writing Category Comparison Table
Model | Writing Benchmark | Instruction Following | Price (I/O per 1M) | Best For |
|---|
GPT-5.5 | New GDPval leader, 60% fewer hallucinations vs 5.4 | Very Good | $5 / $30 | Documents, reports, knowledge work |
Claude Sonnet 4.6 | GDPval-AA: 1,643 Elo | Excellent | $3 / $15 | Long-form, style-guide compliance |
Gemini 3.1 Pro | GPQA Diamond: 94.3% | Good | $2 / $12 | Research-heavy, accuracy-critical |
Claude Opus 4.7 | GDPval-AA strong | Excellent | $5 / $25 | Complex writing with reasoning |
GPT-5.4 | GDPval-AA: 1,671 Elo (prior leader) | Very Good | $2.50 / $15 | Budget option, still widely available |
Runner-up and alternatives
GPT-5.5 is the strongest second pick and the better choice for structured knowledge work and research-heavy drafting. Gemini 3.1 Pro is worth considering for accuracy-critical writing. Claude Opus 4.7 handles longer multi-section documents with stronger structural reasoning when budget is not a constraint.
What Changed This Month
GPT-5.5 (April 23) overtook both GPT-5.4 and Sonnet 4.6 on the GDPval-AA leaderboard and dropped hallucinations by 60%. Sonnet 4.6 still leads on style, voice fidelity, and instruction-following, but GPT-5.5 is now the better default for structured knowledge work and business documents.
Best AI for Chat / Daily Assistant
GPT-5.5 is OpenAI’s new frontier model and the strongest ChatGPT version to date. The upgrade from GPT-5.4 is substantial: 88.7% on SWE-bench, 92.4% on MMLU, and a 60% drop in hallucinations compared to GPT-5.4. For a daily assistant that answers random questions, reads documents, and moves across tools, fewer hallucinations matter more than another benchmark point. Rolling out now to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex.
The model’s agentic and computer-use gains also change what a daily AI assistant can mean. GPT-5.5 is stronger than any prior OpenAI model at operating software, filling forms, moving across tools, and executing multi-step desktop workflows without step-by-step guidance. Paired with Codex, it can now finish tasks that previously required a human in the loop. Context window is 1M tokens via API, with pricing at $5 / $30 per million input/output tokens. GPT-5.5 Pro is live for Pro, Business, and Enterprise at $30 / $180 per million.
For users who do not need frontier-tier performance, GPT-5.4 remains widely available at $2.50 / $15 and is still excellent. ChatGPT’s routing automatically flows everyday chats through GPT-5.3 Instant, GPT-5.4 Thinking, or GPT-5.4 Pro depending on plan, with GPT-5.4 mini serving as a fallback and as the Thinking default for Free and Go tiers. If you are on Plus or Pro, GPT-5.5 is now your default model; if you are on Free, you will still see the GPT-5.4 family.
Gemini 3.1 Pro is the most competitive alternative for research-heavy conversations, with native Google Search grounding that provides citation-backed answers. It is also available as a free native Mac app with Option + Space quick-chat. At $2 / $12 per million tokens, it costs less than GPT-5.5 at the API level. Grok 4.20 is the strongest option for real-time X and web data, with significantly lower per-token pricing that makes it cost-effective for developers building chatbot applications.
Chat Category Comparison Table
Model | Chat Quality | Tool / Web Access | Computer Use | Best For |
|---|
GPT-5.5 | Excellent | Native + Tool Search | Improved vs 5.4 | Daily tasks, automation, research |
Gemini 3.1 Pro | Excellent | Google Search native | Limited | Research-heavy conversations |
Grok 4.20 | Very Good | Real-time X / web | No | Current events, creative chat |
Claude Opus 4.7 | Very Good | Limited | Agent teams | Deep analytical conversations |
GPT-5.4 | Excellent | Native + Tool Search | Yes (OSWorld 75%) | Default tier, lower cost |
Runner-up and alternatives
Gemini 3.1 Pro is the strongest alternative for users who prioritize accuracy and research depth. Grok 4.20 is the best choice for real-time information and costs a fraction of GPT-5.5 at the API level. GPT-5.4 is still the right pick if you want ChatGPT-grade quality at lower cost.
What Changed This Month
GPT-5.5 shipped on April 23 with 60% fewer hallucinations than GPT-5.4 and is now the default ChatGPT model for Plus, Pro, Business, and Enterprise. Google shipped a native Gemini Mac app on April 15 with Option + Space and window sharing.
Best AI for Images
ChatGPT Images 2.0 is the new benchmark. OpenAI’s April 21 release (API model name gpt-image-2) finally solved the single hardest capability in AI image generation: readable text inside images. Images 2.0 renders legible typography in dense layouts like menus, signs, scientific diagrams, and infographic posters, and it handles non-Latin scripts like Japanese, Korean, Chinese, Hindi, and Bengali. It supports 2K resolution, aspect ratios from 3:1 to 1:3, and generates up to 8 coherent images from a single prompt with character and object continuity across the batch.
The thinking mode (paid subscribers only) adds web search, multi-image generation from one prompt, and self-verification. The standard version is free for all ChatGPT, Codex, and API users. For posters, slides, infographics, branded content, and any image where text is in the frame, Images 2.0 is now the clear default.
Gemini 3.1 Flash Image (Nano Banana 2) is still the better choice for speed, cost, and native 4K output. It is also deeply integrated across Google products (Gemini app, Search AI Mode, Google Ads, Flow, and the new Mac app), which matters if you already live in that stack. For high-volume production where cost-per-image matters, Gemini is usually the cheaper pick. Use Images 2.0 when text is in the frame; use Flash Image when speed and cost matter more.
Gamma integrates Nano Banana Pro directly inside decks for in-deck image generation, which makes it one of the fastest prompt-to-slide paths available. Canva AI 2.0 pairs with Nano Banana for design generation. Flux 2 [max] excels at photographic skin texture and fine-art aesthetics, and remains the strongest open-ecosystem option for artistic style diversity. For a deeper side-by-side on the prior leaders, see our Gemini Nano Banana Pro vs GPT-Image-1.5 ultimate comparison.
Image Generation Comparison Table
Model | Current Leaderboard Position | Best Strength | Known Weakness | Best For |
|---|
ChatGPT Images 2.0 (gpt-image-2) | New benchmark, Apr 21 launch | Readable text, multilingual scripts, 2K, thinking mode | Cost for thinking mode | Posters, slides, infographics, branded content |
Gemini 3.1 Flash Image / Nano Banana 2 | Top of recent LM Arena snapshots | Speed + multilingual + 4K | Less artistic range | High-volume, multilingual |
GPT Image 1.5 (high) | Still strong, now superseded | Text rendering + photorealism | Cost | Legacy projects in production |
Gemini 3 Pro Image | Top-tier | Diverse style range | Slightly lower realism | Varied creative projects |
Flux 2 [max] | Top open-ecosystem | Artistic, skin texture | Text rendering | Fine art, photography |
Runner-up and alternatives
Gemini 3.1 Flash Image is the best cost and speed pick for high-volume or multilingual work. Flux 2 stays the leader on photographic skin texture and fine-art aesthetics. In Fello AI, you get both ChatGPT Images 2.0 and Gemini 3.1 Flash Image under one $9.99/month subscription.
What Changed This Month
ChatGPT Images 2.0 launched on April 21 and took the top spot on text rendering, multilingual scripts, and infographic-style output. Microsoft released MAI-Image-2 on April 2, too new to rank but worth watching.
Best AI for Video
Video is the most contested category right now. Artificial Analysis’s current text-to-video leaderboards place HappyHorse-1.0 at #1, Seedance 2.0 at #2, and Kling 3.0 1080p Pro at #3. On pure preference voting, no single model is a universal winner.
Our editorial pick for cinematic production work is Veo 3.1. Google says it generates at 24fps with optional 4K output, produces synchronized audio, sound effects, ambient noise, and dialogue natively in the same pass, and follows complex multi-element prompts. It also ships two capabilities that separate it from the field: Scene Extension for continuous narratives exceeding 60 seconds, and Ingredients to Video, which lets you upload up to three reference images to lock character face, clothing, and environment consistently across all scenes.
Veo 3.1 Lite (launched March 31) brings the family’s quality to cost-sensitive workflows at $0.05/sec (720p) and $0.08/sec (1080p), less than half the price of Veo 3.1 Fast. Combined with Veo 3.1 (balanced) and Veo 3.1 Pro (premium), Google now covers every budget tier in the video category from a single family.
OpenAI’s Sora app shuts down on April 26, 2026, with the API following September 24, 2026. If you currently use Sora 2, migrate to Veo 3.1, Kling 3.0, Seedance 2.0, or HappyHorse-1.0 before the deadline.
Kling 3.0 from Kuaishou is the best value option for high-volume production, with Multi-Shot Storyboard letting you define entire sequences with individual prompts, camera angles, and transitions. Seedance 2.0 occupies a different niche: its multi-modal input with audio reference makes it the best tool for music video production and brand content that needs to match a specific audio track.
Video Generation Comparison Table
Model | Native Audio | Resolution | Best Strength | Best For |
|---|
Veo 3.1 | Yes | Up to 4K / 24fps | Prompt accuracy, cinematic | Broadcast, commercial, film (editorial pick) |
HappyHorse-1.0 | – | – | Currently #1 on Artificial Analysis text-to-video | Benchmark-leading preference voting |
Seedance 2.0 | Yes (+ audio ref) | 1080p / 24fps | Multi-modal input, #2 on Artificial Analysis | Music video, brand content |
Kling 3.0 1080p Pro | Yes | 1080p / 24fps | Low cost, multi-shot storyboard | Rapid prototyping, social |
Sora 2 | Yes | 1080p / 24fps | Physics simulation | Shutting down April 26, 2026 |
Runner-up and alternatives
HappyHorse-1.0 and Seedance 2.0 lead preference voting on Artificial Analysis. Kling 3.0 is the best cost-per-clip option for social and rapid prototyping. Migrate away from Sora 2 before April 26.
What Changed This Month
Veo 3.1 Lite launched March 31 at less than half the price of Veo 3.1 Fast. The Sora app shuts down April 26, 2026, with the API following September 24.
Best AI for Coding
Coding is the most contested category right now. Claude Opus 4.7 and GPT-5.5 are neck and neck, and the right pick depends on the shape of your codebase. Anthropic reports Claude Opus 4.7 outperforms Opus 4.6 across industry benchmarks including SWE-bench Verified, SWE-bench Pro, and agentic computer use, and third-party trackers like Vellum AI and VentureBeat back this with scores in the high-80s on SWE-bench Verified and mid-60s on SWE-bench Pro. SWE-bench Pro tests real GitHub issues from open-source repositories, requiring the model to understand an existing codebase, identify relevant files, and write a correct patch.
GPT-5.5 scored 88.7% on SWE-bench at launch on April 23 and brings the strongest agentic coding performance OpenAI has shipped to date, especially inside Codex. For tasks that need computer-use, tool chains, and multi-step automation, GPT-5.5 is the new benchmark. For deep multi-file refactors where context carries across many files, Opus 4.7 still has the edge. If your workflow is agent-first, pick GPT-5.5. If your workflow is refactor-first, pick Opus 4.7.
The architecture advantage for Claude Opus 4.7 is the multi-agent system plus the task budgets feature in public beta. Through Claude Code, Opus 4.7 can spawn and coordinate parallel sub-agents, delegating different parts of a codebase to independent processes and recombining results, and now controls token spend per agentic loop before it starts. On large refactors or feature additions spanning multiple files and modules, this combination handles work that single-context models struggle with. Anthropic also says it specifically trained Opus 4.7 to reduce logic hallucinations, the class of error where code is syntactically valid but logically incorrect.
Gemini 3.1 Pro is still the best cost-effective alternative, especially for very large codebases where its 1M token context and $2 / $12 per million tokens matter more than the absolute top score. Claude Sonnet 4.6 (around 79.6% SWE-bench Verified) is the right choice for daily coding assistance at $3 / $15. On the open-weight side, DeepSeek V4 Pro (shipped April 24) is now the strongest open-weight coding model with 80%+ SWE-bench Verified and 90% HumanEval at $1.74 / $3.48 per 1M tokens, with Alibaba’s Qwen 3.6-Plus (April 2) a strong second.
Coding Comparison Table
Model | Coding Strength | Context | Price (I/O per 1M) | Best For |
|---|
GPT-5.5 | 88.7% SWE-bench, strongest agentic coding + Codex | 1M | $5 / $30 | Agentic coding, Codex workflows, computer use |
Claude Opus 4.7 | Anthropic-reported leader on SWE-bench Verified and Pro | 1M | $5 / $25 | Complex multi-file refactors, long context |
Gemini 3.1 Pro | 80.6% SWE-bench Verified | 1M | $2 / $12 | Long-context, cost-sensitive, Google Cloud work |
Claude Sonnet 4.6 | 79.6% SWE-bench Verified | 1M | $3 / $15 | Daily coding, near-Opus quality |
GPT-5.4 | Strong computer-use and IDE automation | 1.05M | $2.50 / $15 | Rapid prototyping on a budget |
Claude Opus 4.6 | 80.8% SWE-bench Verified (prior leader) | 1M | $5 / $25 | Legacy workflows |
DeepSeek V4 Pro | 80%+ SWE-bench, 90% HumanEval | 1M | $1.74 / $3.48 | Open-weight, cost-sensitive coding |
Runner-up and alternatives
For cost-sensitive or large-context work, Gemini 3.1 Pro is the right call at $2 / $12. Claude Sonnet 4.6 at $3 / $15 is the best quality-to-cost option for daily coding. DeepSeek V4 Pro (April 24) is now the strongest open-weight challenger at 80%+ SWE-bench and $1.74 / $3.48 per 1M tokens.
What Changed This Month
DeepSeek V4 Preview shipped on April 24 and is now the strongest open-weight coding model with 80%+ SWE-bench Verified and 90% HumanEval. GPT-5.5 shipped on April 23 with 88.7% SWE-bench, making proprietary coding a two-horse race with Claude Opus 4.7. Anthropic shipped Claude Opus 4.7 on April 16 with material gains on SWE-bench Verified and Pro over Opus 4.6. Qwen 3.6-Plus (April 2) claimed parity with Opus 4.5 on SWE-bench.
Best AI for Creativity
Creativity is the hardest category to measure objectively. There is no authoritative benchmark equivalent to SWE-bench or GPQA Diamond. What we can say with evidence: Grok 4.20 currently sits in the LM Arena top-10 text leaderboard (around 1,485 Elo at the most recent snapshot), and human raters consistently prefer its outputs in open-ended conversation, the domain most relevant to creative collaboration. Grok 4.20 is currently in beta and available only to SuperGrok (~$30/month) and X Premium+ (~$40/month) subscribers.
Grok 4.20’s four-agent architecture is the key differentiator. Four specialized sub-agents (Grok, Harper, Benjamin, and Lucas) deliberate in parallel, fact-check each other, and reach consensus before responding. Grok orchestrates, Harper handles research, Benjamin does logic and math, and Lucas provides contrarian analysis. This process pushes outputs away from the statistically safest, most expected answer. The results are less predictable than other frontier models, which is either an advantage or a drawback depending on your creative workflow. For brainstorming, concept generation, and ideation under uncertainty, that divergence from the expected is exactly what you want.
Real-time data access through X and the broader web gives Grok 4.20 a further creative edge. It can incorporate current cultural references, trending formats, and breaking news into its outputs in a way that models without live data access cannot. For content creators working on topical or trend-driven material, this gives Grok 4.20 relevance that Claude and Gemini cannot match without supplementary search tools. If you are weighing Grok against ChatGPT for daily use, our Grok vs ChatGPT comparison breaks down where each wins.
This is the most subjective category we rank. If you need tight style constraints rather than open-ended divergence, Claude Sonnet 4.6 is the better fit. Its instruction-following precision means it will stay inside defined creative parameters far more reliably than Grok 4.20. GPT-5.5, with its Tool Search integration and 60% hallucination drop, is the best option for creative projects that blend research with ideation, such as long-form journalism or strategy documents. For visual creative work, Claude Design (new on April 17) is now one of the fastest paths from concept to finished deck, and Gamma with Nano Banana Pro is still the fastest prompt-to-slide path when you want in-deck image generation.
Creativity Comparison Table
Model | Creative Approach | Real-time Data | Arena Elo (recent) | Best For |
|---|
Grok 4.20 Beta1 | Multi-agent deliberation | Yes (X + web) | ~1,485 | Topical, brainstorming |
Claude Sonnet 4.6 | Deep instruction following | No | Top-tier | Structured creative writing |
GPT-5.5 | Versatile, tool-enabled, 60% fewer hallucinations | Yes (Tool Search) | New; not yet ranked | Creative + research combined |
Gemini 3.1 Pro Preview | Technically rigorous | Yes (Google) | ~1,493 | Science writing, journalism |
Grok 4.20 is currently in beta. Elo values are snapshot readings from LM Arena and shift weekly.
Runner-up and alternatives
Claude Sonnet 4.6 for structured creative writing with tight style constraints. GPT-5.5 for creative work that blends research and ideation. Gemini 3.1 Pro for science writing and journalism with factual rigor.
What Changed This Month
GPT-5.5 (April 23) joined the creativity conversation with versatile tool use and fewer hallucinations. Claude Design (April 17) is the first Anthropic Labs product for visual creative work, powered by Opus 4.7.
Best AI for Accuracy
Gemini 3.1 Pro is the most factually reliable LLM based on directly reported benchmarks. Google cites 94.3% on GPQA Diamond and 77.1% on ARC-AGI-2, along with 80.6% on SWE-bench Verified. The ARC-AGI-2 score represents a large generational jump over its predecessor.
GPT-5.5 Pro is the strongest new challenger. OpenAI reports near parity with the frontier on GPQA Diamond and a 60% drop in hallucinations versus GPT-5.4. For knowledge-work accuracy and research tasks, GPT-5.5 Pro is now the best fit inside the OpenAI ecosystem. Claude Opus 4.7 is also within noise of Gemini 3.1 Pro and GPT-5.5 Pro on GPQA Diamond, and posts the strongest SWE-bench numbers when “accuracy” includes correct engineering output, not just factual recall.
The native Google Search grounding remains Gemini 3.1 Pro’s operational advantage. For use cases where correctness matters most, medical queries, legal summaries, scientific research, financial analysis, Gemini 3.1 Pro automatically grounds its answers against current search results when needed. This means factual errors from knowledge cutoffs are far less common than in models without live search integration.
For research, analysis, and any task where a factual error has real consequences, Gemini 3.1 Pro remains the safest default. GPT-5.5 Pro is the new second choice thanks to its hallucination drop and tool-enabled research capabilities.
Accuracy Comparison Table
Model | GPQA Diamond | ARC-AGI-2 | Coding benchmark | Arena Elo (recent) | Best For |
|---|
Gemini 3.1 Pro Preview | 94.3% | 77.1% | SWE-bench Verified 80.6% | ~1,493 | Research, science, factual |
GPT-5.5 Pro | Near parity with frontier | Competitive | SWE-bench 88.7% | New; not yet ranked | Knowledge-work accuracy, 60% fewer hallucinations |
Claude Opus 4.7 | Parity with frontier (per Anthropic) | – | Anthropic-reported lead on SWE-bench | New; not yet ranked | Logic, coding accuracy |
Grok 4.20 Beta1 | Competitive | Strong | – | ~1,485 | Forecasting, real-time |
Runner-up and alternatives
GPT-5.5 Pro and Claude Opus 4.7 are both within noise of Gemini 3.1 Pro on GPQA Diamond. Pick based on ecosystem: Gemini for Google Workspace, GPT-5.5 Pro for ChatGPT/Codex, Opus 4.7 for engineering accuracy.
What Changed This Month
GPT-5.5 Pro (April 23) joined the accuracy top tier with a 60% hallucination drop over GPT-5.4. Claude Opus 4.7 (April 16) closed the gap on GPQA Diamond. Claude Opus 4.6 Thinking currently holds the top LM Arena text slot around 1,502 Elo, pending vote accumulation on Opus 4.7.
Best AI for Problem Solving
GPT-5.5 Pro (April 23) took a big swing at this category. It scored 39.6% on FrontierMath Tier 4, nearly double Claude Opus 4.7’s 22.9%. For hard math, physics, and reasoning chains that need extended thinking inside the OpenAI ecosystem, GPT-5.5 Pro is the one to try first.
Claude Opus 4.7 Thinking extends Anthropic’s chain-of-thought mode onto the Opus 4.7 base. Claude Opus 4.6 Thinking still holds the top LM Arena text slot at around 1,502 Elo while the 4.7 release collects enough votes to re-rank, and on Anthropic-reported benchmarks Opus 4.7 already leads its predecessor on multi-step reasoning and engineering tasks. The core capability is explicit step-by-step reasoning: the model surfaces its assumptions, considers alternative paths, and shows the working before committing to an answer. Paired with task budgets in public beta, Opus 4.7 Thinking can now plan the size of its own reasoning envelope before it starts.
The agent team architecture is the decisive advantage for complex problem-solving. Opus 4.7 can decompose a hard problem, assign subtasks to parallel sub-agents via Claude Code, and synthesize results into a coherent solution. This is not a token-level reasoning improvement but a structural one: the model breaks a problem into independently solvable components and recombines them. For problems with no single correct answer, the thinking mode surfaces assumptions and explores alternatives before converging, reducing the risk of confidently wrong outputs.
Gemini 3.1 Pro’s Deep Think mode is the strongest alternative for scientific and mathematical problems. It leads on GPQA Diamond (94.3%) and ARC-AGI-2 (77.1%). For hypothesis testing, research design, and problems with verifiable ground truth, Gemini 3.1 Pro Deep Think rivals Claude Opus 4.7 Thinking. The choice between them often comes down to domain: Opus 4.7 Thinking is stronger on multi-step logic and engineering problems, while Gemini 3.1 Pro Deep Think is stronger on scientific and empirical reasoning.
Grok 4.20 offers a structurally different approach: its four-agent deliberation is always active, not a separately enabled mode. The four sub-agents fact-check each other in parallel before responding, producing a consensus answer rather than a single chain of thought. For forecasting, multi-perspective analysis, and scenarios where contrarian views improve the output, Grok 4.20’s architecture provides a meaningful alternative to the Claude and Gemini extended-thinking approaches.
Problem Solving Comparison Table
Model | Extended Reasoning | Multi-agent | Arena Elo (recent) | Best For |
|---|
GPT-5.5 Pro | Yes (thinking mode) | Via Codex | New; not yet ranked | 39.6% on FrontierMath Tier 4 (near double Opus 4.7’s 22.9%) |
Claude Opus 4.7 Thinking | Yes (chain-of-thought + budgets) | Yes (Claude Code) | New; not yet ranked | Complex reasoning, agentic work |
Claude Opus 4.6 Thinking | Yes | Yes | ~1,502 (current top text) | Current benchmark leader |
Gemini 3.1 Pro Deep Think (Preview) | Yes | Limited | ~1,493 | Scientific problems, research |
Grok 4.20 Beta1 | Yes (4-agent) | Built-in | ~1,485 | Forecasting, multi-perspective |
Runner-up and alternatives
GPT-5.5 Pro for hard math and FrontierMath-style problems. Gemini 3.1 Pro Deep Think for scientific reasoning. Grok 4.20 for multi-perspective analysis and forecasting.
What Changed This Month
GPT-5.5 Pro (April 23) scored 39.6% on FrontierMath Tier 4, nearly double Opus 4.7’s 22.9%. Claude Opus 4.7 Thinking (April 16) introduced task budgets, a new primitive for controlling agentic token spend.