Claude Opus 4.8 article cover with a glowing robot, orange AI dashboard, and “#1 on the Intelligence Index” badge.

Everything You Need to Know About Anthropic’s Claude Opus 4.8

On May 28, 2026, Anthropic released Claude Opus 4.8, taking the #1 spot on the Artificial Analysis Intelligence Index with a score of 61.4, dethroning GPT-5.5 for the first time since OpenAI’s April launch. The update delivers a 4.9-point jump on SWE-bench Pro (69.2% vs. 64.3%), four times fewer unflagged code flaws, and alignment scores that match Claude Mythos Preview. Pricing stays the same as Opus 4.7 at $5/$25 per million tokens, while fast mode drops to one-third of its previous cost.

Alongside the model, Anthropic shipped three platform features: Dynamic Workflows for running hundreds of parallel subagents in Claude Code, effort control across all plans in claude.ai, and a Messages API enhancement that lets developers inject system directives mid-conversation without breaking prompt cache.

Opus 4.8 is available now through the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

What Changed in Opus 4.8

Opus 4.8 is not a new model tier. It replaces Opus 4.7 as the top Claude model available to the general public. Anthropic positions it as a stronger collaborator, better at catching its own mistakes, more consistent across long-running projects, and meaningfully more honest about what it does and doesn’t know.

The biggest gains are in agentic coding and knowledge work. SWE-bench Pro climbs from 64.3% to 69.2%. GDPval-AA, the benchmark that measures economically valuable real-world work, rises from 1,753 to 1,890 Elo, implying roughly a 67% win rate against GPT-5.5 in head-to-head comparisons. Terminal-Bench 2.1 improves by 8.5 points to 74.6%, though GPT-5.5 still leads that particular benchmark at 78.2%.

The most interesting shift is in honesty and self-correction. Opus 4.8 is the first Claude model to score 0% on uncritically reporting flawed results, and shows a more than ten-fold reduction in overconfidence versus Opus 4.7. Anthropic’s alignment team writes that it “reaches new highs on our measures of prosocial traits,” with misaligned behavior rates substantially lower than Opus 4.7 and comparable to the restricted Claude Mythos Preview.

Here’s a quick summary of what’s new:

  • SWE-bench Pro: 64.3% → 69.2% (+4.9 points)
  • GDPval-AA: 1,753 → 1,890 Elo (+137 Elo)
  • Terminal-Bench 2.1: 66.1% → 74.6% (+8.5 points)
  • HLE (with tools): 54.7% → 57.9% (+3.2 points)
  • OSWorld-Verified: 83.4%, leading all competitors
  • BrowseComp: 79.3% → 84.3% (+5.0 points)
  • Code flaw detection: 4× fewer unflagged flaws
  • Alignment: Near Mythos Preview levels
  • Fast mode: 3× cheaper than previous generation

Benchmark Results

Coding

BenchmarkOpus 4.8Opus 4.7GPT-5.5Gemini 3.1 Pro
SWE-bench Verified88.6%87.6%N/AN/A
SWE-bench Pro69.2%64.3%58.6%54.2%
SWE-bench Multilingual84.4%N/AN/AN/A
Terminal-Bench 2.174.6%66.1%78.2%70.3%
CursorBenchLeads all effort levelsBaselineN/AN/A

The headline number is SWE-bench Pro at 69.2%. This is the harder variant that tests real-world pull request resolution across complex codebases, and Opus 4.8 now leads every model in the comparison by a wide margin, 10.6 points over GPT-5.5 and 15 points over Gemini 3.1 Pro.

Terminal-Bench 2.1 is the one coding benchmark where GPT-5.5 still wins (78.2% vs. 74.6%). This measures multi-tool command-line workflows that require planning, iteration, and error recovery. The gap narrowed from 12.1 points on Opus 4.7 to 3.6 points on Opus 4.8, but it’s still there.

Cursor’s co-founder Michael Truell noted that Opus 4.8 “exceeds prior Opus on CursorBench across all effort levels” with “more efficient tool calling, fewer steps, and better follow-through on longer tasks.”

Agentic Tasks

BenchmarkOpus 4.8Opus 4.7GPT-5.5Gemini 3.1 Pro
OSWorld-Verified83.4%82.8%78.7%76.2%
Online-Mind2Web84.0%N/AN/AN/A
MCP-Atlas82.2%77.3%N/AN/A
Super-Agent BenchmarkCompletes all casesN/AIncompleteN/A
BrowseComp (single-agent)84.3%79.3%84.4%85.9%
BrowseComp (multi-agent)88.5%N/AN/AN/A

Agentic performance is where the clearest gains land. OSWorld-Verified (the benchmark for driving a virtual machine, clicking through UIs, and completing mixed software tasks) hits 83.4%, ahead of all competitors. GenSpark’s co-founder Kay Zhu calls Opus 4.8 “the only model completing every Super-Agent case end-to-end, beating prior Opus and GPT-5.5 at cost parity.”

BrowserBase’s tech lead Miguel Gonzalez reports 84% on Online-Mind2Web, calling it “the strongest computer-use and browser-agent model we’ve tested” with “a meaningful jump over 4.7 and GPT-5.5.”

Reasoning and Knowledge

BenchmarkOpus 4.8Opus 4.7GPT-5.5Gemini 3.1 Pro
GPQA Diamond93.6%94.2%93.6%N/A
HLE (no tools)49.8%46.9%41.4%N/A
HLE (with tools)57.9%54.7%52.2%N/A
USAMO 202696.7%N/AN/AN/A
GDPval-AA1,890 Elo1,753 Elo1,769 EloN/A

GDPval-AA is the standout. At 1,890 Elo, Opus 4.8 leads GPT-5.5 by 121 points and Opus 4.7 by 137 points. Artificial Analysis notes that Opus 4.8 achieves this while using 15% fewer turns and 35% fewer output tokens than its predecessor, though it still takes roughly 30% more turns than GPT-5.5 to complete tasks.

GPQA Diamond dipped slightly from 94.2% to 93.6%, a negligible move. Humanity’s Last Exam improved in both configurations: 49.8% without tools (up from 46.9%) and 57.9% with tools (up from 54.7%), leading every competitor on both.

Knowledge Work

BenchmarkOpus 4.8Opus 4.7GPT-5.5Gemini 3.5 Flash
Finance Agent v253.9%N/AN/A57.9%
Legal Agent BenchmarkHighest recordedN/AN/AN/A

Harvey AI’s head of applied research Niko Grupen reports that Opus 4.8 posted “the highest Legal Agent Benchmark score” and was “the first model to break 10% on the all-pass standard,” noting that “the accuracy lift translates directly to more attorney work that can be safely delegated.”

Bridgewater Associates’ Michael Ran describes “consistently higher analysis quality than prior Opus” with “faster, richer outputs, better signal-to-noise ratio” and the model “proactively flagging input and output issues.”

Enterprise Feedback from Box AI

Aaron Levie, CEO of Box, shared detailed benchmarking results from testing Opus 4.8 with Box AI agents on complex enterprise document workflows:

  • Report drafting: Opus 4.8 outperforms on a majority of tasks. On industrial goods reporting, it scored 87% vs. 77% for Opus 4.7. On consumer product launch evaluation, 90% vs. 84%.
  • Legal NDA review: Near-perfect consistency across all trials, catching more relevant clauses and flagging more potential issues than Opus 4.7.
  • Financial data analysis: On corporate lending analysis comparing syndicated vs. bilateral loan structures, Opus 4.8 leads by nearly 8 percentage points in accurate metric extraction.
  • Public sector grant analysis: Correctly extracted and validated nearly all required eligibility data points, catching specific details that Opus 4.7 “overlooked or misinterpreted.”

Opus 4.8 is rolling out to Box customers for deployment in Box AI agents.

Artificial Analysis Intelligence Index

Opus 4.8 takes the #1 spot on the Artificial Analysis Intelligence Index with a score of 61.4 out of 149 models tracked. This is a +4.1 point jump from Opus 4.7 (57.3) and +1.2 points ahead of GPT-5.5 at max effort (60.2), which held the top spot since April.

ModelIntelligence IndexInput $/MOutput $/MContext
Claude Opus 4.861.4$5.00$25.001M
GPT-5.560.2$5.00$30.00922K
Claude Opus 4.757.3$5.00$25.001M
Qwen3.7-Max56.6$2.50$7.501M
Gemini 3.5 Flash55.3$1.50$9.001M
DeepSeek V4 Pro52.0$1.74$3.481M

The index aggregates ten evaluations: GDPval-AA, Terminal-Bench Hard, τ²-Bench Telecom, AA-LCR, AA-Omniscience, Humanity’s Last Exam, GPQA Diamond, SciCode, IFBench, and CritPt.

Specific point improvements from Opus 4.7: Terminal-Bench Hard gained 6.8 points, τ²-Bench Telecom gained 5.9 points, and IFBench gained 3.6 points. Performance on AA-LCR, GPQA, and SciCode was roughly flat.

One caveat from Artificial Analysis: Opus 4.8 is “among the leading models in intelligence, but particularly expensive when comparing to other models of similar price” and “slower than average and very verbose,” producing approximately 110 million tokens during the full Intelligence Index evaluation versus a 35 million token average. The total evaluation cost was $4,685.85.

Honesty and Alignment

This is arguably the most significant change in Opus 4.8, even if it doesn’t have a single headline benchmark number.

Opus 4.8 is four times less likely than Opus 4.7 to let code flaws pass without flagging them. It produces 17 times fewer dishonest agentic code summaries compared to Claude Sonnet 4.6. It’s the first Claude model to score 0% on uncritically reporting flawed results. Overconfidence (claiming certainty about something it shouldn’t be certain about) dropped by more than 10× versus Opus 4.7.

Anthropic’s alignment team assessed that misaligned behavior rates, including deception and cooperation with misuse, are “substantially lower than Opus 4.7” and now match Claude Mythos Preview, the restricted model that was previously the alignment benchmark. Pre-deployment safety testing details are published in the full system card.

For everyday users, this translates to a model that is more willing to say “I don’t know” and less likely to confidently generate wrong answers. For developers building autonomous agents, it means a lower risk of the model quietly producing bad code or misleading status reports in unattended workflows.

Cognition (Devin) CEO Scott Wu confirmed this in practice: “Opus 4.8 uses tools cleanly and follows instructions with consistency for autonomous engineering. It fixes the comment-verbosity and tool-calling issues we saw in 4.7.”

Dynamic Workflows

The biggest platform launch alongside Opus 4.8 is Dynamic Workflows, now available as a research preview in Claude Code for Enterprise, Team, and Max plan users.

Dynamic Workflows lets Claude orchestrate hundreds of parallel subagents within a single Claude Code session. The system plans the work, distributes it across subagents, verifies outputs, and reports results, all without manual orchestration.

The practical use case Anthropic highlights is codebase-scale migrations spanning hundreds of thousands of lines of code. Claude Code with Opus 4.8 can carry out a migration from kickoff to merge, using the existing test suite as its quality bar.

This is a meaningful step beyond basic multi-file editing. Instead of working through files sequentially, Claude can spin up separate agents for independent changes, run them in parallel, and coordinate the results. Think of it as the difference between a single developer working through a backlog and a team lead distributing tickets.

Dynamic Workflows is also available through the Claude API, Bedrock, Vertex AI, and Microsoft Foundry.

Effort Control

Anthropic introduced a new control that lets users choose how much computational effort Claude applies to a response. This is available across all plans in claude.ai and Cowork.

The settings work as follows:

  • Low effort: Faster responses, lower rate-limit consumption. Best for simple questions.
  • High effort (default): Balanced quality and speed. Uses approximately the same number of tokens as Opus 4.7 while delivering better results.
  • Extra / xhigh: More frequent and deeper thinking. Recommended for difficult tasks and long-running async workflows.
  • Maximum: Highest token consumption. Best performance at the highest cost.

Rate limits in Claude Code have been expanded to accommodate higher effort levels. Anthropic recommends the default “high” setting for most work, noting that it already outperforms Opus 4.7’s default output at similar token budgets.

Messages API Enhancement

Developers can now insert system entries directly inside the messages array during a conversation. This means you can update Claude’s instructions (permissions, token budgets, environment context) mid-task without breaking prompt cache and without routing the update through a user turn.

This matters most for agent builders who need to adjust Claude’s behavior as a task progresses without resetting the cached context. It’s a technical change, but for multi-step workflows it removes a real friction point.

Pricing and Availability

Opus 4.8 ships at the same price as Opus 4.7:

TierInputOutput
Standard$5.00 / 1M tokens$25.00 / 1M tokens
Fast mode (2.5× speed)$10.00 / 1M tokens$50.00 / 1M tokens
Prompt caching (cache hit)$0.50 / 1M tokensN/A

Fast mode is now 3× cheaper than it was on previous models. At 2.5× the speed, this makes it a practical option for latency-sensitive applications where it was previously cost-prohibitive.

Prompt caching at $0.50 per million input tokens is a 90% discount on repeated context, which compounds well for agent workflows that repeatedly reference the same documents or instructions.

Context window: 1 million tokens (~1,500 A4 pages)
Max output: 128K tokens
Model ID: claude-opus-4-8
Input modalities: Text and image
Output modalities: Text only

Platforms: Claude API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry

MetricOpus 4.8GPT-5.5Qwen3.7-MaxGemini 3.5 Flash
Input price$5.00$5.00$2.50$1.50
Output price$25.00$30.00$7.50$9.00
Context window1M922K1M1M
Intelligence Index61.460.256.655.3

At the frontier tier, Opus 4.8 is $5 cheaper per million output tokens than GPT-5.5 ($25 vs. $30) while scoring higher on the Intelligence Index. The cost-conscious alternative is Qwen3.7-Max at $2.50/$7.50, which trades 4.8 Intelligence Index points for a roughly 70% price reduction.

How It Compares

Against Opus 4.7: Straightforward upgrade across the board. Coding benchmarks improve by 5 to 8 points. Knowledge work jumps 137 Elo. Alignment improves dramatically. Same price. There’s no reason to stay on 4.7.

Against GPT-5.5: Split results that lean Opus 4.8. Claude leads cleanly on SWE-bench Pro (+10.6 points), GDPval-AA (+121 Elo), OSWorld-Verified (+4.7 points), and Humanity’s Last Exam. GPT-5.5 holds Terminal-Bench 2.1 (+3.6 points) and FrontierMath Tier 4. On the Intelligence Index, Opus 4.8 edges ahead by 1.2 points. GPT-5.5 is $5 more expensive per million output tokens.

Against Gemini 3.5 Flash: Different weight class. Gemini 3.5 Flash (55.3 Intelligence Index) costs roughly 70% less and runs approximately 4× faster. For budget-sensitive and latency-sensitive work, Flash is still the better value. Opus 4.8 is the pick when raw intelligence and agentic reliability matter more than per-token cost.

Against Claude Mythos Preview: Mythos remains the higher-intelligence tier, but it’s only available to select organizations for cybersecurity applications. Anthropic says the general release is “expected in coming weeks” pending stronger cyber safeguards. Until then, Opus 4.8 is the best Claude model most people can actually use.

Where It Sits in the Claude Lineup

ModelIntelligence IndexBest ForPricing
Claude Mythos PreviewN/ACybersecurity (restricted)Limited access
Claude Opus 4.861.4Frontier coding, agents, knowledge work$5 / $25
Claude Sonnet 4.6N/AEveryday tasks, fast responsesLower tier
Claude Haiku 4.5N/AHigh-throughput, cost-sensitiveBudget tier

Opus 4.8 replaces Opus 4.7 as the default frontier model. The upgrade path is clear: Haiku for volume, Sonnet for everyday work, Opus for hard problems and autonomous agents, and eventually Mythos for the most demanding applications once it clears safety review.

Early Tester Highlights

The breadth of early tester feedback is notable. Here’s a cross-section:

OrganizationTesterKey Quote
ShopifyTom Pritchard, Staff Engineer“Better judgment. Asks right questions, catches mistakes, pushes back on unsound plans.”
CursorMichael Truell, Co-Founder/CEO“Exceeds prior Opus on CursorBench across all effort levels. Fewer steps, better follow-through.”
Cognition (Devin)Scott Wu, CEO“Uses tools cleanly, follows instructions with consistency. Fixes comment-verbosity issues from 4.7.”
Harvey AINiko Grupen, Head of Applied Research“Highest Legal Agent Benchmark score. First to break 10% on all-pass standard.”
BrowserBaseMiguel Gonzalez, Tech Lead“Strongest computer-use model tested. 84% on Online-Mind2Web, a meaningful jump over 4.7 and GPT-5.5.”
DatabricksHanlin Tang, CTO Neural Networks“Step change in agentic reasoning. Multimodal PDF/diagram reasoning at 61% cheaper tokens than 4.7.”
Thomson ReutersJoel Hron, CTO“Meaningful improvements in consistency and reasoning for high-stakes professional workflows.”
BridgewaterMichael Ran, Sr. Investment Associate“Consistently higher analysis quality. Proactively flags input/output issues.”
HebbiaAabhas Sharma, CTO“Better citation precision and token efficiency on retrieval for dense financial filings.”
BoxAaron Levie, CEO“Measurably better at generative and analytical work enterprises care about most.”
EveryKatie Parrott, Staff Writer“Major quality-of-life update: faster, easier collaboration, better style retention across sessions.”

The pattern across these quotes is consistent: better judgment, fewer wasted steps, stronger follow-through. Several testers specifically call out the improvements in reliability and self-correction. Shopify’s note about the model “pushing back on unsound plans” aligns directly with the alignment gains Anthropic measured.

What This Means

For Developers

Opus 4.8 is a drop-in replacement. Use model ID claude-opus-4-8 in the API. The Messages API system-entry enhancement is worth adopting if you’re building multi-step agents, since it removes the need to restructure prompts when updating instructions mid-task.

Dynamic Workflows in Claude Code is worth testing if you’re dealing with large-scale migrations or refactors. The parallel subagent approach should compress wall-clock time significantly for tasks that would otherwise run sequentially.

For Claude Code Users

Experiment with the new effort settings. The default “high” already outperforms Opus 4.7’s default. Try “extra” or “max” for complex debugging sessions or multi-file refactors. If you’re on Enterprise, Team, or Max, try Dynamic Workflows for codebase-wide changes.

For Everyday AI Users

The quality-of-life improvements matter more than the benchmark deltas. Opus 4.8 is less likely to confidently make things up, more likely to flag when it’s unsure, and better at maintaining context across long conversations. For creative and analytical work, the “faster, richer outputs” that early testers describe translate to fewer back-and-forth cycles to get a useful result.

If you want to compare Opus 4.8 against GPT-5.5, Gemini 3.5, DeepSeek V4, and other frontier models from a single app, Fello AI bundles all top AI models into one native Mac, iPhone, and iPad app for $9.99/month. It’s around 30 to 50 MB, has over 25,000 five-star reviews, and consistently ranks among the best AI apps for iPhone. Opus 4.8 will be available in Fello AI in the coming days alongside all existing models.

What’s Next

Opus 4.8 is Anthropic’s fifth Opus release in seven months. The pace signals that incremental, frequent updates, rather than monolithic launches, are the strategy. The company has already indicated two upcoming moves: a lower-cost model matching Opus capabilities (likely a Sonnet-tier update) and a higher-intelligence tier exceeding current Opus (Claude Mythos, pending safety clearance).

Claude Mythos Preview remains available to select organizations for cybersecurity work, with general availability expected “in coming weeks.” Anthropic says safeguard development is making “swift progress.”

For now, Opus 4.8 is the best generally available Claude model and, by the numbers, the most intelligent model most people can access today.

Share Now!

Facebook
X
LinkedIn
Threads
E-Mail

Erhalten Sie exklusive AI-Tipps in Ihrem Posteingang!

Bleiben Sie mit den Erkenntnissen von KI-Experten, auf die sich die besten Technikexperten verlassen, immer einen Schritt voraus!