Everything You Need to Know About Anthropic’s Claude Opus 4.8

On May 28, 2026, Anthropic released Claude Opus 4.8, taking the #1 spot on the Artificial Analysis Intelligence Index with a score of 61.4, dethroning GPT-5.5 for the first time since OpenAI’s April launch. The update delivers a 4.9-point jump on SWE-bench Pro (69.2% vs. 64.3%), four times fewer unflagged code flaws, and alignment scores that match Claude Mythos Preview. Pricing stays the same as Opus 4.7 at $5/$25 per million tokens, while fast mode drops to one-third of its previous cost. Anthropic later released that Mythos-class model publicly as Claude Fable 5, priced at twice Opus 4.8 and falling back to Opus 4.8 on high-risk requests.

Opus 4.8 has since been superseded. Claude Opus 5 shipped on July 24, 2026 at the same $5 / $25 per million tokens, and Anthropic now lists Opus 4.8 under legacy models with a note to consider migrating. Everything below is the launch-day record of what Opus 4.8 delivered in May 2026, with the benchmarks left attributed to the model that earned them.

Introducing Claude Opus 4.8: it builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors.

Available today at the same price. pic.twitter.com/EufxL7T1kb
— Claude (@claudeai) May 28, 2026

Alongside the model, Anthropic shipped three platform features: Dynamic Workflows for running hundreds of parallel subagents in Claude Code, effort control across all plans in claude.ai, and a Messages API enhancement that lets developers inject system directives mid-conversation without breaking prompt cache.

Opus 4.8 is available now through the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

Inhaltsübersicht hide

What Changed in Opus 4.8

Benchmark Results

Coding

Agentic Tasks

Reasoning and Knowledge

Knowledge Work

Enterprise Feedback from Box AI

Artificial Analysis Intelligence Index

Honesty and Alignment

Dynamic Workflows

Effort Control

Messages API Enhancement

Pricing and Availability

How It Compares

Where It Sits in the Claude Lineup

Early Tester Highlights

What This Means

For Developers

For Claude Code Users

For Everyday AI Users

What’s Next

What Changed in Opus 4.8

Opus 4.8 was not a new model tier. It replaced Opus 4.7 as the top Claude model available to the general public, a position it held until Claude Opus 5 arrived on July 24, 2026. Anthropic positions it as a stronger collaborator, better at catching its own mistakes, more consistent across long-running projects, and meaningfully more honest about what it does and doesn’t know.

The biggest gains are in agentic coding and knowledge work. SWE-bench Pro climbs from 64.3% to 69.2%. Open-weight challengers are closing in too; Nex-N2-Pro posts 58.8% on the same SWE-bench Pro test. A different approach goes further still: Sakana AI’s Fugu orchestrator scores 73.7 on SWE-Bench Pro by routing work across a pool of models, edging past Opus 4.8’s 69.2. GDPval-AA, the benchmark that measures economically valuable real-world work, rises from 1,753 to 1,890 Elo, implying roughly a 67% win rate against GPT-5.5 in head-to-head comparisons. Terminal-Bench 2.1 improves by 8.5 points to 74.6%, though GPT-5.5 still leads that particular benchmark at 78.2%.

The most interesting shift is in honesty and self-correction. Opus 4.8 is the first Claude model to score 0% on uncritically reporting flawed results, and shows a more than ten-fold reduction in overconfidence versus Opus 4.7. Anthropic’s alignment team writes that it “reaches new highs on our measures of prosocial traits,” with misaligned behavior rates substantially lower than Opus 4.7 and comparable to the restricted Claude Mythos Preview.

Here’s a quick summary of what’s new:

SWE-bench Pro: 64.3% → 69.2% (+4.9 points)
GDPval-AA: 1,753 → 1,890 Elo (+137 Elo)
Terminal-Bench 2.1: 66.1% → 74.6% (+8.5 points)
HLE (with tools): 54.7% → 57.9% (+3.2 points)
OSWorld-Verified: 83.4%, leading all competitors
BrowseComp: 79.3% → 84.3% (+5.0 points)
Code flaw detection: 4× fewer unflagged flaws
Alignment: Near Mythos Preview levels
Fast mode: 3× cheaper than previous generation

Benchmark Results

Coding

Benchmark	Opus 4.8	Opus 4.7	GPT-5.5	Gemini 3.1 Pro
SWE-bench Verified	88.6%	87.6%	N/A	N/A
SWE-bench Pro	69.2%	64.3%	58.6%	54.2%
SWE-bench Multilingual	84.4%	N/A	N/A	N/A
Terminal-Bench 2.1	74.6%	66.1%	78.2%	70.3%
CursorBench	Leads all effort levels	Baseline	N/A	N/A

The headline number is SWE-bench Pro at 69.2%. This is the harder variant that tests real-world pull request resolution across complex codebases, and Opus 4.8 now leads every model in the comparison by a wide margin, 10.6 points over GPT-5.5 and 15 points over Gemini 3.1 Pro.

Terminal-Bench 2.1 is the one coding benchmark where GPT-5.5 still wins (78.2% vs. 74.6%). This measures multi-tool command-line workflows that require planning, iteration, and error recovery. The gap narrowed from 12.1 points on Opus 4.7 to 3.6 points on Opus 4.8, but it’s still there.

Cursor’s co-founder Michael Truell noted that Opus 4.8 “exceeds prior Opus on CursorBench across all effort levels” with “more efficient tool calling, fewer steps, and better follow-through on longer tasks.”

Agentic Tasks

Benchmark	Opus 4.8	Opus 4.7	GPT-5.5	Gemini 3.1 Pro
OSWorld-Verified	83.4%	82.8%	78.7%	76.2%
Online-Mind2Web	84.0%	N/A	N/A	N/A
MCP-Atlas	82.2%	77.3%	N/A	N/A
Super-Agent Benchmark	Completes all cases	N/A	Incomplete	N/A
BrowseComp (single-agent)	84.3%	79.3%	84.4%	85.9%
BrowseComp (multi-agent)	88.5%	N/A	N/A	N/A

Agentic performance is where the clearest gains land. OSWorld-Verified (the benchmark for driving a virtual machine, clicking through UIs, and completing mixed software tasks) hits 83.4%, ahead of all competitors. GenSpark’s co-founder Kay Zhu calls Opus 4.8 “the only model completing every Super-Agent case end-to-end, beating prior Opus and GPT-5.5 at cost parity.”

BrowserBase’s tech lead Miguel Gonzalez reports 84% on Online-Mind2Web, calling it “the strongest computer-use and browser-agent model we’ve tested” with “a meaningful jump over 4.7 and GPT-5.5.”

Reasoning and Knowledge

Benchmark	Opus 4.8	Opus 4.7	GPT-5.5	Gemini 3.1 Pro
GPQA Diamond	93.6%	94.2%	93.6%	N/A
HLE (no tools)	49.8%	46.9%	41.4%	N/A
HLE (with tools)	57.9%	54.7%	52.2%	N/A
USAMO 2026	96.7%	N/A	N/A	N/A
GDPval-AA	1,890 Elo	1,753 Elo	1,769 Elo	N/A

GDPval-AA is the standout. At 1,890 Elo, Opus 4.8 leads GPT-5.5 by 121 points and Opus 4.7 by 137 points. Artificial Analysis notes that Opus 4.8 achieves this while using 15% fewer turns and 35% fewer output tokens than its predecessor, though it still takes roughly 30% more turns than GPT-5.5 to complete tasks.

GPQA Diamond dipped slightly from 94.2% to 93.6%, a negligible move. Humanity’s Last Exam improved in both configurations: 49.8% without tools (up from 46.9%) and 57.9% with tools (up from 54.7%), leading every competitor on both.

Knowledge Work

Benchmark	Opus 4.8	Opus 4.7	GPT-5.5	Gemini 3.5 Flash
Finance Agent v2	53.9%	N/A	N/A	57.9%
Legal Agent Benchmark	Highest recorded	N/A	N/A	N/A

Harvey AI’s head of applied research Niko Grupen reports that Opus 4.8 posted “the highest Legal Agent Benchmark score” and was “the first model to break 10% on the all-pass standard,” noting that “the accuracy lift translates directly to more attorney work that can be safely delegated.”

Bridgewater Associates’ Michael Ran describes “consistently higher analysis quality than prior Opus” with “faster, richer outputs, better signal-to-noise ratio” and the model “proactively flagging input and output issues.”

Enterprise Feedback from Box AI

Aaron Levie, CEO of Box, shared detailed benchmarking results from testing Opus 4.8 with Box AI agents on complex enterprise document workflows:

Report drafting: Opus 4.8 outperforms on a majority of tasks. On industrial goods reporting, it scored 87% vs. 77% for Opus 4.7. On consumer product launch evaluation, 90% vs. 84%.
Legal NDA review: Near-perfect consistency across all trials, catching more relevant clauses and flagging more potential issues than Opus 4.7.
Financial data analysis: On corporate lending analysis comparing syndicated vs. bilateral loan structures, Opus 4.8 leads by nearly 8 percentage points in accurate metric extraction.
Public sector grant analysis: Correctly extracted and validated nearly all required eligibility data points, catching specific details that Opus 4.7 “overlooked or misinterpreted.”

Opus 4.8 is rolling out to Box customers for deployment in Box AI agents.

Artificial Analysis Intelligence Index

At launch, Opus 4.8 took the #1 spot on the Artificial Analysis Intelligence Index with a score of 61.4 out of 149 models tracked. That was a +4.1 point jump from Opus 4.7 (57.3) and +1.2 points ahead of GPT-5.5 at max effort (60.2), which had held the top spot since April.

Model	Intelligence Index	Input $/M	Output $/M	Context
Claude Opus 4.8	61.4	$5.00	$25.00	1M
GPT-5.5	60.2	$5.00	$30.00	922K
Claude Opus 4.7	57.3	$5.00	$25.00	1M
Qwen3.7-Max	56.6	$2.50	$7.50	1M
Gemini 3.5 Flash	55.3	$1.50	$9.00	1M
DeepSeek V4 Pro	52.0	$1.74	$3.48	1M

That version of the index aggregated ten evaluations: GDPval-AA, Terminal-Bench Hard, τ²-Bench Telecom, AA-LCR, AA-Omniscience, Humanity’s Last Exam, GPQA Diamond, SciCode, IFBench, and CritPt.

The board has moved twice since. Artificial Analysis rebased the Intelligence Index to v4.1, a different nine-evaluation mix, so every number in the table above is on the retired scale and is not comparable to current scores. On v4.1, Claude Opus 5 leads at 61, Claude Fable 5 scores 60 and GPT-5.6 Sol scores 59. Opus 4.8 itself is rescored at 56, which puts it 10th of 190 models tracked rather than first.

Specific point improvements from Opus 4.7: Terminal-Bench Hard gained 6.8 points, τ²-Bench Telecom gained 5.9 points, and IFBench gained 3.6 points. Performance on AA-LCR, GPQA, and SciCode was roughly flat.

One caveat from Artificial Analysis: Opus 4.8 is “among the leading models in intelligence, but particularly expensive when comparing to other models of similar price” and “slower than average and very verbose,” producing approximately 110 million tokens during the full Intelligence Index evaluation versus a 35 million token average. The total evaluation cost was $4,685.85.

Honesty and Alignment

This is arguably the most significant change in Opus 4.8, even if it doesn’t have a single headline benchmark number.

Opus 4.8 is four times less likely than Opus 4.7 to let code flaws pass without flagging them. It produces 17 times fewer dishonest agentic code summaries compared to Claude Sonnet 4.6, the mid tier Anthropic has since refreshed as Claude Sonnet 5. It’s the first Claude model to score 0% on uncritically reporting flawed results. Overconfidence (claiming certainty about something it shouldn’t be certain about) dropped by more than 10× versus Opus 4.7.

Anthropic’s alignment team assessed that misaligned behavior rates, including deception and cooperation with misuse, are “substantially lower than Opus 4.7” and now match Claude Mythos Preview, the restricted model that was previously the alignment benchmark. Pre-deployment safety testing details are published in the full system card.

For everyday users, this translates to a model that is more willing to say “I don’t know” and less likely to confidently generate wrong answers. For developers building autonomous agents, it means a lower risk of the model quietly producing bad code or misleading status reports in unattended workflows.

Cognition (Devin) CEO Scott Wu confirmed this in practice: “Opus 4.8 uses tools cleanly and follows instructions with consistency for autonomous engineering. It fixes the comment-verbosity and tool-calling issues we saw in 4.7.”

Dynamic Workflows

The biggest platform launch alongside Opus 4.8 is Dynamic Workflows, now available as a research preview in Claude Code for Enterprise, Team, and Max plan users.

Dynamic Workflows lets Claude orchestrate hundreds of parallel subagents within a single Claude Code session. The system plans the work, distributes it across subagents, verifies outputs, and reports results, all without manual orchestration.

The practical use case Anthropic highlights is codebase-scale migrations spanning hundreds of thousands of lines of code. Claude Code with Opus 4.8 can carry out a migration from kickoff to merge, using the existing test suite as its quality bar.

This is a meaningful step beyond basic multi-file editing. Instead of working through files sequentially, Claude can spin up separate agents for independent changes, run them in parallel, and coordinate the results. Think of it as the difference between a single developer working through a backlog and a team lead distributing tickets.

Dynamic Workflows is also available through the Claude API, Bedrock, Vertex AI, and Microsoft Foundry.

Effort Control

Anthropic introduced a new control that lets users choose how much computational effort Claude applies to a response. This is available across all plans in claude.ai and Cowork.

The settings work as follows:

Low effort: Faster responses, lower rate-limit consumption. Best for simple questions.
High effort (default): Balanced quality and speed. Uses approximately the same number of tokens as Opus 4.7 while delivering better results.
Extra / xhigh: More frequent and deeper thinking. Recommended for difficult tasks and long-running async workflows.
Maximum: Highest token consumption. Best performance at the highest cost.

Rate limits in Claude Code have been expanded to accommodate higher effort levels. Anthropic recommends the default “high” setting for most work, noting that it already outperforms Opus 4.7’s default output at similar token budgets.

Messages API Enhancement

Developers can now insert system entries directly inside the messages array during a conversation. This means you can update Claude’s instructions (permissions, token budgets, environment context) mid-task without breaking prompt cache and without routing the update through a user turn.

This matters most for agent builders who need to adjust Claude’s behavior as a task progresses without resetting the cached context. It’s a technical change, but for multi-step workflows it removes a real friction point.

Pricing and Availability

Opus 4.8 ships at the same price as Opus 4.7:

Tier	Input	Output
Standard	$5.00 / 1M tokens	$25.00 / 1M tokens
Fast mode (2.5× speed)	$10.00 / 1M tokens	$50.00 / 1M tokens
Prompt caching (cache hit)	$0.50 / 1M tokens	N/A

Fast mode is now 3× cheaper than it was on previous models. At 2.5× the speed, this makes it a practical option for latency-sensitive applications where it was previously cost-prohibitive.

Prompt caching at $0.50 per million input tokens is a 90% discount on repeated context, which compounds well for agent workflows that repeatedly reference the same documents or instructions.

Context window: 1 million tokens (~1,500 A4 pages)
Max output: 128K tokens
Model ID: claude-opus-4-8
Input modalities: Text and image
Output modalities: Text only

Platforms: Claude API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry

Metric	Opus 4.8	GPT-5.5	Qwen3.7-Max	Gemini 3.5 Flash
Input price	$5.00	$5.00	$2.50	$1.50
Output price	$25.00	$30.00	$7.50	$9.00
Context window	1M	922K	1M	1M
Intelligence Index	61.4	60.2	56.6	55.3

Index scores in the table above are the May 2026 figures, before the v4.1 rebase. At the frontier tier, Opus 4.8 was $5 cheaper per million output tokens than GPT-5.5 ($25 vs. $30) while scoring higher on the Intelligence Index. The cost-conscious alternative is Qwen3.7-Max at $2.50/$7.50, which trades 4.8 Intelligence Index points for a roughly 70% price reduction.

How It Compares

Against Opus 4.7: Straightforward upgrade across the board. Coding benchmarks improve by 5 to 8 points. Knowledge work jumps 137 Elo. Alignment improves dramatically. Same price. There’s no reason to stay on 4.7.

Against GPT-5.5: Split results that lean Opus 4.8. Claude leads cleanly on SWE-bench Pro (+10.6 points), GDPval-AA (+121 Elo), OSWorld-Verified (+4.7 points), and Humanity’s Last Exam. GPT-5.5 holds Terminal-Bench 2.1 (+3.6 points) and FrontierMath Tier 4. On the Intelligence Index, Opus 4.8 edges ahead by 1.2 points. GPT-5.5 is $5 more expensive per million output tokens.

Against Gemini 3.5 Flash: Different weight class. Gemini 3.5 Flash (55.3 Intelligence Index) costs roughly 70% less and runs approximately 4× faster. For budget-sensitive and latency-sensitive work, Flash is still the better value. Opus 4.8 is the pick when raw intelligence and agentic reliability matter more than per-token cost.

Against Claude Mythos Preview: at the time, Mythos was the higher-intelligence tier and was only available to select organizations for cybersecurity applications, with general release “expected in coming weeks” pending stronger cyber safeguards. That resolved on June 9, 2026: Anthropic released Claude Fable 5 as its most capable widely available model, and shipped Claude Mythos 5 alongside it, still invitation-only for defensive cybersecurity work under Project Glasswing. Opus 4.8 was the best Claude model most people could actually use until Opus 5 replaced it.

Where It Sits in the Claude Lineup

Model	Index (v4.1)	Best For	Pricing
Claude Mythos 5	Not scored	Defensive cybersecurity, invitation-only	Same as Fable 5
Claude Fable 5	60	Highest available capability, long-running agents	$10 / $50
Claude Opus 5	61	Complex agentic coding and enterprise work	$5 / $25
Claude Opus 4.8 (legacy)	56	Superseded by Opus 5, same price	$5 / $25
Claude Sonnet 5	Not published	Speed and intelligence balance	$3 / $15
Claude Haiku 4.5	Not published	High-throughput, cost-sensitive	$1 / $5

Opus 4.8 replaced Opus 4.7 as the default frontier model, and Opus 5 has since replaced it in that slot. The upgrade path still reads the same shape: Haiku for volume, Sonnet for everyday work, Opus for hard problems and autonomous agents, and Fable 5 when a workload needs the highest capability Anthropic sells. Anthropic’s own guidance now reads “start with Claude Opus 5 for complex agentic coding and enterprise work,” and “for workloads that need the highest available capability, use Claude Fable 5.”

Early Tester Highlights

The breadth of early tester feedback is notable. Here’s a cross-section:

Organization	Tester	Key Quote
Shopify	Tom Pritchard, Staff Engineer	“Better judgment. Asks right questions, catches mistakes, pushes back on unsound plans.”
Cursor	Michael Truell, Co-Founder/CEO	“Exceeds prior Opus on CursorBench across all effort levels. Fewer steps, better follow-through.”
Cognition (Devin)	Scott Wu, CEO	“Uses tools cleanly, follows instructions with consistency. Fixes comment-verbosity issues from 4.7.”
Harvey AI	Niko Grupen, Head of Applied Research	“Highest Legal Agent Benchmark score. First to break 10% on all-pass standard.”
BrowserBase	Miguel Gonzalez, Tech Lead	“Strongest computer-use model tested. 84% on Online-Mind2Web, a meaningful jump over 4.7 and GPT-5.5.”
Databricks	Hanlin Tang, CTO Neural Networks	“Step change in agentic reasoning. Multimodal PDF/diagram reasoning at 61% cheaper tokens than 4.7.”
Thomson Reuters	Joel Hron, CTO	“Meaningful improvements in consistency and reasoning for high-stakes professional workflows.”
Bridgewater	Michael Ran, Sr. Investment Associate	“Consistently higher analysis quality. Proactively flags input/output issues.”
Hebbia	Aabhas Sharma, CTO	“Better citation precision and token efficiency on retrieval for dense financial filings.”
Box	Aaron Levie, CEO	“Measurably better at generative and analytical work enterprises care about most.”
Every	Katie Parrott, Staff Writer	“Major quality-of-life update: faster, easier collaboration, better style retention across sessions.”

The pattern across these quotes is consistent: better judgment, fewer wasted steps, stronger follow-through. Several testers specifically call out the improvements in reliability and self-correction. Shopify’s note about the model “pushing back on unsound plans” aligns directly with the alignment gains Anthropic measured.

What This Means

For Developers

Opus 4.8 is a drop-in replacement. Use model ID claude-opus-4-8 in the API. The Messages API system-entry enhancement is worth adopting if you’re building multi-step agents, since it removes the need to restructure prompts when updating instructions mid-task.

Dynamic Workflows in Claude Code is worth testing if you’re dealing with large-scale migrations or refactors. The parallel subagent approach should compress wall-clock time significantly for tasks that would otherwise run sequentially.

For Claude Code Users

Experiment with the new effort settings. The default “high” already outperforms Opus 4.7’s default. Try “extra” or “max” for complex debugging sessions or multi-file refactors. If you’re on Enterprise, Team, or Max, try Dynamic Workflows for codebase-wide changes.

For Everyday AI Users

The quality-of-life improvements matter more than the benchmark deltas. Opus 4.8 is less likely to confidently make things up, more likely to flag when it’s unsure, and better at maintaining context across long conversations. For creative and analytical work, the “faster, richer outputs” that early testers describe translate to fewer back-and-forth cycles to get a useful result. Opus 4.8 stayed on the paid Pro and Max plans throughout its run. The same is true of its successor. Opus 5 is now the default model on Claude Max and the strongest model on Claude Pro. If you are weighing the cost, our guide to Claude’s free tier shows what you can do without paying.

If you want to compare the current Claude models against GPT-5.6, Gemini and DeepSeek V4 from a single app, Fello AI bundles the leading models into one native Mac, iPhone, and iPad app for $9.99/month. It is around 30 to 50 MB, holds a 4.7-star rating across 27,000+ reviews, and consistently ranks among the best AI apps for iPhone. And if Claude ever stops responding mid-task, our guide on what to do when Claude AI is not responding covers the status checks and fastest backups.

What’s Next

Opus 4.8 was Anthropic’s fifth Opus release in seven months, and the pace it signalled held. The company had flagged two upcoming moves at the time, a lower-cost model matching Opus capabilities and a higher-intelligence tier exceeding Opus. Both shipped. Claude Fable 5 and Claude Mythos 5 arrived together on June 9, 2026und Claude Sonnet 5 followed at $3 / $15 per million tokens with a 1M context window.

Mythos did not become a self-serve product. Claude Mythos 5 is offered in limited availability to approved customers under Project Glasswing, with no self-serve sign-up, and it shares Fable 5’s specs and pricing. By Anthropic’s own account, Opus 5 “remains behind Mythos 5 on cybersecurity tasks”, which is a deliberate split rather than a gap the company is racing to close.

Opus 4.8 held the top of the generally available Claude lineup for just under two months. If you are picking a model now, Opus 5 is the default choice at the same price. The only reason to stay on Opus 4.8 is a pinned model ID you have not migrated yet.

Share Now!

Erhalten Sie exklusive AI-Tipps in Ihrem Posteingang!

Bleiben Sie mit den Erkenntnissen von KI-Experten, auf die sich die besten Technikexperten verlassen, immer einen Schritt voraus!

Michal Langmajer
Mai 29, 2026
AI, anthropic, Claude Opus, Claude opus 4.8, llm, llm 101, LLMs

Get Fello AI: All-In-One AI Chatbot

All top AI models like GPT, Claude, Gemini, or Grok – in one app that works on Mac, iPhone, and iPad.

Holen Sie sich Fello AI jetzt!

Everything You Need to Know About Anthropic’s Claude Opus 4.8

What Changed in Opus 4.8