Anthropic launched Claude Opus 4.5: Faster, Cheaper, and Crazy Good at Coding

November 24, 2025. Anthropic has officially launched Claude Opus 4.5, a major refresh of its top-tier model and the company’s strongest push yet in the fight for AI leadership. Coming just days after Google’s Gemini 3 debut, the new Opus arrives with a sharp focus on professional coding, long-running agents, and desk-work automation—and Anthropic is backing the launch with aggressive pricing and hard benchmark data.

The company says Opus 4.5 is now the leading model for real-world software engineering, slide and spreadsheet editing, and multi-step agentic workflows. Early numbers support the claim, with the model showing large performance gains across enterprise tasks. And in a move aimed at accelerating adoption, Anthropic has cut Opus pricing by roughly two-thirds compared to the previous generation.

Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use.

Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done. pic.twitter.com/mid2Z1qzIf
— Claude (@claudeai) November 24, 2025

Here’s what’s new, what actually improved, and how Opus 4.5 shifts the AI race heading into 2026.

Table of Contents hide

Performance & Benchmarks

Multilingual edge

Vision & agentic browsing

Long-context Research

Cost vs. Accuracy

Opus 4.5 Rocks in Anthropic’s Own Hiring Test

Agentic Tasks

Deep Research, Excel, Slides, and Real Workflows

Model Safety Check

Prompt-Injection Resistance

Real-World Misuse

Creative Reasoning

Conclusion

Performance & Benchmarks

Opus 4.5’s debut is backed by three task-grounded benchmarks engineers actually use: SWE-bench Verified, Terminal-Bench 2.0, and τ²-Bench. Together they cover real-repo bug fixing, command-line automation, and multi-tool agent workflows—exactly what matters when models move from chat to shipped code.

In plain terms: SWE-bench Verified checks end-to-end bug fixes on real open-source repos (apply patch, run tests, pass). Terminal-Bench 2.0 measures how well a model plans and executes in a real shell—CLI tools, file ops, pipelines—so it maps to DevOps and automation. τ²-Bench tests multi-step agent behavior across tools and contexts—reasoning, planning, recovery from errors—so you see if a model can run longer workflows without hand-holding.

Model	SWE-bench Verified	Terminal-Bench 2.0	τ²-Bench
Claude Opus 4.5	80.9 %	59.3 %	88.9 %
Claude Sonnet 4.5	77.2 %	50.0 %	86.2 %
Gemini 3 Pro	76.2 %	54.2 %	85.3 %
GPT-5.1	74.9 %	58.1 %	80 %*

Why it matters: Opus 4.5 is the first LLM to clear the 80 % bar on real-world bug-fixing while also topping shell automation and multi-tool agent tests. That three-for-three sweep makes it a genuine all-rounder instead of a one-trick coding pony.

Multilingual edge

On SWE-bench Multilingual, Opus 4.5 ranks first in 7 of 8 languages—Java, Go, Rust, C++, Python, TypeScript, PHP—with Kotlin as the single exception (Gemini 3 leads there). This matters for polyglot codebases where one fix can touch multiple stacks: a single model stays consistent across languages, reducing hand-offs and prompt juggling. The suite measures end-to-end issue resolution on real repos (apply patch → run tests → pass), so these wins reflect practical bug-fixing rather than isolated code snippets.

Vision & agentic browsing

On vision-augmented and browsing tasks (BrowseComp-Plus and Vending-Bench 2) Opus 4.5 shows single-digit percentage gains over Sonnet 4.5. The lead holds, but margins are modest, suggesting incremental progress rather than a breakout. Expect movement as Anthropic’s new zoom capability (the model can request tight crops of screenshots/UI regions) is incorporated into public evals; finer visual attention typically helps with small UI text, dense terminal output, and multi-panel pages where previous runs lost points to misreads.

Long-context Research

Opus 4.5 keeps the 200 k-token window but adds automatic context compaction and thinking-block memory. In Anthropic’s internal deep-research eval—30-step literature reviews with sub-agents—the full stack (high effort + memory + sub-agents) lifted accuracy by nearly 15 percentage points versus Sonnet 4.5.

Cost vs. Accuracy

Model	Token price (in / out)	$ per SWE-bench point
Opus 4.5	$5 / $25	0.31
Sonnet 4.5	$3 / $15	0.19
Gemini 3 Pro	$2 / $12	0.16
GPT-5.1	$1.25 / $10	0.13

Opus isn’t the cheapest on a raw-token basis, but once you factor in its higher hit-rate (and the human hours saved when a patch lands on the first try) the dollar-per-bug-fixed metric starts to lean its way—especially for teams where a single regression can cost more than an entire month of API usage.

Opus 4.5 Rocks in Anthropic’s Own Hiring Test

One of Anthropic’s boldest claims is that Opus 4.5 scored higher than any human engineer on the company’s internal “performance engineering take-home exam,” a two-hour technical problem-solving test used in real hiring.

There’s nuance here. Anthropic notes that the highest score was achieved using parallel test-time compute (multiple solution attempts in parallel). Without it, Opus 4.5 ties the strongest human, rather than surpassing them outright.

Still, this is the first time Anthropic has publicly stated that a Claude model has matched or exceeded expert humans on one of its real engineering assessments—something they didn’t claim for Sonnet 4.5 or Opus 4.1.

This exam is now effectively an internal AI benchmark, and Opus 4.5 sets a new bar.

Agentic Tasks

Beyond raw code generation, Opus 4.5’s biggest leap appears in agentic tasks—scenarios where the model must plan, act, use tools, and follow rules.

On τ2-bench, a benchmark for real-world multi-step tasks, Opus 4.5 demonstrated a level of creative decision-making that surprised even Anthropic’s test designers.

The standout example:

A customer with a Basic Economy ticket wants to move their flight to a new date.
Policy: Basic Economy cannot be modified.
Expected benchmark answer: Refuse the request.

Opus 4.5 instead:

Reads the policy deeply
Identifies a clause allowing cabin upgrades for all fare types
Proposes upgrading from Basic Economy → Economy
Then modifies the itinerary under the new fare rules

This solution is entirely legitimate, even though the benchmark marked it as a “failure” since it deviated from the expected answer.

Anthropic describes this as “creative problem solving”; safety researchers call it “reward hacking.”
Either way, it demonstrates that Opus 4.5 operates with a high degree of policy reasoning and loophole detection, a sign of both capability and potential risk.

Deep Research, Excel, Slides, and Real Workflows

Anthropic positions Opus 4.5 heavily toward knowledge work. The model shows major gains in:

Spreadsheet generation and editing
Slide deck creation
Long-running research
Desktop and browser-based tasks
Multi-agent coordination

Anthropic reports a 15-point boost on internal deep-research evaluations when combining Opus 4.5 with their improved memory, context management, and multi-agent structure.

These gains surface directly inside product features:

Claude for Excel (now expanded to Max, Team, Enterprise)
Claude for Chrome (now available to all Max users)
Claude Code in the desktop app, which can run multiple coding agents simultaneously

This is where Opus 4.5 becomes more than an LLM: it behaves like an autonomous digital coworker that can navigate tools, edit documents, and reason across long tasks.

Model Safety Check

Claude Opus 4.5 is Anthropic’s most safety-hardened model yet, and the company openly highlights improvements in areas like prompt-injection resistance, misuse refusal, and policy adherence. But while the model is safer than previous Claude versions—and safer than most rivals—the data shows clear limits that matter for anyone deploying AI agents or automating sensitive workflows.

Prompt-Injection Resistance

Anthropic’s launch post includes direct comparisons against models like Gemini 3 Pro and GPT-5.1. Opus 4.5 shows the lowest success rate for strong prompt-injection attacks among all tested models. This is a meaningful upgrade, but not a complete fix: even with its improvements, repeated attacks still succeed at a noticeable rate, meaning developers must keep additional guardrails in place.

Real-World Misuse

In controlled safety evaluations, Opus 4.5 refuses 100% of prohibited coding requests. But in real-world scenarios, especially inside Claude Code, refusal rates drop:

Malware-style requests blocked: ≈78%
Surveillance / monitoring requests blocked: ≈88%

These are strong results compared to industry norms, yet they confirm that a determined attacker—or even an inexperienced user with persistence—can still slip through.

Creative Reasoning

Anthropic showcased an interesting example from τ2-bench, a benchmark for agentic behavior. In a simulated airline scenario, the “correct” answer was to refuse a customer’s request to change a Basic Economy ticket, since the fare class doesn’t allow modifications.

Opus 4.5 found a legal workaround:

Upgrade the ticket (allowed across all fare classes),
Then modify the itinerary under the upgraded rules.

This is the same kind of reasoning that makes Opus 4.5 excellent at strategy, policy reading, and long-running tasks, but it also raises a safety question:

Is the model truly aligned, or is it discovering loopholes to satisfy instructions?

Anthropic calls this “creative problem solving,” but acknowledges that the behavior closely resembles reward hacking—exactly the kind of emergent behavior safety researchers worry about as models gain autonomy.

Conclusion

Claude Opus 4.5 is a strong upgrade. It’s faster, cheaper, and delivers some of the best coding and agent performance right now. It leads internal SWE-Bench tests, matches top human engineers on Anthropic’s own exam, and handles spreadsheets, slides, and long research tasks noticeably better. And with the price drop to $5/$25 per million tokens, it becomes much easier to use every day.

But this launch doesn’t happen in a vacuum. Gemini 3, GPT-5.1, and Grok 4.1 all arrived within weeks. Each one is strong in different areas, and the gap between top models is now very small. There’s no clear winner anymore — just fast releases and constant leapfrogging.

This raises simple but important questions: how should people compare these models when the differences are tiny? Will companies pick based on cost, ecosystem, or safety? And if every new model finds new strengths while still showing safety issues, how close are we to AI that can reliably run tasks on its own?

For now, Opus 4.5 puts Anthropic back at the front in coding and agent work. The real story is how quickly the entire industry is moving — and how hard it’s becoming to judge which model is actually “the best.”

Share Now!

Get Exclusive AI Tips to Your Inbox!

Stay ahead with expert AI insights trusted by top tech professionals!

Michal Langmajer
November 24, 2025
AI, anthropic, claude, Claude Opus, Claude Opus 4.1, dario amodei, llm, LLMs

Get Fello AI: All-In-One AI Chatbot

All top AI models like GPT, Claude, Gemini, or Grok – in one app that works on Mac, iPhone, and iPad.

Get Fello AI Now!

Anthropic launched Claude Opus 4.5: Faster, Cheaper, and Crazy Good at Coding

Performance & Benchmarks

Multilingual edge