GPT-5.5 Released: Everything You Need to Know

On April 23, 2026, OpenAI introduced GPT-5.5, its most capable model to date and the first fully retrained base model since GPT-4.5. The release brings state-of-the-art scores on Terminal-Bench 2.0, OSWorld, and GDPval, a much stronger agentic coding profile in Codex, a new Pro tier, and a pricing step up that reflects the jump in intelligence.

GPT-5.5 is rolling out to Plus, Pro, Business, and Enterprise users across ChatGPT and Codex. GPT-5.5 Pro is available to Pro, Business, and Enterprise tiers. For anyone using Fello AI on Mac, iPhone, or iPad, GPT-5.5 will be available in upcoming weeks alongside Claude 4.6, Gemini 3, DeepSeek, Kimi-K2.6, and Perplexity in the same

Índice hide

What Changed in GPT-5.5

Benchmark Results

How It Compares

Agentic Coding in Codex

Knowledge Work and Research

Computer Use and Browsing

Math and Science

GPT-5.5 Pro

Speed and Efficiency

Infrastructure Behind the Model

What Early Testers Say

Pricing and Context Window

Artificial Analysis Profile

Availability and Access

Where It Sits

What This Means

What Is GPT-5.5

GPT-5.5 is OpenAI’s newest frontier model and the headline release in the GPT-5 family. OpenAI describes it as a new class of intelligence built specifically for real work and for powering agents, with the ability to understand complex goals, use tools, check its own work, and carry multi-step tasks through to completion.

Unlike the incremental GPT-5.1, 5.2, 5.3, and 5.4 releases that preceded it, GPT-5.5 is a fully retrained base model. That difference matters: prior GPT-5.x updates were primarily post-training improvements. GPT-5.5 brings new pretraining, new reasoning behavior, and a large jump on agentic benchmarks.

GPT-5.5 is a reasoning model with text and image inputs and text output. It is ranked #2 on the Artificial Analysis Intelligence Index with a score of 59, evaluated across 10 benchmarks including GDPval, SciCode, GPQA Diamond, and Humanity’s Last Exam.

What Changed in GPT-5.5

Here is a quick summary of what is new compared to GPT-5.4:

State-of-the-art agentic scores on Terminal-Bench 2.0, OSWorld-Verified, and BrowseComp
Fully retrained base for the first time since GPT-4.5
Significantly fewer tokens used on the same Codex tasks
Same per-token latency as GPT-5.4 despite higher intelligence
922K token context window for long documents and repos
New GPT-5.5 Pro tier for demanding reasoning and research
Improved image generation that renders readable text correctly
Deeper tool use and self-checking in agent workflows
Priced at $5/$30 per million tokens (double GPT-5.4)

OpenAI frames the release as a step toward an AI that does not just answer questions but carries work forward. TechCrunch describes this as OpenAI’s clearest move yet toward a consolidated AI super app that absorbs many everyday computer tasks.

Benchmark Results

OpenAI published detailed benchmark numbers alongside the release, and several independent outlets have verified them. The table below puts GPT-5.5 next to GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro, and the restricted Claude Mythos Preview. Numbers are pulled from the OpenAI release post and verified by independent coverage.

Benchmark	GPT-5.5	GPT-5.5 Pro	GPT-5.4	Claude Opus 4.7	Claude Mythos	Gemini 3.1 Pro
Terminal-Bench 2.0	82.7%	–	75.1%	69.4%	82% / 92.1%	68.5%
GDPval	84.9%	–	83.0%	80.3%	–	67.3%
OSWorld-Verified	78.7%	–	75.0%	78.0%	79.6%	–
SWE-bench Pro	58.6%	–	57.7%	64.3%	77.8%	54.2%
GPQA Diamond	93.6%	–	–	94.2%	94.5%	–
BrowseComp	84.4%	90.1%	–	79.3%	86.9%	85.9%
FrontierMath Tier 4	35.4%	39.6%	–	22.9%	–	16.7%
HLE (no tools)	41.4%	43.1%	–	–	56.8%	–
HLE (with tools)	52.2%	57.2%	–	–	64.7%	–
CyberGym	81.8%	–	79.0%	73.1%	83.0%	–
GraphWalks (long context)	45.4%	–	–	–	80.0%	–

The 82.7% Terminal-Bench 2.0 result is state of the art for any publicly available model. GDPval at 84.9% measures the kind of work economists can actually price, including financial analysis, legal drafting, and consulting tasks, which is why OpenAI leans on it as a real-world signal rather than a synthetic puzzle.

How It Compares

A few patterns are visible in the table.

Versus GPT-5.4, GPT-5.5 is a clean step up on every agentic benchmark, with Terminal-Bench jumping 7.6 points and OSWorld climbing 3.7 points, while GDPval picks up nearly 2 points in a benchmark that was already near saturation.

Versus Claude Opus 4.7, the story splits. Anthropic retains a real lead on SWE-bench Pro (64.3% vs 58.6%) for complex multi-file GitHub issue resolution, and edges ahead on GPQA Diamond. GPT-5.5 takes Terminal-Bench 2.0 by more than 13 points, GDPval by 4.6 points, and nearly doubles Opus on FrontierMath Tier 4. For pure coding precision, Opus 4.7 is still worth considering. For agentic workflow execution, GPT-5.5 is now the stronger pick.

Versus Claude Mythos Preview, Anthropic’s gated frontier model leads cleanly on six of nine overlapping rows, particularly on software engineering and knowledge benchmarks like HLE and GraphWalks. GPT-5.5 narrowly edges Mythos on Terminal-Bench (82.7% vs 82% base), but Mythos reaches 92.1% in its higher configuration. The honest read: Mythos is stronger where it has been tested, but it is only available through Project Glasswing partners, which makes GPT-5.5 the best model most people can actually use.

Versus Gemini 3.1 Pro, GPT-5.5 wins almost every row in the table. Google still holds advantages on very long context reasoning and native multimodality, and its 2M token context window and lower API price remain genuine differentiators for specific workloads.

Agentic Coding in Codex

The most visible gains are in Codex, OpenAI’s agentic coding environment. GPT-5.5 is OpenAI’s strongest agentic coding model to date, and the company says it uses significantly fewer tokens to complete the same tasks compared with GPT-5.4. That matters for cost, latency, and the viability of long running agents.

NVIDIA, which deployed Codex internally to more than 10,000 employees, reported that debugging cycles dropped from days to hours, and that multi-week experimentation now completes overnight in complex codebases. NVIDIA also cites stronger reliability and fewer wasted cycles than earlier models.

On SWE-Bench Pro, which evaluates real GitHub issue resolution rather than toy tasks, GPT-5.5 reaches 58.6%, solving more tasks end-to-end in a single pass than any previous GPT release. Terminal-Bench 2.0 tests something different and harder: multi-tool command line workflows that require planning, iteration, and recovery from mistakes. The 82.7% score there is what most coding agent builders are paying attention to.

Knowledge Work and Research

GDPval is OpenAI’s preferred measure of economically valuable work, and 84.9% is a meaningful jump. In Ethan Mollick’s hands-on testing, GPT-5.5 Pro produced PhD-quality academic output on hundreds of crowdfunding research files within four prompts, generating legitimate literature reviews and sophisticated statistical analyses without manual editing.

Mollick also points out the remaining ceiling. Long-form fiction still shows an “uncanny” quality, with ornate sentences, repetitive metaphors, and dialogue where every character speaks in the same clipped tone. Hypothesis generation in statistics is technically sound but often uninteresting without expert prompting. The frontier, as Mollick puts it, remains jagged.

For anyone doing daily analysis, writing, or research, this model is strong enough that the bottleneck moves from the AI to the human choosing what to ask. That is a meaningful shift from twelve months ago.

Computer Use and Browsing

OSWorld-Verified at 78.7% is a real desktop computer use benchmark, and this number suggests GPT-5.5 can actually drive a virtual machine, click through UIs, and complete mixed software tasks with reasonable reliability. That makes computer use agents viable for more than demos.

BrowseComp at 90.1% on the Pro tier is the other half of the agent story. It measures how well a model can track down hard-to-find information across the open web, which is the core loop inside most deep research agents. Combined with stronger tool use, GPT-5.5 pushes practical browsing agents from interesting to useful.

Math and Science

On FrontierMath Tier 4, the hardest tier of a benchmark built by working mathematicians to resist memorization, GPT-5.5 Pro scored 39.6%. That is nearly double Claude Opus 4.7’s 22.9% and is a large step over any previous publicly released model.

SciCode and GPQA Diamond, both represented in the Artificial Analysis composite, test graduate-level scientific reasoning and programming. GPT-5.5 performs at or near the top on both, which is what gets it to the overall Intelligence Index score of 59.

GPT-5.5 Pro

GPT-5.5 Pro is the higher effort variant, aimed at demanding reasoning tasks, deep research, and long horizon agents. It is where the most impressive numbers come from: 90.1% on BrowseComp and 39.6% on FrontierMath Tier 4.

Pro used to be a niche option because of cost and speed. With the efficiency work in GPT-5.5, OpenAI says Pro is now a much more practical option for regular use, and it is what powers the most capable tasks ChatGPT can take on. In Mollick’s testing, GPT-5.5 Pro completed a 5000-year procedurally evolving harbor town simulation in 20 minutes, down from 33 minutes on the previous generation, and built a complete 101-page illustrated tabletop RPG from a single prompt.

Pro is available to Pro, Business, and Enterprise users in ChatGPT.

Speed and Efficiency

One of the more interesting claims in the release is that GPT-5.5 matches GPT-5.4’s per-token latency in real-world serving while performing better across nearly every evaluation. In plain terms, it is smarter without being slower, and it uses fewer tokens to reach the same answer.

For agent builders, this compounds. Fewer tokens per task means lower cost per task, shorter wall-clock time per task, and more tasks per dollar of budget. It also means longer agent runs become economically feasible, which is one of the main things holding back complex workflows today.

On the ChatGPT product side, full-stack inference improvements mean a more capable model served at faster end-user speed. This is part of why OpenAI is comfortable putting Pro in front of more users.

Infrastructure Behind the Model

GPT-5.5 runs on NVIDIA GB200 NVL72 rack-scale systems, which NVIDIA says deliver 35x lower cost per million tokens and 50x higher token output per second per megawatt compared with prior systems. The first GB200 NVL72 100,000 GPU cluster completed large-scale training runs for GPT-5.5 and set new reliability benchmarks at scale.

The OpenAI and NVIDIA relationship goes back to 2016, when Jensen Huang personally delivered the first DGX-1 to OpenAI. OpenAI has committed to deploying more than 10 gigawatts of NVIDIA systems for next generation infrastructure, which gives a sense of the compute footprint the next generation of models will need.

What Early Testers Say

Before the public release, OpenAI worked with a closed group of testers. Ethan Mollick, who has written the most detailed outside review, calls GPT-5.5 a “research partner” that performs especially well when paired with contextual inputs from documents and plugins.

His concrete examples are worth knowing. The model handled a multi-style image generation test (Klimt, Picasso, and Monet simultaneously) while correctly rendering text, which had been a persistent weakness. It ran a procedural 5000-year harbor town simulation that previous models could not complete. It produced a fully formed, illustrated 101-page tabletop RPG in one session.

NVIDIA’s internal rollout to over 10,000 engineers and researchers is the other signal. Debugging cycles compressed from days to hours, and long experimentation runs moved overnight, which is the type of qualitative change that shows up in productivity data rather than in benchmark tables.

Pricing and Context Window

GPT-5.5 API pricing is $5 per million input tokens and $30 per million output tokens. That is exactly double GPT-5.4 at $2.50 input and $15 output. GPT-5.5 Pro is priced at $30 input and $180 output per million tokens, reflecting its deeper reasoning and agentic loops.

Against the industry average of $1.60 input and $8.40 output across comparable models, GPT-5.5 ranks as expensive for its class, though most teams will find the higher per-task efficiency offsets part of the increase.

The context window is 922,000 tokens per Artificial Analysis, which works out to roughly 1,383 A4 pages at 12pt Arial. Some partner documentation references a 1M token tier at standard API pricing, which matches Claude Opus 4.7’s window and falls short of Gemini 3.1 Pro’s 2M.

Artificial Analysis Profile

Artificial Analysis, an independent benchmarking service, places GPT-5.5 (high) at #2 out of 141 evaluated models. Here is how it profiles the release.

Metric	GPT-5.5 (high)	Notes
Intelligence Index	59	#2 of 141 models, 4/4 intelligence rating
Input price	$5.00 / 1M tokens	#118 of 141 (industry avg $1.60)
Output price	$30.00 / 1M tokens	#125 of 141 (industry avg $8.40)
Blended rate (3:1)	$11.25 / 1M tokens
Context window	922K tokens	~1,383 A4 pages
Input modalities	Text, image
Output modalities	Text only
Model type	Reasoning with extended thinking
Full eval cost	$2,159.38	Cost to run the Intelligence Index
Verbosity	45M output tokens	Slightly above 36M median
Released	April 23, 2026

The composite Intelligence Index score of 59 draws on 10 benchmarks including GDPval-AA, SciCode, GPQA Diamond, and Humanity’s Last Exam. The median comparable score is 33, so GPT-5.5 sits well above the field, with only one model scoring higher across the full evaluation set.

Availability and Access

GPT-5.5 is available starting April 23, 2026 to Plus, Pro, Business, and Enterprise users in ChatGPT, and in Codex for developers. GPT-5.5 Pro is available to Pro, Business, and Enterprise users in ChatGPT.

The API is live with the standard pricing above. OpenAI has not announced a free-tier rollout.

If you want to avoid yet another subscription, Fello AI bundles all top AI models from different companies a single native Mac, iPhone, and iPad app. The app is around 30 to 50 MB, uses negligible RAM, and runs on any Apple silicon device, which makes it a practical way to try GPT-5.5 next to the other frontier models without juggling accounts.

Where It Sits

Against the current lineup, GPT-5.5 is the general purpose leader for most real tasks. Claude Opus 4.7 is still strongest for some kinds of highly autonomous software engineering, the upcoming Claude Mythos remains a restricted preview, and Gemini 3.1 Pro leads on ultra long context and native multimodality. Among open models, Llama 4 and gpt-oss continue to be the default when self-hosting matters more than peak capability.

Within OpenAI’s own ladder, GPT-5.5 replaces GPT-5.4 as the default and GPT-5.5 Pro sits above it for the hardest work. The smaller GPT-5.1, 5.2, and 5.3 tiers remain relevant for cost-sensitive serving.

What This Means

GPT-5.5 is the clearest example yet that the AI capability curve is still bending upward rather than flattening. It is not a new paradigm, but it is a real jump: state-of-the-art agentic benchmarks, meaningful wins on economically valuable knowledge work, a context window large enough for almost any document set, and a Pro tier fast enough that advanced reasoning is not a once-a-day treat.

The practical shift for most people is this. A year ago, AI could draft. Six months ago, it could reason. Today it can complete. That changes what it is worth handing to a model: not a small piece of work to be reviewed, but an entire task to be finished.

For everyday use, GPT-5.5 is worth trying next to Claude Opus 4.7 on your actual workload, not on benchmarks. For agent and coding teams, the Codex efficiency gains and Terminal-Bench 2.0 score are the numbers that justify the price increase. For anyone who wants to test it without a direct OpenAI subscription, GPT-5.5 will be available in Fello AI within upcoming weeks. For now you can work with all other top models like GPT-5.4, Claude 4.6, Gemini 3 and more.