GPT-5.2 Is a Monster on Benchmarks – So Why Do Users Hate It?

OpenAI is under real pressure again. In late 2025, Google’s Gemini 3 and Anthropic’s Claude Opus 4.5 closed what used to be a comfortable performance gap. Benchmarks tightened. User sentiment shifted. For the first time in years, OpenAI was no longer the unquestioned leader across reasoning, coding, and everyday usability at the same time.

Just weeks after GPT-5.1 — and only a couple of months after GPT-5.0 — OpenAI shipped GPT-5.2. Internally, this followed what multiple reports describe as a “Code Red” moment: a company-wide push to improve ChatGPT’s competitiveness after Gemini 3 began outperforming OpenAI models on several internal and external evaluations.

On paper, GPT-5.2 looks like a major win. It dominates or matches state-of-the-art results on several high-profile benchmarks. It introduces major gains in long-context retrieval, coding autonomy, spreadsheet and presentation generation, and agentic workflows. OpenAI positions it as “the most capable model series yet for professional knowledge work.”

And yet, user reaction has been unusually negative. Across Reddit, X, developer forums, and independent evaluations, a consistent pattern appears: GPT-5.2 is powerful, slow, rigid, and deeply unpleasant to interact with. Many users describe it as colder, more censorious, and less reliable in everyday use than GPT-5.1 — despite being objectively stronger in narrow, measurable tasks.

https://twitter.com/theo/status/1999985196901540205

Table of Contents hide

The Promise – “The Biggest Release Since GPT-5”

Benchmarks Don’t Match Reality

Censorship, Refusals, and the “Compliance Vibe”

The Bigger Picture of AI Landscape

Conclusion

The Promise – “The Biggest Release Since GPT-5”

OpenAI’s positioning of GPT-5.2 is unambiguous. They say it’s a productivity engine for professional work: spreadsheets, presentations, codebases, long documents, tool-calling, and multi-step projects.

The headline metric is GDPval, a benchmark OpenAI designed to measure how often AI output is preferred over human professionals on well-specified knowledge work tasks.

According to OpenAI, GPT-5.2 Thinking beats or ties human experts 70.9% of the time across 44 occupations, at roughly 1% of the cost and 11× the speed of humans. This represents a dramatic jump from GPT-5.1’s reported 38.8%.

The model lineup also expanded:

GPT-5.2 Instant – fast, lightweight, aggressively optimized
GPT-5.2 Thinking – higher reasoning effort, slower, deeper analysis
GPT-5.2 Pro – maximum reasoning, priced for enterprise
GPT-5.2 Pro (Extended) – extreme reasoning depth, very high latency and cost

This segmentation matters. Many complaints stem not from the model family as a whole, but from how automatic routing and mode switching affect real usage inside ChatGPT. Users often do not know which sub-model answered their query, or why behavior changed mid-conversation.

Gemini 3 and Claude Opus 4.5 both landed strong gains in late 2025. Gemini surged in multimodal reasoning and UI generation. Claude retained leadership in speed, tone, and everyday usability. GPT-5.2 was OpenAI’s attempt to reclaim technical dominance quickly.

Benchmarks Don’t Match Reality

The core criticism of GPT-5.2 is not that it is weak. It is that it appears over-fitted to benchmark success. Independent testers and users report a growing gap between how GPT-5.2 performs on clean, well-specified evaluations and how it behaves in messy, real-world workflows. This shows up in several ways.

A large volume of feedback describes GPT-5.2 as:

Cold
Overly formal
Argumentative
Compliance-driven
Emotionally flat

Many users explicitly say they did not ask for a personality change. GPT-5.1 had moved slightly closer to a conversational, Claude-like tone. GPT-5.2 reversed that shift. The result feels abrupt, especially for non-coding users.

OpenAI’s own system card acknowledges quality dips in Instant mode compared to GPT-5.1. Users consistently report that GPT-5.2 Instant feels “dumber” than the model it replaced, particularly for writing nuance, translation, creative tasks, follow-up consistency,… This fuels the perception that GPT-5.2 is worse, even when Thinking or Pro modes outperform earlier models.

GPT-5.2 relies heavily on automatic model routing. In practice, this means:

A complex question may be answered by a weaker sub-model
A simple follow-up may suddenly trigger heavy reasoning
Behavior can change mid-thread without explanation

For production workflows, this unpredictability is damaging. As one developer summarized: “Benchmarks don’t ship products. Reliability does.”

Censorship, Refusals, and the “Compliance Vibe”

Another major axis of controversy is censorship. Multiple independent testers report that GPT-5.2 is the most censored frontier model currently available. On the Sansa benchmark, GPT-5.2 Thinking ranks as the most restrictive model tested, exceeding Claude and Gemini in refusal frequency.

gpt-5.2 ranks as the "most censored" model on the sansa bench

not sure how accurate this is, but

i feel like gemini 3 feels heavily censored in the gemini app, but on AI studio, it's basically uncensored, and simple jailbreaks work

still don't get how grok ranks low on… pic.twitter.com/EZnSRA4ak8
— Haider. (@slow_developer) December 15, 2025

User complaints follow a pattern:

Refusals on clearly non-sensitive prompts
Long safety preambles replacing direct answers
A tone described as “corporate compliance training”

OpenAI is in a difficult position. Loosening safeguards risks abuse, regulatory pressure, and brand damage. Tightening them risks alienating users.

The problem is not that GPT-5.2 is safer. It is that its rules often feel inconsistent and unexplained. When the same prompt works one day and fails the next, trust erodes quickly.

As one reviewer put it: GPT-5.2 is not fun to interact with. It feels constrained, tense, and unhappy — and users feel that tension.

For tasks where creativity, exploration, or tone matter, this is a deal-breaker. For strictly factual or technical work, many users tolerate it.

The Bigger Picture of AI Landscape

GPT-5.2 exists in a very different AI landscape than GPT-4 — and that context explains why “winning benchmarks” no longer automatically means “winning users.”

In 2025, frontier AI development accelerated on multiple fronts at once.

First, industry now dominates frontier model output. According to the Stanford AI Index 2025, nearly 90% of notable AI models released in 2024 came from industry, up sharply from about 60% just a year earlier. OpenAI is no longer competing mainly with research labs — it is competing with hyperscalers, chipmakers, and well-funded AI-first companies shipping at high velocity.

Second, the scaling curve has not flattened. The same report shows that training compute continues to double roughly every five months, alongside rapid growth in dataset size and energy usage. This explains why models like Gemini 3, Claude Opus 4.5, and GPT-5.2 can all post meaningful gains within weeks of each other — but also why those gains are increasingly incremental rather than transformational.

Third, the performance gap between frontier models has collapsed. The AI Index notes that the difference between the #1 and #10 models on many benchmarks is now roughly 5.4%, and the gap between the top two models can be as small as 0.7%. At that level, statistical noise, prompt variance, and evaluation design matter almost as much as raw capability. UX, defaults, speed, and reliability begin to dominate user perception.

Geographically, the United States still leads in total frontier models, with 40 notable models released in 2024, compared to 15 from China and only a handful from Europe. But Chinese labs have reached near-parity on benchmarks like MMLU and HumanEval, and the overall trend is convergence, not dominance. The frontier is crowded, fast-moving, and unforgiving.

This is why GPT-5.2’s reception is paradoxical.

OpenAI can legitimately claim leadership in long-context retrieval, agentic coding, spreadsheet and presentation generation, and structured professional tasks — exactly the areas highlighted in the GPT-5.2 announcement. On paper, it delivers what enterprises want.

But in a market where top models cluster tightly, people no longer choose “the smartest model.” They choose:

the fastest daily driver
the most predictable assistant
the one with the least friction
the one they enjoy working with

That is why many users still default to Claude Opus 4.5 for writing and everyday work, or Gemini 3 for UI-heavy and multimodal tasks — even when GPT-5.2 beats them on selected benchmarks.

The strategic reality: GPT-5.2 wins certain evaluations decisively, but loses the default assistant slot for many users. In 2025, that distinction matters more than ever.

OpenAI appears aware of this tension. Public statements and reporting suggest GPT-5.3 is already planned, with explicit focus on speed, personality, and usability regressions. The Code Red was not just about shipping GPT-5.2 — it was about staying relevant in a frontier where raw intelligence alone is no longer enough.

Conclusion

GPT-5.2 is not a bad model. It is a highly specialized one. On paper, it delivers real progress: stronger long-context handling, better agentic coding, improved structured outputs, and strong benchmark performance. For demanding professional tasks where depth and correctness matter more than speed or tone, GPT-5.2 can be the right tool.

But benchmarks are no longer enough. In everyday use, many users experience GPT-5.2 as slow, rigid, overly censored, and unpredictable due to routing and mode switching. Its personality and refusal patterns actively get in the way for writing, brainstorming, and general assistance. In a crowded frontier where top models differ by only a few percentage points, those UX details decide adoption.

GPT-5.2 wins certain evaluations decisively, but loses the default assistant role for many users. OpenAI appears to recognize this. GPT-5.2 looks less like a final answer and more like a stopgap under competitive pressure. Whether OpenAI regains momentum will depend on whether the next release can keep the intelligence — and fix the experience.

Share Now!

Get Exclusive AI Tips to Your Inbox!

Stay ahead with expert AI insights trusted by top tech professionals!

Michal Langmajer
December 16, 2025
claude 4.5, future of AI, gemini, Gemini 3, GPT-5, GPT-5.2, OpenAI

Get Fello AI: All-In-One AI Chatbot

All top AI models like GPT, Claude, Gemini, or Grok – in one app that works on Mac, iPhone, and iPad.

Get Fello AI Now!

GPT-5.2 Is a Monster on Benchmarks – So Why Do Users Hate It?

The Promise – “The Biggest Release Since GPT-5”

Benchmarks Don’t Match Reality