OpenAI Quietly Launched GPT‑4.1 – A GPT-4o Successor That’s Crushing Benchmarks

OpenAI has just launched a brand-new series of GPT models—GPT‑4.1, GPT‑4.1 mini, and GPT‑4.1 nano—that promise major advances in coding, instruction following, and the ability to handle incredibly long contexts. In a tweet from Sam Altman (@sama), he explained that these models are designed for real-world utility and are available exclusively via the API. The announcement makes it clear: while ChatGPT will continue using an updated GPT‑4o model, these latest models push the boundaries for developers with even better performance.

GPT-4.1 (and -mini and -nano) are now available in the API!

these models are great at coding, instruction following, and long context (1 million tokens).

benchmarks are strong, but we focused on real-world utility, and developers seem very happy.

GPT-4.1 family is API-only.
— Sam Altman (@sama) April 14, 2025

The GPT‑4.1 series sets a new standard in various areas. With support for up to 1 million tokens of context—enough space to handle multiple long documents or even entire codebases—the new models are built to process extensive input and deliver relevant answers quickly. The series is geared toward applications that demand robust coding capabilities and nuanced instruction-following.

Developers and organizations can expect faster responses and lower costs. GPT‑4.1 mini and GPT‑4.1 nano, in particular, are designed to deliver improved latency and cost efficiency while still offering strong performance. With benchmarks demonstrating significant gains over previous models, OpenAI is clearly aiming to empower more sophisticated agentic applications.

Key Improvements and Capabilities

The GPT‑4.1 model family comes with several important enhancements compared to its predecessors. Here are some of the highlights:

Enhanced Coding Skills

GPT‑4.1 outperforms earlier models on coding benchmarks. It scores 54.6% on SWE‑bench Verified—a 21.4 percentage point improvement over GPT‑4o. This means the model is notably better at tasks such as exploring code repositories, generating code patches that run successfully, and even handling diff formats with fewer extraneous edits.

Instruction Following

The new models excel at following complex, multi-turn instructions. An internal evaluation using Scale’s MultiChallenge benchmark shows GPT‑4.1 scoring 38.3% compared to 27.8% for GPT‑4o. This improvement is critical when building agents that need to understand and act on multiple pieces of instruction sequentially.

Long-Context Processing

Perhaps the most eye-catching feature is the support for up to 1 million tokens per interaction. This extended context window gives the model the ability to “remember” and accurately process long documents or multiple data sources at once. In internal tests—such as Video‑MME and needle-in-a-haystack evaluations—the model demonstrated impressive accuracy even with vast amounts of input.

Improved Efficiency

While GPT‑4.1 offers top-tier performance on several benchmarks, the mini and nano variants trade a bit of accuracy for much lower latency and cost. For example, GPT‑4.1 nano is hailed as the fastest and cheapest option, making it ideal for tasks like classification and autocompletion.

These improvements are not just theoretical. OpenAI has tailored GPT‑4.1 based on direct feedback from the developer community, ensuring that everyday workflows—from automated code reviews to long document analysis—run smoother and faster.

Benchmarks Performance

To put the improvements into perspective, here are some of the key performance figures and comparisons that OpenAI released:

Coding Capabilities

Model	SWE‑bench Verified Accuracy	Polyglot Diff Accuracy (Whole/Diff)
GPT‑4.1	54.6%	52% (whole) / 53% (diff)
GPT‑4o (2024‑11‑20)	33.2%	31% (whole) / 18% (diff)
GPT‑4.5	38%	64% (whole) / 62% (diff)

Note: The ‘diff’ format measures the model’s ability to follow diff formats for code changes, significantly reducing output length and costs.

Instruction Following

Model	Internal API (Hard)	MultiChallenge Accuracy	IFEval Accuracy
GPT‑4.1	49.1%	38.3%	87.4%
GPT‑4o (2024‑11‑20)	29.2%	27.8%	81.0%
GPT‑4.5	54.0%	43.8%	88.2%
GPT‑4.1 mini	45.1%	35.8%	84.1%
GPT‑4.1 nano	31.6%	15.0%	74.5%

These numbers highlight the leaps in performance that GPT‑4.1 offers, especially in areas that matter for real-world applications such as multi-turn conversations and following detailed formatting instructions.

Long Context and Vision Evaluations

GPT‑4.1 also shows strong performance in tasks involving long-context and even vision-based questions. For example:

Long Context Tasks:
In the OpenAI-MRCR evaluation (which tests the model’s ability to retrieve specific “needle” responses from a sea of similar content), GPT‑4.1 outperforms GPT‑4o with scores reaching 57.2% in a 128k token context scenario.
Vision Benchmarks:
In image-understanding tasks like MMMU and MathVista, GPT‑4.1’s scores are competitive, often exceeding those of GPT‑4o.

These benchmarks are supported by detailed evaluations and internal tests, reinforcing GPT‑4.1’s utility in scenarios ranging from detailed legal document reviews to complex coding tasks.

API Exclusivity

Beyond raw performance, GPT‑4.1’s release brings important business considerations for developers and enterprises. Here are some of the key details:

API-Only Access and Fine-Tuning Support

GPT‑4.1 is available exclusively via the API, meaning you won’t see it directly in the ChatGPT interface. Instead, ChatGPT continues to operate on an updated version of GPT‑4o, incorporating some of the improvements found in GPT‑4.1. This approach allows OpenAI to tailor GPT‑4.1 for specific developer needs and keep pushing performance with lower latency and cost.

OpenAI is also rolling out fine-tuning support for GPT‑4.1 and GPT‑4.1 mini. This upcoming feature will let organizations customize the models with their own data, ensuring that the responses match their desired tone, domain-specific terminology, and particular workflows. With integration in services like the Azure AI Foundry, businesses gain more control over versioning, security, and scalability.

Cost and Efficiency

The new models come with a detailed pricing structure designed to balance cost with performance:

Model	Input Cost*	Output Cost*	Cached Input Cost*
GPT‑4.1	$2.00 per 1M tokens	$8.00 per 1M tokens	$0.50 per 1M tokens
GPT‑4.1 mini	$0.40 per 1M tokens	$1.60 per 1M tokens	$0.10 per 1M tokens
GPT‑4.1 nano	$0.10 per 1M tokens	$0.40 per 1M tokens	$0.025 per 1M tokens

*Pricing is based on typical input/output and cache ratios; expect further discounts when using the Batch API, which offers an additional 50% discount.

The efficiency improvements mean that, for example, GPT‑4.1 nano is the fastest and cheapest option available, ideal for scenarios like simple classification or code autocompletion. This pricing structure is designed to make these capabilities available to a wide range of applications—from small-scale startups to large enterprises.

Real-World Impact

Early testing by partners such as Windsurf, Qodo, Hex, Blue J, Thomson Reuters, and Carlyle has already shown the practical benefits of GPT‑4.1. From accelerating code review processes and reducing extraneous edits to handling complex legal and financial documents efficiently, these models are reshaping how developers build and deploy intelligent applications.

The real-world evaluations not only show enhanced performance but also underline substantial improvements in efficiency, making GPT‑4.1 a pivotal tool for building reliable, agentic software capable of independent decision-making and dynamic task management.

Conclusión

OpenAI’s GPT‑4.1 model series marks a significant evolution in AI—especially for developers whose work depends on robust coding, detailed instruction following, and processing long documents. With up to 1 million tokens of context, improved efficiency, and competitive pricing, these models open up new possibilities for a variety of real-world applications.

While GPT‑4.1 remains an API-only offering, its advanced features and upcoming fine-tuning support promise to revolutionize workflows in areas ranging from software engineering and legal analysis to data extraction and beyond. As developers continue to experiment and integrate these capabilities into their systems, GPT‑4.1 stands out as a bold step forward in the practical application of AI.

Stay tuned as the industry adapts to these breakthroughs and as more detailed analyses and user feedback come in. OpenAI’s focused improvements based on community input suggest that the GPT‑4.1 series is not only more powerful but also more aligned with the everyday needs of modern software development.

Reciba consejos exclusivos sobre inteligencia artificial en su buzón de entrada.

Manténgase a la vanguardia con los conocimientos expertos en IA en los que confían los mejores profesionales de la tecnología.

Michal Langmajer
abril 15, 2025
gpt-4, gpt-4.1, gpt-4.5, GPT-4o, GPT-5, OpenAI

Get Fello AI: All-In-One Mac AI Chatbot

All the best AI models such as GPT-4o, Claude 4, Gemini 2.5, LLaMA 4 in a single app. Multi-language support, chat with PDFs, create images, search the web and more!

Consigue ya la IA de Fello

OpenAI Quietly Launched GPT‑4.1 – A GPT-4o Successor That’s Crushing Benchmarks