OpenAI just dropped its first open-weight models since GPT-2 — a big move that shifts its usual strategy. Announced on August 5, the new models, called gpt-oss, are fully open-source under the Apache 2.0 license. That means anyone can download them, tweak them, and run them locally without any restrictions.
This release comes as open-source AI is gaining momentum, with players like Meta and Mistral pushing strong alternatives. gpt-oss includes two models:
- A 120-billion parameter version that performs about the same as OpenAI’s o4-mini
- A 20-billion parameter version aimed at lighter, more accessible setups
Both are optimized to run directly on local hardware — no internet, no API needed — and still support features like chain-of-thought reasoning, tool use, and large context windows.
While OpenAI continues to build out its paid API offerings, this release shows it’s also responding to the demand for more customizable, private, and cost-efficient AI setups.
gpt-oss is a big deal; it is a state-of-the-art open-weights reasoning model, with strong real-world performance comparable to o4-mini, that you can run locally on your own computer (or phone with the smaller size). We believe this is the best and most usable open model in the…
— Sam Altman (@sama) August 5, 2025
What is gpt-oss
gpt-oss represents OpenAI’s return to open-source AI development after a six-year hiatus. Released under the permissive Apache 2.0 license, these models can be freely downloaded, modified, and deployed without licensing fees or usage restrictions. Unlike OpenAI’s API-based models that require internet connectivity and per-token pricing, gpt-oss runs entirely on local hardware, giving users complete control over their AI infrastructure and data privacy.
The release includes two models optimized for different use cases. gpt-oss-120b targets high-performance applications where maximum capability matters, while gpt-oss-20b focuses on efficiency and accessibility for smaller deployments. Both models use mixture-of-experts (MoE) architecture, which dramatically reduces the computational resources needed during inference – the 120-billion parameter model only activates 5.1 billion parameters per token, making it far more efficient than traditional dense models of similar size.
gpt-oss includes the same advanced reasoning capabilities found in the o-series models, with three configurable reasoning levels (low, medium, high) that let users balance performance against speed. The models natively support tool use – including web browsing, Python code execution, and custom function calling.
By open-sourcing models that match their proprietary offerings, OpenAI is directly competing with Meta’s Llama series and other open alternatives while potentially expanding their influence in enterprise and government markets where data sovereignty and local deployment are critical requirements.
| Feature | gpt-oss-120b | gpt-oss-20b |
|---|---|---|
| Total Parameters | 117 billion | 21 billion |
| Active Parameters | 5.1B per token | 3.6B per token |
| Memory Required | 80GB GPU | 16GB RAM |
| Performance Comparison | Matches o4-mini | Matches o3-mini |
| Context Length | 128,000 tokens | 128,000 tokens |
| Reasoning Modes | Low/Medium/High | Low/Medium/High |
| Ideal Use Cases | High-performance apps | Edge devices, local inference |
gpt-oss Benchmarks Performance
gpt-oss delivers impressive performance across academic and real-world benchmarks, with the 120-billion parameter model nearly matching OpenAI’s proprietary o4-mini in most categories. On AIME 2024 – a prestigious mathematics competition for high school students – gpt-oss-120b scored 96.6% accuracy when using tools, essentially solving nearly every problem correctly. This performance extends to coding competitions, where the model achieved a 2,622 ELO rating on Codeforces, placing it among the top competitive programmers globally.
The models show particularly strong performance in health-related queries, an area where many open-source models struggle. On HealthBench – a benchmark that tests realistic medical conversations – gpt-oss-120b scored 57.6%, actually outperforming both GPT-4o and o4-mini. However, OpenAI emphasizes these models aren’t medical devices and shouldn’t be used for diagnosis or treatment. The smaller 20-billion parameter model punches well above its weight class, matching or exceeding o3-mini performance across most benchmarks despite being significantly smaller.
Multilingual capabilities are robust across 14 languages, with gpt-oss-120b achieving 81.3% average accuracy on MMLU (Massive Multitask Language Understanding). The models also excel at software engineering tasks, scoring 62.4% on SWE-bench Verified – a benchmark that tests real-world programming problems by asking models to fix actual GitHub issues. This represents practical coding ability rather than theoretical knowledge.
One notable limitation is hallucination rates. While gpt-oss performs well on complex reasoning tasks, it hallucinates more frequently than o4-mini on simple factual questions (SimpleQA and PersonQA benchmarks). This suggests the models are optimized for complex reasoning rather than basic fact recall, making them better suited for analytical tasks than straightforward information retrieval.
| Benchmark | gpt-oss-120b | gpt-oss-20b | Comparison |
|---|---|---|---|
| AIME 2024 (Math Competition) | 96.6% | 96.0% | Both excel |
| MMLU (General Knowledge) | 90.0% | 85.3% | Strong performance |
| Codeforces ELO (Programming) | 2,622 | 2,516 | Competitive level |
| SWE-bench Verified (Real Coding) | 62.4% | 60.7% | Practical ability |
| HealthBench (Medical Conversations) | 57.6% | 42.5% | Outperforms proprietary models |
| MMMLU (14 Languages Average) | 81.3% | 75.7% | Strong multilingual |
| GPQA Diamond (PhD-level Science) | 80.1% | 71.5% | Advanced reasoning |
| SimpleQA (Basic Facts) | 16.8% | 6.7% | Struggles |
gpt-oss vs Traditional GPT Models
The fundamental difference between gpt-oss and traditional GPT models is control and deployment flexibility. While ChatGPT and GPT-4 require internet connectivity and charge per token used, gpt-oss runs entirely offline on your own hardware with no usage fees after the initial setup. This creates entirely new possibilities that simply aren’t available with API-based models.
Air-Gapped and Classified Applications
This represents the most obvious advantage. Government agencies, defense contractors, and intelligence organizations can run gpt-oss on completely isolated networks without any internet connectivity. This enables AI-powered analysis of classified documents, strategic planning, and sensitive research that would be impossible with API models requiring external connections. Traditional GPT models can’t operate in secure, offline environments where data sovereignty is critical.
Cost Structure Advantages
High-volume applications like 24/7 customer service for major e-commerce sites or continuous content moderation for social platforms can run indefinitely without accumulating per-token usage fees. A company processing millions of customer interactions daily could save hundreds of thousands in API costs while maintaining relatively consistent service quality.
Complete Model Customization
This possibility allows organizations to fundamentally modify gpt-oss for specialized domains. Medical institutions can fine-tune versions on proprietary hospital protocols, legal firms can train assistants on specific case law databases, and engineering companies can develop AI trained on internal technical documentation. Traditional GPT models offer limited customization through fine-tuning APIs, but you can’t access or modify the underlying model weights.
Transparent Research Capabilities
Unlike ChatGPT, gpt-oss reveals its complete internal reasoning process – you can see exactly how it thinks through problems step by step. This transparency helps researchers study AI behavior and detect potential deception – the reasoning chains often contain errors, inappropriate content, or flawed logic that doesn’t appear in the final answer. OpenAI deliberately left this “messiness” intact for research purposes.
How to Run gpt-oss: Requirements and Setup
The hardware requirements for gpt-oss are surprisingly accessible thanks to aggressive quantization techniques. gpt-oss-120b requires a single GPU with 80GB of VRAM – typically an NVIDIA H100 or A100 – while the smaller gpt-oss-20b can run on just 16GB of system RAM, making it compatible with high-end consumer hardware or even some laptops. Both models use MXFP4 quantization (4.25 bits per parameter), which reduces memory usage by roughly 75% compared to standard 16-bit models without significant performance loss.
Getting started is straightforward through multiple deployment options. The models are available for immediate download on Hugging Face, with OpenAI providing native quantized versions that are ready to run. For quick experimentation, you can try the models through OpenAI’s open model playground without any local setup. Popular inference frameworks like vLLM, Ollama, and llama.cpp already support gpt-oss, while cloud providers including Azure, AWS, and Hugging Face Spaces offer hosted versions for users who prefer not to manage their own infrastructure.
Local deployment requires downloading the model weights (around 120GB for the large model, 20GB for the small one) and installing compatible inference software. OpenAI has released reference implementations in both PyTorch and Apple’s Metal platform, along with harmony renderers in Python and Rust to handle the models’ custom chat format. Windows users get additional support through Microsoft’s AI Toolkit for VS Code and Foundry Local, both optimized with ONNX Runtime for efficient CPU and GPU inference on Windows devices.
The models support OpenAI’s Harmony chat format, which handles multi-role conversations between system, developer, and user messages. This format is essential for tool use and agentic workflows. Setup involves configuring these tools through the harmony renderer, which acts as an interface layer between your application and the model’s tool-calling capabilities.
For production deployments, consider partnership integrations with platforms like Databricks, Vercel, Cloudflare, or OpenRouter, which offer managed hosting with optimized inference stacks. Hardware partners including NVIDIA, AMD, Cerebras, and Groq have also optimized their systems for gpt-oss, ensuring maximum performance across different deployment scenarios. The Apache 2.0 license means you can modify, fine-tune, and redistribute the models without licensing restrictions, making them suitable for commercial applications.
Technical Architecture and Safety Innovation
Architecture Innovations
gpt-oss introduces alternating dense and locally banded sparse attention patterns, which means the model focuses computational resources on the most relevant parts of conversations rather than processing everything equally. It also uses grouped multi-query attention with groups of 8, reducing memory usage during operation. The models run on a new o200k_harmony tokenizer with 201,088 tokens, specifically designed to handle complex multi-agent conversations and tool interactions more efficiently than previous systems.
Unprecedented Safety Testing
OpenAI conducted the most comprehensive safety evaluation ever done for an open-source model by deliberately trying to make gpt-oss dangerous. They trained specialized versions to be malicious in biology and cybersecurity, simulating how bad actors might weaponize the technology. Even with advanced training methods, these conflicting versions couldn’t reach dangerous capability levels. Three independent expert groups reviewed this methodology and OpenAI implemented their critical safety recommendations.
Known Vulnerabilities
gpt-oss has documented security weaknesses compared to proprietary models. It’s more susceptible to prompt injection attacks where users try to override system instructions, and it hallucinates more frequently on simple factual questions. These limitations require extra caution when deploying in production environments where security and accuracy are critical.
Final Thoughts
gpt-oss represents a good step forward for open-source AI, delivering genuinely competitive performance with OpenAI’s proprietary models while addressing real-world deployment needs. The models excel at complex reasoning tasks and demonstrate impressive capabilities in math, coding, and tool use, though they notably struggle more with basic factual accuracy compared to their proprietary counterparts. The hardware requirements, while reasonable for their capability level, still limit adoption to organizations with decent computing infrastructure or budget for cloud deployment.
What makes gpt-oss most valuable isn’t necessarily superior performance, but rather the flexibility it provides for specific use cases where traditional API models are not adoptable. For organizations requiring data privacy, unlimited usage, or complete customization, these models fill genuine gaps in the market. However, most everyday users will likely find little reason to switch from ChatGPT or other established AI services, especially given the technical complexity of local deployment. gpt-oss is ultimately a solid addition to the open-source AI ecosystem rather than a game-changer, offering practical solutions for specialized needs while maintaining competitive quality standards.




