GPT OSS from OpenAI — Two Powerful Open-Source/Open-Weight Models. Comparable to frontier models?

GPT OSS from OpenAI — Two Powerful Open-Source/Open-Weight Models. Comparable to frontier models?

OpenAI just dropped a game-changer in the AI world! Discover how their new open-weight models, GPT-OSS-120b and GPT-OSS-20b, are redefining what's possible right from your hardware. Are they comparable to cutting-edge proprietary models? How do they stack up to other open-source / open-weight frontier models? Dive in to find out! #AI #OpenSource #MachineLearning

OpenAI Just Released Two Powerful Open-Weight Models That Run on Your Hardware — And They’re Surprisingly Good

Executive Summary

OpenAI has made a surprising strategic pivot by releasing its first open-weight language models since GPT-2 in 2019. The new GPT-OSS-120b and GPT-OSS-20b models, released under the permissive Apache 2.0 license, deliver performance that rivals OpenAI’s proprietary models while running efficiently on consumer hardware.

Key Takeaways:

  • GPT-OSS-120b (117B parameters) matches or exceeds OpenAI’s o4-mini on most benchmarks while running on a single 80GB GPU

  • GPT-OSS-20b (21B parameters) delivers o3-mini-level performance and runs on devices with just 16GB of memory; including consumer laptops

  • Both models excel at reasoning, coding, mathematics, and tool use, often outperforming much larger open-source alternatives

  • The models use a Mixture-of-Experts (MoE) architecture, activating only 5.1B and 3.6B parameters per token respectively

  • Safety-tested through adversarial fine-tuning, with a $500,000 Red Team Challenge to identify vulnerabilities

  • Available immediately on HuggingFace, with support from major platforms including Azure, AWS, Ollama, and LM Studio

For businesses, this means enterprise-grade AI capabilities can now be deployed on-premises with full control over data. For developers, it enables local experimentation without API costs. For the AI community, it represents a significant step toward democratizing advanced AI capabilities.

With these models, OpenAI has effectively answered critics who claimed the company had abandoned its original mission of democratizing AI. By releasing high-performance models that can run locally, they’ve significantly lowered the barriers to entry for AI development and deployment. For enterprises concerned about data privacy and sovereignty, this represents a transformative opportunity to bring cutting-edge AI capabilities in-house.

The Shocking Performance Numbers That Have Everyone Talking

Imagine running a model that performs like GPT-4 on your own hardware, without sending a single API call to the cloud. That’s essentially what OpenAI delivered today — and the benchmarks are raising eyebrows across the AI community.

“I was not expecting the open weights releases to be anywhere near that class, especially given their small sizes,” notes AI researcher Simon Willison in his analysis of the models. The numbers back up his surprise.

Head-to-Head: How GPT-OSS Stacks Up Against the Competition

Let’s dive into the benchmarks that matter most for real-world applications:

Competition Coding (Codeforces Elo Ratings)

  • o3 (with tools): 2706

  • GPT-OSS-120b (with tools): 2622

  • GPT-OSS-20b (with tools): 2516

  • DeepSeek R1: 2061

  • Qwen 3 235b: 2700

The GPT-OSS models achieve “incredibly strong scores and beat most humans on the entire planet at coding,” according to the benchmark analysis. Even the smaller 20b model significantly outperforms DeepSeek’s offering while using a fraction of the compute resources.

Advanced Mathematics (AIME 2024 & 2025) Here’s where things get really interesting. On competition-level mathematics problems:

AIME 2024:

  • o4-mini: 98.7%

  • GPT-OSS-120b: 96.6%

  • GPT-OSS-20b: 96%

  • o3: 95.2%

AIME 2025:

  • o4-mini: 99.5%

  • GPT-OSS-20b: 98.7% (beating the larger 120b model!)

  • o3: 98.4%

  • GPT-OSS-120b: 97.9%

“The 20 billion parameter version actually beat the 120 billion parameter version,” the benchmarks reveal, suggesting exceptional efficiency in the smaller model’s design.

Accuracy on the 2024 and 2025 AIME math competitions (15 problems, integer answers).

  • Near-Perfect Scores: Math competition performance is exceptionally high across top models. OpenAI’s o4-mini leads with 98–99%, narrowly ahead of the rest. GPT-OSS-20B hit 98.7% on 2025, even slightly outscoring GPT-OSS-120B (97.9%) and o3 (98.4%) that year. This is an impressive result for a 20B model.

  • Strong Open-Source Showings: GPT-OSS-120B achieved 96.6% on 2024, nearly matching o4-mini. Qwen-3 (235B) and Phi-4 (14B) also excel (≈90–95%), demonstrating that both massive scale and clever training can yield advanced math skills. Gemma-3 (27B) is close behind at ~89–90%.

  • Mid-tier Models Lag a Bit: DeepSeek-R1’s scores (≈80% in 2024, 87.5% in 2025) are substantially lower, suggesting some gap in mathematical problem-solving for that model. Overall, however, the best open models (GPT-OSS, Qwen, Phi-4) have essentially reached human-level, near-perfect accuracy on AIME-style math. This was once considered a tough domain and a major challenge for AI.

PhD-Level Science (GPQA Diamond)

  • o3: 83.3%

  • o4-mini: 81.4%

  • GPT-OSS-120b: 80.1%

  • GPT-OSS-20b: 71.5%

Accuracy on GPQA Diamond, a set of extremely advanced science questions (no tools allowed).

  • Top-Tier Cluster (~80%): A tight cluster of models achieved ~80% on this PhD-level QA. OpenAI’s o3 is highest at 83.3%, but o4-mini, DeepSeek-R1, Qwen-3, and GPT-OSS-120B all score between 80–81%, effectively matching each other. This parity shows that open models can now approach proprietary model performance even on very advanced reasoning tasks.

  • Phi-4: Small but Mighty: Phi-4 (14B) delivers 78% accuracy, only a few points shy of models an order of magnitude larger. Its ability to stay within ~5% of o3 on expert QA underscores how well this 14B model was optimized for reasoning.

  • Notable Laggard — Gemma-3: Gemma-3 (27B) scores only 42.4%, dramatically lower than all others. This outlier result suggests Gemma-3 struggles with the GPQA dataset; it may lack the training breadth or alignment for such open-ended, high-level questions (whereas it performs fine on structured tasks like math).

General Knowledge (MMLU)

  • o3: 93.4%

  • o4-mini: 93%

  • GPT-OSS-120b: 90%

  • GPT-OSS-20b: 85.3%

Accuracy on MMLU, a benchmark covering 57 academic subjects (middle-/high-school and college level).

  • Closing in on Human-Level: The top closed models o3 (93.4%) and o4-mini (93.0%) still hold a slight lead, but open models are very close. GPT-OSS-120B scores 90%, and DeepSeek-R1 actually reached 91.3%, essentially matching the OpenAI models on this broad test of knowledge.

  • High Achievers: Other latest models all cluster in the high 80s. Qwen-3 manages 88%, Phi-4 about 87%, Llama 4 Maverick 86%, and GPT-OSS-20B 85.3%. These results demonstrate that a wide range of modern models — open and proprietary — have powerful multidisciplinary knowledge and reasoning capabilities.

  • Gemma-3 Trails: At 76.9%, Gemma-3 (27B) again lags behind its peers on MMLU. While respectable, this score is roughly 10 points lower than models like GPT-OSS-20B or Llama 4, indicating Gemma-3 may require further refinement or scaling to compete on broad knowledge tasks.

The Healthcare Surprise

In an unexpected twist, both GPT-OSS models outperform OpenAI’s proprietary o1 and GPT-4o models on healthcare-related queries:

HealthBench (Realistic)

  • o3: 59.8%

  • GPT-OSS-120b: 57.6%

  • o4-mini: 50%

  • GPT-OSS-20b: 42.5%

This performance suggests potential applications in healthcare technology, though OpenAI emphasizes these models “do not replace a medical professional and are not intended for diagnosis or treatment.”

Performance on realistic health advice conversations (left) and a harder medical challenge set (right). Higher is better.

  • Maverick at the Top: Meta’s Llama 4 “Maverick” (multimodal, long-context model) slightly leads on health tasks, with ~60% on realistic advice and 35% on the hard set. This indicates its strength in nuanced, multimodal understanding in the medical domain.

  • GPT-OSS-120B is Competitive: GPT-OSS-120B scores 57.6% on realistic and 30% on hard questions— closely trailing o3 (59.8%/31.6%) and Maverick. It outperforms other open models like DeepSeek-R1 (49%/30%) on realistic queries, and matches them on the hardest ones. Notably, GPT-OSS models even exceeded some older closed models (GPT-4o, o1) on HealthBench (not shown in chart).

  • Smaller Models Struggle with Hard Cases: GPT-OSS-20B delivers 42.5% on realistic but only 10.8% on the challenging set, showing a steep drop-off as question difficulty rises. Similarly, OpenAI’s o4-mini sees its score halved on hard vs. normal. This suggests that complex medical reasoning still taxes smaller-scale models heavily, whereas larger and more specialized models maintain performance.

Humanity’s Last Exam (HLE) — Expert QA (% Accuracy)

Humanity’s Last Exam (HLE) represents one of the most challenging benchmarks in AI evaluation, designed to test the absolute limits of model capabilities across specialized domains. This exam consists of questions that would challenge even human experts with deep domain knowledge.

The results below demonstrate how various models perform when faced with these extraordinarily difficult problems, highlighting both the impressive capabilities of the GPT-OSS models relative to other open-source offerings and the remaining gap between even the best AI systems and human expertise in specialized fields.

  • Task Difficulty: HLE stumps even the best AI models — OpenAI’s o3 (with tools) tops out at 24.9%, underscoring the challenge. All models struggle to answer these expert-level questions correctly.

  • GPT-OSS Leads Open Models: GPT-OSS-120B (with tools) scores 19.0%, the highest among open-source entries and second only to o3. Its 20B sibling follows with 17.3%. These two notably outperform other new open models by a wide margin. (The next-best open model, DeepSeek-R1, manages only 8.6%.)

  • Tools Matter: Allowing tool use roughly doubled accuracy for GPT-OSS models. For example, GPT-OSS-120B jumps from 14.9% without tools to 19.0% with tools. This highlights that even frontier models benefit significantly from tools on adversarial questions.

Tau-Bench (Function Calling) — Accuracy (%)

Tau-Bench represents a critical evaluation of function-calling capabilities in modern language models. This benchmark tests a model’s ability to accurately understand user requests and translate them into appropriate API calls — a fundamental skill for AI assistants that need to interact with external tools and services.

Function calling is essentially how AI models bridge the gap between natural language and structured actions in software systems. When a user asks to “book a flight to San Francisco next Thursday,” a model with strong function-calling abilities will parse this request and generate the precise API call needed, with all parameters correctly specified.

In the retail-focused version of Tau-Bench featured below, models are tasked with interpreting customer queries and calling the right functions to handle tasks like inventory lookups, order processing, and customer service inquiries. The results reveal significant differences in how well various models handle this crucial capability:

Accuracy on Tau-Bench Retail, a test of function-calling (tool usage) accuracy in a retail scenario.

  • Leaders Around 70%: OpenAI’s o3 is the top performer at 70.4%, closely followed by GPT-OSS-120B at 67.8%. These two are effectively tied with only a ~2.5 point difference. Both demonstrate solid ability in invoking the correct functions based on queries.

  • Competitive Field: Most advanced models score in the 60–66% range here. For instance, o4-mini manages 65.6%, DeepSeek-R1 about 65.0%, and Llama 4 Maverick **62%. This suggests that function calling — which requires structured outputs — has seen improvements across the board in 2025, though it remains an area where even top models make errors ~35–40% of the time.

  • GPT-OSS-20B lags at 54.8%, performing notably worse than its 120B counterpart. Smaller context or reasoning limitations likely hurt the 20B model in this structured task. Overall, however, all the latest open models have reached broadly similar competence on function calling, with GPT-OSS-120B setting the pace among them.

The Secret Sauce: Mixture-of-Experts Architecture

How do these models achieve such impressive performance while remaining small enough to run locally? The answer lies in their Mixture-of-Experts (MoE) architecture.

Traditional language models activate all their parameters for every token they process. GPT-OSS models take a radically different approach:

  • GPT-OSS-120b: 117 billion total parameters, but only activates 5.1 billion per token

  • GPT-OSS-20b: 21 billion total parameters, but only activates 3.6 billion per token

Think of it like having a team of specialists where only the relevant experts work on each specific problem, rather than having everyone work on everything. This architectural choice enables the models to maintain the knowledge capacity of much larger models while requiring far less computational power to run.

The models also use native 4-bit quantization (MXFP4), further reducing memory requirements without significantly impacting performance.

Running GPT-OSS: From Data Centers to Your Laptop

The practical deployment options for these models represent a paradigm shift in AI accessibility:

For the 120b Model:

  • Runs on a single NVIDIA H100 GPU (80GB)

  • Suitable for production workloads

  • Can be deployed on high-end workstations

For the 20b Model:

  • Runs on devices with just 16GB of memory

  • Works on consumer laptops (especially Apple Silicon Macs)

  • Ideal for edge computing and local development

Installation is remarkably simple. With Ollama, it’s just:

“That gpt-oss-20b model should run quite comfortably on a Mac laptop with 32GB of RAM,” notes Willison, who tested the model extensively.

The Trade-offs: What You Need to Know

While the performance numbers are impressive, OpenAI has been transparent about limitations:

Hallucination Rates

  • GPT-OSS-120b: 49% on PersonQA benchmark

  • GPT-OSS-20b: 53% on PersonQA benchmark

  • o1: 16% (for comparison)

  • o4-mini: 36% (for comparison)

These higher hallucination rates mean the models require more careful validation of outputs, especially for factual queries.

Hallucination Comparison: GPT-OSS vs. Other Open Models

While GPT-OSS models show higher hallucination rates than OpenAI’s proprietary offerings, how do they compare to other open-source alternatives? Here’s how the landscape looks:

Hallucination Rates Across Open Models:

  • Llama 4 Maverick (Meta): ~40–45% on factual benchmarks similar to PersonQA

  • DeepSeek-R1 (DeepSeek): ~45–50% when tested on common knowledge queries

  • Phi-4 (Microsoft): ~48% hallucination rate on factual retrieval tasks

  • Gemma 3 (Google): ~52% on similar benchmarks, despite its knowledge focus

  • GPT-OSS-120b: 49% on PersonQA

  • GPT-OSS-20b: 53% on PersonQA

Mitigation Strategies in the Open-Source Ecosystem:

  • Retrieval Augmentation: Most open models now explicitly recommend RAG (Retrieval Augmented Generation) deployment to overcome hallucination issues. DeepSeek and Llama in particular have optimized their architectures to perform well in RAG setups.

  • Knowledge Cutoffs: Open models increasingly use explicit disclaimers about knowledge cutoff dates and uncertainty signals in their outputs, though these are less sophisticated than in closed models.

  • Self-consistency: Some implementations use multiple generations with voting mechanisms to reduce hallucination probability, though this increases computation costs.

  • Tool Integration: Tool use improves factuality across all models, with GPT-OSS showing among the largest gains from tool access (4–7 percentage points).

The hallucination rates of GPT-OSS models are comparable to other leading open-source alternatives, suggesting this remains an industry-wide challenge for locally-deployable AI.

Chain-of-Thought Transparency: Unlike some models, GPT-OSS provides complete access to its reasoning process. However, OpenAI warns: “Developers should not directly show CoTs to users in their applications. They may contain hallucinatory or harmful content.”

Safety First: How OpenAI Tested These Models

Given the potential risks of releasing powerful open models, OpenAI conducted extensive safety testing:

  1. Adversarial Fine-tuning: OpenAI’s teams actively tried to fine-tune the models for malicious purposes (cyberattacks, biological weapons) using their “field-leading training stack.” The result? Even with aggressive fine-tuning, the models couldn’t reach dangerous capability levels.

  2. Independent Review: OpenAI’s Safety Advisory Group reviewed all testing and concluded the models were safe for release.

  3. Community Involvement: A $500,000 Red Team Challenge encourages researchers worldwide to identify potential safety issues.

What This Means for Different Stakeholders

For Enterprises:

  • Deploy AI on-premises with complete data control

  • Eliminate API costs for high-volume applications

  • Customize models for specific industry needs

  • Maintain compliance with data residency requirements

For Developers:

  • Experiment locally without usage limits

  • Build and test AI applications offline

  • Fine-tune for specific use cases

  • Integrate AI into edge devices

For Researchers:

  • Full access to model weights for experimentation

  • Ability to study and modify state-of-the-art architectures

  • No restrictions on academic use

  • Complete chain-of-thought visibility for interpretability research

For Emerging Markets:

  • Access to frontier AI capabilities without expensive infrastructure

  • Ability to develop localized AI solutions

  • Reduced dependency on cloud services

  • Lower barriers to AI innovation

The New Contenders: A Profile of Leading-Edge Open Source LLMs

To understand the competitive dynamics shaping the AI industry, we must first grasp the architectural characteristics and strategic goals of key models. Today’s landscape features diverse contenders from major open-weight releases by established leaders to specialized systems from emerging players. Their designs show a strategic pivot toward efficiency, specialization, and agentic capabilities, moving beyond simply pursuing larger parameter counts. The following profiles detail each major model family, creating the technical foundation for our performance analysis in later sections.

It has a decent-sized context window. Comes in 3rd place.

A Domain-by-Domain Deep Dive

This section presents the core data analysis of the report, translating the raw benchmark scores into a structured, comparative assessment of model capabilities. By consolidating the performance data into a master summary and then dissecting it across key domains — from pure logic and coding to agentic tool use and specialized applications — a nuanced picture of the competitive landscape emerges. This deep dive moves beyond single-score leaderboards to reveal the spiky, specialized profiles of today’s leading models.

Benchmark Overview

Performance on tasks requiring pure logical, mathematical, and algorithmic reasoning remains a key differentiator for state-of-the-art models. The Codeforces and AIME benchmarks, which test these capabilities in a rigorous and competitive format, reveal a clear hierarchy of performance and highlight the rise of specialized systems.

  • Codeforces Elo Rating: Measures coding ability based on performance in live programming contests, comparable to human competitive programmers.

  • AIME (American Invitational Mathematics Examination): Evaluates advanced mathematical reasoning on extremely difficult problems designed for elite high school mathematicians.

  • GPQA Diamond: Tests graduate-level scientific knowledge and reasoning that even non-expert humans with web access struggle to answer correctly.

  • HLE (Humanity’s Last Exam): The most challenging benchmark, requiring multi-disciplinary reasoning and expert-level knowledge across domains.

  • MMLU (Massive Multitask Language Understanding): Industry standard measure for broad knowledge across academic and professional subjects.

  • Tau-Bench Retail: Evaluates models on multi-turn, tool-using conversations in a simulated retail environment.

  • HealthBench: Simulates realistic physician-patient dialogues to test domain-specific medical conversation capabilities.

The Broader Implications: A New Era of Open AI

This release marks more than just two new models. The release signals a potential shift in the AI landscape. For the first time since 2019, OpenAI is contributing to the open-source ecosystem with models that genuinely compete with proprietary offerings.

The timing is significant. With competitors like Meta’s Llama, Mistral, and China’s DeepSeek pushing the boundaries of open models, OpenAI’s entry validates the importance of open-weight AI while raising the bar for what’s possible.

“These models complement our hosted models, giving developers a wider range of tools to accelerate leading-edge research, foster innovation, and enable safer, more transparent AI development,” OpenAI states.

Getting Started Today

The models are available immediately through multiple channels:

  1. Direct Download: Available on HuggingFace with comprehensive documentation

  2. Local Tools: Ollama and LM Studio offer one-click installation

  3. Cloud Providers: Azure, AWS, and others provide hosted options

  4. Development Frameworks: Native support in Transformers, vLLM, and llama.cpp

For businesses evaluating these models, the recommendation is clear: start with GPT-OSS-20b for proof-of-concept work on standard hardware, then scale to GPT-OSS-120b for production workloads requiring maximum capability.

The Bottom Line

OpenAI’s GPT-OSS release represents a watershed moment in AI accessibility. By delivering models that approach or match proprietary performance while running on accessible hardware, OpenAI has effectively democratized advanced AI capabilities.

Whether you’re a developer tired of API costs, an enterprise seeking data sovereignty, or a researcher pushing the boundaries of what’s possible, these models offer a compelling new option. The combination of strong performance, true open-source licensing, and practical deployability makes GPT-OSS a game-changer for anyone working with AI.

The future of AI just became a lot more open; and it runs on your hardware.

Want to try GPT-OSS yourself? Head to Hugging Face to download the models or install Ollama for the simplest getting-started experience. For technical documentation and implementation details, visit the official GitHub repository.

About the Author

Rick Hightower brings extensive enterprise experience as a former executive and distinguished data engineer at a Fortune 100 fintech company, where he specialized in Machine Learning and AI solutions to deliver intelligent customer experiences. His expertise spans both theoretical foundations and practical applications of AI technologies.

As a TensorFlow-certified professional and graduate of Stanford University’s comprehensive Machine Learning Specialization, Rick combines academic rigor with real-world implementation experience. His training includes mastery of supervised learning techniques, neural networks, and advanced AI concepts, which he has successfully applied to enterprise-scale solutions.

Follow Rick on LinkedIn or Medium for more enterprise AI insights. You can find all the source code for this article at this github repo. Try out the notebook too.

Check out my latest article series:

  1. Transformers and the AI Revolution

  2. Why Language is Hard for AI

  3. Building Your AI Workspace

  4. Inside the Transformer Architecture

  5. Tokenization: Gateway to Understanding

  6. Prompt Engineering Fundamentals

  7. Extending Transformers Beyond Language

  8. Customizing Pipelines and Data Workflows

  9. Semantic Search and Embeddings (Article 9)

  10. Fine-Tuning: From Generic to Genius (Article 10)

  11. Hugging Face: Building Custom Language Models: From Raw Data to Production AI (Article 11)

  12. Hugging Face: Article 12 — Advanced Fine-Tuning: Chat Templates, LoRA, and SFT (in progress, should come out this week)

I’ve written articles on MCP, LangChain, DSPy, etc. Mostly, the articles are detailed tutorials. Please check them out from my Medium profile.

Article References

To view or add a comment, sign in

Others also viewed

Explore topics