GPT OSS from OpenAI — Two Powerful Open-Source/Open-Weight Models. Comparable to frontier models?
OpenAI just dropped a game-changer in the AI world! Discover how their new open-weight models, GPT-OSS-120b and GPT-OSS-20b, are redefining what's possible right from your hardware. Are they comparable to cutting-edge proprietary models? How do they stack up to other open-source / open-weight frontier models? Dive in to find out! #AI #OpenSource #MachineLearning
OpenAI Just Released Two Powerful Open-Weight Models That Run on Your Hardware — And They’re Surprisingly Good
Executive Summary
OpenAI has made a surprising strategic pivot by releasing its first open-weight language models since GPT-2 in 2019. The new GPT-OSS-120b and GPT-OSS-20b models, released under the permissive Apache 2.0 license, deliver performance that rivals OpenAI’s proprietary models while running efficiently on consumer hardware.
Key Takeaways:
GPT-OSS-120b (117B parameters) matches or exceeds OpenAI’s o4-mini on most benchmarks while running on a single 80GB GPU
GPT-OSS-20b (21B parameters) delivers o3-mini-level performance and runs on devices with just 16GB of memory; including consumer laptops
Both models excel at reasoning, coding, mathematics, and tool use, often outperforming much larger open-source alternatives
The models use a Mixture-of-Experts (MoE) architecture, activating only 5.1B and 3.6B parameters per token respectively
Safety-tested through adversarial fine-tuning, with a $500,000 Red Team Challenge to identify vulnerabilities
Available immediately on HuggingFace, with support from major platforms including Azure, AWS, Ollama, and LM Studio
For businesses, this means enterprise-grade AI capabilities can now be deployed on-premises with full control over data. For developers, it enables local experimentation without API costs. For the AI community, it represents a significant step toward democratizing advanced AI capabilities.
With these models, OpenAI has effectively answered critics who claimed the company had abandoned its original mission of democratizing AI. By releasing high-performance models that can run locally, they’ve significantly lowered the barriers to entry for AI development and deployment. For enterprises concerned about data privacy and sovereignty, this represents a transformative opportunity to bring cutting-edge AI capabilities in-house.
The Shocking Performance Numbers That Have Everyone Talking
Imagine running a model that performs like GPT-4 on your own hardware, without sending a single API call to the cloud. That’s essentially what OpenAI delivered today — and the benchmarks are raising eyebrows across the AI community.
“I was not expecting the open weights releases to be anywhere near that class, especially given their small sizes,” notes AI researcher Simon Willison in his analysis of the models. The numbers back up his surprise.
Head-to-Head: How GPT-OSS Stacks Up Against the Competition
Let’s dive into the benchmarks that matter most for real-world applications:
Competition Coding (Codeforces Elo Ratings)
o3 (with tools): 2706
GPT-OSS-120b (with tools): 2622
GPT-OSS-20b (with tools): 2516
DeepSeek R1: 2061
Qwen 3 235b: 2700
The GPT-OSS models achieve “incredibly strong scores and beat most humans on the entire planet at coding,” according to the benchmark analysis. Even the smaller 20b model significantly outperforms DeepSeek’s offering while using a fraction of the compute resources.
Advanced Mathematics (AIME 2024 & 2025) Here’s where things get really interesting. On competition-level mathematics problems:
AIME 2024:
o4-mini: 98.7%
GPT-OSS-120b: 96.6%
GPT-OSS-20b: 96%
o3: 95.2%
AIME 2025:
o4-mini: 99.5%
GPT-OSS-20b: 98.7% (beating the larger 120b model!)
o3: 98.4%
GPT-OSS-120b: 97.9%
“The 20 billion parameter version actually beat the 120 billion parameter version,” the benchmarks reveal, suggesting exceptional efficiency in the smaller model’s design.
Accuracy on the 2024 and 2025 AIME math competitions (15 problems, integer answers).
Near-Perfect Scores: Math competition performance is exceptionally high across top models. OpenAI’s o4-mini leads with 98–99%, narrowly ahead of the rest. GPT-OSS-20B hit 98.7% on 2025, even slightly outscoring GPT-OSS-120B (97.9%) and o3 (98.4%) that year. This is an impressive result for a 20B model.
Strong Open-Source Showings: GPT-OSS-120B achieved 96.6% on 2024, nearly matching o4-mini. Qwen-3 (235B) and Phi-4 (14B) also excel (≈90–95%), demonstrating that both massive scale and clever training can yield advanced math skills. Gemma-3 (27B) is close behind at ~89–90%.
Mid-tier Models Lag a Bit: DeepSeek-R1’s scores (≈80% in 2024, 87.5% in 2025) are substantially lower, suggesting some gap in mathematical problem-solving for that model. Overall, however, the best open models (GPT-OSS, Qwen, Phi-4) have essentially reached human-level, near-perfect accuracy on AIME-style math. This was once considered a tough domain and a major challenge for AI.
PhD-Level Science (GPQA Diamond)
o3: 83.3%
o4-mini: 81.4%
GPT-OSS-120b: 80.1%
GPT-OSS-20b: 71.5%
Accuracy on GPQA Diamond, a set of extremely advanced science questions (no tools allowed).
Top-Tier Cluster (~80%): A tight cluster of models achieved ~80% on this PhD-level QA. OpenAI’s o3 is highest at 83.3%, but o4-mini, DeepSeek-R1, Qwen-3, and GPT-OSS-120B all score between 80–81%, effectively matching each other. This parity shows that open models can now approach proprietary model performance even on very advanced reasoning tasks.
Phi-4: Small but Mighty: Phi-4 (14B) delivers 78% accuracy, only a few points shy of models an order of magnitude larger. Its ability to stay within ~5% of o3 on expert QA underscores how well this 14B model was optimized for reasoning.
Notable Laggard — Gemma-3: Gemma-3 (27B) scores only 42.4%, dramatically lower than all others. This outlier result suggests Gemma-3 struggles with the GPQA dataset; it may lack the training breadth or alignment for such open-ended, high-level questions (whereas it performs fine on structured tasks like math).
General Knowledge (MMLU)
o3: 93.4%
o4-mini: 93%
GPT-OSS-120b: 90%
GPT-OSS-20b: 85.3%
Accuracy on MMLU, a benchmark covering 57 academic subjects (middle-/high-school and college level).
Closing in on Human-Level: The top closed models o3 (93.4%) and o4-mini (93.0%) still hold a slight lead, but open models are very close. GPT-OSS-120B scores 90%, and DeepSeek-R1 actually reached 91.3%, essentially matching the OpenAI models on this broad test of knowledge.
High Achievers: Other latest models all cluster in the high 80s. Qwen-3 manages 88%, Phi-4 about 87%, Llama 4 Maverick 86%, and GPT-OSS-20B 85.3%. These results demonstrate that a wide range of modern models — open and proprietary — have powerful multidisciplinary knowledge and reasoning capabilities.
Gemma-3 Trails: At 76.9%, Gemma-3 (27B) again lags behind its peers on MMLU. While respectable, this score is roughly 10 points lower than models like GPT-OSS-20B or Llama 4, indicating Gemma-3 may require further refinement or scaling to compete on broad knowledge tasks.
The Healthcare Surprise
In an unexpected twist, both GPT-OSS models outperform OpenAI’s proprietary o1 and GPT-4o models on healthcare-related queries:
HealthBench (Realistic)
o3: 59.8%
GPT-OSS-120b: 57.6%
o4-mini: 50%
GPT-OSS-20b: 42.5%
This performance suggests potential applications in healthcare technology, though OpenAI emphasizes these models “do not replace a medical professional and are not intended for diagnosis or treatment.”
Performance on realistic health advice conversations (left) and a harder medical challenge set (right). Higher is better.
Maverick at the Top: Meta’s Llama 4 “Maverick” (multimodal, long-context model) slightly leads on health tasks, with ~60% on realistic advice and 35% on the hard set. This indicates its strength in nuanced, multimodal understanding in the medical domain.
GPT-OSS-120B is Competitive: GPT-OSS-120B scores 57.6% on realistic and 30% on hard questions— closely trailing o3 (59.8%/31.6%) and Maverick. It outperforms other open models like DeepSeek-R1 (49%/30%) on realistic queries, and matches them on the hardest ones. Notably, GPT-OSS models even exceeded some older closed models (GPT-4o, o1) on HealthBench (not shown in chart).
Smaller Models Struggle with Hard Cases: GPT-OSS-20B delivers 42.5% on realistic but only 10.8% on the challenging set, showing a steep drop-off as question difficulty rises. Similarly, OpenAI’s o4-mini sees its score halved on hard vs. normal. This suggests that complex medical reasoning still taxes smaller-scale models heavily, whereas larger and more specialized models maintain performance.
Humanity’s Last Exam (HLE) — Expert QA (% Accuracy)
Humanity’s Last Exam (HLE) represents one of the most challenging benchmarks in AI evaluation, designed to test the absolute limits of model capabilities across specialized domains. This exam consists of questions that would challenge even human experts with deep domain knowledge.
The results below demonstrate how various models perform when faced with these extraordinarily difficult problems, highlighting both the impressive capabilities of the GPT-OSS models relative to other open-source offerings and the remaining gap between even the best AI systems and human expertise in specialized fields.
Task Difficulty: HLE stumps even the best AI models — OpenAI’s o3 (with tools) tops out at 24.9%, underscoring the challenge. All models struggle to answer these expert-level questions correctly.
GPT-OSS Leads Open Models: GPT-OSS-120B (with tools) scores 19.0%, the highest among open-source entries and second only to o3. Its 20B sibling follows with 17.3%. These two notably outperform other new open models by a wide margin. (The next-best open model, DeepSeek-R1, manages only 8.6%.)
Tools Matter: Allowing tool use roughly doubled accuracy for GPT-OSS models. For example, GPT-OSS-120B jumps from 14.9% without tools to 19.0% with tools. This highlights that even frontier models benefit significantly from tools on adversarial questions.
Tau-Bench (Function Calling) — Accuracy (%)
Tau-Bench represents a critical evaluation of function-calling capabilities in modern language models. This benchmark tests a model’s ability to accurately understand user requests and translate them into appropriate API calls — a fundamental skill for AI assistants that need to interact with external tools and services.
Function calling is essentially how AI models bridge the gap between natural language and structured actions in software systems. When a user asks to “book a flight to San Francisco next Thursday,” a model with strong function-calling abilities will parse this request and generate the precise API call needed, with all parameters correctly specified.
In the retail-focused version of Tau-Bench featured below, models are tasked with interpreting customer queries and calling the right functions to handle tasks like inventory lookups, order processing, and customer service inquiries. The results reveal significant differences in how well various models handle this crucial capability:
Accuracy on Tau-Bench Retail, a test of function-calling (tool usage) accuracy in a retail scenario.
Leaders Around 70%: OpenAI’s o3 is the top performer at 70.4%, closely followed by GPT-OSS-120B at 67.8%. These two are effectively tied with only a ~2.5 point difference. Both demonstrate solid ability in invoking the correct functions based on queries.
Competitive Field: Most advanced models score in the 60–66% range here. For instance, o4-mini manages 65.6%, DeepSeek-R1 about 65.0%, and Llama 4 Maverick **62%. This suggests that function calling — which requires structured outputs — has seen improvements across the board in 2025, though it remains an area where even top models make errors ~35–40% of the time.
GPT-OSS-20B lags at 54.8%, performing notably worse than its 120B counterpart. Smaller context or reasoning limitations likely hurt the 20B model in this structured task. Overall, however, all the latest open models have reached broadly similar competence on function calling, with GPT-OSS-120B setting the pace among them.
The Secret Sauce: Mixture-of-Experts Architecture
How do these models achieve such impressive performance while remaining small enough to run locally? The answer lies in their Mixture-of-Experts (MoE) architecture.
Traditional language models activate all their parameters for every token they process. GPT-OSS models take a radically different approach:
GPT-OSS-120b: 117 billion total parameters, but only activates 5.1 billion per token
GPT-OSS-20b: 21 billion total parameters, but only activates 3.6 billion per token
Think of it like having a team of specialists where only the relevant experts work on each specific problem, rather than having everyone work on everything. This architectural choice enables the models to maintain the knowledge capacity of much larger models while requiring far less computational power to run.
The models also use native 4-bit quantization (MXFP4), further reducing memory requirements without significantly impacting performance.
Running GPT-OSS: From Data Centers to Your Laptop
The practical deployment options for these models represent a paradigm shift in AI accessibility:
For the 120b Model:
Runs on a single NVIDIA H100 GPU (80GB)
Suitable for production workloads
Can be deployed on high-end workstations
For the 20b Model:
Runs on devices with just 16GB of memory
Works on consumer laptops (especially Apple Silicon Macs)
Ideal for edge computing and local development
Installation is remarkably simple. With Ollama, it’s just:
“That gpt-oss-20b model should run quite comfortably on a Mac laptop with 32GB of RAM,” notes Willison, who tested the model extensively.
The Trade-offs: What You Need to Know
While the performance numbers are impressive, OpenAI has been transparent about limitations:
Hallucination Rates
GPT-OSS-120b: 49% on PersonQA benchmark
GPT-OSS-20b: 53% on PersonQA benchmark
o1: 16% (for comparison)
o4-mini: 36% (for comparison)
These higher hallucination rates mean the models require more careful validation of outputs, especially for factual queries.
Hallucination Comparison: GPT-OSS vs. Other Open Models
While GPT-OSS models show higher hallucination rates than OpenAI’s proprietary offerings, how do they compare to other open-source alternatives? Here’s how the landscape looks:
Hallucination Rates Across Open Models:
Llama 4 Maverick (Meta): ~40–45% on factual benchmarks similar to PersonQA
DeepSeek-R1 (DeepSeek): ~45–50% when tested on common knowledge queries
Phi-4 (Microsoft): ~48% hallucination rate on factual retrieval tasks
Gemma 3 (Google): ~52% on similar benchmarks, despite its knowledge focus
GPT-OSS-120b: 49% on PersonQA
GPT-OSS-20b: 53% on PersonQA
Mitigation Strategies in the Open-Source Ecosystem:
Retrieval Augmentation: Most open models now explicitly recommend RAG (Retrieval Augmented Generation) deployment to overcome hallucination issues. DeepSeek and Llama in particular have optimized their architectures to perform well in RAG setups.
Knowledge Cutoffs: Open models increasingly use explicit disclaimers about knowledge cutoff dates and uncertainty signals in their outputs, though these are less sophisticated than in closed models.
Self-consistency: Some implementations use multiple generations with voting mechanisms to reduce hallucination probability, though this increases computation costs.
Tool Integration: Tool use improves factuality across all models, with GPT-OSS showing among the largest gains from tool access (4–7 percentage points).
The hallucination rates of GPT-OSS models are comparable to other leading open-source alternatives, suggesting this remains an industry-wide challenge for locally-deployable AI.
Chain-of-Thought Transparency: Unlike some models, GPT-OSS provides complete access to its reasoning process. However, OpenAI warns: “Developers should not directly show CoTs to users in their applications. They may contain hallucinatory or harmful content.”
Safety First: How OpenAI Tested These Models
Given the potential risks of releasing powerful open models, OpenAI conducted extensive safety testing:
Adversarial Fine-tuning: OpenAI’s teams actively tried to fine-tune the models for malicious purposes (cyberattacks, biological weapons) using their “field-leading training stack.” The result? Even with aggressive fine-tuning, the models couldn’t reach dangerous capability levels.
Independent Review: OpenAI’s Safety Advisory Group reviewed all testing and concluded the models were safe for release.
Community Involvement: A $500,000 Red Team Challenge encourages researchers worldwide to identify potential safety issues.
What This Means for Different Stakeholders
For Enterprises:
Deploy AI on-premises with complete data control
Eliminate API costs for high-volume applications
Customize models for specific industry needs
Maintain compliance with data residency requirements
For Developers:
Experiment locally without usage limits
Build and test AI applications offline
Fine-tune for specific use cases
Integrate AI into edge devices
For Researchers:
Full access to model weights for experimentation
Ability to study and modify state-of-the-art architectures
No restrictions on academic use
Complete chain-of-thought visibility for interpretability research
For Emerging Markets:
Access to frontier AI capabilities without expensive infrastructure
Ability to develop localized AI solutions
Reduced dependency on cloud services
Lower barriers to AI innovation
The New Contenders: A Profile of Leading-Edge Open Source LLMs
To understand the competitive dynamics shaping the AI industry, we must first grasp the architectural characteristics and strategic goals of key models. Today’s landscape features diverse contenders from major open-weight releases by established leaders to specialized systems from emerging players. Their designs show a strategic pivot toward efficiency, specialization, and agentic capabilities, moving beyond simply pursuing larger parameter counts. The following profiles detail each major model family, creating the technical foundation for our performance analysis in later sections.
It has a decent-sized context window. Comes in 3rd place.
A Domain-by-Domain Deep Dive
This section presents the core data analysis of the report, translating the raw benchmark scores into a structured, comparative assessment of model capabilities. By consolidating the performance data into a master summary and then dissecting it across key domains — from pure logic and coding to agentic tool use and specialized applications — a nuanced picture of the competitive landscape emerges. This deep dive moves beyond single-score leaderboards to reveal the spiky, specialized profiles of today’s leading models.
Benchmark Overview
Performance on tasks requiring pure logical, mathematical, and algorithmic reasoning remains a key differentiator for state-of-the-art models. The Codeforces and AIME benchmarks, which test these capabilities in a rigorous and competitive format, reveal a clear hierarchy of performance and highlight the rise of specialized systems.
Codeforces Elo Rating: Measures coding ability based on performance in live programming contests, comparable to human competitive programmers.
AIME (American Invitational Mathematics Examination): Evaluates advanced mathematical reasoning on extremely difficult problems designed for elite high school mathematicians.
GPQA Diamond: Tests graduate-level scientific knowledge and reasoning that even non-expert humans with web access struggle to answer correctly.
HLE (Humanity’s Last Exam): The most challenging benchmark, requiring multi-disciplinary reasoning and expert-level knowledge across domains.
MMLU (Massive Multitask Language Understanding): Industry standard measure for broad knowledge across academic and professional subjects.
Tau-Bench Retail: Evaluates models on multi-turn, tool-using conversations in a simulated retail environment.
HealthBench: Simulates realistic physician-patient dialogues to test domain-specific medical conversation capabilities.
The Broader Implications: A New Era of Open AI
This release marks more than just two new models. The release signals a potential shift in the AI landscape. For the first time since 2019, OpenAI is contributing to the open-source ecosystem with models that genuinely compete with proprietary offerings.
The timing is significant. With competitors like Meta’s Llama, Mistral, and China’s DeepSeek pushing the boundaries of open models, OpenAI’s entry validates the importance of open-weight AI while raising the bar for what’s possible.
“These models complement our hosted models, giving developers a wider range of tools to accelerate leading-edge research, foster innovation, and enable safer, more transparent AI development,” OpenAI states.
Getting Started Today
The models are available immediately through multiple channels:
Direct Download: Available on HuggingFace with comprehensive documentation
Local Tools: Ollama and LM Studio offer one-click installation
Cloud Providers: Azure, AWS, and others provide hosted options
Development Frameworks: Native support in Transformers, vLLM, and llama.cpp
For businesses evaluating these models, the recommendation is clear: start with GPT-OSS-20b for proof-of-concept work on standard hardware, then scale to GPT-OSS-120b for production workloads requiring maximum capability.
The Bottom Line
OpenAI’s GPT-OSS release represents a watershed moment in AI accessibility. By delivering models that approach or match proprietary performance while running on accessible hardware, OpenAI has effectively democratized advanced AI capabilities.
Whether you’re a developer tired of API costs, an enterprise seeking data sovereignty, or a researcher pushing the boundaries of what’s possible, these models offer a compelling new option. The combination of strong performance, true open-source licensing, and practical deployability makes GPT-OSS a game-changer for anyone working with AI.
The future of AI just became a lot more open; and it runs on your hardware.
Want to try GPT-OSS yourself? Head to Hugging Face to download the models or install Ollama for the simplest getting-started experience. For technical documentation and implementation details, visit the official GitHub repository.
About the Author
Rick Hightower brings extensive enterprise experience as a former executive and distinguished data engineer at a Fortune 100 fintech company, where he specialized in Machine Learning and AI solutions to deliver intelligent customer experiences. His expertise spans both theoretical foundations and practical applications of AI technologies.
As a TensorFlow-certified professional and graduate of Stanford University’s comprehensive Machine Learning Specialization, Rick combines academic rigor with real-world implementation experience. His training includes mastery of supervised learning techniques, neural networks, and advanced AI concepts, which he has successfully applied to enterprise-scale solutions.
Follow Rick on LinkedIn or Medium for more enterprise AI insights. You can find all the source code for this article at this github repo. Try out the notebook too.
Check out my latest article series:
Semantic Search and Embeddings (Article 9)
Fine-Tuning: From Generic to Genius (Article 10)
Hugging Face: Building Custom Language Models: From Raw Data to Production AI (Article 11)
Hugging Face: Article 12 — Advanced Fine-Tuning: Chat Templates, LoRA, and SFT (in progress, should come out this week)
I’ve written articles on MCP, LangChain, DSPy, etc. Mostly, the articles are detailed tutorials. Please check them out from my Medium profile.
Article References
Qwen-3 Benchmarks & Specs — https://dev.to/best_codes/qwen-3-benchmarks-comparisons-model-specifications-and-more-4hoa
Qwen-3 Coder Analysis — https://apidog.com/blog/qwen3-coder/
DeepSeek-R1 Model Card — https://huggingface.co/deepseek-ai/DeepSeek-R1
DeepSeek-R1 Tech Report — https://arxiv.org/html/2504.07343v1
Llama 4 Maverick Docs — https://docsbot.ai/models/llama-4-maverick
Gemma-3 27B Leaderboard — https://rankedagi.com/models/gemma-3-27b
Llama 4 Maverick/Scout Benchmarks — https://www.reddit.com/r/LocalLLaMA/comments/1jv9xxo/benchmark_results_for_llama_4_maverick_and_scout/
HLE Benchmark Guide — https://www.promptfoo.dev/docs/guides/hle-benchmark/
Qwen-3 Summer Release Article — https://venturebeat.com/ai/its-qwens-summer-new-open-source-qwen3-235b-a22b-thinking-2507-tops-openai-gemini-reasoning-models-on-key-benchmarks/
Microsoft Phi-4 Technical Report — https://www.microsoft.com/en-us/research/wp-content/uploads/2024/12/P4TechReport.pdf
Phi-4 Benchmark Round-up — https://bestcodes.dev/blog/phi-4-benchmarks-and-info
Phi-4 vs GPT-4 Comparison — https://llm-stats.com/models/compare/gpt-4-0613-vs-phi-4
DeepSeek-R1 vs GPT-O1 — https://www.leanware.co/insights/deepseek-r1-vs-gpt-o1
Phi-4 Release Blog (DL.ai: The Batch) — https://www.deeplearning.ai/the-batch/microsofts-phi-4-blends-synthetic-and-organic-data-to-surpass-larger-models-in-math-and-reasoning-benchmarks/
Gemma-3 Blog Post — https://huggingface.co/blog/gemma3
GPT-OSS-120B Outperforms DeepSeek-R1 — https://www.reddit.com/r/LocalLLaMA/comments/1mifuqk/gptoss120b_outperforms_deepseekr10528_in/
TechCrunch on Phi-4 — https://techcrunch.com/2025/04/30/microsofts-most-capable-new-phi-4-ai-model-rivals-the-performance-of-far-larger-systems/
Llama 4 Comparison Article — https://blog.getbind.co/2025/04/06/llama-4-comparison-with-claude-3-7-sonnet-gpt-4-5-and-gemini-2-5/
PromptLayer Coding Models Review — https://blog.promptlayer.com/best-llms-for-coding/
OpenAI GPT-OSS Announcement — https://openai.com/index/introducing-gpt-oss/
Simon Willison on GPT-OSS — https://simonwillison.net/2025/Aug/5/gpt-oss/