A Paradigm Shift for Building and Testing AI in Medicine
In medicine, diagnosis is rarely a one-shot answer. It’s an unfolding process of generating, testing, and refining hypotheses. Physicians begin with an initial hunch, ask targeted questions, order tests, and revise their thinking as new clues emerge. Clinical diagnosis is a dynamic, iterative endeavor of reasoning and decision making under uncertainty.
Yet most evaluations of language models in medicine rely on static benchmarks: Here’s a case. What’s the correct answer? This quiz-like framing reduces diagnostic reasoning to a snapshot and misses the central challenge clinicians face: deciding what to do next, pursuing new clues, and updating hypotheses on the path to a diagnosis.
Today, we introduced the Sequential Diagnosis Benchmark (SDBench) and the MAI-DxO system, shifting focus toward the iterative nature of diagnostic reasoning. Our work highlights the opportunity to jointly consider both accuracy and cost-effectiveness. Full details are available in our technical article, published today. It’s been a blast working with an extraordinary team on this intensive study.
Revisiting Sequential Diagnosis
Decades ago, during my doctoral research at Stanford University, I pursued formal approaches to harnessing AI to perform diagnostic reasoning. My focus was building and testing systems that could perform cycles of hypothesis generation, evidence gathering, and hypothesis updating, grounded in Bayesian probability and decision theory. These early systems handled diagnostic cases step-by-step—referred to as sequential diagnosis—computing differential diagnoses, selecting the most informative next tests to perform, and updating probabilities as new data arrived. Probability and decision theory provides a gold-standard framework for diagnosis and broader decision making and planning.
Here's a figure that shows the flow of analysis of the models I had constructed and studied in the 1980s. In the approach, an initial presentation is analyzed with a Bayesian model (Bayesian networks at the time). An initial differential diagnosis is formulated. Next, the probability distribution over possible diagnoses is employed to compute the next expected value of information for all possible questions and tests, and these are offered as recommendations to the clinician. Given the results that follow from these tests, the evidence set is updated and the Bayesian model is used to revise the differentia diagnosis. The process continues until a decision analysis suggests stopping information gathering and acting with treatment strategies.
We employed this approach to build systems in several domains, including for pathology and trauma medicine. A screenshot from a trauma care system that we constructed with these methods is displayed here, showing two steps of the sequential diagnosis cycle. In this case the system is supporting a paramedic at the scene of a motorcycle crash, where the patient presents initially as unconscious, responding to painful stimuli only, and with the odor of alcohol on his breath.
Back in the early 1990s, we also pursued mobile, handsfree versions of these kinds of systems and explored their use in different settings, including our excitement about the possibility of "wearing" a sequential diagnostic reasoning system as a Bayesian thinking cap. Here's a fun video from the day with my demoing a handsfree diagnostic system.
The Bayesian systems often performed at expert levels in focused domains. However, they were labor-intensive to develop. Each required hand-coded knowledge bases, domain-specific tailoring, and manually elicited probability tables—limitations that ultimately curbed their real-world impact.
Now, returning to this long-standing challenge in the era of large language models (LLMs) feels like arriving at a familiar trailhead—with powerful, but very different tools.
Today's LLMs don’t come with explicit probabilistic engines or the guarantees that their reasoning aligns with principles of probability and utility. What they do offer is a remarkable breadth of medical knowledge and flexible, adaptive reasoning. But without the formal guarantees of earlier systems, we need to evaluate their accuracy and reliability empirically.
That’s why we’ve focused on building benchmarks that test not just the final diagnosis, but the process of diagnosis itself—step-by-step, just as it unfolds in real clinical work.
Early Explorations of Sequential Diagnosis with Language Models
When OpenAI's o1 model became available last year, I began experimenting with multistep prompts to explore whether the model could perform sequential diagnosis. I structured the prompt to guide the model through phases of analysis: generating and updating differentials, selecting high-value tests, and reasoning about when to stop and act. The example cases and results were impressive, drawing enthusiastic feedback from my colleagues in clinical practice.
Here are two points in the interactive sequential diagnosis produced with a rich, multi-stage prompt to OpenAI's o1-preview model. In some of my earliest experiments, I had prompted the model to produce graphical output of the form we had implemented with the Bayesian systems of the 1990s. I presented such examples to medical grand rounds and presentations over the past year (full case for this example is at timecode 47:24 at my Discovery presentation at Vanderbilt University; if you're interested in the history of sequential diagnosis, you may enjoy the whole address).
We explored in early experiments general diagnostic challenge problems like the leptospirosis case above. In other prototyping, we examined how the approach would work on top challenges as reported in the New England Journal of Medicine’s Clinicopathological Conferences (CPCs)—some of the most challenging diagnostic cases ever published. CPCs are selected for their complexity, instructional value, and real-world ambiguity.
Here are two steps showing how prompting OpenAI's o3's model on a case from a May 2025 NEJM CPC case. The case was solved in a few steps.
Sequential Diagnosis Benchmark and MAI-DxO
The early studies highlighted the need for formal benchmarking in LLM-powered diagnostic reasoning. My ongoing discussions with leads of the new MAI health team were full of energy and alignment, and we quickly decided to collaborate closely together on this challenge. The result is the Sequential Diagnosis Benchmark (SDBench), built from 304 of the NEJM CPCs. Creating the benchmark was far from trivial.
We transformed these static narratives into interactive, stepwise diagnostic tasks. A Gatekeeper agent withholds details about labs, tests, and imaging results until explicitly asked. Language models begin with limited information and must iteratively decide what history to gather, what tests to order, and when to commit to a diagnosis. Throughout, the system tracks the cumulative cost of the testing. The design allows the benchmark to evaluate not only accuracy, but also cost-efficiency and reasoning strategy.
Complementing the benchmark, we introduce the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic system that transforms a single language model into a structured diagnostic team. MAI-DxO orchestrates a panel of role-playing agents, each with a specific function: maintaining the evolving differential diagnosis, selecting high-value next tests, challenging premature closure, tracking costs, and ensuring logical consistency. A "chain of debate" among these agents drives the decision to either seek more information or commit to a final diagnosis.
The figure below outlines the methodology behind the Sequential Diagnosis Benchmark. At runtime, three agents operate: Gatekeeper, Diagnostic and Judge agents. At run-time, the Gatekeeper mediates requests for information from the Diagnostic agent, deciding if and how to respond to the Diagnostic agent’s questions about patient history, examination findings, and test results. The Judge evaluates whether the Diagnostic Agent’s final diagnosis matches the ground truth as reported in the original CPC article.
When paired with OpenAI’s o3 model, MAI-DxO achieves 80% diagnostic accuracy compared to 20% for physicians on average. It also reduces diagnostic costs by 20% relative to physicians, and 70% compared to o3 alone. These gains generalize across model families (GPT, Claude, Gemini), underscoring a key insight: how you use a model can matter as much as which model you use.
Moving Forward
This work points toward a future in which AI employed for diagnosis functions not as a static answer engine, but as a structured, adaptive partner, capable of supporting real-world clinical reasoning in a cost-aware, context-sensitive way.
Of course, there’s more to do. CPC cases are complex and atypical. To ensure broader applicability, we’ll need new benchmarks grounded in primary care, emergency medicine, and global health. We’re also exploring real-world deployments and interactive educational settings where students can practice diagnosis with live AI feedback.
In directions on evaluation, we note that, in this initial round, we asked generalist physicians to tackle complex CPC cases without access to the internet. Looking ahead, future evaluations could explore how specialists—whose expertise aligns with the specific challenges—perform on these same cases, assuming that difficult cases would be referred appropriately in practice. Another valuable direction is to assess how generalist performance improves when they are allowed to use the tools and resources they typically rely on in everyday clinical care.
Looking ahead systems like MAI-DxO could become essential tools—not only improving diagnostic accuracy but for identifying cost-effective diagnostic pathways, especially in environments where time, resources, or access to specialists are limited. These systems can help clinicians to act more judiciously, think more deeply, and bring high-quality diagnostic support to more people, in more places.
For me, this project has been deeply energizing—a return to the challenge of sequential diagnosis after many years, now with entirely new tools, collaborators, and momentum. What once felt like an uphill climb with principled yet brittle systems now feels like a wide-open frontier.
I’m thrilled to be back on this path, and even more excited for where it leads.
Additional information
Paper: Sequential Diagnosis with Language Models
Talk: Discovery Lecture: From AI Aspirations to Healthcare Futures
Earlier studies of sequential diagnosis:
Paper: Sequential diagnosis for trauma care
Paper: Sequential diagnosis for pathology
Paper: Diagnostic strategies in hypothesis-directed reasoning
Semi-Retired Technologist
3w👏 And we all know Mike Shwe recovered.
Senior Asset Infrastructure PM | EPCM Specialist | Hydrogen & Clean Energy Transition | 24+ Years | USA & LATAM
3wCan we connect this model with a sample of our hair in a device and have a better home diagnosis of what is going on in our system as a preventive method from illness?
AI Fluency | SDE | Leveraging analytical skills to build intelligent systems in AI/ML & Automation
2moRead Your Paper and feel excited to see your works findings marks a massive step toward practical, workflow-aware AI in medicine. It not only demonstrates high performance but shows how structured orchestration can meaningfully close the gap between static AI benchmarks and real-world clinical reasoning.
Faculty Director of Health Information Technology at University of California, San Francisco - School of Medicine
2moLanguage models are not a solution to a system where the language is already incorrect.
AI Strategy, Sales & GTM | Author (Visit My Store below)
2mo👍 Microsoft Microsoft AI