A Paradigm Shift for Building and Testing AI in Medicine

Eric Horvitz

Chief Scientific Officer of Microsoft

Published Jun 30, 2025

In medicine, diagnosis is rarely a one-shot answer. It’s an unfolding process of generating, testing, and refining hypotheses. Physicians begin with an initial hunch, ask targeted questions, order tests, and revise their thinking as new clues emerge. Clinical diagnosis is a dynamic, iterative endeavor of reasoning and decision making under uncertainty.

Yet most evaluations of language models in medicine rely on static benchmarks: Here’s a case. What’s the correct answer? This quiz-like framing reduces diagnostic reasoning to a snapshot and misses the central challenge clinicians face: deciding what to do next, pursuing new clues, and updating hypotheses on the path to a diagnosis.

Today, we introduced the Sequential Diagnosis Benchmark (SDBench) and the MAI-DxO system, shifting focus toward the iterative nature of diagnostic reasoning. Our work highlights the opportunity to jointly consider both accuracy and cost-effectiveness. Full details are available in our technical article, published today. It’s been a blast working with an extraordinary team on this intensive study.

Revisiting Sequential Diagnosis

Decades ago, during my doctoral research at Stanford University, I pursued formal approaches to harnessing AI to perform diagnostic reasoning. My focus was building and testing systems that could perform cycles of hypothesis generation, evidence gathering, and hypothesis updating, grounded in Bayesian probability and decision theory. These early systems handled diagnostic cases step-by-step—referred to as sequential diagnosis—computing differential diagnoses, selecting the most informative next tests to perform, and updating probabilities as new data arrived. Probability and decision theory provides a gold-standard framework for diagnosis and broader decision making and planning.

Here's a figure that shows the flow of analysis of the models I had constructed and studied in the 1980s. In the approach, an initial presentation is analyzed with a Bayesian model (Bayesian networks at the time). An initial differential diagnosis is formulated. Next, the probability distribution over possible diagnoses is employed to compute the next expected value of information for all possible questions and tests, and these are offered as recommendations to the clinician. Given the results that follow from these tests, the evidence set is updated and the Bayesian model is used to revise the differentia diagnosis. The process continues until a decision analysis suggests stopping information gathering and acting with treatment strategies.

Sequential diagnosis loop. During my PhD research, I pursued the use of AI diagnostic systems, employing Bayesian probability and decision theory for guiding testing decisions and identifying when to stop testing and to formulate an ideal treatment.

We employed this approach to build systems in several domains, including for pathology and trauma medicine. A screenshot from a trauma care system that we constructed with these methods is displayed here, showing two steps of the sequential diagnosis cycle. In this case the system is supporting a paramedic at the scene of a motorcycle crash, where the patient presents initially as unconscious, responding to painful stimuli only, and with the odor of alcohol on his breath.

Bayesian sequential diagnosis in action. Two steps in a case at the side of the road with assessing injuries following a motorcycle accident (1991).

Back in the early 1990s, we also pursued mobile, handsfree versions of these kinds of systems and explored their use in different settings, including our excitement about the possibility of "wearing" a sequential diagnostic reasoning system as a Bayesian thinking cap. Here's a fun video from the day with my demoing a handsfree diagnostic system.

The Bayesian systems often performed at expert levels in focused domains. However, they were labor-intensive to develop. Each required hand-coded knowledge bases, domain-specific tailoring, and manually elicited probability tables—limitations that ultimately curbed their real-world impact.

Now, returning to this long-standing challenge in the era of large language models (LLMs) feels like arriving at a familiar trailhead—with powerful, but very different tools.

Today's LLMs don’t come with explicit probabilistic engines or the guarantees that their reasoning aligns with principles of probability and utility. What they do offer is a remarkable breadth of medical knowledge and flexible, adaptive reasoning. But without the formal guarantees of earlier systems, we need to evaluate their accuracy and reliability empirically.

That’s why we’ve focused on building benchmarks that test not just the final diagnosis, but the process of diagnosis itself—step-by-step, just as it unfolds in real clinical work.

Early Explorations of Sequential Diagnosis with Language Models

When OpenAI's o1 model became available last year, I began experimenting with multistep prompts to explore whether the model could perform sequential diagnosis. I structured the prompt to guide the model through phases of analysis: generating and updating differentials, selecting high-value tests, and reasoning about when to stop and act. The example cases and results were impressive, drawing enthusiastic feedback from my colleagues in clinical practice.

Here are two points in the interactive sequential diagnosis produced with a rich, multi-stage prompt to OpenAI's o1-preview model. In some of my earliest experiments, I had prompted the model to produce graphical output of the form we had implemented with the Bayesian systems of the 1990s. I presented such examples to medical grand rounds and presentations over the past year (full case for this example is at timecode 47:24 at my Discovery presentation at Vanderbilt University; if you're interested in the history of sequential diagnosis, you may enjoy the whole address).

Ealy experiments with language-model--based sequential diagnosis. Here, two steps are presented in an analysis supported by a multistage prompt to o1 preview model (Sept. 2024). Left and right panels show revised differential dx and next best tests.

We explored in early experiments general diagnostic challenge problems like the leptospirosis case above. In other prototyping, we examined how the approach would work on top challenges as reported in the New England Journal of Medicine’s Clinicopathological Conferences (CPCs)—some of the most challenging diagnostic cases ever published. CPCs are selected for their complexity, instructional value, and real-world ambiguity.

Here are two steps showing how prompting OpenAI's o3's model on a case from a May 2025 NEJM CPC case. The case was solved in a few steps.

Prompting o3 to perform sequential diagnosis on challenging NEJM CPC case. The initial presentation induces a differential diagnosis and next best tests to perform.

Revision of differential diagnosis with new information. Answers to the recommended labs and imaging studies lead to a sequence of updates to the differential diagnosis.

Sequential Diagnosis Benchmark and MAI-DxO

The early studies highlighted the need for formal benchmarking in LLM-powered diagnostic reasoning. My ongoing discussions with leads of the new MAI health team were full of energy and alignment, and we quickly decided to collaborate closely together on this challenge. The result is the Sequential Diagnosis Benchmark (SDBench), built from 304 of the NEJM CPCs. Creating the benchmark was far from trivial.

We transformed these static narratives into interactive, stepwise diagnostic tasks. A Gatekeeper agent withholds details about labs, tests, and imaging results until explicitly asked. Language models begin with limited information and must iteratively decide what history to gather, what tests to order, and when to commit to a diagnosis. Throughout, the system tracks the cumulative cost of the testing. The design allows the benchmark to evaluate not only accuracy, but also cost-efficiency and reasoning strategy.

Complementing the benchmark, we introduce the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic system that transforms a single language model into a structured diagnostic team. MAI-DxO orchestrates a panel of role-playing agents, each with a specific function: maintaining the evolving differential diagnosis, selecting high-value next tests, challenging premature closure, tracking costs, and ensuring logical consistency. A "chain of debate" among these agents drives the decision to either seek more information or commit to a final diagnosis.

The figure below outlines the methodology behind the Sequential Diagnosis Benchmark. At runtime, three agents operate: Gatekeeper, Diagnostic and Judge agents. At run-time, the Gatekeeper mediates requests for information from the Diagnostic agent, deciding if and how to respond to the Diagnostic agent’s questions about patient history, examination findings, and test results. The Judge evaluates whether the Diagnostic Agent’s final diagnosis matches the ground truth as reported in the original CPC article.

When paired with OpenAI’s o3 model, MAI-DxO achieves 80% diagnostic accuracy compared to 20% for physicians on average. It also reduces diagnostic costs by 20% relative to physicians, and 70% compared to o3 alone. These gains generalize across model families (GPT, Claude, Gemini), underscoring a key insight: how you use a model can matter as much as which model you use.

Sequential Diagnosis Benchmark: NEJM CPC cases are transformed into sequential diagnosis challenge problems using three agents: the Gatekeeper, Diagnostic Agent, and Judge.

Moving Forward

This work points toward a future in which AI employed for diagnosis functions not as a static answer engine, but as a structured, adaptive partner, capable of supporting real-world clinical reasoning in a cost-aware, context-sensitive way.

Of course, there’s more to do. CPC cases are complex and atypical. To ensure broader applicability, we’ll need new benchmarks grounded in primary care, emergency medicine, and global health. We’re also exploring real-world deployments and interactive educational settings where students can practice diagnosis with live AI feedback.

In directions on evaluation, we note that, in this initial round, we asked generalist physicians to tackle complex CPC cases without access to the internet. Looking ahead, future evaluations could explore how specialists—whose expertise aligns with the specific challenges—perform on these same cases, assuming that difficult cases would be referred appropriately in practice. Another valuable direction is to assess how generalist performance improves when they are allowed to use the tools and resources they typically rely on in everyday clinical care.

Looking ahead systems like MAI-DxO could become essential tools—not only improving diagnostic accuracy but for identifying cost-effective diagnostic pathways, especially in environments where time, resources, or access to specialists are limited. These systems can help clinicians to act more judiciously, think more deeply, and bring high-quality diagnostic support to more people, in more places.

For me, this project has been deeply energizing—a return to the challenge of sequential diagnosis after many years, now with entirely new tools, collaborators, and momentum. What once felt like an uphill climb with principled yet brittle systems now feels like a wide-open frontier.

I’m thrilled to be back on this path, and even more excited for where it leads.

Additional information

Paper: Sequential Diagnosis with Language Models

Talk: Discovery Lecture: From AI Aspirations to Healthcare Futures

Earlier studies of sequential diagnosis:

Paper: Sequential diagnosis for trauma care

Paper: Sequential diagnosis for pathology

Paper: Diagnostic strategies in hypothesis-directed reasoning

Mary Smith Greene

Semi-Retired Technologist

👏 And we all know Mike Shwe recovered.

Julian Franco

Senior Asset Infrastructure PM | EPCM Specialist | Hydrogen & Clean Energy Transition | 24+ Years | USA & LATAM

Can we connect this model with a sample of our hair in a device and have a better home diagnosis of what is going on in our system as a preventive method from illness?

Gyanendra Sharma

AI Fluency | SDE | Leveraging analytical skills to build intelligent systems in AI/ML & Automation

2mo

Read Your Paper and feel excited to see your works findings marks a massive step toward practical, workflow-aware AI in medicine. It not only demonstrates high performance but shows how structured orchestration can meaningfully close the gap between static AI benchmarks and real-world clinical reasoning.

2 Reactions

Gilbert Ramirez

Faculty Director of Health Information Technology at University of California, San Francisco - School of Medicine

2mo

Language models are not a solution to a system where the language is already incorrect.

2 Reactions

Leonardo Caldeira

AI Strategy, Sales & GTM | Author (Visit My Store below)

2mo

👍 Microsoft Microsoft AI

2 Reactions

See more comments

To view or add a comment, sign in

See all

LinkedIn respects your privacy

A Paradigm Shift for Building and Testing AI in Medicine

Eric Horvitz

Chief Scientific Officer of Microsoft

Revisiting Sequential Diagnosis

Early Explorations of Sequential Diagnosis with Language Models

Sequential Diagnosis Benchmark and MAI-DxO

Moving Forward

More articles by this author

Others also viewed

AI Horizons: Recent Breakthroughs in AI and Healthcare

Latest AI Breakthroughs in Heart Attack and Cardiac Arrest Prevention

Implementing Artificial Intelligence in Critical Care Medicine: a consensus of 22

AI is changing medicine for the better

The Role of Big Data and AI in Personalized Medicine

A data ecosystem for personalized medicine

AI in Healthcare: Replacement or Reinforcement of Orthodox Approaches?

Making the invisible, visible. Addressing data bias in medical research this International Women’s Day

The Big Pitch ...what a night

‘Personalized medicine is the future – and AI combined with data is the key to bringing it to market’

Explore content categories

Revisiting Sequential Diagnosis

Early Explorations of Sequential Diagnosis with Language Models

Sequential Diagnosis Benchmark and MAI-DxO

Moving Forward

A Leap Forward in Chemistry

Jun 18, 2025

Toward an Era of AI-Enabled Clinical Collaboration

May 19, 2025

Breakthrough in Quantum Computing

Feb 20, 2025

Advancing Healthcare AI: Progress in Medical Reasoning with LLMs

Dec 18, 2024

Protecting Scientific Integrity in an Age of Generative AI

May 22, 2024

Fortifying the Resilience of our Critical Infrastructure

Feb 28, 2024

Better Together: Joining Forces on Digital Media Provenance

Feb 10, 2024

A Milestone Reached

Jan 31, 2022

A Leap Forward in Bioscience

Nov 22, 2021

Open for Research: COVID-19 Literature Dataset

Mar 17, 2020

Others also viewed

AI Horizons: Recent Breakthroughs in AI and Healthcare

Latest AI Breakthroughs in Heart Attack and Cardiac Arrest Prevention

Implementing Artificial Intelligence in Critical Care Medicine: a consensus of 22

AI is changing medicine for the better

The Role of Big Data and AI in Personalized Medicine

A data ecosystem for personalized medicine

AI in Healthcare: Replacement or Reinforcement of Orthodox Approaches?

Making the invisible, visible. Addressing data bias in medical research this International Women’s Day

The Big Pitch ...what a night

‘Personalized medicine is the future – and AI combined with data is the key to bringing it to market’

Explore content categories