Reflections on the AI Airlock: Charting Practical, Globally Relevant Pathways for LLMs in Healthcare

Reflections on the AI Airlock: Charting Practical, Globally Relevant Pathways for LLMs in Healthcare

🌀 RIFF | Edition #3

Reflections on the MHRA’s AI sandbox and where we go from here

Yesterday’s MHRA webinar marked the close of Phase 1 of its AI sandbox programme and the announcement of new government funding for a second cohort. The initiative was framed as a success story in adaptive regulation—tempered with pragmatism and an honest recognition of the challenges agencies face when implementing sandbox approaches. It also signalled continued government appetite for AI-enabled innovation in healthcare.

Yet beneath the surface, some tensions remain unresolved. Having followed the programme closely and participated in yesterday’s session, I left with mixed feelings—not because the effort lacks merit, but because its core purpose still feels under-articulated.

Here are four reflections—on what the UK sandbox programme did well, and what may need rethinking.

What Is This Sandbox For?

A regulatory sandbox is typically a structured, time-bound environment for safe, collaborative experimentation. It enables regulators, innovators, clinicians, and patients to evaluate emerging technologies under guided conditions—testing design hypotheses without compromising patient safety or public health. In the case of AI, such testing must integrate both foundation-model-level and application-specific risk-tiering. The purpose is to generate transparent, shareable, and regulatory-grade evidence to inform both pre-market approval and post-market surveillance.

Due to the complexity involved in this continual learning nature of AI as a Medical Device (AIaMD), Phase 1 of the Airlock diverged from this traditional sandbox model. Rather than establishing a testbed to assess evidence-generation strategies or delivery pathways for AI-enabled medical devices, it focused primarily on exploratory risk mapping—particularly the uncertainties introduced by general-purpose AI (GPAI) models, most notably large language models (LLMs).

Notable characteristics of Phase 1:

  • One-to-one consultations encouraged participants to reflect on potential risks and compliance challenges in their specific application areas.
  • The process leaned heavily on hypothetical regulatory futures informed by broad AI ethics principles, rather than adapting existing compliance pathways to the novel features of adaptive AI systems.
  • There was no clear articulation of regulatory benchmarks, performance thresholds, or expectations tied to specific intended uses or clinical contexts.

To be meaningful in future phases, the sandbox must evolve beyond abstract risk reflection. It should focus on delivery-oriented evidence strategies that clearly specify the types of data, operational conditions, and validation protocols required to bring AI-enabled technologies safely and effectively to market—anchored within existing regulatory frameworks, not abstracted from them.

With LLMs increasingly serving as the foundation for AI systems, sandbox activities must support evidence development at two levels:

  • Foundation model level: including generalisation capacity, population-health impact, and mitigation of structural bias.
  • Application domain level: including clinical workflow integration, safety, reliability, and effectiveness in real-world contexts.

Models informed by predicate-device analogies (e.g., FDA’s 510(k) pathway) and the safe originator concept (proposed in the implementation of the AI Act) can be useful in structuring this approach. Crucially, this requires clear articulation of a 1) delivery path and accompanying 2) evidence-generation strategy as part of any regulatory submission.

✅ Recommendation 1: Define and prioritise delivery-focused evidence strategies as the core function of the sandbox.

These strategies should align with regulatory approval and post-market requirements, explicitly connecting foundation-level AI evaluation with use-case-specific validation in healthcare delivery.

Focus on AI Function—Not AI as a Concept

Much of the discussion in Phase 1 became bogged down in conceptual debates about AI—especially around LLMs and model features, such as the use of synthetic data. This focus risks obscuring the regulatory task at hand.

Regulators such as the MHRA, FDA, and EMA do not regulate AI models in the abstract—they regulate medical products with a defined intended purpose, risk classification, and user population. This distinction is fundamental, and it is reflected in both the EU AI Act and the emerging Code of Practice on GPAI.

The sandbox should have centred on how AI models and systems are used, for whom, and in what clinical or operational context. Without this specificity, sandbox activities risk amplifying confusion, not resolving it.

✅ Recommendation 2: Regulate use, not models.

LLMs are general-purpose technologies. While they should remain outside the scope of direct medical regulation, when embedded in regulated products, the strategies used—such as prompting, safety filters, and model update procedures—can and should be evaluated not directly but through adaptive regulatory frameworks.

The focus should be on:

  • The product's intended medical use—for example, to predict burnout among healthcare professionals by means of XYZ delivery-focused evidence strategy.
  • The delivery modality—the broader study configuration that will implement the evidence strategy.
  • The evidence required to ensure safe, effective, and equitable performance.

Simulated Environments Are Not Real Testbeds

A foundational feature of regulatory sandboxes is access to real-world data in controlled, ethically governed environments. Yet, Phase 1 relied almost entirely on simulated case studies—hypothetical scenarios without real patient involvement, live clinical systems, or operational datasets.

Simulation can be useful in early design, but it is not a substitute for:

  • Regulated access to real or near-live data
  • Secure and compliant validation environments
  • Operational testing under real clinical conditions

This absence reinforces the “grey zones” that regulatory initiatives are meant to eliminate—particularly in the UK, where uncertainty around data access and validation frameworks continues to hinder AI deployment.

✅ Recommendation 3: Embed real data governance and validation pathways.

To be fit for purpose, the sandbox should:

  • Distinguish clearly between training data (population-level datasets) and validation data (patient-specific, context-bound).
  • Enable and leverage collaboration within and across platforms such as the NHS Federated Data Platform, HDR UK nodes, or testbed hospitals.
  • Ensure ethical, consent-based access to real-world patient data (in addition to facilitating access to large population health datasets) for participating cohorts.

Not a Government Research Lab—And It Shouldn’t Act Like One

The most critical concern is that the sandbox risks being repurposed as an exploratory governance lab—one that tests not only emerging technologies but also speculative regulatory models. This approach risks blurring the line between regulatory implementation and regulatory design.

But a regulatory agency is not a futures lab. Nor should it behave like one.

Regulatory sandboxes are designed to apply existing frameworks in novel or fast-evolving contexts, not to serve as a proving ground for entirely new legal or policy regimes. At times, however, the Airlock conveyed the impression that the MHRA is still determining what kind of regulator it intends to be in the AI era. While such strategic ambiguity is understandable—given the pace of change and lack of international consensus—it was often foregrounded in ways that may have undermined the sandbox’s credibility as a market-ready, confidence-building tool.

The explicit acknowledgment of this ambiguity, rather than masking it, was commendable. But clarity is now needed. For the sandbox to serve its intended function, it must avoid becoming a speculative exercise in regulatory identity. It should focus on operationalising current frameworks, offering a trusted environment in which innovators can validate compliance strategies aligned with today’s rules—while regulators gather actionable insights to inform future guidance and capacity-building.

Credit Where It’s Due: A Culture of Openness

Despite the critique, the MHRA deserves recognition for its transparency and willingness to engage the public in an ongoing dialogue. It is rare for a national regulator to invite scrutiny midstream rather than only publishing after conclusions are finalised. This openness is a vital strength.

Phase 1 of the AI Airlock has surfaced the complexities that the sandbox concept must navigate. With renewed funding and focus, there is a real opportunity to fine-tune the approach. Phase 2 must:

  • Centre on product-specific, pathway-relevant questions
  • Facilitate live adaptive learning by showcasing specific use cases
  • Deliver reusable regulatory insights that collectively reduce uncertainty

This approach can transform the sandbox from an abstract consultation into a practical tool that accelerates clarity and innovation in AI-enabled medical devices.

Defining the Sandbox

Phase 1 raised the right stakes but drifted off course: too much focus on “what is AI?”, not enough on “is this safe, effective, and usable in the NHS today?” A second cohort is now forming. Its mission should be clearer: to protect patients and public health, not just to speculate on long-term systemic risks from AI. To this end, we must bear in mind that AI is not a medical device—software is. AI may be a component of a regulated product, and in this context, a regulatory sandbox should:

  • Specify the additional evidence, validation, and operational conditions required to bring AI-enabled products to market
  • Extend sandbox models to encompass clinical trials and real-world evaluations—combining population-level training data with patient-specific validation

In other words, the sandbox should not replace MHRA or NICE approval pathways—it should clarify AI-specific requirements within them.

Key questions the sandbox should answer:

  • What counts as "regulatory-grade" evidence for AI components?
  • Under what conditions of use can these be safely deployed?
  • How should adaptive systems fit within lifecycle oversight frameworks?

Summary of Recommendations

Here are the key elements I propose going forward:

🧩 Adopt a delivery-focused evidence strategy Strengthen regulatory approaches by embedding both capability risk (model design and architecture) and situational risk (clinical context and use) within a unified, regulatory-grade evidence framework.

🧩 Regulate use, not models Focus regulatory efforts on demonstrating how models translate technical capability into effective clinical performance within defined healthcare settings.

🧩 Ground the sandbox in use case specificity Design testing cohorts around real healthcare delivery contexts and patient journeys, rather than generic or hypothetical technology potentials.

🧩 Build toward publishable, globally relevant guidance Leverage insights from sandbox iterations to develop clear, reproducible evidence pathways and support standards development that align international regulatory expectations.

Conclusion

In the wake of seismic shifts in what AI represents, the sandbox must evolve from conceptual exploration into a delivery-oriented, evidence-generating mechanism—one that supports the safe, effective, and equitable integration of AI into regulated healthcare.

Its next phase should move beyond broad ethical reflection and instead prioritise real-world validation, contextual specificity, and regulatory alignment. By focusing on how AI performs in use—not merely how it is built—the sandbox can address both foundational and applied risks through a single, transparent, and scalable evidence strategy.

If successful, the AI sandbox will no longer just reflect on the future—it will help shape it. And in doing so, it can answer a pivotal question for health systems and innovators alike:

What does it take to bring foundation-model-enabled AI safely, credibly, and sustainably into healthcare delivery that patients and clinicians can trust?


🌀 Welcome to The Riff-

A sharp, human-centred take on where digital health and AI are headed next—offering signal over noise, with an eye on equity, sustainability, and real-world impact. Each edition riffs on a theme—from drift in AI systems and digital bias in healthcare, to sandboxes, standards, and smarter models of care. It's rooted in active work across policy, ethics, and innovation ecosystems—but always grounded in people, practice, and possibility. Whether you're shaping the future of health systems, building technology, or asking better questions, The Riff is your lens into what's emerging, what’s working, and what we need to talk about next.

Ivaylo Petrov

Shemha Health, Co-CEO & Founder | Projects & Public Affairs Lead @ Bulgarian Joint Cancer Network | Market Access Advisor @ Kelvin Health | Future Healthcare Hacker | Healthcare Ecosystem Convergent & Advisor

3mo

Exactly on the ethical scope of the next-gen. Clinical Decision Support Systemd complexity of delivering multidisciplinary, multimodal evidence-based recommendations 🚀

To view or add a comment, sign in

Others also viewed

Explore content categories