Reflections on the AI Airlock: Charting Practical, Globally Relevant Pathways for LLMs in Healthcare
🌀 RIFF | Edition #3
Reflections on the MHRA’s AI sandbox and where we go from here
Yesterday’s MHRA webinar marked the close of Phase 1 of its AI sandbox programme and the announcement of new government funding for a second cohort. The initiative was framed as a success story in adaptive regulation—tempered with pragmatism and an honest recognition of the challenges agencies face when implementing sandbox approaches. It also signalled continued government appetite for AI-enabled innovation in healthcare.
Yet beneath the surface, some tensions remain unresolved. Having followed the programme closely and participated in yesterday’s session, I left with mixed feelings—not because the effort lacks merit, but because its core purpose still feels under-articulated.
Here are four reflections—on what the UK sandbox programme did well, and what may need rethinking.
What Is This Sandbox For?
A regulatory sandbox is typically a structured, time-bound environment for safe, collaborative experimentation. It enables regulators, innovators, clinicians, and patients to evaluate emerging technologies under guided conditions—testing design hypotheses without compromising patient safety or public health. In the case of AI, such testing must integrate both foundation-model-level and application-specific risk-tiering. The purpose is to generate transparent, shareable, and regulatory-grade evidence to inform both pre-market approval and post-market surveillance.
Due to the complexity involved in this continual learning nature of AI as a Medical Device (AIaMD), Phase 1 of the Airlock diverged from this traditional sandbox model. Rather than establishing a testbed to assess evidence-generation strategies or delivery pathways for AI-enabled medical devices, it focused primarily on exploratory risk mapping—particularly the uncertainties introduced by general-purpose AI (GPAI) models, most notably large language models (LLMs).
Notable characteristics of Phase 1:
To be meaningful in future phases, the sandbox must evolve beyond abstract risk reflection. It should focus on delivery-oriented evidence strategies that clearly specify the types of data, operational conditions, and validation protocols required to bring AI-enabled technologies safely and effectively to market—anchored within existing regulatory frameworks, not abstracted from them.
With LLMs increasingly serving as the foundation for AI systems, sandbox activities must support evidence development at two levels:
Models informed by predicate-device analogies (e.g., FDA’s 510(k) pathway) and the safe originator concept (proposed in the implementation of the AI Act) can be useful in structuring this approach. Crucially, this requires clear articulation of a 1) delivery path and accompanying 2) evidence-generation strategy as part of any regulatory submission.
✅ Recommendation 1: Define and prioritise delivery-focused evidence strategies as the core function of the sandbox.
These strategies should align with regulatory approval and post-market requirements, explicitly connecting foundation-level AI evaluation with use-case-specific validation in healthcare delivery.
Focus on AI Function—Not AI as a Concept
Much of the discussion in Phase 1 became bogged down in conceptual debates about AI—especially around LLMs and model features, such as the use of synthetic data. This focus risks obscuring the regulatory task at hand.
Regulators such as the MHRA, FDA, and EMA do not regulate AI models in the abstract—they regulate medical products with a defined intended purpose, risk classification, and user population. This distinction is fundamental, and it is reflected in both the EU AI Act and the emerging Code of Practice on GPAI.
The sandbox should have centred on how AI models and systems are used, for whom, and in what clinical or operational context. Without this specificity, sandbox activities risk amplifying confusion, not resolving it.
✅ Recommendation 2: Regulate use, not models.
LLMs are general-purpose technologies. While they should remain outside the scope of direct medical regulation, when embedded in regulated products, the strategies used—such as prompting, safety filters, and model update procedures—can and should be evaluated not directly but through adaptive regulatory frameworks.
The focus should be on:
Simulated Environments Are Not Real Testbeds
A foundational feature of regulatory sandboxes is access to real-world data in controlled, ethically governed environments. Yet, Phase 1 relied almost entirely on simulated case studies—hypothetical scenarios without real patient involvement, live clinical systems, or operational datasets.
Simulation can be useful in early design, but it is not a substitute for:
This absence reinforces the “grey zones” that regulatory initiatives are meant to eliminate—particularly in the UK, where uncertainty around data access and validation frameworks continues to hinder AI deployment.
✅ Recommendation 3: Embed real data governance and validation pathways.
To be fit for purpose, the sandbox should:
Not a Government Research Lab—And It Shouldn’t Act Like One
The most critical concern is that the sandbox risks being repurposed as an exploratory governance lab—one that tests not only emerging technologies but also speculative regulatory models. This approach risks blurring the line between regulatory implementation and regulatory design.
But a regulatory agency is not a futures lab. Nor should it behave like one.
Regulatory sandboxes are designed to apply existing frameworks in novel or fast-evolving contexts, not to serve as a proving ground for entirely new legal or policy regimes. At times, however, the Airlock conveyed the impression that the MHRA is still determining what kind of regulator it intends to be in the AI era. While such strategic ambiguity is understandable—given the pace of change and lack of international consensus—it was often foregrounded in ways that may have undermined the sandbox’s credibility as a market-ready, confidence-building tool.
The explicit acknowledgment of this ambiguity, rather than masking it, was commendable. But clarity is now needed. For the sandbox to serve its intended function, it must avoid becoming a speculative exercise in regulatory identity. It should focus on operationalising current frameworks, offering a trusted environment in which innovators can validate compliance strategies aligned with today’s rules—while regulators gather actionable insights to inform future guidance and capacity-building.
Credit Where It’s Due: A Culture of Openness
Despite the critique, the MHRA deserves recognition for its transparency and willingness to engage the public in an ongoing dialogue. It is rare for a national regulator to invite scrutiny midstream rather than only publishing after conclusions are finalised. This openness is a vital strength.
Phase 1 of the AI Airlock has surfaced the complexities that the sandbox concept must navigate. With renewed funding and focus, there is a real opportunity to fine-tune the approach. Phase 2 must:
This approach can transform the sandbox from an abstract consultation into a practical tool that accelerates clarity and innovation in AI-enabled medical devices.
Defining the Sandbox
Phase 1 raised the right stakes but drifted off course: too much focus on “what is AI?”, not enough on “is this safe, effective, and usable in the NHS today?” A second cohort is now forming. Its mission should be clearer: to protect patients and public health, not just to speculate on long-term systemic risks from AI. To this end, we must bear in mind that AI is not a medical device—software is. AI may be a component of a regulated product, and in this context, a regulatory sandbox should:
In other words, the sandbox should not replace MHRA or NICE approval pathways—it should clarify AI-specific requirements within them.
Key questions the sandbox should answer:
Summary of Recommendations
Here are the key elements I propose going forward:
🧩 Adopt a delivery-focused evidence strategy Strengthen regulatory approaches by embedding both capability risk (model design and architecture) and situational risk (clinical context and use) within a unified, regulatory-grade evidence framework.
🧩 Regulate use, not models Focus regulatory efforts on demonstrating how models translate technical capability into effective clinical performance within defined healthcare settings.
🧩 Ground the sandbox in use case specificity Design testing cohorts around real healthcare delivery contexts and patient journeys, rather than generic or hypothetical technology potentials.
🧩 Build toward publishable, globally relevant guidance Leverage insights from sandbox iterations to develop clear, reproducible evidence pathways and support standards development that align international regulatory expectations.
Conclusion
In the wake of seismic shifts in what AI represents, the sandbox must evolve from conceptual exploration into a delivery-oriented, evidence-generating mechanism—one that supports the safe, effective, and equitable integration of AI into regulated healthcare.
Its next phase should move beyond broad ethical reflection and instead prioritise real-world validation, contextual specificity, and regulatory alignment. By focusing on how AI performs in use—not merely how it is built—the sandbox can address both foundational and applied risks through a single, transparent, and scalable evidence strategy.
If successful, the AI sandbox will no longer just reflect on the future—it will help shape it. And in doing so, it can answer a pivotal question for health systems and innovators alike:
What does it take to bring foundation-model-enabled AI safely, credibly, and sustainably into healthcare delivery that patients and clinicians can trust?
🌀 Welcome to The Riff-
A sharp, human-centred take on where digital health and AI are headed next—offering signal over noise, with an eye on equity, sustainability, and real-world impact. Each edition riffs on a theme—from drift in AI systems and digital bias in healthcare, to sandboxes, standards, and smarter models of care. It's rooted in active work across policy, ethics, and innovation ecosystems—but always grounded in people, practice, and possibility. Whether you're shaping the future of health systems, building technology, or asking better questions, The Riff is your lens into what's emerging, what’s working, and what we need to talk about next.
Shemha Health, Co-CEO & Founder | Projects & Public Affairs Lead @ Bulgarian Joint Cancer Network | Market Access Advisor @ Kelvin Health | Future Healthcare Hacker | Healthcare Ecosystem Convergent & Advisor
3moExactly on the ethical scope of the next-gen. Clinical Decision Support Systemd complexity of delivering multidisciplinary, multimodal evidence-based recommendations 🚀