Enhancing Clinical Decision-Making: Integrating Multi-Agent Systems with Ethical AI Governance

Ying-Jung Chen1 1YJC, AA conceive the idea, YJC, AA conduct research, YJC, AA, CSC write and edit the article    Ahmad Albarqawi    Chi-Sheng Chen
Abstract

Recent advances in the data-driven medicine approach, which integrates ethically managed and explainable artificial intelligence into clinical decision support systems (CDSS), are critical to ensure reliable and effective patient care. This paper focuses on comparing novel agent system designs that use modular agents to analyze laboratory results, vital signs, and clinical context, and to predict and validate results. We implement our agent system with the eICU database, including running lab analysis, vitals-only interpreters, and contextual reasoners agents first, then sharing the memory into the integration agent, prediction agent, transparency agent, and a validation agent. Our results suggest that the multi-agent system (MAS) performed better than the single-agent system (SAS) with mortality prediction accuracy (59%, 56%) and the mean error for length of stay (LOS)(4.37 days, 5.82 days), respectively. However, the transparency score for the SAS (86.21) is slightly better than the transparency score for MAS (85.5). Finally, this study suggests that our agent-based framework not only improves process transparency and prediction accuracy but also strengthens trustworthy AI-assisted decision support in an intensive care setting.

I Introduction

Artificial intelligence (AI) has been widely adopt into healthcare [1]. Within medicine, it’s proving valuable for sharpening diagnostic precision, supporting treatment planning, and helping clinicians take care of patients [2]. Recent work has focused on using AI to interpret complex medical visuals like surgical footage [3], computed tomography (CT) scans [4], and magnetic resonance imaging (MRI) scans [5], making interpretation faster and more consistent. These efforts provide new possibilities across neurology, psychiatry, and continuous patient monitoring [6]. Altogether, these advancements point to a future where AI supports both visual and signal-based insights, forming the backbone of smarter clinical decision-making tools.

Clinical decision support systems (CDSS) have become a vital part of current healthcare settings, offering insights drawn from electronic health records (EHRs) and real-time monitoring tools. Yet, many of the traditional AI models used in these systems fall short when it comes to flexibility, transparency, and oversight key qualities, especially critical in high-risk settings like intensive care units (ICUs). To address these limitations, we introduce a modular multi-agent system (MAS) designed to reflect how clinical teams make decisions, with a built-in emphasis on ethical AI to uphold both explainability and accountability.

Building on progress in large language model (LLM) agent-based frameworks, our system breaks down the decision-making pipeline into focused, collaborative agents. Each agent is responsible for a different aspect of ICU assessment: from interpreting lab results and tracking vital signs to making context-sensitive judgments based on a patient’s history or co-existing conditions. These individual agents pass their findings to an integration agent that brings everything together, enabling more comprehensive predictions, examining its transparency, and cross-validating its outcomes. This system simulates how doctors gather evidence from various sources, weigh context, and form a unified clinical picture.

By structuring the system around modular agents and grounding it in ethical oversight, we improve not just how interpretable the model is, but also how it upholds accountability throughout the clinical decision-making process. To test the framework, we used the eICU Collaborative Research Database [7], showing that our method can deliver well-organized predictions, shed light on key prognostic indicators, and build greater trust in AI-supported medical judgments.

II Related Work

II-A Applications of Clinical Decision Support Systems in Intensive Care Settings

CDSS have come a long way, especially in ICU environments where every second counts. Earlier systems typically leaned on rule-based logic or statistical methods to generate recommendations [8, 9]. More recent developments have looked to clinical practice guidelines (CPGs) as a way to enrich LLMs, boosting their ability to offer context-aware treatment advice. Research suggests that LLMs enhanced with CPGs outperform traditional models in delivering more accurate clinical suggestions [8].

Meanwhile, MAS to CDSS have been gaining popularity. Researchers developed a particularly interesting framework that combines case-based reasoning with a MAS. This system uses different agents to manage how users interact with it, execute tasks, and apply medical knowledge[9]. What makes this approach valuable is how it merges MAS with case-based reasoning, allowing the system to learn more efficiently and better adapt to each patient’s unique situation.

II-B eICU Data and Its Applications

The eICU Collaborative Research Database has emerged as a critical resource for intensive care research, gathering comprehensive data from over 200,000 ICU stays across the United States [10]. This extensive collection spans vital parameters—including vital signs, treatment protocols, severity indices, diagnoses, and interventions—serving as a solid groundwork for developing and validating AI models that address the specific challenges of critical care.

eICUs represent a significant influence on critical care medicine, harnessing telemedicine to address the shortage of on-site specialists for high-risk patients[11]. This innovative method enables continuous expert monitoring and intervention without the limitation of physical distance. For example, Philips’ eICU system uses the eCareManager platform to virtually bring ICU specialists to the patient’s bedside. By connecting hospital networks and providing real-time clinical feedback, eICUs effectively bridge the gap between remote experts and immediate patient needs [12].

The implementation of eICU has been notable, as evidenced by the experience at Baptist Health South Florida. After introducing their eICU model, the institution saw a significant 23% decrease in ICU mortality rates and up to a 25% reduction in average length of stay (LOS)[12]. These improvements really show how telemedicine is changing critical care for the better. The eICU approach makes care better in several important ways - doctors can watch patients around the clock, catch problems earlier, make better use of limited specialists, follow consistent treatment plans, and work more closely with the nurses and doctors at the bedside. Hospitals also receive financial benefits since patients are getting out of the ICU faster. In addition, all the data these systems collect is valuable for research, which can help to keep making ICU care better over time.

II-C LLM-Based Agents in Healthcare

LLMs have recently become more common in healthcare. These models now assist in multiple areas, including virtual assistants, individualized health education, symptom checking, and mental health support tools [13]. By improving patient interactions and simplifying administrative tasks, LLM-based systems are beginning to influence how healthcare is provided.

One example is MDAgents, a MAS using LLMs to manage complex medical decisions [14]. Its design replicates the teamwork observed in actual healthcare environments, enabling effective communication among its agents. Testing has shown that MDAgents performs better than earlier models in various evaluations.

A recent review explored the use of LLM-based agents in medical contexts [15]. The review covered technical foundations, practical applications, and existing challenges. It emphasized components such as planning techniques, reasoning strategies, integration of external tools, and agent architecture. These systems are now employed for tasks like CDSS, automatic patient documentation, simulation training, and workflow optimization.

However, MedAgentBench, offers a benchmark for evaluating LLMs as medical agents, featuring 300 clinically-derived tasks across 10 categories and 100 realistic patient profiles. The results indicate that current LLMs still struggle with complex tasks, leading to the need for optimization prior to used in autonomous healthcare applications[16].

Therefore, LLM-based agents have been further developed into MAS, where multiple agents interact in a collaborative manner. This change allows for systems that are more organized and flexible, offering new ways to manage challenging healthcare situations, such as emergency response coordination and personalized patient treatment.

II-D Multi-Agent Systems in Healthcare

MAS are gaining attention as a promising way to tackle complex challenges in healthcare. One example applies MAS to pre-hospital emergency response, where agents—such as EMS dispatch centers, ambulances, traffic nodes, and medical providers—collaborate within a distributed decision-making setup [17].

The idea of multi-stage AI agents builds on this by organizing intelligent agents into layers, each handling different parts of perception and reasoning. Many of these layers are now powered by LLMs, allowing for more structured and scalable workflows [18]. This layered setup has shown promise in areas like personalized care and remote health services. Furthermore, the LLM-medical-agent framework, for instance, demonstrates how MAS can be applied to modular analysis of healthcare data in practical settings [19].

II-E Ethical Governance in Healthcare AI

Ethical rules applied in healthcare AI is addressed through Explainable AI (XAI), which showcases the decision-making processes of complex algorithms. ”Healthcare AI Datasheets” framework, which documents potential biases through demographic data[20]. These complementary approaches not only enhance transparency by demystifying the ”black-box” problem but also actively work toward equitable healthcare outcomes by identifying and addressing the sampling and complexity biases that have historically perpetuated healthcare disparities across diverse populations[21].

Growing concerns around the safety of LLM-based agents have prompted the development of MAS [22, 23], which embed ethical advisor and policy guardrails to ensure compliance with safety and privacy standards—an especially important safeguard in clinical environments [24].

Bringing LLMs into electronic health record systems also introduces a range of ethical, legal, and practical questions. These include how to handle consent, maintain oversight, and ensure data governance [25]. A patient-centered approach—with transparency and strong ethical foundations—is essential for protecting vulnerable groups.

Thus, the World Health Organization (WHO) has outlined ethical guidelines for AI in healthcare, highlighting principles such as human autonomy, well-being, and system transparency [26]. Especially in high-risk areas like intensive care units, solving these governance challenges is key to the responsible deployment of AI [27].

II-F Motivation and Research Gap

While AI-powered CDSS have been promisingly adopted, there are still needs to be considered for real-world ICU practices. Many solutions fall short in modularity due to lack of transparency, or aren’t built with an inter-agent communication system to reflect the dynamic, interdisciplinary nature of intensive care.

Currently, most existing approaches tend to focus on isolated tasks—like interpreting lab results, monitoring vital signs, or reasoning based on medical history—but few bring these components together into a unified or dynamic system that reflects how clinicians actually work as a team. The essential needs draw our attention to propose a novel agent-based system that breaks down the clinical reasoning process.

Thus, we build the system with specialized, collaborative agents—each designed to handle a distinct aspect of care while maintaining accountability and interpretability throughout. By applying ethical AI principles at every stage of the pipeline and validating our design using the eICU database, this study aims to bridge both the technical and ethical gaps in deploying trustworthy AI for high-stakes decision-making in critical care.

Refer to caption
Figure 1: Illustration of the MAS Design. The system consists of a set of specialized agents, each responsible for processing a specific type of clinical data. The Context Analysis Agent handles unstructured inputs like clinical notes, while the Vitals Analysis Agent focuses on real-time physiological signals, and the Lab Analysis Agent interprets laboratory test results. These distinct streams of information are brought together by the Integration Agent, which fuses multimodal features into a unified representation. Based on this, the Prediction Agent carries out key forecasting tasks—such as predicting ICU mortality or estimating LOS. To support interpretability, the Transparency Agent generates human-readable, traceable explanations of model outputs. Finally, the Validation Agent oversees performance assessment by comparing predictions against ground truth data.

III Methods

III-A Dataset and Preprocessing

This study used the eICU Collaborative Research Database v2.0 [10], which compiles anonymized ICU records from over 200,000 patient admissions in different hospitals across the U.S. The database contains two types of data: 1) structured details (e.g., vital signs and lab results) and 2) unstructured clinical notes contributed by nurses and physicians, giving us a overall view of patient care.

Here we used several key eICU files: patient.csv, lab.csv, vitalPeriodic.csv, note.csv, and medication.csv, along with APACHE-related data files [28] (apacheApsVar.csv and apachePatientResult.csv). To align each patient’s information, records were grouped according to patient-unit-stayid. If any essential data were missing—such as vital signs, lab values, or clinical notes—we removed those entries to maintain reliability.

We filled in the data gaps, ordered events based on their timestamps, and shortened lengthy text fields to meet the language model guidelines. Then, we sampled 150 patients for the study: 76 mortality patients and 74 survived patients. This balanced sample, provided a controlled dataset for our comparative analysis.

Then, we retrieved specific features from each patient’s record set. We collected the ten most recent vital sign readings to reveal each patient’s current physiological state. We also selected the latest distinct lab biomarkers deemed clinically relevant. When dealing with unstructured clinical documentation, we included up to three notes for every patient, focusing primarily on entries written by physicians and nurses. In addition, our analysis tracked the top 20 medications and treatments, identifying them by frequency or uniqueness within the overall dataset. Finally, we incorporated APACHE scores and predictions as reference points to aid in validating and evaluating our modeling outcomes.

III-B Multi-Agent System Design

To emulate real-world ICU decision-making, we implemented a modular MAS consisting of six discrete agents, each responsible for a semantically distinct task. The system is shown in Figure 1.

  • Lab Analysis Agent: Receives structured lab data and highlights key abnormalities (e.g., hyperlactatemia, creatinine elevation) with implications on APACHE scoring and patient prognosis.

  • Vitals Analysis Agent: Processes vital signs (e.g., heart rate (HR), systolic blood pressure (SBP), peripheral capillary oxygen saturation (SpO2SpO_{2}), temperature) and evaluates physiological stability, respiratory function, and cardiovascular performance.

  • Context Analysis Agent: Analyzes unstructured notes, medication usage, and treatment strings to infer diagnoses, risk factors, and progression trajectory.

  • Integration Agent: Aggregates all the results from the above agent into a comprehensive, system-by-system clinical assessment. It prioritizes ICU risk factors related to mortality and length of stay (LOS).

  • Prediction Agent: Generates structured outcome predictions (mortality probability and ICU LOS) using integrated findings and APACHE variables. Results follow a strict template for automated parsing.

  • Transparency Agent: Analyzes clinical prediction outputs to meet ethical standards and are explainable to various stakeholders. This score includes calculating explainability, interpretability, and traceability scores based on specific steps for each dimension.

  • Validation Agent: Compares predicted vs. actual ICU outcomes and reflects on the prediction’s accuracy, key contributing variables, and future improvement insights.

To ensure information transfer between agents, we implemented a shared memory architecture that allows any agent to access inputs and outputs from previous pipeline stages. This approach reduces the risk of information loss between modules while maintaining the semantic separation of responsibilities. While this shared memory design has proven effective at improving predictive performance by ensuring that no critical data leakage during the analysis process, we recognize that the MAS introduces additional complexity that can impact transparency. Our current implementation focuses on performance optimization, with ongoing work to enhance the clarity across the agent communication pathways.

Each agent is implemented using OpenAI GPT-4o [29] and configured via the intelli framework [30], which allows asynchronous agent orchestration with JSON-structured prompts and logging.

III-C Few-Shot Learning Example Construction

To ground the reasoning of the Prediction Agent, we incorporated two real ICU patients as few-shot exemplars. These examples span different outcomes (e.g., survived vs. expired) and are selected based on APACHE completeness and data richness. Each example includes demographics, APACHE variables, labs, vitals, and actual outcomes. The examples are embedded directly in the prompt using clearly segmented format blocks and used to improve model generalizability.

III-D System Execution and DAG Orchestration

The entire multi-agent pipeline is expressed as a directed acyclic graph (DAG), where tasks are mapped via semantic dependencies. Specifically:

  • lab_analysis, vitals_analysis, and context_analysis feed into integration.

  • integration feeds into prediction.

  • prediction feeds into transparency.

  • transparency feeds into validation.

Execution is managed asynchronously using Python’s asyncio to allow concurrent LLM calls and reduce latency. The system supports multi-threaded batch evaluation and error-tolerant retries.

Refer to caption
Figure 2: Performance comparison between the MAS and SAS across two evaluation metrics: Mortality Prediction Accuracy and Length of Stay (LOS) Mean Error. Each model was executed 8 times, and the box plots represent the distribution of results across these runs. The MAS demonstrates slightly higher mean mortality accuracy and consistently lower LOS errors than the SAS.

III-E Implementation Details

Each agent is instantiated as a generative pre-trained transformer-based (GPT-based) [31] text agent with a predefined mission, API credentials, and output format enforcement. Prompts are customized per task using structured sections (e.g., “KEY ABNORMALITIES”, “APACHE RELEVANT FINDINGS”). Inputs are truncated or summarized to fit within the 10,000-token limit of GPT.

Agent configuration example:

Agent(
  provider="openai",
  mission="Analyze lab data for
  abnormalities",
  model_params={"key": OPENAI_API_KEY,
  "model": "gpt-4o"}
)

All patient data is saved per analysis run, including intermediate and final agent outputs in JSON format.

III-F Ethical AI and Explainability

Here we incorporate the explainability,interpretability, and traceability aspects within our agent system:

  • Explainability: Measures how the system detects and explains the factors influencing predictions. This includes elements such as the patient’s age, intubation status, creatinine levels, and other clinical parameters.

  • Interpretability: Evaluates how the reasoning process is communicated among various stakeholders. This step can allow the decision-making logic is understandable to clinicians, patients, and administrators.

  • Traceability: Assesses the trace process starting from input data, as well as influencing factors back to their sources. This builds a solid trail of evidence backing up the system’s findings, making it easier to trust and use in real-world scenarios.

III-G Quantitative Evaluation of Agent Transparency

We built the transparency assessment process, which is similar to [32] evaluating the transparency of clinical prediction explanations by analyzing text responses for key features. The transparency score is calculated based on explainability, interpretability, and traceability in the clinical prediction framework. The explainability score shows how the system justifies decisions based on feature importance identification, concentration of critical factors, clear reasoning, and stakeholder explanations. In addition, the interpretability measures how humans understand the model’s workings based on reasoning processes, prediction predictability, complexity, and alternative scenario analysis. The traceability assesses documentation quality by tracking input data sources, data transformations, model development history, and decision process. The overall transparency score is calculated by these components, showing strong performance but it indicates needs for improving interpretability.

IV Results

Our results shown in Table I and Figure 2 demonstrate that the MAS consistently outperforms the SAS approach across all evaluated metrics over 150 unique patients. Each experiment was conducted across eight runs with approximately 150 patients per run, and the reported values represent the average performance to ensure consistency and robustness of evaluation. In terms of mortality prediction accuracy, the MAS achieved a mean of 59%, while the SAS reached only 56%. While this 3 percentage point difference was consistent across multiple runs, further statistical analysis with larger samples would be necessary to establish clinical significance. Additionally, the standard deviation for the MAS is marginally higher, suggesting a bit more variability across runs.

We conducted paired t-tests across all performance metrics. The results show that these improvements are not just numerical differences but statistically robust findings, with p-values well below 0.0001 for most metrics and p = 0.0001 for mortality accuracy. What’s particularly encouraging is that the confidence intervals for all improvements exclude zero, meaning we can be confident these gains are real rather than chance variations. This statistical validation directly addresses a key concern in clinical AI research - distinguishing between genuine performance improvements and random noise in the data.

TABLE I: Statistical Comparison of Multi-agent and Single-agent Models (Average over 8 runs)
Metric Model Mean (SD) p-value
Mortality Prediction Accuracy (%) Multi-agent 58.6 (1.1) 0.0001
Single-agent 55.7 (0.8)
LOS Mean Error (days) Multi-agent 4.37 (0.21) 𝐩<0.0001\mathbf{p<0.0001}
Single-agent 5.82 (0.11)
Mean Squared Error (days²) Multi-agent 35.5 (2.3) 𝐩<0.0001\mathbf{p<0.0001}
Single-agent 48.1 (1.4)
Root Mean Squared Error (days) Multi-agent 5.95 (0.19) 𝐩<0.0001\mathbf{p<0.0001}
Single-agent 6.94 (0.10)

Our analysis indicates an enhancement in predicting LOS when using the MAS. In our study, the average prediction error drops to 4.37 days under the MAS, compared to 5.82 days observed with the SAS—an improvement of roughly 25% in accuracy. This gain is important, given its direct influence on how ICU resources are allocated and care is planned. In addition, the MAS registers a mean squared error of 35.49 as opposed to 48.13, along with a root mean squared error of 5.95 compared to 6.94, signifying that it delivers more stable and consistent predictions with fewer extreme fluctuations.

A particular aspect of our findings is how the MAS manages to reduce the LOS prediction error. A mean error of 4.37 days, as opposed to 5.82 days from the SAS, illustrates a noteworthy improvement of 25%, which is critical in fine-tuning patient care. Moreover, the improved metrics—lower mean squared error (35.49 instead of 48.13) and root mean squared error (5.95 compared to 6.94)—further confirm that this system not only enhances accuracy but also offers more stable predictions across various patient groups.

Refer to caption
Figure 3: A comparison of MAS and SAS performance across two key metrics: Mortality Prediction Accuracy (top) and Length of Stay Mean Error (bottom). The diagram displays results from eight independent runs. For mortality prediction, higher accuracy indicates better performance, while lower mean error values (in days) represent more accurate predictions for LOS.

Figure 3 reveals a clear analysis across the eight test runs we conducted. Looking at the top graph, you can see the MAS (blue line) consistently outperformed the SAS (green line) in predicting mortality. The MAS accuracy ranges from about 56% to nearly 60%, while the SAS stays between 55%–57%. What’s notable not just that it performed better, but that this advantage held steady across every single run.

The LOS prediction results in the bottom graph show more improvement for MAS. The blue line stays below the green throughout all runs, with errors around 4–4.7 days compared to the Single-agent’s 5.7–6 days. That gap - somewhere around a day and a half - might not sound huge until you consider what it means for real patients and hospital planning. Most systems like this show more ups and downs, but here the MAS maintained its edge consistently.

TABLE II: Comparison of transparency score of MAS and SAS (Average over 8 runs)
Metric Model Mean
Average transparency score (%) Multi-agent 85.50
Single-agent 86.21

Table II provides a side-by-side comparison of average transparency scores from eight independent runs. The data shows that the SAS approach scores an average of 86.21% in contrast to 85.50% for the MAS. This suggests that, under our current evaluation criteria, both models perform similarly in terms of transparency with the SAS design being marginally more interpretable. Despite the distributed nature of the MAS, it maintains nearly equivalent transparency levels, indicating that our shared memory architecture effectively preserves reasoning traceability across multiple specialized agents.

Overall, the MAS outperforms SAS in terms of predictive accuracy and error. Interestingly, our results provide a different perspective from previous research [33] that suggests MAS may fail through inter-agent misalignment. The shared memory architecture within MAS and transparency assessment process can establish a trustworthy CDSS. This finding highlights that a well-optimized inter-agent dynamic is more beneficial than a simpler SAS. Although our MAS offers a robust system, we still need to consider whether additional factors—such as scalability or deployment constraints—demand a MAS, based on the results observed.

V Conclusion

In this study, we found that the MAS performs better than the SAS for the mortality prediction accuracy (59% vs. 56%, respectively). Additionally, our results indicate that MAS has a shorter predicted LOS while maintaining a similar degree of transparency. Specifically, the MAS scores 85.5% for transparency, whereas the SAS shows a very close 86.21%. This subtle difference suggests that our MAS effectively maintains transparency despite the nature of complicated coordinating decisions among several specialized agents.

The balance between performance and explainability in decision support systems has significant implications for clinical patient care. In critical care scenarios where prediction accuracy is important, our MAS approach presents significant advantages. While maintaining high transparency, the MAS system improves predictive performance and addresses the critical needs for both reliable outcomes and interpretable decision processes that healthcare professionals can incorporate into their clinical judgment. On the other hand, when explaining prediction rationales to build clinician trust is essential, the higher transparency within Single-Agent Systems (SAS) still needs further investigation. Our research will focus on enhancing the clarity of the MAS while preserving its predictive advantages. Through improved inter-agent communication protocols and more sophisticated explanation mechanisms, we aim to develop a system that combines the predictive capabilities of MAS with the interpretability essential for safe and effective implementation in critical care settings.

Acknowledgment

The authors thank for Medwrite Limited for support our work.

References

  • [1] D. Menzies, S. Kirwan, and A. Albarqawi, “Ai managed emergency documentation with a pretrained model,” arXiv preprint arXiv:2408.09193, 2024.
  • [2] C.-T. Li, C.-S. Chen, C.-M. Cheng, C.-P. Chen, J.-P. Chen, M.-H. Chen, Y.-M. Bai, and S.-J. Tsai, “Prediction of antidepressant responses to non-invasive brain stimulation using frontal electroencephalogram signals: Cross-dataset comparisons and validation,” Journal of Affective Disorders, vol. 343, pp. 86–95, 2023.
  • [3] S.-L. Lai, C.-S. Chen, B.-R. Lin, and R.-F. Chang, “Intraoperative detection of surgical gauze using deep convolutional neural network,” Annals of Biomedical Engineering, vol. 51, no. 2, pp. 352–362, 2023.
  • [4] P. Deshpande, M. W. Bhatt, S. K. Shinde, N. Labhade-Kumar, N. Ashokkumar, K. Venkatesan, and F. D. Shadrach, “Combining handcrafted features and deep learning for automatic classification of lung cancer on ct scans,” Journal of Artificial Intelligence and Technology, vol. 4, no. 2, pp. 102–113, 2024.
  • [5] S. Dayarathna, K. T. Islam, S. Uribe, G. Yang, M. Hayat, and Z. Chen, “Deep learning based synthesis of mri, ct and pet: Review and analysis,” Medical image analysis, vol. 92, p. 103046, 2024.
  • [6] C.-S. Chen and W.-S. Wang, “Psycho gundam: Electroencephalography based real-time robotic control system with deep learning,” arXiv preprint arXiv:2411.06414, 2024.
  • [7] T. Deng, D. Wu, S.-s. Liu, X.-l. Chen, Z.-w. Zhao, and L.-l. Zhang, “Association of blood urea nitrogen with 28-day mortality in critically ill patients: A multi-center retrospective study based on the eicu collaborative research database,” Plos one, vol. 20, no. 1, p. e0317315, 2025.
  • [8] M. Quttainah, V. Mishra, S. Madakam, Y. Lurie, S. Mark et al., “Cost, usability, credibility, fairness, accountability, transparency, and explainability framework for safe and effective large language models in medical education: Narrative review and qualitative study,” Jmir Ai, vol. 3, no. 1, p. e51834, 2024.
  • [9] D. Wang, J. Liu, Q. Lin, and H. Yu, “A decision-making system based on case-based reasoning for predicting stroke rehabilitation demands in heterogeneous information environment,” vol. 154. Elsevier, 2024, p. 111358.
  • [10] P. Rockenschaub, A. Hilbert, T. Kossen, P. Elbers, F. von Dincklage, V. I. Madai, and D. Frey, “The impact of multi-institution datasets on the generalizability of machine learning prediction models in the icu,” Critical Care Medicine, vol. 52, no. 11, pp. 1710–1721, 2024.
  • [11] S. Gupta, S. Dewan, A. Kaushal, A. Seth, J. Narula, and A. Varma, “eicu reduces mortality in stemi patients in resource-limited areas,” 2014.
  • [12] L. A. Celi, E. Hassan, C. Marquardt, M. Breslow, and B. Rosenfeld, “The eicu: it’s not just telemedicine,” pp. N183–N189, 2001.
  • [13] J. Qiu, K. Lam, G. Li, A. Acharya, T. Y. Wong, A. Darzi, W. Yuan, and E. J. Topol, “Llm-based agentic systems in medicine and healthcare,” pp. 1418–1420, 2024.
  • [14] Y. Kim, C. Park, H. Jeong, Y. S. Chan, X. Xu, D. McDuff, H. Lee, M. Ghassemi, C. Breazeal, H. Park et al., “Mdagents: An adaptive collaboration of llms for medical decision-making,” Advances in Neural Information Processing Systems, vol. 37, pp. 79 410–79 452, 2024.
  • [15] W. Wang, Z. Ma, Z. Wang, C. Wu, W. Chen, X. Li, and Y. Yuan, “A survey of llm-based agents in medicine: How far are we from baymax?” arXiv preprint arXiv:2502.11211, 2025.
  • [16] Y. Jiang et al., “Medagentbench: A realistic virtual ehr environment to benchmark medical llm agents,” arXiv preprint arXiv:2501.14654, January 2025, revised February 2025.
  • [17] R. Safdari, J. S. Malak, N. Mohammadzadeh, and A. D. Shahraki, “A multi agent based approach for prehospital emergency management,” Bulletin of Emergency & Trauma, vol. 5, no. 3, p. 171, 2017.
  • [18] Z. Yao and H. Yu, “A survey on llm-based multi-agent ai hospital,” 2025.
  • [19] S. A. Gebreab, K. Salah, R. Jayaraman, M. H. ur Rehman, and S. Ellaham, “Llm-based framework for administrative task automation in healthcare,” IEEE, pp. 1–7, 2024.
  • [20] M. Siddik and H. J. Pandit, “Datasheets for healthcare ai: A framework for transparency and bias mitigation,” arXiv preprint arXiv:2501.05617, January 2025, accessed April 14, 2025.
  • [21] D. Saraswat, P. Bhattacharya, A. Verma, V. K. Prasad, S. Tanwar, G. Sharma, P. N. Bokoro, and R. Sharma, “Explainable ai for healthcare 5.0: opportunities and challenges,” pp. 84 486–84 517, 2022.
  • [22] Y.-J. Chen and V. K. Madisetti, “Information security, ethics, and integrity in llm agent interaction,” Journal of Information Security, vol. 16, no. 1, pp. 184–1, 2024.
  • [23] P. Radanliev, “Ai ethics: Integrating transparency, fairness, and privacy in ai development,” Applied Artificial Intelligence, vol. 39, no. 1, p. 2463722, 2025.
  • [24] Z. Xiang, L. Zheng, Y. Li, J. Hong, Q. Li, H. Xie, J. Zhang, Z. Xiong, C. Xie, C. Yang et al., “Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning,” arXiv preprint arXiv:2406.09187, 2024.
  • [25] S. Tripathi, K. Mongeau, D. Alkhulaifat, A. Elahi, and T. S. Cook, “Large language models in health systems: governance, challenges, and solutions,” Academic Radiology, vol. 32, no. 3, pp. 1189–1191, 2025.
  • [26] W. H. Organization, “Ethics and governance of artificial intelligence for health: large multi-modal models. who guidance,” 2024.
  • [27] M. R. Pinsky, A. Bedoya, A. Bihorac, L. Celi, M. Churpek, N. J. Economou-Zavlanos, P. Elbers, S. Saria, V. Liu, P. G. Lyons et al., “Use of artificial intelligence in critical care: opportunities and obstacles,” Critical Care, vol. 28, no. 1, p. 113, 2024.
  • [28] W. A. Knaus, J. E. Zimmerman, D. P. Wagner, E. A. Draper, and D. E. Lawrence, “Apache—acute physiology and chronic health evaluation: a physiologically based classification system,” Critical care medicine, vol. 9, no. 8, pp. 591–597, 1981.
  • [29] OpenAI, “Gpt-4o,” https://chatgpt.com/, 2024, accessed: 2025-04-09.
  • [30] I. Node, “Intelli: A framework for creating chatbots and ai agent workflows.” https://github.com/intelligentnode/Intelli, 2024, accessed: 2025-04-09.
  • [31] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
  • [32] O. R. Cawiding, S. Lee, H. Jo, S. Kim, S. Suh, E. Y. Joo, S. Chung, and J. K. Kim, “Symscore: Machine learning accuracy meets transparency in a symbolic regression-based clinical score generator,” Computers in Biology and Medicine, vol. 185, p. 109589, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0010482524016743
  • [33] M. Cemri, M. Z. Pan, S. Yang, L. A. Agrawal, B. Chopra, R. Tiwari, and I. Stoica, “Why do multi-agent llm systems fail?” arXiv preprint arXiv:2503.13657, 2025.