- Systematic Review
- Open access
- Published:
Evaluating the performance of artificial intelligence-based speech recognition for clinical documentation: a systematic review
BMC Medical Informatics and Decision Making volume 25, Article number: 236 (2025)
Abstract
Background
Clinical documentation is vital for effective communication, legal accountability and the continuity of care in healthcare. Traditional documentation methods, such as manual transcription, are time-consuming, prone to errors and contribute to clinician burnout. AI-driven transcription systems utilizing automatic speech recognition (ASR) and natural language processing (NLP) aim to automate and enhance the accuracy and efficiency of clinical documentation. However, the performance of these systems varies significantly across clinical settings, necessitating a systematic review of the published studies.
Methods
A comprehensive search of MEDLINE, Embase, and the Cochrane Library identified studies evaluating AI transcription tools in clinical settings, covering all records up to February 16, 2025. Inclusion criteria encompassed studies involving clinicians using AI-based transcription software, reporting outcomes such as accuracy (e.g., Word Error Rate), time efficiency and user satisfaction. Data were extracted systematically, and study quality was assessed using the QUADAS-2 tool. Due to heterogeneity in study designs and outcomes, a narrative synthesis was performed, with key findings and commonalities reported.
Results
Twenty-nine studies met the inclusion criteria. Reported word error rates ranged widely, from 0.087 in controlled dictation settings to over 50% in conversational or multi-speaker scenarios. F1 scores spanned 0.416 to 0.856, reflecting variability in accuracy. Although some studies highlighted reductions in documentation time and improvements in note completeness, others noted increased editing burdens, inconsistent cost-effectiveness and persistent errors with specialized terminology or accented speech. Recent LLM-based approaches offered automated summarization features, yet often required human review to ensure clinical safety.
Conclusions
AI-based transcription systems show potential to improve clinical documentation but face challenges in accuracy, adaptability and workflow integration. Refinements in domain-specific training, real-time error correction and interoperability with electronic health records are critical for their effective adoption in clinical practice. Future research should also focus on next-generation “digital scribes” incorporating LLM-driven summarization and repurposing of text.
Clinical trial number
Not applicable.
Background
Clinical documentation, defined as the systematic recording of a patient’s medical history, diagnoses, treatment plans and care provided, remains a cornerstone of effective healthcare. It is critical in ensuring accurate communication among healthcare providers, legal accountability, and continuity of care [1]. However, traditional documenting methods, such as manual note-taking or transcription, are often labour-intensive, prone to errors and can detract from the quality of patient-clinician interactions [2, 3]. These inefficiencies not only contribute to clinician burnout but also risk compromising the accuracy of medical records and patient safety.
In recent years, Artificial Intelligence (AI) has begun to transform clinical documentation through the use of advanced technologies like automatic speech recognition (ASR), large language models (LLM) and natural language processing (NLP) [4, 5]. These AI-driven transcription tools automate the process of converting spoken language into structured electronic medical records (EMRs), thereby alleviating the burden of manual data entry [5]. By streamlining this process, AI transcription systems offer the potential to improve the accuracy and completeness of clinical documentation while allowing clinicians to focus more on patient care and communication.
Despite the promise of AI in this domain, the effectiveness of AI transcription tools remains inconsistent across different clinical settings [4]. Studies report varying levels of accuracy, time savings and user satisfaction [4, 5]. While some tools demonstrate significant improvements in documentation speed and precision, others face challenges with speech recognition (SR) errors, the need for manual post-editing and inconsistencies in real-world clinical use [4, 5]. These mixed outcomes highlight the complexity of integrating AI tools into diverse healthcare environments and underscore the need for a thorough evaluation of their performance.
This review aims to synthesize the current evidence on AI transcription tools, focusing on their accuracy, efficiency and usability in clinical practice. By examining the successes and challenges of implementing these technologies, the review seeks to provide insights that can guide the development and integration of AI-driven documentation systems, ultimately shaping the future of clinical workflows and improving the quality of patient care.
Methods
Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [6] for identifying, selecting and synthesizing evidence, the protocol for this review was developed and registered in PROSPERO (registration number CRD42024597200).
Search strategy
A comprehensive literature search was performed on February 16, 2025, and it was conducted across multiple electronic databases, including MEDLINE (via OVID), Embase and the Cochrane Library, covering all records published up to February 16, 2025. The search strategy was developed in consultation with the help of medical information experts to identify studies that evaluated the performance of AI-based medical transcription software in clinical settings. Key words and the following Medical Subject Headings (MeSH) were applied: “Artificial Intelligence”, “Digital Scribe”, “Medical Transcription”, “Speech Recognition”, “Natural Language Processing”, “Electronic Health Records” and “Clinical Documentation”. Details of the search strategy can be found in the Supplementary Material (Table S1).
Additionally, grey literature was searched via Google search engine to capture relevant non-peer-reviewed studies. To further enhance the comprehensiveness of the search, forward and backward searching were performed on the reference lists of relevant studies to identify additional literature that may not have been captured in the initial search.
Inclusion and exclusion criteria
The inclusion criteria for this systematic review were as follows: the population of interest included studies that involved clinicians, such as physicians and nurses, who used AI-based transcription software for clinical documentation. The intervention of focus was the use of AI-driven transcription tools, which may include technologies like ASR, LLM, and NLP systems. Eligible studies must report on one or more key outcomes, such as transcription accuracy (measured through Word Error Rate or WER), time savings, clinician satisfaction or the impact on patient care. The review included empirical studies of various designs, including randomized controlled trials (RCTs), cohort studies, cross-sectional studies, comparative evaluations and proof-of-concept studies. Only studies published in English or with an English translation, and indexed up to February 16, 2025, were considered.
Studies that did not involve AI-based transcription tools were excluded. This included studies conducted in clinical settings which did not involve a physician facing a patient, such as laboratory-based evaluations, reports generated by radiologists and/or pathologists, as well as those focusing on non-English language transcription. Additionally, conference abstracts, editorials, commentaries and opinion pieces that did not provide empirical data were excluded as well.
Study selection
All identified studies were imported into Covidence (Veritas Health Innovation, Melbourne, Australia) to facilitate the screening process. Four independent reviewers (J.J.W.N., E.W., C.X.L.G. and G.Z.N.S.) screened titles and abstracts to exclude studies that did not meet the inclusion criteria. Studies passing this initial screening underwent a full-text review by two independent reviewers (J.J.W.N. and E.W.). Discrepancies in study inclusion were resolved through discussion, with arbitration by a third, senior reviewer (Q.X.N., H.K.T. or S.S.N.G.) if necessary.
Data extraction
Data were extracted from each included study using a standardized data extraction form developed for this review. The extracted data encompassed study characteristics, including the software or model used, the type of AI model, study design, clinical setting and country in which the study was conducted. Additionally, details about the study population, sample size, whether the study was vendor-initiated, the reference standard and the comparator type were also recorded. Key performance metrics, such as F1 score, precision, recall, and WER, along with paper-specific outcomes related to AI transcription proficiency and key findings, were also included. Novel features of the AI transcription tools were documented to provide a comprehensive overview of the studies.
Assessment of risk of bias and study quality
The Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool [7] was used to systematically assess the risk of bias and applicability of the studies included in our review. QUADAS-2 is a widely used tool that evaluates risk of bias across four key domains: patient selection, index test, reference standard and flow and timing [7]. Each domain was independently assessed for potential bias by reviewing key study characteristics and determining if the conduct or interpretation of results could have introduced bias. We also evaluated whether the applicability of each study matched our review question.
For each domain, the risk of bias was rated as either “low,” “high,” or “unclear,” depending on the completeness and clarity of the study’s reported methods. Specifically, we looked at factors such as the selection of patients or datasets, how the index test was performed and interpreted, whether the reference standard was appropriate, and if there were any exclusions that could have influenced the results. Any discrepancies were resolved through discussion among the reviewers (E.W., J.J.W.N. and X.Z.). This careful assessment ensured that the studies included in the analysis were reliable and applicable to our research objectives.
Data synthesis
Given that a meta-analysis was not feasible due to anticipated heterogeneity in the study design, interventions and outcomes, a narrative synthesis was conducted, as guided by Popay et al. [8]. Findings were narratively synthesized by summarizing key outcomes such as transcription accuracy, clinician satisfaction, impact on patient care, and usability, and by identifying common patterns across the studies.
Results
Literature retrieval
A total of 5,244 records were initially identified through database searches. After removing 1,011 duplicates using Covidence, 4,233 studies were screened based on titles and abstracts. During this screening phase, 4,210 studies were excluded for not meeting the inclusion criteria, leaving 60 studies for full-text review. All of these studies were retrieved for detailed assessment. After applying the inclusion and exclusion criteria, 25 studies were included. As illustrated in Fig. 1, an additional four studies were identified through the forward and backward citation searching, bringing the final total to 29 studies. The key study characteristics and findings of the 29 studies [4, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36] are summarised in Tables 1 and 2, respectively.
Study results
Table 1 provides an overview of each study’s design, setting, participant information, AI transcription tools and indication of vendor involvement (or not). Study designs ranged from RCTs [11, 16] to comparative or observational studies [9, 10, 12, 20, 21], with a growing number of recent publications employing qualitative or pre-post approaches to capture both performance metrics and user perspectives [23, 28, 33,34,35,36]. These studies spanned diverse environments including emergency departments (EDs), inpatient wards, specialty outpatient clinics (e.g., gastroenterology, urology, dermatology) and simulated clinical scenarios.
Key findings
Accuracy and error rates
While some systems demonstrate impressive precision and recall Happe et al. achieved a precision of 0.73 and recall of 0.90 in a specialized vocabulary environment [10] and Suominen et al. reached F1 scores of up to 0.856 for nursing tasks [15], other studies highlight notable shortcomings. For instance, Lybarger et al. reported a much lower maximum F1 score of 0.49 [18], and Zhou et al. found an F1 score of 0.416 in nursing contexts despite real-world training data [19]. Similarly, WER ranged from as low as 0.087 in controlled scenarios (Issenman et al. [12]) to more than 2.9 in real-time, multi-specialty outpatient encounters (Biro et al. [32]). van Buchem et al. demonstrated modest ROUGE F1 scores (0.32 unedited vs. 0.41 human-edited) for automated summaries [31]; the fact that human editing still improved these outputs underscores the potential but also the current limitations of LLM-driven summarization.
Workflow efficiency and time savings
Results on workflow efficiency were mixed. While Zick et al. [9] and Issenman et al. [12] observed decreased turnaround time (from days to hours or minutes), Hodgson et al. [16] and Blackley et al. [21] found that post-editing often negated potential time gains. More recent LLM-based systems (e.g., Bundy et al. [23], Ma et al. [34]) claim to shorten overall documentation time for certain specialties, but these claims often relied on small sample sizes or single-site studies, limiting generalizability.
Cost implications
Cost-effectiveness was inconclusive across the included studies. Early work in EDs (Zick et al. [9]) suggested significant cost savings with ASR, whereas Issenman et al. [12] found that voice recognition could be twice as expensive in pediatric gastroenterology. These differences highlight how cost can vary based on clinical setting, complexity of cases and existing staffing models.
Clinical documentation quality and patient care
Studies such as Almario et al. [13, 14] showed that AI-assisted documentation captured more clinically relevant red flags than physician-typed notes. Other research (e.g., Kodish-Wachs et al. [17], Sezgin et al. [30]) observed that high error rates or poor summarization fidelity could pose risks to patient safety, especially if omissions go unnoticed by overburdened clinicians. Notably, more recent digital scribes and LLM-based solutions are designed to create structured summaries (e.g., SOAP notes), although these still generally require human review for accuracy.
Clinician satisfaction, burnout and adoption
Goss et al. [20] identified higher satisfaction when clinicians encountered fewer transcription errors and minimal editing demands. Misurac et al. [28] and Shah et al. [36] further reported decreased burnout levels following the adoption of AI tools, indicating a potential benefit for clinician well-being. Despite improved satisfaction in some quarters, many clinicians expressed reluctance to rely fully on AI scribing, citing concerns about real-time error correction and the need for manual review (e.g., Bundy et al. [23], Moryousef et al. [35]).
Risk of bias and study quality
Most of the included studies were assessed to have a low risk of bias and low risk of applicability concerns, as shown in Table 3 and illustrated in Figs. 2 and 3. For studies identified to have a moderate or high risk of bias and applicability concerns, the greatest contributing factor was patient selection, followed by index test. This was possibly due to some studies having unclear patient selection criteria, and some studies having controlled test environments, which may bias the index tests. The QUADAS-2 assessment of the included studies showed predominantly low risk of bias and applicability concerns, supporting the reliability of findings on LLMs as transcription tools in medical domains (Figs. 2 and 3). However, some studies, such as Zick et al. [9], Blackley et al. [21], Hodgson et al. [16], and Kodish-Wachs et al. [17], had high risk of bias in patient selection and applicability concerns, which may limit generalizability. Unclear reference standards in studies like Bundy et al. [23] and Goss et al. [20] suggest potential gaps in validation. While the overall low risk in flow and timing strengthens confidence in the results, variability in methodological rigor underscores the need for standardised evaluation for future studies to ensure consistent and reliable conclusions.
Discussion
Our review identified 29 studies that investigated the applications of ASR and NLP in medical transcriptions across a variety of clinical settings. The included studies spanned environments such as EDs, inpatient wards, specialized clinics (e.g., gastroenterology, psychiatry, and endocrinology) and even simulated scenarios replicating ambulatory primary care workflows. Owing to this diversity and the significant heterogeneity in study designs, sample sizes and performance metrics, making direct comparisons was challenging. Nonetheless, the findings underscore the wide-ranging potential of AI-based transcription technology in healthcare and highlight certain common challenges to overcome in order to advance the field.
A broad array of AI models and software systems emerged from the review, from older ASR-based tools, such as Dragon NaturallySpeaking Medical Suite and Dragon Medical One [9, 12, 15, 16, 20], to more advanced products that incorporate LLMs, including DAX Copilot and GPT-4–driven systems [31, 33, 34]. The newer studies tend to describe ambient AI scribes, which not only transcribe but also summarize and repurpose clinical notes. Such systems may overcome some limitations of standalone SR tools. While these technologies differ in their underlying architectures, their shared aim is to automate or accelerate clinical documentation by converting speech into text, extracting medically relevant content, and in some cases summarizing or repurposing notes for different uses. Such variety in technological approaches was mirrored by the diversity of clinical settings in which these tools were tested. Some studies relied on synthetic data sets, such as nursing handover records, while others evaluated real-world interactions in high-pressure environments like the ED. This variety demonstrates the adaptability of AI transcription solutions but also reveals that performance is heavily context dependent. Tools that excel in structured, repetitive ED workflows may struggle with varied discussions in multi-specialty clinics or with more complex, freeform patient-doctor dialogues. Likewise, whether a study was conducted using real or simulated encounters also influenced performance, as differences in setting and complexity affect metrics such as WER or F1 scores.
In general, the accuracy of AI-driven transcription remains mixed. WER varied from as low as 0.087 in highly controlled settings (Issenman et al. [12]) to over 50% in conversational or multi-speaker encounters (Kodish-Wachs et al. [17]). Some tools achieved more favorable precision/recall in domain-specific contexts, particularly when leveraging specialty vocabularies (e.g., Happe et al. [10], Suominen et al. [15]). However, others (e.g., Lybarger et al. [18]) highlighted persistent transcription errors that require substantial manual correction. It is, however, also important to interpret performance estimates from older systems cautiously, as rapid technological advances—particularly in neural network and transformer-based models—have likely rendered those results outdated or less generalizable to current AI transcription capabilities.
Besides accuracy, studies conveyed mixed evidence concerning time efficiency. Zick et al. and Issenman et al. both reported substantial reductions in documentation turnaround times [9, 12], whereas newer research from Blackley et al. and Hodgson et al. found negligible or even negative impacts once clinicians’ editing tasks were factored in [16, 20]. Similarly, cost analyses yield no consensus. Zick et al. posited that voice recognition could be up to 100 times less expensive than manual transcription [9], but Issenman et al. found it to be more costly in a pediatric gastroenterology context [12]. As these examples illustrate, site-specific factors—such as the prevalence of templated text, local staff costs, and volume of standard phrases—likely determine the effectiveness of AI transcription.
Although AI transcription systems do not directly deliver patient care, they can indirectly influence clinical outcomes by improving documentation completeness and quality. Almario et al. reported a higher identification of red-flag symptoms in AI-drafted notes [13, 14], while others showed that accurate automated transcription may reduce cognitive load on clinicians. Nevertheless, any potential gains are offset by persistent concerns about error rates. High WER or omissions, as highlighted by Kodish-Wachs et al. [17], remain a threat to real-time decision-making, and this problem has not disappeared with the advent of LLM-based scribes, as seen in Bundy et al., van Buchem et al. and Biro et al. [23, 31, 32]. The subsequent post-editing burden also continues to challenge clinicians’ time management, particularly in busy and dynamic outpatient settings with a variety of patient presentations. Moreover, the efficiency gains from AI transcription are not guaranteed and initial investment costs can be prohibitive [37]. Clinicians’ opinion, acceptance and burnout also surfaced as important considerations for AI adoption. Surveys by Goss et al. [20] and interventions assessed by Misurac et al. [28] revealed that while some clinicians appreciate the potential reduction in documentation burdens, many remain cautious, dissatisfied with high error rates, or concerned about the reliability of AI-generated transcripts.
Several recurring challenges in this space require further attention. First, transcription accuracy often degrades with longer or more complex audio, which suggests the need for incremental or real-time correction features. Second, accented or non-native speech frequently leads to transcription mistakes, highlighting a need for accent adaptation or multi-accent training modules [11, 20, 22, 38]. While systems like AEGIS and NOMINDEX demonstrated high accuracy in specific clinical environments [39], their performance may not generalize well across diverse settings, particularly where speech patterns differ. This is especially true in multinational healthcare systems or regions with a high percentage of non-native English speakers. Third, the training of specialty-specific AI models can be hampered by privacy concerns, as clinicians and institutions may be reluctant to share sensitive patient transcripts for model fine-tuning. Fourth, only a minority of tools currently offer robust real-time error correction, meaning that any short-term gains in typing speed may be negated by lengthy revision processes. Beyond technical refinement, the limited pace of AI transcription adoption in healthcare might reflect deep structural barriers, including regulatory scrutiny over patient safety, a fragmented EHR environment that impedes easy integration, and unclear financial incentives. Moreover, as some studies have indicated (e.g., Issenman et al. [12]), frustrated or unreceptive physicians may be unwilling to incorporate new documentation technologies, especially if these tools require significant training or produce large volumes of errors. Further progress will depend on resolving issues of accuracy, accent variability, system interoperability and cost. Future research should also incorporate advanced evaluation metrics, like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy) [40], to systematically assess the quality of AI-generated summaries beyond simple WER.
It might be worth mentioning that workflows, already in limited clinical use (e.g., Tortus, Heidi) in the UK since early 2025 [41], may represent the logical next step for AI documentation: integrating LLMs to summarize and repurpose transcripts without requiring pristine input accuracy. Transcription, in this sense, is merely a transitional stage toward more comprehensive AI scribe solutions that ultimately address patient-clinician interactions holistically.
Limitations of review
This review is not without limitations. Firstly, this review searched three major databases (Medline, Embase and Cochrane Library) as well as grey literature but did not consult IEEE Xplore, which is a key database for engineering and technology-related research, including AI and machine learning. This exclusion may have limited the review’s ability to capture relevant studies on AI transcription systems, particularly those focused on technical innovations in SR and NLP (although they may not be applied in health care). Secondly, most of the studies included in this review were short-term evaluations or proof-of-concept studies conducted in controlled environments or with small sample sizes. There is a lack of long-term, real-world data on the sustained use of AI transcription tools in clinical practice. As a result, this review cannot fully assess the long-term impact of AI transcription on clinician efficiency, patient care outcomes or system-wide healthcare improvements. Thirdly, given the narrative synthesis approach, this review lacked the ability to draw strong, statistically powered conclusions about the overall effectiveness of AI transcription tools. Lastly, the review focused primarily on outcomes such as accuracy, time savings and clinician satisfaction, without addressing other potentially important dimensions, such as cost-effectiveness, user training requirements or implementation barriers. These additional factors could significantly affect the adoption and success of AI transcription tools in clinical practice, but they were not consistently reported in the studies reviewed.
Conclusions
In conclusion, this systematic review revealed that AI SR and transcription software has certain potential to improve clinical documentation, enhance workflow efficiency and reduce the documentation burden on clinicians. The tools designed for specific medical domains can achieve high levels of accuracy, as evidenced by systems like AEGIS and NOMINDEX, which outperformed manual documentation. However, there was significant variability in the performance of AI SR and transcription tools across different software platforms and clinical environments, with general-purpose SR systems often producing high error rates and requiring time-consuming manual corrections. This variability highlights that AI transcription software is still in a developmental phase, with much room for refinement, particularly in adapting systems to accents and complex medical language and improving real-time error correction before widespread adoption can be achieved. Future work should also expand the scope beyond transcription alone—exploring end-to-end AI scribe capabilities and evaluating their real-world effectiveness.
Data availability
No datasets were generated or analysed during the current study.
References
Kuhn T, Basch P, Barr M, Yackel T, Medical Informatics Committee of the American College of Physicians. Clinical Documentation in the 21st century: executive summary of a policy position paper from the American college of physicians. Ann Intern Med. 2015;162(4):301–3. https://doi.org/10.7326/M14-2128.
Moy AJ, Schwartz JM, Chen R, Sadri S, Lucas E, Cato KD, Rossetti SC. Measurement of clinical Documentation burden among physicians and nurses using electronic health records: a scoping review. J Am Med Inf Assoc. 2021;28(5):998–1008. https://doi.org/10.1093/jamia/ocaa325.
Gesner E, Dykes PC, Zhang L, Gazarian P. Documentation burden in nursing and its role in clinician burnout syndrome. Appl Clin Inf. 2022;13(5):983–90. https://doi.org/10.1055/s-0042-1757157.
Balloch J, Sridharan S, Oldham G, Wray J, Gough P, Robinson R, Sebire NJ, Khalil S, Asgari E, Tan C, Taylor A, Pimenta D. Use of an ambient artificial intelligence tool to improve quality of clinical Documentation. Future Healthc J. 2024;11(3):100157. https://doi.org/10.1016/j.fhj.2024.100157.
Perkins SW, Muste JC, Alam T, Singh RP. Improving clinical Documentation with artificial intelligence: A systematic review. Perspect Health Inform Manage. 2024;21(2):1–25.
Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE, Chou R, Glanville J, Grimshaw JM, Hróbjartsson A, Lalu MM, Li T, Loder EW, Mayo-Wilson E, McDonald S, McGuinness LA, Stewart LA, Thomas J, Tricco AC, Welch VA, Whiting P, Moher D. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. https://doi.org/10.1136/bmj.n71.
Whiting PF, Rutjes AW, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, Leeflang MM, Sterne JA, Bossuyt PM, QUADAS-2 Group. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155(8):529–36. https://doi.org/10.7326/0003-4819-155-8-201110180-00009.
Popay J, Roberts H, Sowden A, Petticrew M, Arai L, Rodgers M, Britten N, Roen K, Duffy S. Guidance on the conduct of narrative synthesis in systematic reviews. A product from the ESRC. Methods Programme Version. 2006;1(1):b92.
Zick RG, Olsen J. Voice recognition software versus a traditional transcription service for physician charting in the ED. Am J Emerg Med. 2001;19(4):295–8. https://doi.org/10.1053/ajem.2001.24487.
Happe A, Pouliquen B, Burgun A, Cuggia M, Le Beux P. Automatic concept extraction from spoken medical reports. Int J Med Inf. 2003;70(2–3):255–63. https://doi.org/10.1016/s1386-5056(03)00055-8.
Mohr DN, Turner DW, Pond GR, Kamath JS, De Vos CB, Carpenter PC. Speech recognition as a transcription aid: a randomized comparison with standard transcription. J Am Med Inf Assoc. 2003 Jan-Feb;10(1):85–93. https://doi.org/10.1197/jamia.m1130.
Issenman RM, Jaffer IH. Use of voice recognition software in an outpatient pediatric specialty practice. Pediatrics. 2004;114(3):e290–3. https://doi.org/10.1542/peds.2003-0724-L.
Almario CV, Chey W, Kaung A, Whitman C, Fuller G, Reid M, Nguyen K, Bolus R, Dennis B, Encarnacion R, Martinez B, Talley J, Modi R, Agarwal N, Lee A, Kubomoto S, Sharma G, Bolus S, Chang L, Spiegel BM. Computer-generated vs. physician-documented history of present illness (HPI): results of a blinded comparison. Am J Gastroenterol. 2015;110(1):170–9. https://doi.org/10.1038/ajg.2014.356.
Almario CV, Chey WD, Iriana S, Dailey F, Robbins K, Patel AV, Reid M, Whitman C, Fuller G, Bolus R, Dennis B, Encarnacion R, Martinez B, Soares J, Modi R, Agarwal N, Lee A, Kubomoto S, Sharma G, Bolus S, Chang L, Spiegel BM. Computer versus physician identification of Gastrointestinal alarm features. Int J Med Inf. 2015;84(12):1111–7. https://doi.org/10.1016/j.ijmedinf.2015.07.006.
Suominen H, Johnson M, Zhou L, Sanchez P, Sirel R, Basilakis J, Hanlen L, Estival D, Dawson L, Kelly B. Capturing patient information at nursing shift changes: methodological evaluation of speech recognition and information extraction. J Am Med Inf Assoc. 2015;22(e1):e48–66. https://doi.org/10.1136/amiajnl-2014-002868.
Hodgson T, Magrabi F, Coiera E. Efficiency and safety of speech recognition for Documentation in the electronic health record. J Am Med Inf Assoc. 2017;24(6):1127–33. https://doi.org/10.1093/jamia/ocx073.
Kodish-Wachs J, Agassi E, Kenny P 3rd, Overhage JM. A systematic comparison of contemporary automatic speech recognition engines for conversational clinical speech. AMIA Annu Symp Proc. 2018;2018:683–689.
Lybarger K, Ostendorf M, Yetisgen M. Automatically detecting likely edits in clinical notes created using automatic speech recognition. AMIA Annu Symp Proc. 2018;2017:1186–95.
Zhou L, Blackley SV, Kowalski L, Doan R, Acker WW, Landman AB, Kontrient E, Mack D, Meteer M, Bates DW, Goss FR. Analysis of errors in dictated clinical documents assisted by speech recognition software and professional transcriptionists. JAMA Netw Open. 2018;1(3):e180530. https://doi.org/10.1001/jamanetworkopen.2018.0530.
Goss FR, Blackley SV, Ortega CA, Kowalski LT, Landman AB, Lin CT, Meteer M, Bakes S, Gradwohl SC, Bates DW, Zhou L. A clinician survey of using speech recognition for clinical Documentation in the electronic health record. Int J Med Inf. 2019;130:103938. https://doi.org/10.1016/j.ijmedinf.2019.07.017.
Blackley SV, Schubert VD, Goss FR, Al Assad W, Garabedian PM, Zhou L. Physician use of speech recognition versus typing in clinical documentation: A controlled observational study. Int J Med Inf. 2020;141:104178. https://doi.org/10.1016/j.ijmedinf.2020.104178.
Van Woensel W, Taylor B, Abidi SSR. Towards an adaptive clinical transcription system for In-Situ transcribing of patient encounter information. Stud Health Technol Inf. 2022;290:158–62. https://doi.org/10.3233/SHTI220052.
Bundy H, Gerhart J, Baek S, Connor CD, Isreal M, Dharod A, Stephens C, Liu TL, Hetherington T, Cleveland J. Can the administrative loads of physicians be alleviated by AI-Facilitated clinical documentation? J Gen Intern Med. 2024;39(15):2995–3000. https://doi.org/10.1007/s11606-024-08870-z.
Cao DY, Silkey JR, Decker MC, Wanat KA. Artificial intelligence-driven digital scribes in clinical documentation: pilot study assessing the impact on dermatologist workflow and patient encounters. JAAD Int. 2024;15:149–51. https://doi.org/10.1016/j.jdin.2024.02.009.
Haberle T, Cleveland C, Snow GL, Barber C, Stookey N, Thornock C, Younger L, Mullahkhel B, Ize-Ludlow D. The impact of nuance DAX ambient listening AI documentation: a cohort study. J Am Med Inf Assoc. 2024;31(4):975–9. https://doi.org/10.1093/jamia/ocae022.
Islam MN, Mim ST, Tasfia T, Hossain MM. Enhancing patient treatment through automation: the development of an efficient scribe and prescribe system. Inf Med Unlocked. 2024;45:101456.
Liu TL, Hetherington TC, Stephens C, McWilliams A, Dharod A, Carroll T, Cleveland JA. AI-Powered clinical Documentation and clinicians’ electronic health record experience: A nonrandomized clinical trial. JAMA Netw Open. 2024;7(9):e2432460. https://doi.org/10.1001/jamanetworkopen.2024.32460.
Misurac J, Knake LA, Blum JM. The effect of ambient artificial intelligence notes on provider burnout. Appl Clin Inf. 2024. https://doi.org/10.1055/a-2461-4576.
Owens LM, Wilda JJ, Grifka R, Westendorp J, Fletcher JJ. Effect of ambient voice technology, natural Language processing, and artificial intelligence on the Patient-Physician relationship. Appl Clin Inf. 2024;15(4):660–7. https://doi.org/10.1055/a-2337-4739.
Sezgin E, Sirrianni JW, Kranz K. Evaluation of a digital scribe: conversation summarization for emergency department consultation calls. Appl Clin Inf. 2024;15(3):600–11. https://doi.org/10.1055/a-2327-4121.
van Buchem MM, Kant IMJ, King L, Kazmaier J, Steyerberg EW, Bauer MP. Impact of a digital scribe system on clinical Documentation time and quality: usability study. JMIR AI. 2024;3:e60020. https://doi.org/10.2196/60020.
Biro J, Handley JL, Cobb NK, Kottamasu V, Collins J, Krevat S, Ratwani RM. Accuracy and safety of AI-Enabled scribe technology: instrument validation study. J Med Internet Res. 2025;27:e64993. https://doi.org/10.2196/64993.
Duggan MJ, Gervase J, Schoenbaum A, Hanson W, Howell JT 3rd, Sheinberg M, Johnson KB. Clinician experiences with ambient scribe technology to assist with Documentation burden and efficiency. JAMA Netw Open. 2025;8(2):e2460637. https://doi.org/10.1001/jamanetworkopen.2024.60637.
Ma SP, Liang AS, Shah SJ, Smith M, Jeong Y, Devon-Sand A, Crowell T, Delahaie C, Hsia C, Lin S, Shanafelt T, Pfeffer MA, Sharp C, Garcia P. Ambient artificial intelligence scribes: utilization and impact on Documentation time. J Am Med Inf Assoc. 2025;32(2):381–5. https://doi.org/10.1093/jamia/ocae304.
Moryousef J, Nadesan P, Uy M, Matti D, Guo Y. Assessing the efficacy and clinical utility of artificial intelligence scribes in urology. Urology. 2025;196:12–7. https://doi.org/10.1016/j.urology.2024.11.061.
Shah SJ, Devon-Sand A, Ma SP, Jeong Y, Crowell T, Smith M, Liang AS, Delahaie C, Hsia C, Shanafelt T, Pfeffer MA, Sharp C, Lin S, Garcia P. Ambient artificial intelligence scribes: physician burnout and perspectives on usability and Documentation burden. J Am Med Inf Assoc. 2025;32(2):375–80. https://doi.org/10.1093/jamia/ocae295.
Joseph J, Moore ZEH, Patton D, O’Connor T, Nugent LE. The impact of implementing speech recognition technology on the accuracy and efficiency (time to complete) clinical Documentation by nurses: A systematic review. J Clin Nurs. 2020;29(13–14):2125–37. https://doi.org/10.1111/jocn.15261.
Koenecke A, Nam A, Lake E, Nudell J, Quartey M, Mengesha Z, Toups C, Rickford JR, Jurafsky D, Goel S. Racial disparities in automated speech recognition. Proc Natl Acad Sci U S A. 2020;117(14):7684–9. https://doi.org/10.1073/pnas.1915768117.
Blackley SV, Huynh J, Wang L, Korach Z, Zhou L. Speech recognition for clinical Documentation from 1990 to 2018: a systematic review. J Am Med Inf Assoc. 2019;26(4):324–38. https://doi.org/10.1093/jamia/ocy179.
Gardner N, Khan H, Hung C-C. Definition modeling: literature review and dataset analysis. Appl. Comput. Intell. 2022;2:83–98. https://doi.org/10.3934/aci.2022005.
Lawton J. NHS AI trial hailed as ‘remarkable’ and most ‘transformative’ tech in 15 years [Internet]. 2025 [cited 2025 Mar 10]. Available from: https://www.dailystar.co.uk/news/latest-news/nhs-ai-trial-hailed-remarkable-34423254
Acknowledgements
We thank Dr Clyve Yu Leon Yaow and Mr Ansel Shao Pin Tang for helping to develop and refine the search strategies for this review.
Author information
Authors and Affiliations
Contributions
All authors have made substantial contributions to all the following: (1) the conception and design of the study, or acquisition of data, or analysis and interpretation of data, (2) drafting the article or revising it critically for important intellectual content, (3) final approval of the version to be submitted. No writing assistance was obtained in the preparation of the manuscript. The manuscript, including related data, figures and tables has not been previously published, and the manuscript is not under consideration elsewhere. Conceptualization, Design and Methodology: K.X.Z., Q.X.N., Data Curation: K.X.Z., C.X.L.G., G.Z.N.S., S.S.N.G., Q.X.N., J.J.W.N., E.W., X.Z., Formal Analysis: K.X.Z., C.X.L.G., G.Z.N.S., Q.X.N., J.J.W.N., E.W., X.Z., S.S.N.G., H.K.T., Investigation: C.X.L.G., G.Z.N.S., K.X.Z., Q.X.N., J.J.W.N., E.W., X.Z., H.K.T., S.S.N.G., Supervision: H.K.T., S.S.N.G., Q.X.N., Writing– original draft: Q.X.N., J.J.W.N., E.W., X.Z., Writing– review & editing: K.X.Z., H.K.T., S.S.N.G., Q.X.N., J.J.W.N., E.W., X.Z.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ng, J.J.W., Wang, E., Zhou, X. et al. Evaluating the performance of artificial intelligence-based speech recognition for clinical documentation: a systematic review. BMC Med Inform Decis Mak 25, 236 (2025). https://doi.org/10.1186/s12911-025-03061-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12911-025-03061-0