Evaluating gender bias in large language models in long-term care

Rickman, Sam

doi:10.1186/s12911-025-03118-0

Research
Open access
Published: 11 August 2025

Evaluating gender bias in large language models in long-term care

Sam Rickman¹

BMC Medical Informatics and Decision Making volume 25, Article number: 274 (2025) Cite this article

7639 Accesses
1 Citations
141 Altmetric
Metrics details

Abstract

Background

Large language models (LLMs) are being used to reduce the administrative burden in long-term care by automatically generating and summarising case notes. However, LLMs can reproduce bias in their training data. This study evaluates gender bias in summaries of long-term care records generated with two state-of-the-art, open-source LLMs released in 2024: Meta’s Llama 3 and Google Gemma.

Methods

Gender-swapped versions were created of long-term care records for 617 older people from a London local authority. Summaries of male and female versions were generated with Llama 3 and Gemma, as well as benchmark models from Meta and Google released in 2019: T5 and BART. Counterfactual bias was quantified through sentiment analysis alongside an evaluation of word frequency and thematic patterns.

Results

The benchmark models exhibited some variation in output on the basis of gender. Llama 3 showed no gender-based differences across any metrics. Gemma displayed the most significant gender-based differences. Male summaries focus more on physical and mental health issues. Language used for men was more direct, with women’s needs downplayed more often than men’s.

Conclusion

Care services are allocated on the basis of need. If women’s health issues are underemphasised, this may lead to gender-based disparities in service receipt. LLMs may offer substantial benefits in easing administrative burden. However, the findings highlight the variation in state-of-the-art LLMs, and the need for evaluation of bias. The methods in this paper provide a practical framework for quantitative evaluation of gender bias in LLMs. The code is available on GitHub.

Peer Review reports

Introduction

In the US and UK, large language models (LLMs) are being used to generate care documentation by summarising audio transcripts of care interventions or distilling extensive free text case notes into short summaries [1,2,3]. The case for such tools is compelling. Documentation is the most time-consuming task in health and long-term care [4,5,6]. Additionally, electronic care records often span decades, making it impractical for practitioners to review all the information. In some cases, avoidable harm has occurred where workers were unaware of important details in their records [7]. By automatically generating or summarising records, LLMs have the potential to reduce costs without cutting services, improve access to relevant information, and free up time spent on documentation.

There is political will to expand such technologies in health and care. The 2023 US Executive Order issued by President Biden sought to promote the “deployment of… generative AI-enabled technologies in healthcare”, and established a Health and Human Services (HHS) Artificial Intelligence (AI) Task Force [8]. The Spring 2024 UK budget stated that LLMs will be used to increase the time clinicians can spend with patients and unlock an annual productivity benefit of £ 500 million - £ 850 million ($643 million - $1.1 billion USD) [9]. The European Union (EU) Artificial Intelligence (AI) Act provides a framework for the introduction of such products, though it also mandates significant regulatory oversight [10, 11].

LLMs can produce accurate summaries of healthcare records and even outperform humans [12]. High-quality, relevant documentation is associated with reduced cognitive burden, fewer errors, and improved care quality [13,14,15]. However, while accuracy is a necessary condition for deploying such models, it is not sufficient. LLMs can reproduce and even amplify biases present in their training data [16, 17], and subtle differences in tone or emphasis may influence decision-making by care professionals [18].

A growing body of research has found that LLMs perpetuate gender stereotypes across domains. Studies have shown biased outputs in machine translation [19], hiring recommendations [20], and occupational rankings [21], with models often reinforcing stereotypes more strongly than real-world statistics [22]. Similarly, in healthcare contexts, LLMs have been found to generate biased clinical vignettes [23] with diagnoses aligning to stereotypes. Gallegos et al. [24] distinguish between representational harm, where language varies unfairly across groups, and allocational harm, where bias affects treatment decisions. LLM-generated summaries of free-text care records can fall into both categories: if models emphasise one group’s health needs more than another’s, this may influence downstream decisions about service allocation [25].

Although gender bias has been observed in early transformer models such as BERT and GPT-2 [26], recent work suggests that newer models differ widely in their susceptibility to bias. For instance, Shan et al. [27] find large variations across several measures of bias in LLMs with similar architectures and training data. This underscores the need to evaluate specific models to determine which are most appropriate for use in health and care settings.

This study evaluates the gender bias in Meta’s Llama 3 [28] and Google’s Gemma [29], two state-of-the-art, open-source LLMs released in 2024. Summaries of individual-level case notes from a London local authority were generated using each model and compared with summaries produced by earlier benchmark models: Google’s T5 [30] and Meta’s BART [31], which have previously been shown to exhibit gender bias [32, 33]. The aim is to determine whether newer models mitigate or exacerbate gender bias when applied to real-world documentation in long-term care. Gender was chosen as a focal point for this analysis because it is a salient and routinely documented attribute in social care records, and provides a clear entry point for assessing bias in generated content.

Three questions are addressed in this study. Firstly, whether there are measurable, gender-based differences in summaries of long-term care case notes generated by state-of-the-art, open-source LLMs. Secondly, if so, whether there is measurable inclusion bias [34], where different topics are included in summaries for men and women, or linguistic bias [17], where the same topics are discussed using different language. Finally, the implications for care practice of gender-based differences are considered. The paper also contributes a methodological framework for evaluating gender bias in LLM-generated summaries, designed to be reproducible and applicable across domains.

Materials and methods

Data

Pseudonymised records were extracted from a local authority adult social care case recording system in England, recorded between 2010 and 2020. Ethical approval was obtained for the use of the data. Texts about men and women were selected, and gender-swapped versions were created using Llama 3 as outlined in Analysis and data pre-processing. Summaries of each pair of texts were then generated, and the male and female versions of the output were compared in three ways. Firstly, sentiment analysis was applied to determine whether any model generates consistently more negative sentiment. Secondly, the inclusion bias [34] of certain topics was measured by comparing the frequency of terms related to domains such as health and physical appearance in summaries for each gender. Finally, linguistic bias [17] was assessed by comparing the frequencies of words appearing in the output generated by each model.

Measuring bias: counterfactual fairness

To assess bias, this paper uses the framework of counterfactual fairness defined in Kusner et al. [35], that a machine learning model is fair towards an individual if its output is the same in the actual world and a counterfactual world where the individual’s circumstances are identical, except for a demographic change such as gender, race or sexual orientation.

More formally, a predictor $\hat{Y}$ is counterfactually fair if, for any individual with observed attributes $A = a$ (protected attribute) and $X = x$ (remaining attributes), and for any other possible value $a^{\prime}$ of $A$, Eq. (1) holds.

$$\begin{gathered} P\left( {{{\hat Y}_{A \leftarrow a}} = y\mid A = a,X = x} \right) \hfill \\= P\left( {{{\hat Y}_{A \leftarrow {a^\prime }}} = y\mid A = a,X = x} \right),\,{\text{for all }}y. \hfill \\ \end{gathered} $$

(1)

Where:

$P(\hat{Y}_{A \leftarrow a} = y \mid A = a, X = x)$ is the probability that the prediction $\hat{Y} = y$, given that the individual actually has attribute $A = a$ and characteristics $X = x$.

$P(\hat{Y}_{A \leftarrow a^{\prime}} = y \mid A = a, X = x)$ is the probability that the prediction $\hat{Y} = y$, if, counterfactually, the protected attribute $A$ were set to $a^{\prime}$, while keeping all else the same.

This definition was originally designed for outputs ($\hat{Y}$) that are straightforward to compare, such as insurance premiums or predicted risk of offending. The output of LLMs are sequences of high-dimensional vectors of varying length. Direct comparisons between them in vector space may be challenging to implement or interpret. Instead, the approach taken here is to analyse differences in textual content of the model output—a practical adaptation aligned with several methods listed in Gallegos et al. [24]’s recent review of LLM bias metrics, including counterfactual sentiment analysis, Regard score, and lexicon-based comparisons.

Comparison of sentiment output

Three widely used, pre-trained sentiment analysis metrics were initially considered: SiEBERT [36], a binary sentiment classifier based on RoBERTa [37]; Regard [38], which is designed to detect demographic bias including gender stereotypes; and a DistilBERT-based model [39, 40], which outputs continuous sentiment scores. Each metric was tested on gender-swapped versions of the same input text to ensure that any measured sentiment differences in the summaries reflected model output, not bias in the metric itself.

Each metric produces a numeric score per sentence: SiEBERT, fine-tuned on 15 datasets of reviews and social media text, returns a binary classification (1 = positive). The Regard and DistilBERT-based model produce continuous sentiment scores ranging from 0 to 1. SiEBERT and Regard showed no significant gender bias on the original inputs and were used in the main analysis. However, the DistilBERT-based model was found to produce significantly different sentiment scores for gender-swapped versions of identical input texts (see Appendix 1), suggesting that the metric itself may be sensitive to gender. As a result, it was excluded from further analysis. A mixed regression model was applied for each of the sentiment metrics, where the summarisation model was included as a random effect, clustered by document ID as a random intercept, as specified in Eq. (2).

$$\begin{aligned}\text{sentiment}_{ij} &=\; \beta_0 \;+\;\boldsymbol{\beta}_1^\top\,\mathbf{model}_{j} \;+\;\beta_2\,\text{gender}_{j}\\&\quad{}+\;\boldsymbol{\beta}_3^\top\!\bigl(\mathbf{model}_{j} \times \text{gender}_{j}\bigr)\;+\;\boldsymbol{\beta}_4^\top\,\mathbf{max\_tokens}_{j}\\&\quad{}+\;u_{0i}\;+\;\mathbf{u}_{1i}^\top\,\mathbf{model}_{j}\;+\;\epsilon_{ij}\end{aligned}$$

(2)

The dataset consists of 29,616 rows, representing 617 documents, each with 48 possible combinations of gender (2 levels), maximum token length (6 levels), and summarisation model (4 levels).

Where:

$\text{sentiment}_{ij}$ is the outcome (a numeric score) for observation $j$ in document $i$.

$\mathbf{model}_{j}$ is a vector of dummy variables indicating which model (Gemma, Llama 3, T5) level applies to row $j$, with BART as the reference level.

$\text{gender}_{j}$ is binary variable with 0 = female and 1 = male.

$\mathbf{model}_{j} \times \text{gender}_{j}$ is the interaction effect between gender and LLM summarisation model.

$\mathbf{max\_tokens}_{j}$ is a vector of dummy variables for the max_tokens factor (75, 100, 150, 300 or None), with length 50 as the reference level.

$u_{0i}$ and $u_{1i}$ together define random intercepts for document-level $i$ sentiment for the four LLMs. $u_{0i}$ is the random intercept for the reference-level LLM (BART), and $u_{1i}$ represent differences between random intercepts for each of the other models and the random intercept for BART.

$\epsilon_{ij}$ is the residual error term, which is assumed to be $\mathcal{N}(0,\sigma^2)$.

Data was also available for the age, gender and ethnicity of each individual. However, inclusion of these variables in the model led to very similar results, and a Likelihood Ratio test indicated that they did not improve the model. An alternative specification including an interaction between max_tokens and gender was tested, but a likelihood ratio test indicated that this interaction did not significantly improve the explanatory power of the model. For the sake of parsimony, these models are not included in the output in the Results section. For robustness, estimates were bootstrapped, and a variance-structured mixed effects model, a generalised estimating equations (GEE) model, a robust linear mixed model, and a separate linear model for each language model were fitted. Details of this are included in the Appendix.

Inclusion bias: comparison of themes

Thematic analysis is used to assess the downstream consequences of counterfactual bias. A sample of original documents was examined to identify common themes across texts. Four themes were identified: physical health, mental health, physical appearance, and subjective language. To aid in the interpretation of differences in output, lists of words related to each theme were created. Llama 3 and Gemma were used to systematically scan the original texts for phrases associated with each theme. For instance, the models were prompted to identify all subjective language (such as “dirty,” “excessive,” and “rude”) in the original texts. A comprehensive list of terms was generated, which was manually refined to remove irrelevant entries, resulting in focused lists of terms. This process was repeated for each theme. The lists are included in the Appendix.

The total frequency of each term in the summaries generated by each model for male and female subjects was counted. As the original texts used all terms an equal number of times for each gender, any differences in the summaries were attributable to the summarisation models. The total counts of these terms in the summaries were compared, and $\chi^2$ tests were used to determine if the differences were statistically significant. The $p$-values were adjusted for multiple comparisons using the Benjamini-Hochberg method [41].

Linguistic bias: word frequency analysis

To analyse linguistic bias, frequencies of individual words were compared at two levels: overall counts and document-level. Firstly, word counts were aggregated across all documents for each LLM, and the frequency of each word in male and female summaries was compared. A $\chi^2$ test was used to determine if differences in overall counts were statistically significant except if counts of fewer than 5 were observed for either gender, where Fisher’s exact test was used instead. Again, $p$-values were adjusted for multiple comparisons using the Benjamini-Hochberg method [41]. For document-level analysis, regression was performed on the word counts. For each word, a table of all documents in which it appeared was created, and a Poisson regression was run, where the dependent variable was the word count, and the independent variables were document ID, gender, and the maximum number of tokens, as specified in Eq. (3).

$$\begin{aligned}\log (\mathbb{E}[{\text{coun}}{{\text{t}}_{ij}}\mid {{\mathbf{X}}_{ij}}]) = \; & {\beta _0}\; + \;{\beta _1}{\mkern 1mu} {\text{gende}}{{\text{r}}_j}\; \\ & + \;{\mathbf{\beta }}_2^ \top {\mkern 1mu} {\mathbf{max}}\_{\mathbf{token}}{{\mathbf{s}}_j}\; + \;{\mathbf{\beta }}_3^ \top {\mkern 1mu} {\mathbf{doc}}\_{\mathbf{i}}{{\mathbf{d}}_j} \\ \end{aligned} $$

(3)

Where:

$\log\bigl(\mathbb{E}[\text{count}_{ij}\mid \mathbf{X}_{ij}]\bigr)$ is the log of the expected value of the count of each specific word for row $j$ in document $i$, given a vector of explanatory variables $\mathbf{X}_{ij}$.

$\text{gender}_{j}$ is binary variable with 0 indicating female and 1 male.

$\mathbf{max\_tokens}_{j}$ is a vector of dummy variables for the max_tokens factor (75, 100, 150, 300 or None), with length 50 as the reference level.

$\mathbf{doc\_id}_{j}$ is a vector of dummy variables identifying document $i$ on row $j$. This allows the model to account for the fact that words will be expected to appear a different number of times in each document. The document-level coefficients are not of interest and are not included in the results.

Occasionally, perfect separation occurred (i.e., words that never appeared for one gender), so Firth’s penalised likelihood method of Poisson regression [42] was used to obtain reliable parameter estimates. In cases of overdispersion ($\frac{\sum \left(r_i^2 \right)}{\text{df}_{\text{residual}}} > 1.25$), a negative binomial regression with the same independent variables was also run. As multiple comparisons were conducted, words were considered to appear significantly differently only if they were statistically significant in both the regression output and the Benjamini-Hochberg adjusted $\chi^2$ test (adjusted $p < 0.05$).

Analysis and data pre-processing

Creating equivalent male and female texts

The data included free text records for 3046 older adults receiving care in a London local authority. Free text responses to the care needs assessment question, which asks social workers to write a pen portrait of an individual’s needs at the time of assessment, were selected for summarisation. The analysis was limited to responses of at least 200 words, resulting in 2030 records. Duplicate or near-duplicate portraits were removed, as were portraits that would not describe a comparable situation if pronouns were changed. This included texts mentioning domestic violence or references to sex-specific body parts, such as a history of mastectomy. Portraits longer than 500 words, which caused out-of-memory errors on a consumer Graphics Processing Unit (GPU), were also removed.

To ensure that differences in summaries rather than the original text were measured, a gender-swapped version of each text was generated. This approach is similar to counterfactual substitutions made in other papers (see e.g. [43, 44]). However, rather than replacing individual words, Llama 3 was used to create gender-swapped versions of entire notes. See Table 1 for examples of such changes. Prior to this, all texts were cleaned by running them through Llama 3 with a prompt asking it to reproduce them exactly. This led to almost exact reproduction, with punctuation, typographical, and spelling errors corrected. This clean version was then gender-swapped, to ensure there were no differences in output unrelated to gender that could cause downstream differences. All generation was undertaken with the Python transformers library [45]. To ensure correctness, the spacy Python library [46] was used to remove stop words and split each document into sentences. The words in the male and female versions of each summary were then counted. Pairs of texts that did not have the same number of sentences and count of words per sentence, excluding gender-specific words like “man” or “woman,” were excluded from further analysis.

Table 1 Examples of paired sentences used as input to summarisation models

Full size table

In total, 617 pairs of gender-swapped texts were included for summarisation (361 originally about women and 256 originally about men). The individuals had a mean age of 82.5 years (SD 8.5 years), and 69% had their ethnicity recorded as white British.

Selecting sentiment analysis metrics

The sentiment of the male and female versions of each original document was analysed using Regard, SiEBERT, and the DistilBERT-based model. The DistilBERT-based model found significant differences in sentiment between otherwise identical texts based solely on gender, indicating that it was not an appropriate measure of sentiment for this analysis. Therefore, it was excluded from further use. Since no significant differences were observed using Regard or SiEBERT, these metrics were used to evaluate the output of the summarisation models. The details of the analysis for the original documents for each of these metrics are set out in the Appendix.

Generation of summaries

The Hugging Face transformers library [45] was used for all models with Python 3.10.12 [47]. The large BART model [48], the base T5 model [49], the 7 billion parameter version of Gemma [50], and the 8 billion parameter version of Llama 3 [51] were used. Statistical tests and regression analyses were run using R 4.4.0 [52]. The full code for the generation of summaries and all other steps of the analysis are available in the GitHub repository associated with this paper [53].

Word frequency analysis

To create tables of word counts per summary for each LLM, the text was pre-processed to remove stop words and punctuation, and each word was lemmatised. This produced a list of unique words across all documents. Words that did not appear in an English dictionary were excluded from the list of terms for comparison. For each summary, a sparse matrix of word counts per document was created. For the LLM-level $\chi^2$ tests, these were aggregated into total counts per word, per gender.

Results

This section presents the results of the analysis of sentiment output, themes, and word frequency. These findings indicate that, as expected, the BART and T5 models show some differences in sentiment and word choice based on gender. The Llama 3 model shows no significant differences in sentiment, themes, or word counts based on gender. Conversely, significant gender-based differences were found in the summaries generated by the Gemma model, which consistently produced more negative summaries for men and focused more on physical and mental health issues. The Gemma summaries also used different language to describe the needs of women and men, tending to be more explicit about men’s health conditions than women’s. Examples of this are provided below.

Sentiment output

Table 2 presents the estimates from the mixed effects model. The regression results show a consistent and significant effect on sentiment caused by document length, with longer documents compared to the reference level (maximum tokens 50) exhibiting a similar trend in sentiment scores. This effect differs by sentiment metric, with Regard indicating that longer summaries become more positive, and SiEBERT judging them as more negative, which highlights the challenge of interpreting sentiment direction, as the correlation between Regard and SiEBERT in this data is 0.09 (95% CI 0.08 - 0.11). Word- and theme-level analysis are helpful to interpret these results. Table 2 shows that Regard and SiEBERT find a significant effect in opposite directions for being male on the reference level (the BART model). A significant effect is also found for the Gemma model, with male summaries containing more negative sentiment. As the coefficients and $p$ values in Table 2 are compared with reference levels, which can be challenging to interpret, Table 3 includes the estimated marginal means by gender for each of the models, calculated using the emmeans R package [54]. The consistent finding across Regard and SiEBERT is that the Gemma model produces more positive sentiment for women than for men. Details of the covariance matrix for the random effects, including variances and covariances between predictors, as well as the results of the robustness checks that support these findings are included in the Appendix.

Table 2 Effect of gender and explanatory variables on sentiment (mixed effects model)

Full size table

Table 3 Estimated marginal mean effect of gender on sentiment (female - male)

Full size table

Inclusion bias: comparison of themes

The results of the analysis of terms relating to each theme are presented in Table 4. This provides insight into how differences in sentiment might be reflected in the output. The Gemma model uses more words related to physical health, mental health, and physical appearance for men, which aligns with the sentiment analysis findings indicating that the Gemma model generates more negative sentiment for men. Additionally, more subjective language is used for men by the BART model. No other significant differences were observed. However, this relatively broad-brush approach may obscure variation. For example, the BART model shows similar total counts of terms relating to mental health for both men and women. However, certain mental health terms (such as “emotional” and “unwise”) are used more for women, while terms like “anxious” and “agitated” appear more for men. These word-level differences are examined in the next section.

Table 4 Chi-squared tests for gender differences in word counts by theme across LLMs

Full size table

Linguistic bias: word frequency analysis

Different models exhibited varying degrees of bias, as shown in the results of the word-level analysis presented in Table 5. As tests were conducted on many individual words, only words significant in the regression specified in Eq. (3) and with an adjusted $p < 0.05$ in the $\chi^2$ or Fisher’s exact test are included in the table.

Table 5 Word level differences regression and $\chi^2$ output

Full size table

Inclusion bias: BART and T5

Sentences from the BART and T5 models with large differences in sentiment between the male and female summaries are presented in Table 6 for the purpose of contrasting with Llama 3 and Gemma. The words “emotional”, “disabled”, and “wellbeing” are used significantly more for women by the BART model. The BART and T5 models, where differences occur, tend to demonstrate inclusion bias [34], meaning different information is included in summaries for men and women. An example of this is shown in Table 6, where an extra sentence is appended to the female summary stating that the person makes unwise decisions about her care needs. The word “unwise” is used 12 times for women and 5 times for men by the BART model. Another example in Table 6 shows how the BART model refers to the impact of health needs on a woman’s “emotional wellbeing” compared with a man’s “views and wishes”. The T5 model frequently includes different information based on gender as well. The word “happy” appears significantly more for men, and further examples of gender-based differences in the information included by the T5 model are set out in Table 6.

Table 6 Differences in model-generated descriptions for gender-swapped pairs of case notes (BART and T5 models)

Full size table

Linguistic bias: Gemma

More words were found to differ in the Gemma model than BART or T5, as shown in Table 5. Conversely, the Llama 3 model did not exhibit significant gender differences in word usage for any terms, so I focus on the Gemma model in this section and return to Llama 3 in the Discussion. Linguistic bias [17] is observed more in Gemma than the benchmark models, with different words used to summarise notes based on gender. One of the largest differences is in the use of the word “text,” which appears 5042 times for women and 2726 times for men. This is because the Gemma model more often begins women’s summaries by describing the text, e.g. “The text describes Mrs Smith’s care needs.” Comparable texts about men describe the person, e.g. “Mr Smith has care needs.” This also explains why words like “describe,” “highlight,” and “mention” are used significantly more in female summaries.

A notable difference in the Gemma summaries is the way disability is described. The word “disabled” is used 19 times, with 18 of those references being to men. Similarly, the word “unable” is used significantly more for men than for women (373 vs 276 times), and “status”, “resident”, “unable”, “disable”, “require”, and “receive” are more common in male summaries, reflecting more direct discussion of disability and care services. In contrast, female summaries more frequently mention how “needs” or “ability” are affected (both terms appearing significantly more for women). Examples of these differences in the description of disability are set out in Table 7. Additionally, the word “complex” appears 167 times in male summaries and 105 times in female summaries. Table 8 provides examples, showing that men are more often described as having a “complex medical history,” while women are simply described as having a “medical history.” This table also shows examples of how women are frequently described as managing well “despite” their impairments (with “despite” being a word that appears significantly more for women).

Table 7 Differences in descriptions of disability for gender-swapped pairs (Gemma model)

Full size table

Table 8 Differences in descriptions of complexity for gender-swapped pairs (Gemma model)

Full size table

Inclusion bias: Gemma

Physical and mental health issues and subjective language are mentioned more for men. The word “happy” is used significantly more for men, typically manifesting in statements that men are happy with their care, while women are either described as satisfied or their feelings are not mentioned. Examples in Table 9 illustrate cases where women’s health needs are underemphasised compared to men’s. For instance, a man’s “delirium, chest infection, and Covid-19” are summarised in the female version as “health complications”. This pattern occurs consistently in the Gemma output and is reflected in the types of words more frequently used for each gender in Table 5.

Table 9 Inclusion bias: comparison of gender-swapped pairs (Gemma model)

Full size table

Hallucination

When summaries differ for men and women in terms of specific diagnoses, such as medical terms, it is possible that either one gender’s information is being omitted, or that the model is hallucinating additional information for the other gender. To determine which of these scenarios was occurring, a search for physical and mental health diagnoses was conducted in both the original and summary documents. If a diagnosis, such as dementia, is absent from the original text, the model should not infer that the person has dementia. Across the 617 input documents, with two versions (one male, one female) for each, summarised using 24 sets of parameters (four models, each with six maximum lengths for the output), 54 medical terms were checked, representing in 1,599,264 possible opportunities for hallucination. In total, 18 cases of hallucinated medical terms were identified—11 for female subjects and seven for male subjects—across all models. This suggests that the gender differences observed that the gender differences observed in the Gemma model output are not primarily due to hallucinations, but rather the omission of specific issues in texts about women.

Discussion

Key findings

In this study, three key questions regarding the gender bias of state-of-the-art, open-source LLMs in summarising long-term care case notes were explored. The first question asked whether these models demonstrate measurable differences in their summaries based on gender. It was found that, while the Llama 3 model does not exhibit differences according to the metrics in this paper, the Gemma model shows significant gender-based disparities. The second question sought to understand the nature of these differences. Several notable patterns were observed in the Gemma model’s summaries. Sentiment for men tends to be more negative than for women. Additionally, themes such as physical health, mental health, and physical appearance are more frequently highlighted in case notes about men. The language used for men is also more direct. For example, phrases like “he’s unable to do this” or “he is disabled” are common, whereas for women, the language is more euphemistic, such as “she requires assistance” or “she has health needs.”

The third question explored the potential policy or practice implications of these differences. The differences observed in the Gemma model indicate that it underemphasises information about women’s physical and mental health, which may exacerbate inequity in care provision and widen gaps in health outcomes between groups [25]. How data is presented to workers affects decision-making [15], and worker impressions will likely be influenced by the tone and content of the notes. For example, differences in the Gemma model, where a man is described as having a “complex medical history”, while a woman with identical functional ability is described as “living in a town house”, may lead to the impression that the man has greater needs. Care managers must decide how quickly to take action based on these records, and form impressions about the level and urgency of care required. Descriptions that emphasise men’s care needs may lead to faster allocation or influence how much care a person receives. These kinds of differences fall into what Gallegos et al. [24] term allocational harm, where biased language may influence treatment or services. While an in-person assessment should mitigate initial perceptions, it would be optimistic to conclude that this will entirely counteract the effect of gender disparities created in documentation.

Generalisibility

This paper demonstrates clear gender-based differences in LLM output, but the findings are grounded in a specific context. As such, they may not automatically apply to all healthcare settings. The data analysed comes from a relatively small geographical area and one domain—long-term care for older people—so the results may not extend to other settings, such as hospitals or mental health services, where documentation styles, population characteristics, and service models differ. The way notes are written and the types of information included will vary across care contexts and regions, which may affect how LLMs generate summaries. Nevertheless, many health, care, and social service domains also rely on narrative documentation and routinely include disability or long-term conditions. These settings may face similar risks, and further research is needed to assess how LLM bias manifests in other contexts.

The results presented here are also consistent with recent findings showing that gender bias remains a concern in state-of-the-art LLMs [20, 21, 23]. In particular, they align with Shan et al. [27], which found that Llama 3 performed better than Gemma on counterfactual fairness tests across a range of prompts. However, other work (e.g. Zhang et al. [55]) has found contrasting results, such as higher gender bias in Llama 3 when generating summaries based on Wikipedia content.

Such disparities highlight the importance of interpretable methods for evaluating bias. Prior studies often use scalar bias scores based on similarity between counterfactual outputs to quantify the presence of bias. While useful for comparing overall bias across models, this does not capture how bias manifests. For example, a score may detect that summaries for men and women differ, but not reveal that physical disability is mentioned more for men. This paper contributes a complementary approach: a practical framework for analysing both the presence and nature of gender bias in generated summaries of care records.

The methodological approach used in this paper is generalisable. The framework for assessing counterfactual fairness in LLM outputs is designed to be reproducible, interpretable, and applicable across domains, with all code available on GitHub.

Limitations

Several limitations must be considered when interpreting these results. One advantage of state-of-the-art models is their ability to handle long input texts via extended context windows, making them suitable for summarising lengthy case records. However, this study focused on inputs that are substantially shorter than a full care record. This was partly due to hardware restrictions in our secure environment; for Information Governance reasons it was not possible to use GPUs with more VRAM in cloud computing environments, as we were using real case records. There was also a methodological reason for this restriction: limiting input length made it feasible to ensure that male and female versions of the texts were directly comparable. In longer texts, there is a higher likelihood that gender-specific references (such as sex-specific conditions or experiences like domestic violence) might be included, which would not translate cleanly to a gender-swapped version, limiting the ability to assess counterfactual fairness. Using shorter inputs allowed more consistent and interpretable comparisons between genders, but this inevitably constrains the generalisability of the findings to longer and more complex documents. It is possible that different results could be obtained when longer texts are used, although there is no compelling reason to assume that the gender-specific language generated by models would meaningfully differ solely due to input length.

Another limitation is that the LLMs used are stochastic in their output. With the exception of output length, the models were run with default parameters, such as temperature, to measure typical performance. However, this means that random document-level variation is expected between the number of times words are used for males and females, even for a model with no gender bias. Re-running the code does not yield identical summaries. However, each model was run six times with different maximum output lengths to reduce the standard errors around bias estimates, and the findings are consistent across several metrics. Robustness checks, detailed in the Appendix, consistently yielded the same results. The overall trend of Gemma using more indirect language for women holds even if any individual word-level result is removed. Furthermore, it is reassuring that despite the stochastic nature of the algorithms, similar results were found with different data. As the real administrative data could not be shared, LLMs were used to generate around 400 synthetic case notes, included in this paper’s GitHub repository [53]. The primary purpose of the synthetic data was to ensure that the analysis was reproducible. However, the findings from the synthetic data were found to be consistent with those using the real data. Significant gender-based differences were observed in the summaries generated by the Google Gemma model, with physical and mental health mentioned significantly more in male summaries. Many of the same narrative-type words, such as “text,” “emphasise,” and “describe,” appeared more for women than men, while words relating to needs, such as “require,” “necessitate,” “assistance,” and “old,” appeared more for men. The synthetic data results also show no significant gender-based differences in the Llama 3 model output.

Perhaps a more concerning limitation of the stochastic nature of model output is the difficulty in balancing Type I and Type II errors. With statistical tests performed for thousands of words, some unlikely events are inevitable. Caution was exercised by adjusting the $p$-values (using the Benjamini-Hochberg method), but this means that some words with very small unadjusted $p$-values were rejected. It is possible that some meaningful differences between words on the basis of gender were not considered statistically significant due to this conservatism.

A related point is that meaningful differences will not necessarily generate statistical significance. For instance, in the BART model, the word “unwise” appears 12 times for women and 5 times for men, which is not statistically significant according to a $\chi^2$ test or Fisher’s exact test. However, even a single instance stating that a woman is making unwise decisions, where an otherwise identical man is not described this way, could make a practical difference to a care professional acting upon it.

An additional limitation is that pre-trained sentiment analysis models not trained on health and care data were used. SiEBERT is a transfer learning model built on RoBERTa [37] and fine-tuned on a diverse range of data, including reviews and tweets [36]. Regard is based on BERT and fine-tuned on data designed to evaluate demographic bias [38]. Ideally, domain-specific sentiment analysis models would be used, but such models are not currently available, and constructing them would require subjective judgments about how different conditions or care needs relate to positive or negative sentiment. Future research could benefit from the development of domain-specific models, but the current approach provides meaningful exploration of these differences within the available framework. Despite this limitation, sentiment analysis remains useful for identifying that some measurable difference exists between summaries. The interpretation of these differences becomes clearer through the accompanying analysis of themes and word usage.

Thematic analysis, for example, clearly shows when certain domains (e.g. physical health) are included more often for one gender, as we see in the case of Gemma. Word frequency analysis helps drill down into specific patterns of language. For example, Gemma’s greater use of words such as “text”, “describe”, and “highlight” in summaries for women indicates that it tends to describe the text itself rather than the person receiving care. Together, these methods are specific and interpretable—when they detect a difference, this indicates a meaningful instance of gender bias. However, they may not be sensitive to all types of bias. Subtler forms of framing, tone, or discourse structure may go undetected using these techniques. Pfohl et al. [25] provide a set of methods to assess bias in LLMs used in healthcare based on expert evaluation of model output, and the methods in this paper could be complemented with qualitative analysis by human experts to capture these more complex forms of bias.

Finally, cases relating to gender-specific care, such as mastectomies, and those mentioning domestic violence were removed, as they do not fit easily into the counterfactual fairness framework. However, in some cases, gender is salient and output should legitimately differ based on gender or other protected characteristics [43, 56]. The way language models treat gender-specific circumstances remains an important policy question that should be explored in future work.

Conclusion

LLM summarisation models are being used in health and care to generate settings and summarise documentation [1,2,3]. In this study, notable variation in gender-based discrepancies was observed across summarisation LLMs. Llama 3 showed no gender-based differences across any metrics, T5 and BART demonstrated some variation, and the Gemma model exhibited the most significant gender-based disparities. Gemma’s male summaries were generally more negative in sentiment, and certain themes, such as physical health and mental health, were more frequently highlighted for men. The language used by Gemma for men was often more direct, while more euphemistic language was used for women. In the Gemma summaries, women’s health issues appeared less severe than men’s and details of women’s needs were sometimes omitted. Workers reading such summaries might assess women’s care needs differently from those of otherwise identical men, based on gender rather than need. As care services are awarded based on need, this could impact allocation decisions. While gendered language can be appropriate in contexts where gender is relevant, the differences in Gemma’s output suggest that, in many instances, these differences are undesirable.

While this study provides evidence of gender bias in LLM-generated summaries for long-term care, the findings are based on one specific domain and dataset. Further research is needed to assess whether similar patterns arise in other health and care settings, such as hospitals or mental health, where documentation styles and service models may differ. Given the findings in this paper, this makes future research in other health and care contexts where narrative documentation is used an important priority. The methodological framework developed can be applied to any dataset of free-text case records to evaluate bias in model outputs.

As generative models become more widely used for creating documentation, any bias within these models risks becoming part of official records. However, LLMs should not be dismissed as a solution to administrative burden. In this study, there were differences in bias across LLMs. This variation suggests that, if regulators wish to prioritise algorithmic fairness, they should mandate the measurement of bias in LLMs used in long-term care. Practical methods for evaluating gender bias in LLMs have been outlined in this paper, which can be implemented by anyone with access to long-term care data. The code for these evaluations is available on GitHub [53]. It is recommended that these or similar metrics be applied to assess bias across gender, ethnicity, and other legally protected characteristics in LLMs integrated into long-term care systems. By doing so, the benefits of LLMs can be realised while mitigating the risks associated with bias.

Data availability

The data that support the findings of this study are individual-level, administrative care records. This is identifiable human data and restrictions apply to the availability of these data, which were used under licence for the current study, and so are not publicly available. It is not possible to share this data publicly as individual privacy could be compromised. Metadata are however available from the authors upon reasonable request. Synthetic data is provided in the GitHub repository [53] so that the methods provided can be reproduced using the code provided. The findings from the synthetic data are consistent with the findings from the real data.

Abbreviations

AI:: Artificial intelligence
ANOVA:: Analysis of variance
BART:: Bidirectional auto-regressive transformers
BERT:: Bidirectional encoder representations from transformers
CAG:: Confidentiality advisory group
CI:: Confidence interval
EU:: European Union
GEE:: Generalised estimating equations
GPU:: Graphics processing unit
HHS:: Health and human services
LLM:: Large language model
NHS:: National health service
SD:: Standard deviation
UK:: United Kingdom
US:: United States
USD:: US Dollars

References

Local Government Assocation. Artificial intelligence use cases, 2024. https://web.archive.org/web/20240904192138/https://www.local.gov.uk/our-support/cyber-digital-and-technology/artificial-intelligence-hub/artificial-intelligence-use. [Accessed: 2024-09-04].
Local Government: State of the Sector: AI Research Report. Technical report, local government association. 2024. https://web.archive.org/web/20240906174435/https://www.local.gov.uk/sites/default/files/documents/Local%20Government%20State%20of%20the%20Sector%20AI%20Research%20Report%202024%20-%20UPDATED_3.pdf. [Accessed: 2024-09-06].
Google Cloud. MedLM: Generative AI fine-tuned for the healthcare industry, 2024. https://web.archive.org/web/20240804062023/https://cloud.google.com/blog/topics/healthcare-life-sciences/introducing-medlm-for-the-healthcare-industry. [Accessed: 2024-09-01].
Lillis T, Leedham M, Twiner A. Time, the written record, and professional practice: the case of contemporary social work. Writ Commun. 2020;37:431–86. 10.1177/0741088320938804.
Article Google Scholar
Arndt BG, Beasley JW, Watkinson MD, Temte JL, Tuan W-J, Sinsky CA, Gilchrist VJ. Tethered to the EHR: primary care physician workload assessment using EHR event log data and time-motion observations. Ann Fam Med. 2017;15(5):419–26.
Article PubMed PubMed Central Google Scholar
Miller E, Barrie K. Setting the bar for social work in Scotland. 2022.
Preston-Shoot M, Braye S, Preston O, Allen K, Spreadbury K. Analysis of safeguarding adult reviews April 2017–March 2019: findings for sector-led improvement. Local government association. 2020. https://www.local.gov.uk/publications/analysis-safeguarding-adult-reviewsapril-2017-march-2019.
Biden JR. Executive order on the safe, secure, and trustworthy development and use of artificial intelligence. 2023.
HM Treasury. Spring budget. 2024. https://www.gov.uk/government/publications/spring-budget-2024/spring-budget-2024-html. [Accessed: 2024-07-25].
European Commission. Regulation of the European Parliament and of the council laying down harmonised rules on artificial intelligence (artificial intelligence Act) and amending certain union legislative acts: Annexes. 2024.
Google Scholar
EU AI act: First regulation on artificial intelligence, June 2023. https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence. Last updated: 18-06-2024- 16:29, Accessed: 2024-09-23.
Van Veen D, Van Uden C, Blankemeier L, Delbrouck J-B, Aali A, Bluethgen C, Pareek A, Polacin M, Reis EP, Seehofnerová A, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med. 2024;30(4):1134–42.
Article CAS PubMed PubMed Central Google Scholar
Ayaad O, Alloubani A, ALhajaa EA, Farhan M, Abuseif S, Hroub AA, Akhu-Zaheya L. The role of electronic medical records in improving the quality of health care services: comparative study. Int J Med Inf. 2019;127:63–67.
Article Google Scholar
Zegers M, de Bruijne MC, Spreeuwenberg P, Wagner C, Groenewegen PP, van der Wal G. Quality of patient record keeping: an indicator of the quality of care? BMJ Qual Saf. 2011;20(4):314–18.
Article PubMed Google Scholar
Ahmed A, Chandra S, Herasevich V, Gajic O, Pickering BW. The effect of two different electronic health record user interfaces on intensive care provider task load, errors of cognition, and performance. Crit Care Med. 2011;39(7):1626–34, July https://doi.org/10.1097/CCM.0b013e31821858a0.
Article PubMed Google Scholar
Bolukbasi T, Chang K-W, Zou JY, Saligrama V, Kalai AT. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Adv Neural Inf Process Syst. 2016;29.
Caliskan A, Bryson JJ, Narayanan A. Semantics derived automatically from language corpora contain human-like biases. Science. 2017;356(6334):183–86.
Article CAS PubMed Google Scholar
Goodman KE, Paul HY, Morgan DJ. AI-Generated clinical summaries require more than accuracy. JAMA. 2024.
Savoldi B, Gaido M, Bentivogli L, Negri M, Turchi M. Gender bias in machine translation. Trans Assoc Comput Linguist. 2021;9:845–74.
Article Google Scholar
Armstrong L, Liu A, MacNeil S, Metaxa D. The silicon ceiling: auditing GPT’s race and gender biases in hiring. Proceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization. 2024:1–18.
Chen E, Zhan R-J, Lin Y-B, Chen H-H. From structured prompts to open narratives: measuring gender bias in LLMs through open-ended storytelling. Preprint arXiv:2503.15904, 2025.
Kotek H, Dockum R, Sun D. Gender bias and stereotypes in large language models. Proceedings of the ACM collective intelligence conference. 2023:12–24.
Zack T, Lehman E, Suzgun M, Rodriguez JA, Celi LA, Gichoya J, Jurafsky D, Szolovits P, Bates DW, Abdulnour R-E-E, et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: A model evaluation study. The Lancet Digit Health. 2024;6(1):e12–22.
Article CAS PubMed Google Scholar
Gallegos IO, Rossi RA, Barrow J, Tanjim MM, Kim S, Dernoncourt F, Tong Y, Zhang R, Ahmed NK. Bias and fairness in large language models: a survey. Comput Linguist. 2024;50(3):1097–179.
Article Google Scholar
Pfohl SR, Cole-Lewis H, Sayres R, Neal D, Asiedu M, Dieng A, Tomasev N, Rashid QM, Azizi S, Rostamzadeh N, et al. A toolbox for surfacing health equity harms and biases in large language models. Nat Med. 2024;1–11.
Nemani P, Joel YD, Vijay P, Liza FF. Gender bias in transformers: a comprehensive review of detection and mitigation strategies. Nat Lang Process J. 2024;6:100047.
Article Google Scholar
Shan Z, Diana ER, Zhou J. Gender Inclusivity Fairness Index (GIFI): a multilevel framework for evaluating gender diversity in large language models. Preprint arXiv:2506.15568, 2025.
Meta AI. Introducing Meta Llama 3: the most capable openly available LLM to date, 2024a. https://ai.meta.com/blog/meta-llama-3/. [Accessed: 2024-07-14].
Team G, Mesnard T, Hardin C, Dadashi R, Bhupatiraju S, Pathak S, Sifre L, Rivière M, Kale MS, Love J, et al. Gemma: Open models based on gemini research and technology. Preprint arXiv:2403.08295, 2024.
Roberts A, Raffel C, Lee K, Matena M, Shazeer N, Liu PJ, Narang S, Wei L, Zhou Y. Exploring the limits of transfer learning with a unified text-to-text transformer. Google, Tech. Rep., 2019.
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L. Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Preprint arXiv:1910.13461, 2019.
Sharma S, Dey M, Sinha K. Evaluating gender bias in natural language inference. arXiv preprint arXiv:2105.05541, 2021.
Katsarou S, Rodríguez-Gálvez B, Shanahan J. Measuring gender bias in contextualized embeddings. In computer sciences and mathematics Forum. Vol. 3. MDPI; 2022:3.
Google Scholar
Steen J, Markert K. Investigating gender bias in news summarization. Preprint arXiv:2309.08047, 2023.
Kusner MJ, Loftus J, Russell C, Silva R. Counterfactual fairness. Adv Neural Inf Process Syst. 2017;30.
Hartmann J, Heitmann M, Siebert C, Schamp C. More than a feeling: accuracy and application of sentiment analysis. Int J Res in Mark. 2023;40(1):75–87.
Article Google Scholar
Liu Y, Ott M, Goyal N, Jingfei D, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. Roberta: a robustly optimized bert pretraining approach. Preprint arXiv:1907.11692, 2019.
Sheng E, Chang K-W, Natarajan P, Peng N. The woman worked as a babysitter: on biases in language generation. arXiv preprint arXiv:1909.01326, 2019.
Lik Xun Yuan. Distilbert-base-multilingual-cased-sentiments-student (revision 2e33845). 2023. https://huggingface.co/lxyuan/distilbert-base-multilingual-cased-sentiments-student.
Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In NeurIPS $EMC^2$. Workshop; 2019.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol. 1995;57(1):289–300.
Article Google Scholar
Ioannis Kosmidis. Brglm2: Bias reduction in generalized linear models. 2023. https://CRAN.R-project.org/package=brglm2. R package version 0.9.2.
Prabhakaran V, Hutchinson B, Mitchell M. Perturbation sensitivity analysis to detect unintended model biases. Preprint arXiv:1910.04210, 2019.
Garg S, Perot V, Limtiaco N, Taly A, Chi EH, Beutel A. Counterfactual fairness in text classification through robustness. Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 2019:219–26.
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, et al. Huggingface’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
Honnibal M, Montani I. spaCy 3: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. 2021. https://spacy.io/.
Van Rossum G, Drake FL. Python 3 reference manual. CreateSpace, Scotts Valley, CA; 2009. ISBN 1441412697.
Google Scholar
Yuan LXBLC. 2024. https://huggingface.co/facebook/bart-large-cnn. [Accessed: 2024-07-25].
Google. T5 base, 2024a. https://huggingface.co/google-t5/t5-base. [Accessed: 2024-07-25].
Google. Gemma 7b-it, 2024b. https://huggingface.co/google/gemma-7b-it. [Accessed: 2024-07-25].
Meta AI. Llama3-8b-Instruct. 2024. https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct. [Accessed: 2024-07-25].
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2024. https://www.R-project.org/.
Sam Rickman. Evaluating gender bias in LLMs in long-term care. 2024. https://github.com/samrickman/evaluate-llm-gender-bias-ltc. [Accessed: 2024-08-11].
Lenth RV. Emmeans: estimated marginal means, aka least-squares means. 2024. https://rvlenth.github.io/emmeans/. R package version 1.10.2.
Zhang T, Ladhak F, Durmus E, Liang P, McKeown K, Hashimoto TB. Benchmarking large language models for news summarization. Trans Assoc Comput Linguist. 2024;12:39–57.
Article Google Scholar
Dinan E, Fan A, Ledell W, Weston J, Kiela D, Williams A. Multi-dimensional gender bias classification. arXiv preprint arXiv:2005.00614, 2020.

Download references

Acknowledgements

I would like to extend my gratitude to Jose-Luis Fernandez and Juliette Malley, for their insightful feedback after our discussion of the preliminary results. I am also grateful to Uche Osuagwu for his dedication to managing data extraction and quality, and William Wood and the Intelligence Solutions for London team for their vital contributions to Information Governance.

Funding

This paper is based on independent research funded through the NIHR Policy Research Unit in Adult Social Care, reference PR-PRU-1217–21101. Funding was also received from the UK National Institute of Health and Care Research (NIHR) Applied Research Collaboration (ARC) North Thames under grant number NIHR200163. An additional grant was received from the NHS Digital Social Care Pathfinders initiative under the contract 8717. The views expressed are those of the author and not necessarily those of the NIHR, ARC, NHS, or the Department of Health and Social Care.

Author information

Authors and Affiliations

Care Policy and Evaluation Centre, LSE, London, WC2A 2AE, UK
Sam Rickman

Authors

Sam Rickman
View author publications
Search author on:PubMed Google Scholar

Contributions

All analysis was undertaken by the sole author of the paper.

Corresponding author

Correspondence to Sam Rickman.

Ethics declarations

Ethics approval and consent to participate

This study uses secondary data from administrative records, which were pseudonymised prior to egress to remove identifiable personal information (e.g., names, addresses, NHS numbers, and other unique identifiers). According to the UK General Data Protection Regulation (GDPR), processing of these data was conducted under the legal basis of legitimate interests, which does not require individual opt-in consent. This study was conducted in accordance with the principles of the Declaration of Helsinki. It involved the use of secondary data only, with no direct contact with participants. The data were pseudonymised prior to access and processed in line with established ethical standards for research using routinely collected health and social care records. Individual informed consent was not required, as the project involved no automated decision-making and used pseudonymised data throughout. Ethics approval for the project was granted by the LSE Personal Social Services Research Unit’s ethics committee on 30th May 2019, in compliance with the LSE’s Research Ethics Policy. A Data Processing Impact Assessment was completed, and the details of the project were made publicly available via a Privacy Notice on the local authority’s website, with local opt-out options provided. Approval was also granted by the NHS Confidentiality Advisory Group (CAG) in June 2020 (reference number 20/CAG/0043), with annual renewal.

Consent for publication

Not applicable. This study does not include any individual-level identifying images, names, addresses, locations, or other information that could compromise participant anonymity. All data used in the study were pseudonymised prior to access, and no direct contact with participants occurred.

Competing interests

The author declares no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

12911_2025_3118_MOESM1_ESM.zip

Supplementary Material 1: Three appendices are included: (1) Evaluation of sentiment metrics: establishing which sentiment metrics are appropriate for conducting this analysis. (2) Model diagnostics and robustness checks: verifying the robustness of conclusions using several other methods. (3) Evaluation of themes: full lists of words counted in the frequency of the words appearing in each theme. The code to reproduce this analysis is available with synthetic data in the GitHub repository [53]

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Rickman, S. Evaluating gender bias in large language models in long-term care. BMC Med Inform Decis Mak 25, 274 (2025). https://doi.org/10.1186/s12911-025-03118-0

Download citation

Received: 27 September 2024
Accepted: 17 July 2025
Published: 11 August 2025
DOI: https://doi.org/10.1186/s12911-025-03118-0

Evaluating gender bias in large language models in long-term care

Abstract

Background

Methods

Results

Conclusion

Introduction

Materials and methods

Data

Measuring bias: counterfactual fairness

Comparison of sentiment output

Inclusion bias: comparison of themes

Linguistic bias: word frequency analysis

Analysis and data pre-processing

Creating equivalent male and female texts

Selecting sentiment analysis metrics

Generation of summaries

Word frequency analysis

Results

Sentiment output

Inclusion bias: comparison of themes

Linguistic bias: word frequency analysis

Inclusion bias: BART and T5

Linguistic bias: Gemma

Inclusion bias: Gemma

Hallucination

Discussion

Key findings

Generalisibility

Limitations

Conclusion

Data availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Electronic supplementary material

12911_2025_3118_MOESM1_ESM.zip

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Informatics and Decision Making

Contact us