Systematic Bias in a Randomized Trial of Hydroxychloroquine as Post-exposure Prophylaxis for COVID-19

Systematic Bias in a Randomized Trial of Hydroxychloroquine as Post-exposure Prophylaxis for COVID-19

Co-author: Karl Peace, Ph.D. ASA Fellow, Jiann-Ping Hsu College of Public Health, Georgia Southern University

Abstract

The June 3, 2020 NEJM article published the results of the randomized trial sponsored by University of Minnesota, which concludes in the abstract that “After high-risk or moderate-risk exposure to Covid-19, hydroxychloroquine did not prevent illness compatible with Covid-19 or confirmed infection when used as postexposure prophylaxis within 4 days after exposure.” However, the study design and analysis are systematically biased by design and planning with 1) incorrect study design and analysis, 2) flawed sample size adjustment procedure, 3) incorrect futility analysis using conditional power at the current trend, and 4) selection bias of patients entered and overstated conclusions. Due to various design and analysis deficiencies, issues of reporting of study results, and lack of disclosure of all necessary information, it is highly questionable that the trial meets the substantial evidence standard of adequate and well-controlled investigations. For ongoing clinical trials of hydroxychloroquine, it is necessary to ensure that systematic bias in the study design and analysis is removed, and trial organizers and DSMB members are competent to understand the various issues relating to the design and analysis.

Introduction

The University of Minnesota sponsored at randomized, double-blind, placebo controlled trial of hydroxychloroquine as postexposure prophylaxis for Covid-19 that started on March 17, 2020 and was halted on May 6, 2020 at the third interim analysis based on a futility analysis showing a conditional power of less than 1%. Upon publication of the study results on June 3, 2020 in New England Journal of Medicine (NEJM), the study conclusion in the abstract was widely reported by various media outlets without basic vetting of the study design and analysis, its scientific validity, and the reporting of the study results.

A basic review of the study protocol quickly reveals systematic bias in the design and analysis, failure to report all results of primary study objectives, scientifically flawed and overstated study conclusions. There are various other methodological issues of the statistical analysis.

Study Design

The study objective is to determine if post-exposure prophylaxis with hydroxychloroquine is effective in preventing COVID19 disease or ameliorating disease severity. Co-primary endpoints are

  1. Incidence of COVID-19 disease within 14 days (among those asymptomatic at baseline), and
  2. Ordinal scale of COVID-19 disease severity of i) No illness, ii) Illness with outpatient observation, iii) Hospitalization (or post-hospital discharge), and iv) Hospitalization with ICU stay or death

The study target population is adult household contacts or healthcare workers exposed to persons with COVID-19 disease. Permuted block randomization with 1:1 randomization ratio is used to randomize patients to receive treatment with hydroxychloroquine or placebo. 

The study follows a group sequential design with symmetric stopping boundaries for superior or inferior efficacy of hydroxychloroquine compared to placebo at the two-sided type 1 error rate of 0.05. A Lan-DeMets error-spending function is used to approximate the O’Brien-Fleming stopping boundaries for the disease severity endpoint. The total sample size of 1,500 patients was planned based on a 90% power to detect a 50% relative risk reduction for both co-primary endpoints and 20% dropout rate. Three interim analyses with approximately 375, 750, and 1,125 patients, and the final analysis with 1,500 patients were planned.

The first interim analysis was used to assess trends in safety and efficacy for the DSMB to determine if more frequent reviews were necessary. If the DSMB determined that more interim analyses were needed, the Lan-DeMets error spending function was to be used to recalculate approximate O’Brien-Fleming boundaries. 

At the second interim analysis, a sample size adjustment was planned based a 50% relative reduction in transmission risk. The planned sample size 1,500 assumed a 10% transmission rate for the placebo group, which is based on the March 6, 2020 CDC reporting of a symptomatic secondary attack rate of 10.5% (95% CI = 2.9%–31.4%) among household members. The sample size adjustment allowed a four-fold increase of the sample size to 3,000 per treatment group if the incidence rate for the placebo group was only 5%.

The second interim analysis was also used for futility analysis to stop the trial early if the conditional power was less than 20%.  

Issues of the Study Design and Analysis

Type 1 Error Rate of Group Sequential Design and Inference

Lan-DeMets error spending functions of the information fraction provides flexibility for determining the number and timing of interim analyses based on factors not related to unblinded interim data when the maximum information (e.g., total number of events or sample size) is fixed. The use of the information fraction, however, limits the error-spending approach to a fixed maximum. For the current trial, the sample size of 1,500 patients was based on an effect size of 50% reduction in the infection rate where the 10% infection rate is assumed for the placebo group. When sample size adjustment is planned, the maximum sample size is no longer fixed. For group sequential designs with sample size adjustment, Cui et al. (1999) proposed a procedure by which the information (e.g., sample size or number of events) can be adaptively modified. Müller and Schäfer (2001) proposed a method by which the remaining trial could be redesigned in a number of different ways including changing the number or time of the interim analysis, changing the information, or even changing the significance and futility boundaries.

Liu et al. (2010) develop a general conditional theory for the proposed adaptive error spending approach and provided rigorous proof by which it is possible to make unplanned design changes to an ongoing trial that starts with a fixed information design. The adaptive error-spending approach is fundamentally different from the error-spending approach of Lan and DeMets (1983) because neither an error-spending function nor a maximum information level need to be specified in advance. For the adaptive error-spending approach, the interim information levels and the final information level are allowed to be random, and the information fractions used for the error-spending function are not required to be proportional to information levels. As the adaptive error-spending approach contains the error-spending approach of Lan and DeMets (1983) as a special case, the validity of the latter is finally established under the assumptions of the conditional group sequential theory. It is illustrated that the blinded data review process is necessary for design modifications, without which the type 1 error rate can be seriously inflated.

The study design and analysis fail to address the issue of Type 1 error rate inflation, and the study results would be of questionable validity.

No attempt was made in the study protocol to address the inferential issues of bias using a naïve p-value, point and confidence interval estimation calculation following a group sequential design with sample size adjustment.

Sample Size Adjustment

A sample size adjustment was planned based a 50% relative reduction in incidence rate. The planned sample size 1,500 assumed a 10% incidence rate for the placebo group. The sample size adjustment allows a four-fold increase of the sample size to 3,000 per treatment group if the incidence rate for the placebo group is only 5%. This implies an effect size of 2.5%, defined as the difference between the treatment and placebo group, would be clinically meaningful for the study design. The issue with the 50% relative reduction is that when the placebo effect is higher, say 20%, then the effect size of 2.5%, 5%, 7.5%, and 9.99% is no longer clinical meaningful, when in fact a higher COVID-19 infection rate is a more serious problem for public health, society and health care costs.

At the second interim analysis on April 22, 2020, the sample size was reduced to 956 patients. The publication does not disclose any interim results submitted to DSMB; neither the sample size nor the observed placebo incidence rate that led to the reduced sample size of 956 are available for examination. However, the final analysis of 821 patients is close to the planned sample size of 750 patients. Based on the observed incidence rate of 14.3%, 50% relative reduction would correspond to the effect size 7.15%, which is nearly 3 times larger than the 2.5% minimum effect size for the less serious scenario of a placebo incidence rate of 5%. At the final analysis, the observed incidence rate for the treatment group is 11.8%, which coincidentally leads to the observed effect size of 2.5% (14.3% - 11.8%), identical to the 2.5% minimum effect size specified in the study protocol. With the placebo incidence rate of 14.3%, higher than the 10% incidence rate used in the sample size calculation, the required sample size would be larger than the four-fold increase of 3,000 patients per treatment group.  

Conditional Power at the Current Trend

The protocol states that “If the conditional power is <20% at the time of the second interim analysis with approximately 50% of participants enrolled, discontinuation should be considered as a possible recommendation by the DSMB.” The trial stopped for futility on the basis that the conditional power is less than 1%. However, the protocol does not provide any reference on the definition of conditional power. It is not clear whether the conditional power defined by Jennison and Turnbull (2000) or the conditional power at the current trend (Proschan, Lan and Wittes, 2006) was used.

Correct calculation of conditional power is critical for futility analysis. Proschan, Lan and Wittes (2006) showed that the conditional power at the current trend can be substantially smaller than the conditional power at the originally planned effect size. Liu and Chi (2010) calculate the conditional powers at the nonbinding futility boundary derived from the Kim–DeMets error-spending function. They evaluated the conditional power at the specified minimum effect size by which the futility boundary is derived. They also evaluated the conditional power at the current trend, which is simply the observed effect size given as the futility boundary value divided by the square root of the information. They discovered that conditional powers at the current trends were consistently biased against alternative hypotheses of interest. For example, consider a group sequential design with 80% power where the Kim–DeMets error-spending function with shape parameter 0.75 is used to calculate a justifiable aggressive nonbinding futility boundary. At 20% of the maximum information, the Type 2 error rate (or β-spent) is 5.9% for prematurely terminating the trial if the alternative hypothesis at the minimum effect size is true; the Type 1 error rate (or futility level) is 44.9% for continuing the trial if the null hypothesis is true. These probability measures, which are easy to understand, represent meaningfully and carefully-thought-out criteria for stopping the trial for futility. At the futility boundary, the conditional power evaluated at the minimum effect size is 51.2%, which corroborates the aggressive choice of the futility boundary. However, the conditional power at the current trend is < 0.3%, which is grossly biased. By any standard, one would stop the trial if the probability of achieving statistical significance is less than 1%, which would be a decision with high probability of Type 2 error rates.

The issues with conditional power at the current trend, as a form of stochastic curtailment are well documented in the literature. Armitage (1989) pointed out that

“It is perhaps worth mentioning the extensive literature on stochastic curtailment and Bayesian prediction which bears on this (Lan et al., 1982; Spiegelhalter and Freedman, 1988). The idea is that we should stop early if it can be predicted that the final result will, in some sense, not strongly contradict the null hypothesis. Much of the stochastic curtailment literature runs into difficulties over the choice of hypothesis on which to make the prediction, and I prefer the Bayesian approach. However, I have two reservations about the whole approach—one is that early stopping should be justified by the current inference, rather than by prediction of what might happen after some future random events, and secondly, in trials where collection of reliable information takes precedence over the need to select one or other treatment it will usually be foolish to stop unnecessarily early and to waste information. Even if the final difference is non-significant the trial may contribute usefully to knowledge, particularly if combined in an overview with data from other similar trials (p. 335).”

Jennison and Turnbull (2000, p. 219) stated that

“The conditional power calculations are also non-standard in that they do not refer to a single, well-defined reference test. In order to avoid such difficulties, it is wise to define a study protocol as unambiguously as possible at the outset. If this is done thoroughly and an interim analysis schedule is also defined, the full range of group sequential tests are available for use and one of these tests may be preferred to stochastic curtailment.”

Tweel and Noord (2003) analyze two data sets to illustrate various difficulties with the use of conditional powers for stochastic curtailment. They conclude that group sequential analyses have several advantages over stochastic curtailment and recommend that

“more studies should consider a sequential design and analysis to enable early stopping when enough evidence has accumulated to conclude a lack of the expected effect”

We note that for group sequential designs, the conditional power calculation often cited in the literature is incorrect as it does not take group sequential test into consideration. The correct calculation is given in Müller and Schäfer (2001).

A dramatic demonstration of inappropriateness of conditional power for futility is its application to stop Biogen’s Alzheimer’s drug development. Early in 2019, Biogen and partner Eisai terminated their phase 3 trials of aducanumab after the futility criterion of the conditional power of less than 20% was met at an interim analysis. However, on October 22, 2019, Biogen announced plans to pursue regulatory approval for aducanumab and subsequently stated that “The result of the futility analysis is incorrect.”

The publication of the hydroxychloroquine as post-exposure prophylaxis for COVID-19 trial does not disclose any interim results submitted to DSMB. Therefore, the futility analysis that led to trial termination could not be examined. Using the sample size of 821 patients for the final analysis, which is close to the planned sample size of 750 patients for the second interim analysis, the adjusted sample size of 956 patients, the two-sided p value 0.35, and the observed placebo incidence rate of 14.3%, the conditional power at the current trend is 0.08%, which is consistent with the study report that the conditional power is less than 1%. The conditional power following the pre-specified 50% relative reduction from the placebo incidence rate with the observed placebo rate of 14.3% is 41%, suggesting that the trial should continue. These results suggest that the conditional power at the current trend is used for futility analysis.

Totality of Evidence

The protocol specifies co-primary endpoints of the incidence of COVID-19 disease within 14 days (among those asymptomatic at baseline) and an ordinal scale of COVID-19 disease severity. In particular, the protocol states the sample size calculation is based on a 50% relative reduction for both the disease incidence as well as the ordinal scale of disease severity. Furthermore, the protocol states that group sequential boundaries will be provided at each DSMB report for the disease severity endpoint. Yet the publication does not provide any results on the disease severity endpoint.

The protocol specifies various subgroup analysis. The results are given in the supplemental appendix. The following forest plot is extracted from the supplemental appendix.

No alt text provided for this image

The results for the subgroups demonstrate consistent trend towards efficacy of the treatment with hydroxychloroquine, except patients aged greater than 50, for whom no discussion is given to explain the finding.

Study Population

The target population for the study is adult household contacts or healthcare workers exposed to persons with COVID-19 disease. Based on the supplemental appendix of paper, 61.75% enrolled are White or Caucasian, followed by 21.3% of Asian, 5.48% of Hispanic or Latino, and 4.5% of Black or African American. According to July 1, 2019 U.S. Census Bureau statistics, the percentage of Hispanic or Latino is 18.3%, the percentage for Black or African American alone is 13.4%, and the percentage of Asian alone is only 5.9%. According to CDC, 55% of the Minnesota state population is represented by participating COVID-NET counties. The April 8, 2020 report by CDC states that

``in the COVID-NET catchment population, approximately 59% of residents are white, 18% are black, and 14% are Hispanic; however, among 580 hospitalized COVID-19 patients with race/ethnicity data, approximately 45% were white, 33% were black, and 8% were Hispanic, suggesting that black populations might be disproportionately affected by COVID-19.’’

Thus, the ethnicity composition of the COVID-NET catchment population is very similar to that of the U.S. general population. Because of disproportionate infection rate in the black population, it is essential that the patient population enrolled in the study represents the general U.S. population. The lack of such representation suggests selection bias in the study target population. In addition, the study protocol does not provide any information on enrollment strategy by which the U.S. general population is well represented. With such a biased patient population, the conclusion in the abstract states that

‘’After high-risk or moderate-risk exposure to Covid-19, hydroxychloroquine did not prevent illness compatible with Covid-19 or confirmed infection when used as postexposure prophylaxis within 4 days after exposure.’’

This is an overstatement of the study results given the flawed sample size adjustment procedure and incorrect conditional power calculation that led to study termination. This discussion section of the paper does not address issues of population bias. Interestingly, the conclusion statement in the abstract is not supported by the statement of efficacy in the discussion that

``This randomized trial did not demonstrate a significant benefit of hydroxychloroquine as postexposure prophylaxis for Covid-19. Whether pre-exposure prophylaxis would be effective in high risk populations is a separate question, with trials ongoing.”

It is not clear what the meaning of “a significant benefit’’ is. The observed effect size of 2.5% (14.3% - 11.8%) is identical to the effect size of 2.5% for sample size adjustment when the placebo rate is 5%. The difference is not statistically significant because of the flawed sample size adjustment procedure and incorrect conditional power calculation that led to study termination.

Discussion

The presence of multiple obvious issues regarding study design, analysis and reporting easily lead to the observation that the study does not meet the substantial evidence standard of an adequate and well-controlled investigation by which reliable and meaningful conclusions can be drawn from the study results. The study also raises serious questions of how the manuscript is reviewed by NEJM for publication. It is recommended that NEJM publish peer reviews and document for publication. It is also recommended that NEJM require the authors of the publication to fully disclose totally of evidence, including interim analysis results, the DSMB charter, and documents for interim analyses by both the independent biostatistics reporting group and the DSMB. The unpresented COVID-19 pandemic on public health, society and the economy requires that both study authors and NEJM be fully transparent.

In the editorial by NEJM, Cohen states that

 “On June 1, 2020, ClinicalTrials.gov listed a remarkable 203 Covid-19 trials with hydroxychloroquine, 60 of which were focused on prophylaxis. An important question is to what extent the article by Boulware et al. should affect planned or ongoing hydroxychloroquine trials.”

And

``If postexposure prophylaxis with hydroxychloroquine does not prevent symptomatic SARS-CoV-2 infection (with recognition of the limitations of the trial under discussion), should other trials of postexposure prophylaxis with hydroxychloroquine continue unchanged? Do the participants in these trials need to be informed of these results? Do the participants in these trials need to be informed of these results? Do these trial results with respect to postexposure prophylaxis affect trials of preexposure prophylaxis with hydroxychloroquine, some of which are very large (e.g., the Healthcare Worker Exposure Response and Outcomes of Hydroxychloroquine [HERO-HCQ] trial, involving 15,000 health care workers; ClinicalTrials.gov number, NCT04334148)?’’

The answers to the questions are straightforward, but in a reverse order. First, these trial results have the potential to introduce various bias to existing trials, and therefore, critical review and report of the review findings are essential. Second, participants in the other trails should be informed that study is systematically biased in study design, analysis, and reporting. Third, other trials should review their study design and analysis, including the DSMB charter, to ensure systematic bias in the study design and analysis is removed, and trial organizers and DSMB members are competent to understand the various issues relating to the design and analysis.

References

1.      Boulware, D. R. et al. (2020). A randomized trial of hydroxychloroquine as postexposure prophylaxis for Covid-19. The New England Journal of Medicine. DOI: 10.1056/NEJMoa2016638.

2.      Cui, L., Hung, H. M. J., and Wang, S. J. (1999). Modification of sample size in group sequential clinical trials. Biometrics 55, 853–857.

3.      Müller, H. and Schäfer, H. (2001). Adaptive group sequential designs for clinical trials: Combining the advantages of adaptive and of classical group sequential approaches. Biometrics 57, 886–891.

4.      Liu, Q., Lim, P., Nuamah, I., and Li, Y. (2012). Invited paper. On adaptive error spending approach for group sequential trials with random information levels. Journal of Biopharmaceutical Statistics, Special issue on adaptive designs, 22, 687-699.

5.      Armitage, P. (1989). Discussion of the paper by Jennison and Turnbull. Journal of the Royal Statistical Society Series B 51, 334–335.

6.      Lan, K. K. G., DeMets, D. L. (1983). Discrete sequential boundaries for clinical trials. Biometrika 70, 659–663.

7.      Jennison, C., Turnbull, B. W. (2000). Group Sequential Methods with Applications to Clinical Trials. Boca Raton, FL: Chapman & Hall.

8.      Proschan, M. A., Lan, K. K. G., and Wittes, J. T. (2006). Statistical Monitoring of Clinical Trials, New York: Springer.

9.      Liu, Q. and Chi, G. Y. H. (2010). Understanding the FDA guidance on adaptive designs: Historical, legal, and statistical perspectives. Journal of Biopharmaceutical Statistics. Special issue on adaptive designs, 20, 1178-1219.

10.  Lan, K. K. G., Simon, R., Halperin, M. (1982). Stochastically curtailed tests in long-term clinical trials. Communications in Statistics 1, 207–219.

11.  Spiegelhalter, D. J., Freedman, L. S. (1988). Bayesian approaches to clinical trials. In: Bernado, J. M., DeGroot, M. H., Lindely, D. V., Smith, A. F. M., eds. Bayesian Statistics. 3rd ed. Oxford: Oxford University Press.

12.  Tweel, I., Noord, P. A. H. (2003). Early stopping in clinical trials and epidemiologic studies for “futility”: Conditional power versus sequential analysis. Journal of Clinical Epidemiology 56, 610–617.

13.  Cohen, M. S. (2020). Hydroxychloroquine for the prevention of Covid-19 — searching for evidence. Editorial, The New England Journal of Medicine, DOI: 10.1056/NEJMe2020388.

Copyright 2020 Media | QRMedSci, LLC.

To view or add a comment, sign in

Others also viewed

Explore content categories