Statistical Considerations for Clinical Trials During COVID-19: Confirmatory Adaptive Platform Trial (CAPT) Design for COVID-19 Treatment (Part III)

Qing Liu, Ph.D., ASA Fellow

Gene and Cell Therapies|Rare Disease|Agentic AI Prompt Engineering|AI Clinical Trials|AI Health|AI ALGO Agents|AWS Cloud Computing|Food as Medicine, Food Forest|Off-Road Camping

Published May 11, 2020

Authors: Qing Liu, Ph.D., Quantitative and Regulatory Medical Science, LLC and Karl Peace, Ph.D., Jiann-Ping Hsu College of Public Health, Georgia Southern University

Statistical Methods for CAPT Design (continued ...)

Sequential p-value and monitoring

Frequentist inference consists of assessing the strength of statistical evidence against the null hypothesis, which is measured by significance tests or p-values, calculating confidence intervals, and point estimates. For group sequential designs, these were not correctly addressed until Liu and Anderson (2008a, 2008b) who provided a unified approach to group sequential inference at interim and final analyses.

Liu and Anderson (2008b) discovered that all p-values computed from procedures in the literature [described in Jennison and Turnbull (2000, pp. 179–181) or in Proschan, Lan, and Wittes (2006, pp. 116–125)] cannot be meaningfully interpreted as p-values. These lead to further problems for confidence interval procedures as they are inversions of the p-value procedures. To resolve various logical and inferential difficulties of these p-value procedures, Liu and Anderson (2008a), following the ideas Liu and Pledger (2006), proposed three fundamental principles for ordering the sample space of group sequential trials. This ordering provides the foundation for statistical inference in group sequential trials, by which they derive methods for calculating sequential p-values, confidence intervals, and median unbiased estimates procedures.

Either sequential p-values or sequential confidence bounds can be used for monitoring an ongoing trial. For example, if, at any interim analysis, the sequential p value is less than or equal to the significance level α, then the null hypothesis is rejected; if the sequential lower bound is greater than or equal to zero, then the null hypothesis that the effect size is less than or equal to zero is also rejected. These simple and intuitive monitoring methods are equivalent to testing against the null hypothesis using canonical test statistics against their corresponding significance boundary values. One operating benefit of using sequential p-values or confidence intervals is that the DSMB can check whether the significance boundary is crossed without having to refer to the specific details of the design. The real inferential and interpretive benefit is that the sequential p-values, in the frequentist point of view, provide a measure of strength of evidence against the null hypothesis, irrespective of whether a conservative or a less conservative boundary is used. One of the greatest benefits is that the sequential p-value or confidence interval at the final analysis is the final p-value or confidence interval.

At conception, the sequential inference methods are intended as alternatives to the repeated confidence intervals (RCI) approach of Jennison and Turnbull (1989). It turns out that the fundamental principles for group sequential inference also provide the theoretical foundation for the RCI approach of Jennison and Turnbull (1989). Therefore, the sequential inference methods are in fact closely related to the RCI approach. For example, the sequential lower confidence bounds are simply cumulative maxima of the lower repeated confidence bounds; the sequential p-values are cumulative minima of repeated p-values.

The sequential p values are robust against various deviations from the original group sequential design, including a flexible stopping time, a change in sample size, a change in significance level for monitoring, and changes in the number or timing of interim analysis when the significance boundary is calculated according to the method of Lan and DeMets (1983). The sequential p-values also are robust against the underlying homogeneity assumption of treatment effect over time. A significant final sequential p value is interpreted that the underlying null hypothesis has been rejected during the trial, and the type I error rate is controlled at a specified level. In particular, the sequential inference methods apply when the trial is stopped for any reason at any time, where the decision to stop the trial is based on the totality of accumulating data, not just rigid binding stopping rules of the primary endpoint, which is required of other existing methods. The sequential inference addresses over-running, a problem for which group sequential designs are criticized by authors promoting sample size adaptive designs.

The use of sequential p-values for inference and monitoring is illustrated in Liu and Anderson (2008b) with the CAPTURE trial (1997).

Point estimate and confidence interval

For COVID-19 treatment, point estimates and confidence intervals are critically important for the scientific community and public to inform the benefits of new treatments. For COVID-19 clinical trials following the CAPT design, point estimate and confidence interval procedures are given in Liu and Chi (2001) and Liu, Li, Anderson and Lim (2012) for two-stage adaptive designs with sample size adjustment, and in Liu and Anderson (2008a) for general adaptive group sequential designs. The theory for point estimation and confidence interval estimation is given in Liu, Proschan and Pledger (2002).

Multiplicity in Hypothesis Testing

Another important property of sequential p-values is that standard methods for testing multiple hypotheses can be easily implemented with sequential p values, which expands the number of objectives to be achieved in a group sequential trial. This can be illustrated with the motivating example in Liu and Anderson (2008a). In the supplemental article, Liu and Anderson (2008b) describe how to apply the sequential p-values conveniently to control multiplicity of type I error rates for various applications.

Following Liu and Pledger (2006), Liu and Anderson (2008b) provide a sequential adaptive closed testing procedure, which can be modified for applications (e.g., CATCO) for which design modifications include change of primary endpoints. Liu and Anderson (2008b) also describe sequential tests with hierarchical endpoints, which maintain for most applications. Liu and Anderson (2008b) also describe sequential tests for supportive endpoints, sequential adaptive global test statistics, and sequential p-Max test.

Missing Data Handling

Liu and Chi (2010) rely on existing randomization test procedures for time-to-event and binary endpoints, both of which can be used for handling missing data due to early dropout. For the general setting involving continuous endpoints with missing data, Liu, Holdbrook and Castelli (2018) developed a randomization test assuming a missing-at-random (MAR) mechanism. During the blinded data review procedure, a relevant test procedure is chosen for each stage, and a conditional error function is also chosen. The results from each stage are then combined through the conditional error function to derive the overall p-value for the two-stage adaptive trial.

Use of Non-concurrent Controls

The advantage of COVID-19 treatment trials following the CAPT design is that controls from a prior randomization can be incorporated for comparative efficacy analysis of a new treatment with a previous treatment for which both non-concurrent and concurrent controls are available for analysis. Due to potential bias in non-concurrent controls, results of statistical inference following existing naïve statistical methods, including those so-called dynamic pooling methods, are also inherently biased. It is important to recognize the key differences between non-concurrent and concurrent controls, which include changes in standard of care, patient population, or general design changes from the previous randomization. While bias due to imbalance of known risk factors may be controlled through existing naïve statistical methods (e.g., matched control analysis, regression analysis, propensity score matching, and synthetic control arm, or dynamic pooling methods), it is impossible to control bias due to imbalance of unknown risk factors. These issues can be addressed when non-concurrent controls are used to augment concurrent controls.

The virtual matched control methodology by Liu, Castelli, and Holdbrook (2020) can be applied to both non-concurrent controls and concurrent controls separately and then derive a mixture distribution based on weights proportional to the sample size or information on both types of controls. The mixture distribution is then used in the exact conditional intra-patient (ECIP) test for assessing treatment effect, whether it is measured as a change from baseline endpoint or pre- and post- treatment slope endpoint. To assess potential bias that are not captured by covariates used in the virtual matched control analysis, a tipping point analysis is performed by shifting the distribution of the non-concurrent controls against the treatment group until statistical significance of the ECIP test is lost. If the distribution of the non-concurrent controls at the tipping point is comparable to or exceeds the distribution of the concurrent controls, then the results of the ECIP test is robust against potential bias that cannot be accounted for by covariates.

Discussion

Stochastic Curtailment Controversy

There is substantial controversy on the use of stochastic curtailment for stopping group sequential trials for futility. Armitage (1989) pointed out that

“It is perhaps worth mentioning the extensive literature on stochastic curtailment and Bayesian prediction which bears on this (Lan et al., 1982; Spiegelhalter and Freedman, 1988). The idea is that we should stop early if it can be predicted that the final result will, in some sense, not strongly contradict the null hypothesis. Much of the stochastic curtailment literature runs into difficulties over the choice of hypothesis on which to make the prediction, and I prefer the Bayesian approach. However, I have two reservations about the whole approach—one is that early stopping should be justified by the current inference, rather than by prediction of what might happen after some future random events, and secondly, in trials where collection of reliable information takes precedence over the need to select one or other treatment it will usually be foolish to stop unnecessarily early and to waste information. Even if the final difference is non-significant the trial may contribute usefully to knowledge, particularly if combined in an overview with data from other similar trials (p. 335).”

Jennison and Turnbull (2000, p. 219) stated that

“The conditional power calculations are also non-standard in that they do not refer to a single, well-defined reference test. In order to avoid such difficulties, it is wise to define a study protocol as unambiguously as possible at the outset. If this is done thoroughly and an interim analysis schedule is also defined, the full range of group sequential tests are available for use and one of these tests may be preferred to stochastic curtailment.”

Tweel and Noord (2003) analyze two data sets to illustrate various difficulties with the use of conditional powers for stochastic curtailment. They conclude that group sequential analyses have several advantages over stochastic curtailment and recommend that

“more studies should consider a sequential design and analysis to enable early stopping when enough evidence has accumulated to conclude a lack of the expected effect”

We note that for group sequential designs, the conditional power calculation often cited in the literature is incorrect as it does not take group sequential test into consideration. The correct calculation is given in Müller and Schäfer (2001).

Liu and Chi (2010) calculate the conditional powers at the nonbinding futility boundary derived from the Kim–DeMets error-spending function. They evaluate the conditional power at specified minimum effect size by which the futility boundary is derived. They also evaluate the conditional power at the so-called current trend, which is simply the observed effect size given as the futility boundary value divided by the square root of the information. They discover that conditional powers at the current trends are consistently biased against alternative hypotheses of interest. For example, consider a group sequential design with 80% power where the Kim–DeMets error-spending function with shape parameter 0.75 is used to calculate a justifiable aggressive nonbinding futility boundary. At 20% of the maximum information, the type 2 error rate (or β-spent) is 5.9% for prematurely terminating the trial if the alternative hypothesis at the minimum effect size is true; the type 1 error rate (or futility level) is 44.9% for continuing the trial if the null hypothesis is true. These probability measures, which are easy to understand, represent meaningfully a carefully-thought-out criterion for stopping the trial for futility. At the futility boundary, the conditional power evaluated at the minimum effect size is 51.2%, which corroborates the aggressive choice of the futility boundary. However, the conditional power at the current trend is only < 0.3%, which is grossly biased. By any standard, one would stop the trial if the probability of achieving statistical significance is less than 1%, which would be a decision with high probability of type 2 error rates.

The most dramatic demonstration of inappropriateness of conditional power for futility is its application to stop Biogen’s Alzheimer drug development. After analyzing additional follow-up data, Biogen announced that “The result of the futility analysis is incorrect.”

Adaptive Bayesian Designs

Bayesian methods have been used in adaptive platform trial designs. Examples include REMAP-CAP and its adaptation for COVID-19 by UPMC. The Bayesian design is exploratory in nature and does not meet the statutory of adequate and well-controlled investigations. Both the 2016 FDA guidance on adaptive designs for medical devices and 2019 FDA guidance on adaptive designs for drugs and biologics require computer simulations to obtain Bayesian design operating characteristics including type 1 error rates and power. As a result, it is necessary to specify the exact decision rules so simulation can proceed. This is a substantial limitation of the Bayesian methods as the trials do not have any flexibility to handle unexpected design modifications, which are not part of Bayesian decision rules used for simulation studies. In fact, the whole exercise of simulation studies is not necessary. Emerson, Kittelson and Gillen (2007a, 2007b) show that when the canonical joint distribution for the standardized test statistics of Jennison and Turnbull (2000, p. 49) is used, Bayesian boundaries can be converted to boundaries for traditional group sequential boundaries for standardized test statistics, and vice versa. Similar methods are used by Spiegelhalter D, Abrams K, Myles J (2004). Gerber and Gsponer (2016) describe an R package (gsbDesign) for evaluating operating characteristics of a group sequential Bayesian design. From the resulting rejection probabilities under the null hypothesis or alternative hypothesis, traditional group sequential boundaries can then be derived.

Because there is a large number of choices of traditional group sequential boundaries, one could find an optimal group sequential design and then convert the boundaries to Bayesian boundaries. This would lead to optimal group sequential design that is superior to commonly used group sequential Bayesian designs. Because the same group sequential design can be expressed in two different forms, the only benefit of using the Bayesian formulation is to help with interpretation of frequentist based confidence intervals.

Free Software for Group Sequential Designs

A group sequential design can be easily set up with a free R-package gsDesigns developed by Anderson (2020). Better yet, the gsD web app with intuitive user interface developed as a shiny app can be easily installed as iPhone, iPad, Android apps, as well as a Window 10 app that can be saved on the task bar or on the Start menu. It takes minutes to set up a group sequential design with a large number of options boundary functions, etc.

Return to Part I

Part I link: https://lnkd.in/ejYYTZN

Return to Part II

Part II link: https://lnkd.in/eTMafdn

References

1. Liu, Q., Anderson, K. M. (2008a). On adaptive extensions of group sequential trials forclinical investigations. Journal of the American Statistical Association 103, 1621–1630.

2. Liu, Q., Anderson, K. M. (2008b). Theory of inference for adaptively extended group sequential designs with applications in clinical trials. Journal of the American Statistical Association, Supplemental to 2028a.

3. Proschan, M. A., Lan, K. K. G., and Wittes, J. T. (2006). Statistical Monitoring of Clinical Trials, New York: Springer.

4. Lan, K. K. G., and DeMets, D. L. (1983). Discrete sequential boundaries for clinical trials. Biometrika 70, 659-663.

5. The CAPTURE Investigators. (1997). Randomized placebo-controlled trial of abciximab before and during coronary intervention in refractory unstable angina: the CAPTURE study. Lancet 349, 1429-1435.

6. Liu, Q., and Chi, G. Y. H. (2001). On sample size and inference for two-stage adaptive designs. Biometrics 57, 172-177.

7. Liu, Q., Li, G., Anderson, K. M., and Lim, P. (2012). On efficient two-stage adaptive designs for clinical trials with sample size adjustment. Journal of Biopharmaceutical Statistics, Special issue on adaptive designs. 22, 617-640.

8. Liu, Q., Proschan, M.A, and Pledger, G.W. (2002). A unified theory of two-stage adaptive designs. Journal of American Statistical Association 97, 1034-1041. https://www.jstor.org/stable/3085828?seq=1

9. Liu, Q., Pledger, G. W. (2006). On design and inference for two-stage adaptive clinical trials with dependent data. Journal of Statistical Planning and Inference 136, 1962–1984.

10. Jennison, C., Turnbull, B. W. (1989). Interim analysis: The repeated confidence interval approach (with discussion). Journal of the Royal Statistical Society Series B 51, 305–361.

11. Liu, Q., Holdbrook, F., and Castelli, F. (2018). On multiple imputation to assess impacts of potentially non-ignorable missing data with exact tests. Unpublished technical document.

12. Liu, Q., Castelli, J., and Holdbrook, F. (2020). Statistical analysis of single-arm trial with virtual matched controls. Submitted to a special journal issue on complex innovative designs.

13. Armitage, P. (1989). Discussion of the paper by Jennison and Turnbull. Journal of the Royal Statistical Society Series B 51, 334–335.

14. Lan, K. K. G., Simon, R., and Halperin, M. (1982). Stochastically curtailed tests in long-term clinical trials. Communications in Statistics 1, 207–219.

15. Spiegelhalter, D. J., and Freedman, L. S. (1988). Bayesian approaches to clinical trials. In: Bernado, J. M., DeGroot, M. H., Lindely, D. V., Smith, A. F. M., eds. Bayesian Statistics. 3rd ed. Oxford: Oxford University Press.

16. Tweel, I., Noord, P. A. H. (2003). Early stopping in clinical trials and epidemiologic studies for “futility”: Conditional power versus sequential analysis. Journal of Clinical Epidemiology 56, 610–617.

17. Müller, H. and Schäfer, H. (2001). Adaptive group sequential designs for clinical trials: Combining the advantages of adaptive and of classical group sequential approaches. Biometrics 57, 886–891.

18. Liu, Q. and Chi, G. Y. H. (2010). Fundamental theory of adaptive designs with unplanned design change in clinical trials with blinded data. Handbook of Adaptive Designs in Pharmaceutical and Clinical Development, 2-1 to 2-8, edited by Pong, A., and Chow, S. C., Chapman & Hall.

19. Emerson, S., Kittelson, J., and Gillen, D. (2007a). Bayesian evaluation of group sequential clinical trial designs. Statistics in Medicine, 26, 1431–1449. doi:10.1002/sim.2640.

20. Emerson, S., Kittelson, J., and Gillen, D. (2007b). Frequentist evaluation of group sequential clinical trial designs. Statistics in Medicine, 26, 5047–5080. doi:10.1002/sim.2901.

21. Jennison, C., and Turnbull, B. W. (2000), Group Sequential Methods With Applications to Clinical Trials, Boca Raton, FL: Chapman & Hall.

22. Spiegelhalter, D., Abrams, K., and Myles, J. (2004). Bayesian Approaches to Clinical Trials and Health-Care Evaluation. Statistics in Practice. John Wiley & Sons.

23. Gerber, F., and Gsponer, T. (2016). gsbDesign: an R package for evaluating the operating characteristics of a group sequential Bayesian design. Journal of Statistical Software. doi: 10.18637/jss.v069.i11

24. Anderson, K. (2020). Package gsDesign. https://cran.r-project.org/web/packages/gsDesign/gsDesign.pdf

To view or add a comment, sign in

See all

Statistical Considerations for Clinical Trials During COVID-19: Confirmatory Adaptive Platform Trial (CAPT) Design for COVID-19 Treatment (Part III)

Qing Liu, Ph.D., ASA Fellow

Gene and Cell Therapies|Rare Disease|Agentic AI Prompt Engineering|AI Clinical Trials|AI Health|AI ALGO Agents|AWS Cloud Computing|Food as Medicine, Food Forest|Off-Road Camping

Statistical Methods for CAPT Design (continued ...)

Sequential p-value and monitoring

Point estimate and confidence interval

Multiplicity in Hypothesis Testing

Missing Data Handling

Use of Non-concurrent Controls

Discussion

Stochastic Curtailment Controversy

Adaptive Bayesian Designs

Free Software for Group Sequential Designs

Return to Part I

Return to Part II

References

More articles by this author

Others also viewed

New Study: The State Of A.I.-Based, FDA-approved Medical Devices And Algorithms – An Online Database

RQM+ Weekly Watch #61

The Golden Era For Clinical Research sites Part 2: A Decade of Opportunity

The Impact of Artificial Intelligence on Clinical Research: Advances, Challenges, and Case Studies

Current Trends in Clinical Research: A Look at 2025

Aligning AI with the Realities of EU Joint Clinical Assessment: Opportunities, Pitfalls, and the Role of Clinical Context

Clinical Corner: A brief breakdown of clinical research terms in the THIS vs THAT format

Clinical Trials in Belgium: Excellence Today, Innovation Tomorrow "ai ai ai ai"

Implementation Science: Bridging the Gap Between Discovery and Impact in Life Sciences

🌍 Embracing Diversity, Equity & Inclusion (DEI) in Clinical Research: A New Era in Healthcare Innovation

Explore content categories

Statistical Methods for CAPT Design (continued ...)

Sequential p-value and monitoring

Point estimate and confidence interval

Multiplicity in Hypothesis Testing

Missing Data Handling

Use of Non-concurrent Controls

Discussion

Stochastic Curtailment Controversy

Adaptive Bayesian Designs

Free Software for Group Sequential Designs

Return to Part I

Return to Part II

References

Systematic Bias in a Randomized Trial of Hydroxychloroquine as Post-exposure Prophylaxis for COVID-19

Jun 6, 2020