Finding the right power balance: better study design and collaboration can reduce dependence on statistical power
Shinichi Nakagawa, Malgorzata Lagisz, Yefeng Yang & Szymon Drobniak
Electronic Supplementary Material
Setups
Loading packages and custom functions. If your computer do not have
the required packages, please install them via
install.packages("package.name")
# load required packages
pacman::p_load(dplyr, magrittr, tidyr, stringr, ggplot2, cowplot,
patchwork, tidyverse, here, readxl, retrodesign, pwr, Superpower,
pander, metafor)
# custom function for approximate sample size for main
# effect and interactive effect
short_cut <- function(d, method = c("normal", "interaction")) {
method <- match.arg(method)
if (method == "normal") {
size <- 16 * (1/d^2)
} else {
size <- 32 * (1/d^2)
}
size
}
Aims of this Supporting Information
There are three different sections in this document: 1) we presents an fictional story that include some of examples of small effects mainly focusing on interaction effects we mention in the main text (A Fiction), 2) we show how we obtain sample sizes presented in this fictitious story (Power Calculation) and 3) we show why it is important to publish regardless of sample sizes or the power of studies (Small N & Meta-analysis)
Section 1: A Fiction
A likely story of a principal investigator
Imagine that you are a new principal investigator, PI, studying the detrimental consequences of obesity on cognition using a mouse model. You must put together your first grant as a lead PI. So you start with power analysis to determine sample size for a new idea on investigating the effect of maternal obesogenic diet on offspring’s memory tasks in mice and sex-specific differences in the tasks (Fig S1A). You need an estimated effect of the treatment and sex difference; of course, nobody has yet performed such a study, so you do not know an exact effect size. An earlier study examining maternal obesogenic diet on offspring’s cognition, using a different assay, has suggested surprisingly large effects: a 30% increase in the number of errors in males and 20% in females (i.e., 10% difference between the sexes; standardized mean difference, \(d\) = 1.0, 0.67, 0.33, for male and female treatment effects, and the sex difference, respectively). Using these effects sizes, power analysis, which assumes the independence of samples, shows you will need the sample sizes of 16 male and 36 female mice (F1 generation) in each control and treatment groups (32 males and 72 females in total) to detect a similar effect in your proposed experiment. So, you decide that you will use 72 offspring mice for each sex to have a nicely balanced design (Fig S1B). But then, you will need at least 36 x 2 mother mice (F0 generation) considering the independence of samples at least in treatment vs control groups (i.e., taking one male and one female pup from each mother for each group; see Fig S1C).
Figure S1
An experimental design with hypothetical effects. Panel A shows an overview of experimental design with four groups: 1) control females (Ncf), 2) treatment females (Nof), 3) control males (Ncm) and 4) treatment males (Nom) with N representing ‘sample size’. Panel B shows assumed effect sizes as % and d (standardized mean difference; assuming, e.g. the baseline mean of 100 and standard deviation of 20) and resultant sample sizes from power analysis. Panel C shows two possible scenarios based on power analysis: F0 are mothers, F1 are offspring females (pinkish) and males (navy) with a target sample size of 36 offspring per sex per group.
Although this planned experiment is just one part of your grant proposal, it is already eating up the budget, even without considering sex differences. Thus, you decide that it is acceptable to take 4 male and 4 female pups from each mouse mother, so you only need 9 x 2 (18) mothers in total (Fig S1C). But your design now violates the statistical assumption of independent samples, used for the power analysis (see Fig 1). By the way, what you decide not to mention in your grant proposal is that the original power analysis also suggest you would need 294 independent offspring of each sex per group to compare sex difference in treatment effect, or statistically speaking, testing for an interaction among the four groups (i.e., 1,176 offspring mice from 294 x 2 [588] mothers; for details). This decision of concealment, despite your interest in sex difference, is because not only would incorporating such a design be far too expensive, but also running such an experiment would be too labor intensive, even with a PhD student and technician. Surely, the ethics committee would not like such a large mice experiment even if you got funded.
Now let’s use smaller and more realistic, albeit, hypothetical, estimates of effect sizes: 5%, 3%, 2% difference (\(d\) = 0.16, 0.1 or 0.06, respectively) for the treatment effect for males, females, and the sex difference (note that replication work suggests these hypothetical effect sizes are reasonable). For this scenario, you will need 574 x 2 (1,148) males, 1,600 x 2 (3,200) females and 7,128 offspring mice in each of the control and treatment groups of both sexes (i.e., 28,512 mice in total) to reach statistical significance. You must not forget that you will also need 7,128 x 2 (14,256) mothers to produce these offspring mice (only using one male and one female pup per mother to meet the independence assumption). There is no way of doing such a ‘bigger-than-your-university-facility-can-take’ experiment, especially as a new PI (at least, not by yourself). In fact, there is little or no incentive whatsoever to provide realistic estimates of sample sizes when grant bodies request ‘value for money’ and ethics committees ask you to minimize the use of animals. Instead, such well-intended recommendations have been incentivizing scientists to use the smallest sample sizes possible based on greatly inflated published effect sizes (see the main text). Note this example was inspired by our earlier meta-analyses of related studies (Lagisz et al. 2014, 2015; Anwer et al. 2022)
Section 2: Power Calcuation
Preambles
Statistical power are determined by the following three parameters:
Type I error probability, \(\alpha\), also known as significance threshold, which is usually fixed at 0.05 (see Table I);
sample size, \(n\), that is the number of subjects required for an experiment
standardized effect size, \(E[\theta]/\sqrt{Var[\theta]}\), where \(\theta\) is the effect size of interest, which is indicated by the real difference between two groups (in our case: obesogenic diet vs. control diet), \(E[\theta]\) is the population average/expectation, and \(Var[\theta]\) is the respective variance; note that standardized mean difference \(d\) is an example of a standardized effect size (for more on effect size, see also Fig 2 and Box 2).
# drawing Fig 1 Parameter 1: alpha level vs. power
#### set a range of alpha levels (0.01 to 0.1)
alpha_range <- seq(0.001, 1, by = 0.01)
#### calculate power at the set alpha levels using a medium
#### magnitude of standardized effect size 0.5 with a
#### standard deviation of 0.2
power_range <- retro_design(A = 0.5, s = 0.2, alpha = alpha_range)
#### create a data frame
power_vs_alpha <- data.frame(alpha = alpha_range, power = power_range$power)
#### plot
power_vs_alpha_plot <- ggplot(power_vs_alpha) + geom_line(aes(x = alpha,
y = power), show.legend = F) + scale_y_continuous(breaks = seq(0,
1, 0.2), limits = c(0, 1)) + geom_hline(yintercept = 0.8,
colour = "red") + labs(x = "Type 1 error (alpha)", y = "Statistical power",
title = "(A) alpha level vs. power") + theme_bw()
### Parameter 2: n vs. power
#### set a range of n (2 to 100)
n_range <- seq(2, 100, by = 2)
## calculate power for a two-sample t test
## (two-independent-samples-design) using a medium
## magnitude of standardized effect size 0.5
power_range2 <- pwr.t.test(d = 0.5, n = n_range, sig.level = 0.05,
type = "two.sample", alternative = "two.sided")
#### create a dataframe
power_vs_n <- data.frame(n = n_range, power = power_range2$power)
#### plot
power_vs_n_plot <- ggplot(power_vs_n) + geom_line(aes(x = n,
y = power), show.legend = F) + scale_y_continuous(breaks = seq(0,
1, 0.2), limits = c(0, 1)) + geom_hline(yintercept = 0.8,
colour = "red") + labs(x = "Sample size (n)", y = "Statistical power",
title = "(B) n vs. power") + theme_bw()
### Parameter 3: effect size vs. power
#### create a plausible range of standardized effect sizes
es_range <- seq(0.01, 1.01, by = 0.01)
#### calculate power with alpha 0.05
power_range3 <- retrodesign::retro_design(A = es_range, s = 0.2,
alpha = 0.05)
#### create a dataframe
power_vs_es <- data.frame(es = es_range, power = power_range3$power,
alpha = rep(c("0.05"), length(es_range)))
#### plot
power_vs_es_plot <- ggplot(power_vs_es) + geom_line(aes(x = es,
y = power), show.legend = F) + scale_y_continuous(breaks = seq(0,
1, 0.2), limits = c(0, 1)) + geom_hline(yintercept = 0.8,
colour = "red") + labs(x = "Effect size (d)", y = "Statistical power",
title = "(C) effect size vs. power") + theme_bw()
### put all figures together
power_plot <- power_vs_alpha_plot/power_vs_n_plot/power_vs_es_plot
power_plot
Figure S2
An example showing how the three parameters affect the statistical
power: (A) Type I error (\(\alpha\)),
(B) Sample size (\(n\)), (C) Magnitude
of the standardized effect size (\(d\)). These figures are simulated using
retro_design()
function in retrodesign
package
(Gelman & Carlin 2014). See the corresponding code chunk for
detailed code.
From Figure S2A we can see that when an experiment commits a higher Type 1 error (which we do not want), it is easier to achieve a desired statistical power (i.e., Cohen’s recommendation: 80% power). Increasing sample size (\(n\)) and magnitude of standardized effect size are effective ways to increase the statistical power of a given experiment (Figure S2B and S2C).
Estimating sample sizes in the ‘fictitious’ experiment with large effects
When designing an ‘fictitious’ diet experiment in your grant proposal, You choose a common significance threshold, \(\alpha\) = 0.05 and the nominal power level of 80%. Based on your pilot or external information (e.g., relevant studies or a meta-analysis on maternal effect), you assume maternal obesogenic diet will lead to a 30% increase in the number of mistakes in a memory task in males, 20% in females and 10% difference between the sexes. To quantify the diet effect using a standardized effect size (i.e., \(d\)). We assumed the followings:
control group (both male and female) - mean = 100 (arbitrary unit) and standard deviation (sd) = 30;
male treatment group - mean = 130 and sd = 30;
female treatment group - mean = 120 and sd = 30.
Note that we assume the homogeneity of variances among groups (i.e. sd = 30).
## scenario 1: large effects
### set up an independent design
design <- ANOVA_design(
design = "2b*2b", # independent design, which means no correlation
n =290, # the sample size in each group for testing sex difference
mu = c(130, 120, 100, 100),
sd = 30,
labelnames = c("diet", "obesogenic ", "control ", "sex", "male", "female"),
plot = FALSE)
meanplot_largeES <- design$meansplot + labs(x = "Groups", y = "Mean")
Figure_S3 <- meanplot_largeES
Figure_S3
Figure S3
Visualization of the assumed means and standard deviation (sd) of
each group, using the package Superpower
. Error bars
represent sd. Here we assume each group is independent (note that this
is not quite true if we take one male and one female from one mother but
for convenience, let’s assume this)
Presumed effect sizes (large)
Using these means and sds, we have the following standardized mean difference \(d\), corresponding % differences:
1.0, corresponding to a 30% increase in the number of mistakes in a memory task in males after a diet intervention;
0.67, corresponding to a 20% increase in the number of mistakes in a memory task in females after a diet intervention;
0.33, corresponding to a 10% sex difference or interaction between diet and sex.
You are planning to use a typical two-sample t-test to examine the statistical significance of the diet effect in males and females. Then you can approximate sample size required for each group using the following formula (Lehr 1992):
\[ n = 16 \frac{Var[\theta]} {E[\theta]^2} = \text{16} \frac{1} {d^2} \] However, for the last effect (interaction effect), this requires comparing 4 groups so that this formula does not work. Yet you could still use this formula by replacing 16 with 32 as interaction involves four groups rather than two.
Sample size calculation for diet effects
In the following section, we use our custom function (based on the
above formula) and one existing R package pwr
to estimate
the sample size used in your proposed experiment with different
scenarios (sample size mentioned in the fictitious story in the main
text).
## [1] 1
## [1] 0.6666667
# the first set (surprising large effects)
## male treatment effect
pwr.t.test(d = 1, sig.level = 0.05, power = 0.8, type = "two.sample",
alternative = "two.sided")
##
## Two-sample t test power calculation
##
## n = 16.71472
## d = 1
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
# our result from our custom function is very close
pwr_independent_m_d1 <- short_cut(d = 1, method = "normal")
## female treatment effect
pwr.t.test(d = 0.67, sig.level = 0.05, power = 0.8, type = "two.sample",
alternative = "two.sided")
##
## Two-sample t test power calculation
##
## n = 35.95537
## d = 0.67
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
Sample size calculation for sex difference (interaction)
We cannot use pwr.t.test
for getting sample size for the
sex difference. So first we can use our formula and then, we use the
package Superpower
to obtain a simulation based sample
size.
## [1] 0.3333333
## [1] 293.8476
Main results of the outputs of the ANOVA_exact
when
detecting large effects:
## scenario 1: large effects perform ANOVA and calculate
## power for in dependent design
power_results <- ANOVA_exact(design, alpha_level = 0.05, verbose = FALSE)
power_results$main_results
## power partial_eta_squared cohen_f non_centrality
## diet 100.00000 0.14836492 0.41738692 201.388889
## sex 80.94609 0.00692025 0.08347738 8.055556
## diet:sex 80.94609 0.00692025 0.08347738 8.055556
Simulation (using the ANOVA_exact
function) shows that
collecting data from \(n\) = 294 F1
mice (per group) has 81% power for the interaction or sex difference
(see code chunk for R syntax).
We also plot a power curve over a range of sample sizes (Figure S4), from which you can visually explore whether the expected power is achieved for the interaction (bottom panel), and if so, at which sample size.
## Achieved Power and Sample Size for ANOVA-level effects
## variable label n achieved_power desired_power
## 1 diet Desired Power Achieved 12 80.60 80
## 2 sex Desired Power Achieved 284 80.13 80
## 3 diet:sex Desired Power Achieved 284 80.13 80
Figure S4
Power curves for large inter-generational effect in a dependent design (diet and sex are manipulated between animals). Top panel = the main effect - diet; Middle panel = the main effect - sex; Bottom panel = the interactive effect, sex difference. The orange horizontal lines denote the expected statistical power (80%). Note that the main diet effect is an average effect over the two sexes.
As you see, the simulation-based method suggests we need 284 subjects to reach 80% to detect the interaction effect, which confirm what we obtained form the formula was close enough.
Estimating sample sizes in the ‘ficticious’ experiment with realstic (small) effects
This time, we assume maternal obesogenic diet will have more realistic effects on pups’ memory: a 5% increase in males, 3% in females and thus 1% difference in the diet effect between the sexes (interaction).
To calculate \(d\), you assume:
control group (both male and female) - mean = 100 (arbitrary unit) and sd = 30;
male treatment group - mean = 105 and sd = 30;
female treatment group - mean = 103 and sd = 30.
## scenario 2: small effects
### set up an independent design
design2 <- ANOVA_design(
design = "2b*2b", # independent design
n =7128, # the sample size used for testing sex difference
mu = c(105, 103, 100, 100),
sd = 30,
labelnames = c("diet", "obesogenic ", "control ", "sex", "male", "female"),
plot = FALSE)
meanplot_smallES <- design2$meansplot +
labs(x = "Groups", y = "Mean", title = "Realistic small effect")
meanplot_smallES
Figure S5
Visualization of the expected means and standard deviation (sd) of
each group using the package Superpower
under more
realistic scenarios. Error bars represent sd. Here we assume each group
is independent (note that this is not quite true if we take one male and
one female from one mother but for convenience, let’s assume this).
Presumed effect sizes (small)
As with the above, we assumed all groups share a common sd = 30 (population standard deviation). Then you can obtain the following standardized mean difference \(d\):
0.16, corresponding to 5% increase in the number of mistakes in a memory task in males after a diet intervention;
0.1, corresponding to 3% increase in the number of mistakes in a memory task in females after a diet intervention;
0.06, corresponding to 2% sex difference or interaction between diet and sex.
Sample size calculation for diet effects
Following similar procedures in estimating sample sizes for large effects (see above), you can obtain sample sizes with these new presumed effect sizes
## [1] 0.1666667
## [1] 0.1
# the first set (realistic small effects)
## male treatment effect
pwr.t.test(d = 0.167, sig.level = 0.05, power = 0.8, type = "two.sample",
alternative = "two.sided")
##
## Two-sample t test power calculation
##
## n = 563.8262
## d = 0.167
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
pwr_independent_m_d0.167 <- short_cut(d = 0.167, method = "normal")
## female treatment effect
pwr.t.test(d = 0.1, sig.level = 0.05, power = 0.8, type = "two.sample",
alternative = "two.sided")
##
## Two-sample t test power calculation
##
## n = 1570.733
## d = 0.1
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
Sample size calculation for sex difference (interaction)
We can also use our formula to estimate (assuming independence of all groups)
## [1] 0.1
# sex difference
pwr_independent_i_d0.067 <- short_cut(d = 0.067, method = "interaction")
pwr_independent_i_d0.067
## [1] 7128.536
We use a similar simulation-based approach to empirically calculate
power for the interaction effect using Superpower
. Main
results of the outputs of the ANOVA_exact
when assuming
small effects:
## scenario 2: small effects perform ANOVA and calculate
## power for in dependent design
power_results2 <- ANOVA_exact(design2, alpha_level = 0.05, verbose = FALSE)
power_results2$main_results
## power partial_eta_squared cohen_f non_centrality
## diet 100.00000 0.0044253969 0.06667134 126.72
## sex 80.35012 0.0002777396 0.01666784 7.92
## diet:sex 80.35012 0.0002777396 0.01666784 7.92
Simulations (using the ANOVA_exact
function) show that
collecting data from \(n\) = 7129 F1
mice (per group) has 80.35% power for the interaction or sex difference.
So the simulation result seems to catch with the sample size estimated
by the formula.
Corrleated samples and statistical power
As mentioned, correlated samples can increase the statistical power of an experiment so that we require fewer samples. Here, we assume that sibling traits are correlated (\(r\) = 0.5) regardless of sex. We find \(n\) = 3564 can reach the expected statistical power (80%) for interaction (i.e., sex difference). This number (3567) corresponds to
## scenario 1: small effects
### perform ANOVA and calculate power for in independent design
### assuming the siblings are very similar to each other - r = 0.5
design3 <- ANOVA_design(
design = "2w*2w", # dependent design
n =3564,
r = 0.5,
mu = c(105, 103, 100, 100),
sd = 30,
labelnames = c("diet", "obesogenic", "control", "sex", "male", "female"),
plot = FALSE)
power_results3 <- ANOVA_exact(design3, alpha_level = 0.05, verbose = FALSE)
power_results3$main_results
## power partial_eta_squared cohen_f non_centrality
## diet 100.00000 0.034344069 0.18858827 126.72
## sex 80.33173 0.002217916 0.04714707 7.92
## diet:sex 80.33173 0.002217916 0.04714707 7.92
This number (3567) corresponds to the value calculated from the following formula:
\[ n_{interaction} = \frac{32} {d^2}(1 - r) \]
Using this formula, we can assume a lower correlation (\(r\) = 0.25) and then, we find \(n\) = 5346. As before, we can verify this,
using ANOVA_design
:
## scenario 2: small effects
### perform ANOVA and calculate power for in independent design
# assuming the siblings are very similar to each other - r = 0.25
design4 <- ANOVA_design(
design = "2w*2w", # dependent design
n =5346,
r = 0.25,
mu = c(105, 103, 100, 100),
sd = 30,
labelnames = c("diet", "obesogenic", "control", "sex", "male", "female"),
plot = FALSE)
power_results4 <- ANOVA_exact(design4, alpha_level = 0.05, verbose = FALSE)
power_results4$main_results
## power partial_eta_squared cohen_f non_centrality
## diet 100.00000 0.023159080 0.15397447 126.72
## sex 80.33874 0.001479566 0.03849362 7.92
## diet:sex 80.33874 0.001479566 0.03849362 7.92
As you see, with \(n\) = 5346, we
have ~80% power. We note that, as mention in the text, for more complex
designs (e.g. including different strains of mice), we cannot use the
formula or the functions form Superpower
. We need to use
other software packages which could accommodate such design
features.
Section 3: Small N & Meta-analysis
Simulating the population
Two groups of individuals, e.g. males and females, with means of 10 and 12, respectively. Variances are equal (= 100).
set.seed(7777)
mean_m <- 10
mean_f <- 20
sd <- sqrt(100)
N <- 888
males <- rnorm(N, mean_m, sd)
females <- rnorm(N, mean_f, sd)
popdata <- data.frame(y = c(males, females), sex = rep(c("m",
"f"), each = N))
ggplot(data = popdata) + geom_density(aes(y = y, colour = sex,
fill = sex), alpha = 0.4, linewidth = 0.8) + geom_boxplot(aes(x = 0,
y = y, group = sex, colour = sex), position = "dodge2", width = 0.01,
size = 0.8) + xlab("Density") + ylab("Measured trait") +
theme_bw() + coord_flip()
Figure S6
Visualizing the difference between females (f) and males (m).
Simulation of multiple studies
Simulate M = 15 independent studies that look at the population, testing for difference between males and females. Studies range from n = 5 to n = 10
M <- 15
samples <- 5:10
outdata <- data.frame(diff = numeric(0), t = numeric(0), p = numeric(0),
n = numeric(0), pooled_s = numeric(0))
for (i in 1:M) {
sample <- sample(samples, 1)
males_s <- subset(popdata, sex == "m")[sample(1:N/2, size = sample,
replace = F), ]
females_s <- subset(popdata, sex == "f")[sample(1:N/2, size = sample,
replace = F), ]
expdata <- rbind(males_s, females_s)
test <- t.test(y ~ sex, data = expdata)
outdata[i, "diff"] <- test$estimate[1] - test$estimate[2]
outdata[i, "n"] <- sample
outdata[i, "t"] <- test$statistic
outdata[i, "p"] <- test$p.value
outdata[i, "pooled_s"] <- sqrt(((sample - 1) * var(males_s$y) +
(sample - 1) * var(females_s$y))/(2 * sample - 2))
}
# glimpse(outdata)
ggplot(data = outdata, aes(x = diff, y = p)) + geom_point(aes(size = n),
shape = 1) + theme_bw() + geom_hline(yintercept = 0.05, colour = "red",
lty = 2) + geom_point(aes(x = mean(diff), y = 0.1), size = 3,
colour = "blue") + geom_errorbarh(aes(y = 0.1, xmin = mean(diff) -
sd(diff)/sqrt(nrow(outdata)), xmax = mean(diff) + sd(diff)/sqrt(nrow(outdata))),
size = 0.2, height = 0.1, colour = "blue") + geom_point(aes(x = mean(filter(outdata,
p < 0.05)$diff), y = 0.1), size = 3, colour = "purple") +
geom_errorbarh(aes(y = 0.1, xmin = mean(filter(outdata, p <
0.05)$diff) - sd(filter(outdata, p < 0.05)$diff)/sqrt(nrow(outdata)),
xmax = mean(filter(outdata, p < 0.05)$diff) + sd(filter(outdata,
p < 0.05)$diff)/sqrt(nrow(outdata))), size = 0.2,
height = 0.1, colour = "purple") + geom_vline(xintercept = 10,
colour = "red") + geom_text(aes(x = 10, y = 0.5, label = "true difference"),
colour = "red", hjust = "left", nudge_x = 0.2) + xlab("Effect size (mean difference)") +
ylab("p value")
Figure S7
In this example, 26.6666667% of effect sizes would fail to satisfy the 0.05 type-I error rate cut-off, and hence would be prone to being omitted, potentially biasing the overall effect size (purple vs. true ES in blue).
Scenario 1 - low-powered studies are excluded
Based on significance, only “sexy” results are published. Example meta-analysis shows, that this inflates the overall effect size.
# Calculate Cohen's d effect size
outdata <- outdata %>%
mutate(d = diff/pooled_s, var_d = sqrt((2 * sample/(sample *
sample)) + (d^2/(2 * (sample + sample)))))
model1 <- rma(yi = filter(outdata, p < 0.05)$d, vi = filter(outdata,
p < 0.05)$var_d, method = "FE")
summary(model1)
##
## Fixed-Effects Model (k = 4)
##
## logLik deviance AIC BIC AICc
## -2.4518 0.1816 6.9035 6.2898 8.9035
##
## I^2 (total heterogeneity / total variability): 0.00%
## H^2 (total variability / sampling variability): 0.06
##
## Test for Heterogeneity:
## Q(df = 3) = 0.1816, p-val = 0.9805
##
## Model Results:
##
## estimate se zval pval ci.lb ci.ub
## 1.2821 0.3599 3.5624 0.0004 0.5767 1.9875 ***
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Figure S8
A forest plot for Scenario 1
Scenario 2 - conservative publication of all estimates
Publication mandate/archiving makes all effect sizes discoverable, regardless of their significance and magnitude.
##
## Random-Effects Model (k = 15; tau^2 estimator: REML)
##
## logLik deviance AIC BIC AICc
## -10.0477 20.0954 24.0954 25.3735 25.1863
##
## tau^2 (estimated amount of total heterogeneity): 0 (SE = 0.1875)
## tau (square root of estimated tau^2 value): 0
## I^2 (total heterogeneity / total variability): 0.00%
## H^2 (total variability / sampling variability): 1.00
##
## Test for Heterogeneity:
## Q(df = 14) = 4.1608, p-val = 0.9944
##
## Model Results:
##
## estimate se zval pval ci.lb ci.ub
## 0.8628 0.1819 4.7430 <.0001 0.5063 1.2193 ***
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Figure S9
A forest plot for Scenario 2
If power calculation is based on meta-analytical effect-sizes
For biased (significant effect sizes only) overall estimate, the simple power analysis suggests:
powercalc <- power.t.test(delta = summary(model1)$b, sig.level = 0.05,
power = 0.8, sd = 1)
powercalc
##
## Two-sample t test power calculation
##
## n = 10.60216
## delta = 1.282074
## sd = 1
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
In other words, it suggests sampling 11 individuals per group.
Achieving the same power assuming the conservative estimate of effect size would require.
powercalc <- power.t.test(delta = summary(model2)$b, sig.level = 0.05,
power = 0.8, sd = 1)
powercalc
##
## Two-sample t test power calculation
##
## n = 22.08957
## delta = 0.8628064
## sd = 1
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
We need to sample 23 individuals per group (3 times larger N) Clearly - the problem lies not in conducting under-powered studies, but in not publishing their results. The degree of bias will be exacerbated by having noisier data (larger sampling variance), low true ES and clustering/non-additive effects modeled in addition.
References
Anwer, H., Morris, M.J., Noble, D.W., Nakagawa, S. and Lagisz, M., 2022. Transgenerational effects of obesogenic diets in rodents: A meta‐analysis. Obesity Reviews, 23(1), p.e13342.
Gelman A, Carlin J. Beyond power calculations: Assessing type s (sign) and type m (magnitude) errors. Perspectives on Psychological Science. 2014;9: 641–651.
Lagisz M, Blair H, Kenyon P, Uller T, Raubenheimer D, Nakagawa S. Transgenerational effects of caloric restriction on appetite: a meta‐analysis. obesity reviews. 2014 Apr;15(4):294-309.
Lagisz, M., Blair, H., Kenyon, P., Uller, T., Raubenheimer, D. and Nakagawa, S., 2015. Little appetite for obesity: meta-analysis of the effects of maternal obesogenic diets on offspring food intake and body mass in rodents. International Journal of Obesity, 39(12), pp.1669-1678.
Lehr R. Sixteen s-squared over d-squared: A relation for crude sample size estimates. Statistics in medicine. 1992;11: 1099–1102.
R Session Information
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
locale: en_AU.UTF-8||en_AU.UTF-8||en_AU.UTF-8||C||en_AU.UTF-8||en_AU.UTF-8
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: metafor(v.3.8-1), metadat(v.1.2-0), Matrix(v.1.5-3), pander(v.0.6.5), Superpower(v.0.2.0), pwr(v.1.3-0), retrodesign(v.0.1.0), readxl(v.1.4.1), here(v.1.0.1), forcats(v.0.5.2), purrr(v.1.0.0), readr(v.2.1.3), tibble(v.3.1.8), tidyverse(v.1.3.2), patchwork(v.1.1.1), cowplot(v.1.1.1), ggplot2(v.3.4.0), stringr(v.1.5.0), tidyr(v.1.2.1), magrittr(v.2.0.3) and dplyr(v.1.0.10)
loaded via a namespace (and not attached): googledrive(v.2.0.0), minqa(v.1.2.4), colorspace(v.2.0-3), ellipsis(v.0.3.2), rprojroot(v.2.0.3), estimability(v.1.4.1), fs(v.1.5.2), rstudioapi(v.0.14), farver(v.2.1.1), fansi(v.1.0.3), mvtnorm(v.1.1-3), lubridate(v.1.8.0), mathjaxr(v.1.6-0), xml2(v.1.3.3), codetools(v.0.2-18), splines(v.4.2.1), cachem(v.1.0.6), knitr(v.1.41), afex(v.1.2-1), jsonlite(v.1.8.3), nloptr(v.2.0.3), broom(v.1.0.2), dbplyr(v.2.2.1), compiler(v.4.2.1), httr(v.1.4.4), emmeans(v.1.8.2), backports(v.1.4.1), assertthat(v.0.2.1), fastmap(v.1.1.0), gargle(v.1.2.1), cli(v.3.4.1), formatR(v.1.12), htmltools(v.0.5.3), tools(v.4.2.1), lmerTest(v.3.1-3), coda(v.0.19-4), gtable(v.0.3.1), glue(v.1.6.2), reshape2(v.1.4.4), Rcpp(v.1.0.8.3), carData(v.3.0-5), cellranger(v.1.1.0), jquerylib(v.0.1.4), vctrs(v.0.5.0), nlme(v.3.1-157), xfun(v.0.34), lme4(v.1.1-30), rvest(v.1.0.3), lifecycle(v.1.0.3), pacman(v.0.5.1), googlesheets4(v.1.0.1), MASS(v.7.3-58.1), scales(v.1.2.1), hms(v.1.1.2), parallel(v.4.2.1), yaml(v.2.3.6), sass(v.0.4.2), stringi(v.1.7.8), highr(v.0.9), boot(v.1.3-28), rlang(v.1.0.6), pkgconfig(v.2.0.3), evaluate(v.0.17), lattice(v.0.20-45), labeling(v.0.4.2), tidyselect(v.1.2.0), plyr(v.1.8.7), bookdown(v.0.31), R6(v.2.5.1), generics(v.0.1.3), DBI(v.1.1.3), pillar(v.1.8.1), haven(v.2.5.1), withr(v.2.5.0), abind(v.1.4-5), modelr(v.0.1.9), crayon(v.1.5.2), car(v.3.1-0), utf8(v.1.2.2), tzdb(v.0.3.0), rmarkdown(v.2.17), grid(v.4.2.1), rmdformats(v.1.0.4), reprex(v.2.0.2), digest(v.0.6.30), xtable(v.1.8-4), numDeriv(v.2016.8-1.1), munsell(v.0.5.0), viridisLite(v.0.4.1) and bslib(v.0.4.0)