Power Analysis: How Much Data Do You Actually Need?
Every empirical study begins with a deceptively simple question: how many participants do I need? Answer too low, and you risk running a study that cannot detect the very effect it was designed to test — wasting the time of your participants, your collaborators, and your funders. Answer too high, and you waste resources, delay publication, and in clinical contexts, expose more people than necessary to experimental procedures.
The formal tool for answering this question is statistical power analysis. Despite being a half-century-old technique, it remains one of the most misunderstood and frequently neglected steps in research design. This guide walks through what power is, where the conventions came from, how to compute sample sizes for the most common test families, and the deeper conceptual issues that even seasoned researchers often get wrong.
What Is Statistical Power?
Statistical power is the probability that a study will correctly reject the null hypothesis when an effect of a specified size truly exists. Formally, power equals 1 − β, where β is the probability of a Type II error (a false negative). It is a conditional probability: given that the alternative hypothesis is true and the effect size equals some specified value, power tells you how often, in the long run, your study would detect that effect at your chosen alpha level.
Power depends on four interlocking quantities, often called the "power quartet":
- Sample size (N). Larger samples produce tighter sampling distributions and more power.
- Effect size. The true magnitude of the phenomenon you are trying to detect, in standardized units.
- Alpha (α). The Type I error rate, conventionally set to .05.
- Power (1 − β). The desired detection probability, conventionally set to .80.
Fix any three, and the fourth is mathematically determined. Power analysis is simply the algebra (and, increasingly, the simulation) that solves for whichever quantity you do not yet know.
Where Did 0.80 Come From?
The convention that power should be at least 0.80 traces directly to Jacob Cohen. In his foundational textbook Statistical Power Analysis for the Behavioral Sciences (1988), Cohen proposed 0.80 as a reasonable balance between two competing concerns: studies should have a high probability of detecting real effects, but the effort required to push power closer to 1.00 grows quickly and produces diminishing returns.
Cohen's reasoning was explicitly pragmatic, not principled. He suggested that researchers typically treat Type I errors as roughly four times more serious than Type II errors, hence the β:α ratio of 4:1 (.20 to .05). He acknowledged this was a heuristic, writing that 0.80 "is offered as a convention" and warning that it should be revised upward in high-stakes contexts.
The convention β = .20 ... was chosen with the dual considerations that (a) Type I errors are typically thought of as more serious than Type II errors, and (b) the cost of higher power in terms of sample size becomes prohibitive.
In modern practice, 0.80 has hardened into something Cohen never intended: a lower bound that researchers treat as adequate rather than minimal. For confirmatory work, replication studies, and any context where a false negative carries real consequences, 0.90 or 0.95 is increasingly the norm.
Why Power Matters: The Underpowered-Study Problem
If power were merely a planning convenience, neglecting it would be sloppy but not catastrophic. The problem is that low-powered studies do far more damage than simply failing to detect real effects. Three pathologies follow directly from running underpowered research.
1. Inflated effect size estimates
When a study has, say, 20% power, only the largest random fluctuations of the sample estimate will cross the significance threshold. This means that any "significant" finding from a low-powered study is almost certainly an overestimate of the true effect — sometimes by a factor of two or more. Gelman and Carlin (2014) formalized this as the Type M (magnitude) error, and showed that in low-power regimes, the expected exaggeration ratio can exceed 2.0 even for studies that produce statistically significant results.
2. Sign errors
In extreme low-power situations, statistically significant results can even point in the wrong direction. Gelman and Carlin call this Type S (sign) error. For a study with 6% power on a small true effect, the probability that a significant result will have the wrong sign can exceed 20%.
3. Failed replication and degraded literature
Button and colleagues (2013), in their landmark review "Power failure: why small sample size undermines the reliability of neuroscience," estimated the median statistical power across 49 meta-analyses in neuroscience at roughly 21%. The implications cascade: published effects are inflated, the literature is dominated by unreplicable findings, and meta-analytic estimates inherit the bias of the underlying studies. Maxwell (2004) had documented essentially the same problem in psychology a decade earlier, finding that the typical study examining multiple predictors of a behavioral outcome had power well below 0.50 for many of the comparisons of interest.
An underpowered study is not just less likely to find something real. When it does find something, the finding is more likely to be wrong — wrong in magnitude, wrong in sign, or simply a false positive that survived because of selective reporting. Power is not a procedural box to tick; it is a precondition for the literature to mean anything at all.
Computing Sample Size by Test Family
Power formulas differ across statistical tests, but the conceptual structure is identical: you specify the effect size you want to detect, the alpha level, and the desired power, then solve for N. What follows are the workhorse cases.
Independent-samples t-test
For comparing two group means, the relevant effect size is Cohen's d, the standardized mean difference. With α = .05 (two-tailed) and 80% power:
- Large effect (d = 0.8): n ≈ 26 per group (52 total).
- Medium effect (d = 0.5): n ≈ 64 per group (128 total).
- Small effect (d = 0.4): n ≈ 100 per group (200 total).
- Small effect (d = 0.3): n ≈ 176 per group (352 total).
- Very small effect (d = 0.2): n ≈ 394 per group (788 total).
Notice how rapidly the required N grows as the effect shrinks. Doubling sensitivity (halving the effect size you can detect) requires roughly a fourfold increase in sample size — a direct consequence of the standard error scaling with 1/√n.
Paired-samples t-test
Within-subjects designs are dramatically more efficient when responses are correlated across conditions. The effective effect size becomes d / √(2(1 − ρ)), where ρ is the within-subject correlation. For d = 0.4 and a typical ρ = 0.5, you would need roughly 52 participants rather than the 200 required for an independent-samples design — a fourfold reduction. This is why repeated-measures designs are the workhorses of cognitive psychology and psychophysics.
One-way ANOVA
For ANOVA, the relevant effect size is Cohen's f, related to η² by f = √(η² / (1 − η²)). Cohen's small/medium/large benchmarks are f = .10, .25, and .40. To detect a medium effect (f = .25) across four groups at 80% power, you need approximately 45 participants per group, or 180 total. Be wary: ANOVA's omnibus F-test power does not translate to the power for any specific pairwise contrast, which is typically much lower.
Multiple regression
Power for regression depends on Cohen's f², defined as R² / (1 − R²) for the full model, or (R²full − R²reduced) / (1 − R²full) for testing the incremental contribution of a predictor set. To detect a small effect (f² = .02) for a single predictor at the 8th step of a regression with seven covariates, you need approximately 395 participants at 80% power. To detect a medium effect (f² = .15), about 55 suffice.
Correlation
For a single Pearson correlation tested against zero at α = .05 (two-tailed) and 80% power:
- r = .50 (large): N ≈ 29
- r = .30 (medium): N ≈ 84
- r = .20 (small): N ≈ 193
- r = .10: N ≈ 782
For comparing two independent correlations, sample sizes roughly double per group. For mediation analyses estimated via bootstrapped indirect effects, simulation studies (Fritz & MacKinnon, 2007) suggest that detecting a medium-medium pathway often requires N > 150, and small-small pathways can require N > 500.
Worked Example: A Two-Group Intervention Trial
Suppose you are designing a randomized trial of a brief cognitive-behavioral intervention for social anxiety. Pilot data and meta-analyses of similar interventions suggest a true effect somewhere between d = 0.30 and d = 0.50. You decide to power for the lower end of this range to ensure adequate sensitivity if the effect turns out to be modest.
Targeting d = 0.35 with two-tailed α = .05 and power = .80, the required sample size is approximately n = 130 per group, or 260 total. Anticipating a 15% attrition rate, you increase the recruitment target to about 306 participants. If you instead want 90% power against the same effect, the requirement jumps to roughly n = 174 per group (348 total, or 410 with attrition).
Now run a sensitivity analysis. With your achievable N of 260 randomized, what is the smallest effect you could detect at 80% power? Solving for d, you get d ≈ 0.35 — meaning your study is well-calibrated for the effect range you specified, but it would be substantially underpowered if the true effect were closer to d = 0.20.
Effect Size Estimation: The Hardest Part
Power formulas are mechanical. The genuinely difficult part of power analysis is choosing the effect size to plug in. There are four legitimate approaches, ranked roughly by quality of evidence.
1. Smallest effect size of interest (SESOI)
Lakens (2017) and colleagues argue that researchers should specify the smallest effect that would be theoretically meaningful or practically consequential, and power their study to detect that. SESOI is defensible because it grounds power in what matters rather than what we hope to find. It also forms the basis of equivalence testing, where rejecting an effect larger than the SESOI is itself an informative result.
2. Meta-analytic estimates
If a literature exists on your phenomenon, the meta-analytic mean (corrected for publication bias when possible) is a good anchor. Be cautious: published effect sizes are systematically inflated, so apply a discount of 25-50% if the literature is dominated by underpowered studies.
3. Pilot studies
Pilot data are useful for checking feasibility, refining procedures, and estimating nuisance parameters (variances, attrition rates). They are not reliable sources for effect-size estimates, because pilot samples are small and the resulting estimates have wide confidence intervals. Powering a confirmatory study off a pilot's point estimate is one of the most common and consequential mistakes in applied research.
4. Theoretical derivation
In some domains (psychophysics, computational modeling), theory specifies the expected effect size directly. This is the gold standard but rare outside formal modeling traditions.
Sensitivity, Compromise, and Sequential Analyses
A priori power analysis — choosing N before data collection — is the textbook case, but several variants address the constraints of real research.
Sensitivity analysis
If your sample size is fixed by funding or feasibility, you cannot solve for N. Instead, you solve for the smallest effect detectable at your chosen power. Sensitivity analyses are honest: they tell readers exactly what the study can and cannot rule out.
Compromise power analysis
When N is fixed and you are unwilling to accept the conventional α = .05 / β = .20 ratio, you can solve for the alpha and beta that maintain a desired error-rate ratio given your sample. This is rarely used in practice but conceptually elegant.
Sequential analysis and group sequential designs
Lakens (2014) and others have popularized sequential analysis, in which data are analyzed at predetermined interim points with adjusted alpha thresholds (e.g., Pocock or O'Brien-Fleming boundaries). Sequential designs preserve the overall Type I error rate while allowing studies to stop early for efficacy or futility, often with substantial savings in expected sample size. They require pre-registration of the stopping rules to remain valid.
The Trap of Retrospective ("Observed") Power
After running a non-significant study, researchers sometimes report "observed power" — the power computed using the effect size estimated from their own data. This is not informative and can be actively misleading.
Hoenig and Heisey (2001) showed that observed power is a deterministic monotonic function of the p-value: a non-significant result always corresponds to low observed power, regardless of the true effect. It tells you nothing beyond what the p-value already told you. To assess whether your null result was meaningful, you need either an a priori sensitivity analysis (what effect size could you have detected?) or an equivalence test (can you bound the effect to within a region of practical equivalence?).
Observed power calculations are a pseudo-solution to a real problem. They feel like they answer the question, but they do not.
Common Mistakes
- Powering on the published effect size of a single prior study. Single studies have wide confidence intervals; their point estimates are typically inflated. Always discount.
- Confusing η²p (partial) with η² (full). Partial eta-squared is much larger than total eta-squared in factorial designs and cannot be used directly with Cohen's f benchmarks.
- Ignoring multiple testing. If you plan to test six contrasts, your alpha must be adjusted, and power calculations must use the adjusted alpha.
- Treating the omnibus F as power for a specific contrast. The interaction in a 2×2 ANOVA may have very different power from the main effects.
- Assuming equal group sizes when allocation is unequal. Power degrades quickly with imbalance; allocation ratios of 2:1 or worse should be modeled explicitly.
- Not accounting for clustering. In nested designs (students within classrooms, patients within clinics), the design effect
1 + (m − 1)ρcan multiply required sample sizes substantially.
Power for the Methods You Actually Use
Closed-form formulas exist for the classical tests above. For more contemporary methods — multilevel models, structural equation models, generalized estimating equations, Bayesian estimation with informative priors — the appropriate tool is Monte Carlo simulation. You specify a data-generating process, simulate thousands of datasets under that process, run your planned analysis on each, and count the proportion that produce the desired result. Simulation handles arbitrary complexity, multiple effects, dropout, and non-normal distributions in ways closed-form solutions cannot.
Simulation is also the only honest route to power for novel analytic strategies. If you are running a multiverse analysis or applying a method without a published power table, you need to simulate.
Try It in PsyStat Nexus
Power analysis is built into PsyStat Nexus's Study Planner module. Specify your test family, expected effect size, alpha, and power; the planner returns the required N, plots the power curve across a range of plausible effects, and produces a sensitivity analysis tied to your achievable sample. For more complex designs — multilevel models, mediation, sequential stopping rules — the planner runs Monte Carlo simulations with your specified data-generating process.