Multiple Comparisons and the Garden of Forking Paths
A single p-value of .03 looks like evidence. Twenty p-values of which one is .03 looks like noise. The difference is not in the statistic itself but in the inferential context that produced it — and that context is one of the most consistently mishandled aspects of empirical research. This essay walks through the multiple comparisons problem from its arithmetic foundations to its more subtle modern incarnation: the garden of forking paths articulated by Gelman and Loken (2013), in which multiplicity emerges not from a researcher's bad faith but from the ordinary, defensible decisions made during analysis.
The Arithmetic of Multiplicity
Begin with the textbook setup. A null hypothesis significance test rejects the null when p < α, conventionally α = .05. Under the null, the probability of not rejecting on a single test is 1 − α = .95. If we conduct k independent tests, all under their respective nulls, the probability of avoiding every false rejection is (1 − α)k. The probability of at least one false positive — the family-wise error rate, or FWER — is therefore:
FWER = 1 − (1 − α)k
For α = .05 and k = 20 tests, this gives 1 − .9520 ≈ 0.642. A researcher who runs twenty independent tests on null data has roughly a 64% chance of finding at least one "significant" effect. At k = 100 the probability climbs above 99%. The nominal α protects an individual test, not a study; the gap between those two guarantees is the multiple comparisons problem in its purest form.
This much has been understood since at least Bonferroni's inequality work in the 1930s (Bonferroni, 1936). What has shifted over the last fifteen years is the recognition that the relevant k is rarely the number of tests a researcher reports. It is the number of tests that could have been reported given the choices the data invited.
Family-Wise Error Rate Control
The Bonferroni correction
The classical fix attributed to Bonferroni (and popularized by Dunn, 1961) is to test each of the k hypotheses at the more stringent threshold α/k. By Boole's inequality, this guarantees FWER ≤ α regardless of dependence structure among the tests. It is the most conservative and most general correction available, and that is both its strength and its weakness.
Why too conservative? Three reasons. First, Boole's inequality is loose when tests are positively correlated — common in psychology, where outcome measures often share method variance. Second, the correction treats every comparison as equally interesting, ignoring prior plausibility or theoretical structure. Third, and most importantly, it controls the wrong quantity for many research questions. A behavioral geneticist scanning 500,000 SNPs for association with depression does not want a 5% chance of any false positive across the entire genome; she wants to know what proportion of her flagged loci are spurious. Bonferroni answers the first question; in answering it strictly, it leaves nearly every real signal undetected.
Holm's step-down procedure
Holm (1979) offered a uniformly more powerful alternative that still controls FWER at α. Order the p-values from smallest to largest: p(1) ≤ p(2) ≤ … ≤ p(k). Compare p(1) to α/k; if it survives, compare p(2) to α/(k − 1); continue until a test fails, then accept all remaining nulls. The procedure rejects at least as many hypotheses as Bonferroni and often several more, with no additional assumptions. It should be the default when FWER control is genuinely the goal — there is essentially no reason to prefer raw Bonferroni today.
From FWER to FDR
Benjamini and Hochberg (1995) reframed the problem. Instead of asking "what is the probability of any false positive?" they asked "among my discoveries, what fraction are false?" The false discovery rate — the expected proportion of rejected nulls that are actually true — is a more natural target whenever a researcher expects multiple real effects and is willing to tolerate some spurious ones in exchange for greater sensitivity.
Their procedure is elegant. Order the p-values as before. Find the largest i such that p(i) ≤ (i/k) · q, where q is the desired FDR. Reject all hypotheses with rank ≤ i. Under independence (and certain forms of positive dependence), this controls FDR ≤ q.
The Benjamini-Hochberg procedure is now the de facto standard in genomics, neuroimaging, and any setting where dozens to millions of tests are conducted simultaneously. Its philosophical appeal is that it scales: doubling the number of tests does not halve the effective threshold, as Bonferroni does, because the procedure is sensitive to the distribution of p-values rather than treating each in isolation.
FWER (Holm-Bonferroni) is appropriate when even one false positive is consequential — clinical trial primary endpoints, regulatory decisions, confirmatory tests of a single pre-specified hypothesis. FDR (Benjamini-Hochberg) is appropriate for exploratory or screening contexts — finding candidate biomarkers, flagging brain regions for follow-up, ranking gene-trait associations. The choice is not statistical; it is about what kind of error your downstream decisions can absorb.
The Garden of Forking Paths
The arithmetic above assumes the researcher chose her k tests in advance. In practice, k is shaped by the data. Gelman and Loken (2013) coined "the garden of forking paths" to describe the phenomenon: even an honest researcher, running what feels like a single analysis, has implicitly traversed one branch of a tree whose other branches she would have followed had the data looked different.
"Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data." — Gelman & Loken (2013)
Consider a study on whether warm-temperature priming increases prosocial behavior. The pre-registered analysis is "compare donation amounts between warm-cup and cold-cup conditions." But suppose the data come in. The headline contrast is null. Then the researcher notices:
- The effect is significant among women.
- The effect appears if she excludes three high-leverage outliers.
- The effect emerges if she dichotomizes the dependent variable above the median.
- The effect is robust if she controls for age and self-reported mood.
None of these moves is fraudulent. Each is a sensible response to what the data revealed. But the researcher's analytic path was selected because the data fell as they did. Had the headline contrast been positive, none of those alternatives would have been explored. The reported test is therefore the maximum of an unobserved set of tests, and its nominal p-value dramatically overstates the evidence.
Forking Paths versus P-Hacking
P-hacking is the intentional manipulation of analytic choices — collecting a few more participants until p dips below .05, dropping conditions that don't work, switching outcome measures — in pursuit of a publishable result. It is conscious and motivated. Forking paths is the unintentional, often unconscious version: a researcher who would have made different defensible choices had the data looked different. The statistical consequence is identical — the reported p-value reflects a selection process — but the moral framing differs profoundly. Most practitioners are not p-hackers. Almost all are forkers.
This distinction matters because the remedies differ. P-hacking is curbed by norms, training, and professional sanctions. Forking paths is curbed only by procedural commitments made before seeing the data. No amount of integrity protects you from a path you never knew you might have taken.
Researcher Degrees of Freedom
Simmons, Nelson, and Simonsohn (2011), in the article that catalyzed psychology's credibility reckoning, demonstrated through Monte Carlo simulation that four common — and individually defensible — researcher degrees of freedom inflate the false-positive rate from the nominal 5% to over 60%. The four were:
- Optional stopping. Collecting data, peeking, and continuing if results are not yet significant.
- Flexible covariates. Reporting whichever covariate-adjusted analysis "works."
- Outcome flexibility. Choosing among several measured dependent variables post hoc.
- Condition flexibility. Dropping experimental conditions that produce inconvenient patterns.
Their famous demonstration showed that, with these four flexibilities combined, two truly null variables could "predict" listening to the Beatles' When I'm Sixty-Four made participants 1.5 years younger. The point was not that the finding was real but that the analytic pipeline could manufacture comparable absurdities at will.
When to Correct — and When Not To
Multiple-comparisons correction is not a moral imperative; it is a tool fitted to a specific inferential goal. There are well-defined situations in which correction is inappropriate:
- Pre-registered, theoretically distinct hypotheses. If you predicted three independent effects in advance and each is a separate confirmatory test of a separate theoretical claim, correcting across them confuses "family" with "study." Each effect stands or falls on its own merits.
- Bayesian analyses with proper priors. Posterior probabilities already incorporate prior plausibility; an additional frequentist correction is conceptually misplaced.
- Descriptive or exploratory work explicitly labelled as such. Correction implies confirmatory inference. If the goal is hypothesis generation, the correct move is honest labelling, not threshold adjustment.
Correction is warranted when:
- You are testing many comparisons within a single coherent inferential family (e.g., post-hoc pairwise comparisons after an omnibus ANOVA).
- You are screening — brain voxels, SNPs, candidate predictors — with the intention of treating survivors as discoveries.
- You explored the data before settling on a test, and the reported test was selected from among alternatives.
Pre-Registration: The Procedural Solution
The forking-paths framing implies that no post-hoc correction can fully remedy unintentional multiplicity, because the relevant k — the set of analyses the researcher would have run under counterfactual data — is unobservable. The only structural fix is to commit, in writing, to the analytic plan before seeing the data. Pre-registration converts what would have been forking paths into a single specified path; deviations are visible as deviations rather than rebranded as "the analysis."
Pre-registration does not prevent exploration. It separates exploration from confirmation. A pre-registered confirmatory test means what its p-value advertises; a post-hoc exploratory finding can still be reported, but as the hypothesis-generating observation it actually is, with no claim to the false-positive guarantees that confirmatory inference provides.
Multiverse Analysis: When You Cannot Pre-Register
For analyses of existing datasets, where pre-registration is impossible or only partial, Steegen and colleagues (2016) proposed the multiverse analysis. Rather than choosing one defensible specification, run all of them. Plot the distribution of effect sizes and p-values across the full grid. The resulting specification curve reveals how robust a finding is to the analytical choices the researcher could plausibly have made.
A multiverse does not eliminate forking paths; it makes them visible. If 95% of specifications yield the same conclusion, the finding is robust to analyst choice. If 30% do, the headline result is a fragile artifact of a particular path through the garden. Either way, the reader can judge.
A Worked Example
Suppose a developmental psychologist tests whether a literacy intervention improves reading scores. She has measured five outcomes (vocabulary, comprehension, fluency, decoding, spelling). She runs five t-tests and finds p-values of .008, .021, .039, .118, and .402.
Uncorrected, three results are "significant." But under Bonferroni at α = .05, only p ≤ .010 survives — just vocabulary clears the bar. Under Holm-Bonferroni, she compares .008 to .05/5 = .010 (passes), then .021 to .05/4 = .0125 (fails), and stops — same conclusion as Bonferroni in this case. Under Benjamini-Hochberg at FDR q = .05, she finds the largest i with p(i) ≤ (i/5) · .05: testing i = 3 gives .039 ≤ .030? No. i = 2: .021 ≤ .020? No. i = 1: .008 ≤ .010? Yes. So BH also retains only vocabulary. Across all three procedures, the headline conclusion narrows to a single robust finding rather than three putative ones.
Now suppose the same psychologist had also informally tried log-transforming her outcomes, dropping two children with attendance below 80%, and adding pretest scores as a covariate — each switch defensible, none reported. Her effective k is no longer 5 but something closer to 5 × 2 × 2 × 2 = 40. Under that implicit family, even her vocabulary p = .008 looks suspect: the Bonferroni-equivalent threshold is .00125. The forking-paths inflation reaches deep enough to threaten conclusions that survived the explicit correction.
Try It in PsyStat Nexus
PsyStat Nexus operationalizes both halves of this problem. The Multiverse Analysis module lets you specify the analytical degrees of freedom in your study — outlier rules, covariate sets, transformation choices — and runs every combination, returning the full specification curve so you can see how sensitive your conclusions are to the path you would otherwise have chosen silently. The Convergent Core Analysis module then extracts the robust invariant: the claim that holds across nearly all specifications, with explicit boundary conditions where it does not.
Together they offer a workflow for taking forking paths seriously without abandoning analysis altogether. Get started free →
References
- Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300.
- Bonferroni, C. E. (1936). Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R. Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8, 3–62.
- Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56(293), 52–64.
- Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no "fishing expedition" or "p-hacking" and the research hypothesis was posited ahead of time. Department of Statistics, Columbia University.
- Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.
- Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.
- Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science, 11(5), 702–712.