When to Use Bayesian Methods (And When Not To)
Bayesian methods have moved from the methodological fringe to the mainstream of psychology in the past fifteen years. Software like JASP and the BayesFactor R package have made what was once a niche specialty available to any researcher willing to install a free program. Major journals now publish Bayesian analyses without flinching. Influential papers (Wagenmakers et al., 2018; Kruschke, 2015) argue that Bayesian inference solves problems that frequentist statistics cannot.
And yet most published psychology is still frequentist. Most reviewers still think in p-values. Most pre-registration templates ask for power analyses, not prior specifications. The gap between the methodological literature and everyday practice remains wide — and it isn't entirely a failure of education. There are real reasons researchers stay frequentist, and real situations where they shouldn't.
This post is an attempt to give an honest, scholarly account of when Bayesian methods earn their keep, and when they don't.
A Brief Bayesian Primer
The mathematical core of Bayesian inference is one of the simplest equations in statistics:
posterior ∝ likelihood × prior
Three pieces, three jobs:
- Prior. What you believed about the parameter before seeing the data, expressed as a probability distribution. A prior on Cohen's d might say "I expect effects between −1 and 1, with most mass near zero."
- Likelihood. How probable the observed data are under each possible parameter value. This is the same likelihood frequentists use; it isn't uniquely Bayesian.
- Posterior. The updated belief about the parameter after seeing the data. The posterior is what you report — typically as a point estimate (the posterior mean or median) and a credible interval (e.g., the 95% highest-density interval).
The interpretive payoff is large. A 95% credible interval means exactly what most students mistakenly think a 95% confidence interval means: given the data and the prior, there is a 95% probability the parameter lies in this range. A frequentist confidence interval is a statement about the long-run behavior of the procedure, not the parameter itself — a distinction that matters mathematically but is almost never honored in practice.
Bayes Factors vs. p-values
For hypothesis testing, the Bayesian alternative to the p-value is the Bayes factor. A Bayes factor BF10 is a likelihood ratio: how much more probable the observed data are under the alternative hypothesis (H1) than under the null (H0).
A worked example. Suppose you run a small two-group experiment and obtain BF10 = 3. The interpretation is direct:
The data are three times more likely under H1 than under H0. If you started with equal odds for the two hypotheses, you should now favor H1 by 3:1.
By Jeffreys' (1961) widely cited heuristic scale, BF10 between 1 and 3 is "anecdotal" evidence, 3 to 10 is "moderate," 10 to 30 is "strong," and above 30 is "very strong." A BF10 of 3 is therefore real but unimpressive evidence — nothing like the rhetorical force of "p < .05."
Crucially, BF01 = 1/BF10 quantifies evidence for the null. A BF01 of 8 means the data are 8 times more likely under H0 than under H1 — a result frequentist methods structurally cannot deliver. A non-significant p-value is famously not evidence for the null; it is merely failure to reject it. Bayes factors fix this asymmetry.
The Real Advantages
1. Quantifying evidence for the null
This is the single most underappreciated reason to go Bayesian. Psychology has a long history of pretending that p = .42 is uninformative when it is often quite informative. Rouder et al. (2009) developed default Bayes factors for t-tests precisely so researchers could report "BF01 = 12, moderate-to-strong evidence the means are equivalent" rather than the evasive "no significant difference was found."
For null-result papers, replication studies, equivalence claims, and any context where "the effect is small or absent" is itself the finding, Bayes factors are simply the right tool.
2. Sequential analysis and optional stopping
Frequentist p-values are not robust to checking your data as you collect it. Each peek inflates the Type I error rate; researchers either pre-commit to a sample size or pay for sequential designs with corrections like alpha spending.
Bayes factors are different. Because they are likelihood ratios rather than tail probabilities, they can in principle be monitored continuously, and you can stop when evidence reaches a pre-specified threshold (e.g., BF10 > 10 or BF01 > 10). This is sometimes overstated — optional stopping based on Bayes factors does still bias the distribution of obtained Bayes factors and requires honest priors — but the practical flexibility is real and substantial.
3. Accumulating prior evidence across studies
If twenty previous studies have established that an intervention has roughly d = 0.3, frequentist methods give you no principled way to use that information in your twenty-first study. You either ignore it or fold it into a separate meta-analysis. Bayesian methods let you encode prior research as — literally — the prior. The posterior then reflects the cumulative scientific picture, not just your single sample.
4. More interpretable intervals
As noted above, credible intervals say what people want them to say. For applied audiences — clinicians, policymakers, journalists — "there is a 95% probability the true effect is between 0.21 and 0.44" is comprehensible in a way no frequentist statement is.
The Real Disadvantages
1. Prior specification is genuinely hard
The most honest Bayesian critique of Bayesianism is that priors are subjective and consequential. Default priors (Cauchy, Jeffreys-Zellner-Siow, etc.) are conventions designed to be reasonable across many situations, but they are not neutral. A wider prior makes BF01 larger by spreading the alternative's probability mass over implausible effects, mechanically punishing H1. Narrower priors do the opposite.
This is not a fatal problem. It does, however, mean that prior sensitivity analysis is mandatory, not optional. Reporting a Bayes factor without showing how it varies across reasonable priors is the Bayesian equivalent of p-hacking.
2. Computational cost
Closed-form Bayesian solutions exist for simple cases (t-tests, ANOVAs, linear regression with conjugate priors). For anything complicated — multilevel models, structural equation models, custom likelihoods — you need Markov chain Monte Carlo (MCMC). MCMC samples from the posterior by random walk; modern variants like Hamiltonian Monte Carlo (the engine behind Stan) are vastly better than the Metropolis-Hastings of the 1990s, but they still require:
- Convergence diagnostics (R-hat < 1.01, effective sample size > 400, trace plot inspection).
- Hours to days of compute for large hierarchical models.
- A non-trivial amount of statistical literacy to debug when chains misbehave.
For a 2 × 2 ANOVA, MCMC is overkill. For a six-level hierarchical model with crossed random effects, it may be the only honest option — but it is not a quick analysis.
3. Communication friction
Most reviewers were trained on frequentist statistics. Many will ask for a p-value alongside your Bayes factor, or politely express skepticism about your priors, or simply not engage with the analysis at all. Editors at certain journals still treat Bayesian methods as exotic. In regulated contexts (FDA submissions, formal clinical trial reporting), frequentist procedures are often outright required.
This is a sociological problem, not a statistical one, but it is real. Choosing Bayesian methods in the wrong venue can mean choosing not to be read.
The Lindley Paradox
Lindley (1957) pointed out a striking phenomenon: it is possible for a result to yield p < .05 (rejecting the null) while simultaneously yielding a Bayes factor that strongly favors the null. The two paradigms can give opposite answers from the same data.
The resolution is that they are answering different questions. A p-value asks: "How surprising are these data if the null is exactly true?" A Bayes factor asks: "How well does the null predict these data, compared to the alternative averaged over its prior?" When the sample is very large and the effect is very small, the data can be unsurprising under H0 (giving a small p) and yet still better predicted by a tightly null-favoring H0 than by a diffuse H1 that wasted prior probability on large effects that didn't materialize.
The Lindley paradox is not a bug; it is a feature that exposes a hidden assumption. Frequentist tests implicitly weight all alternatives equally; Bayes factors do not. Which weighting is correct depends on what you actually believe before the study — which is exactly what makes the prior non-trivial.
When to Prefer Bayesian Methods
- Small samples. Frequentist asymptotic justifications are weakest where Bayesian methods (with informative priors) are strongest.
- Optional stopping. If you genuinely don't know your final n, Bayes factors handle this more gracefully than corrected sequential frequentist designs.
- Evidence for the null. Equivalence claims, replication failures, "the manipulation didn't work" findings — all of these benefit enormously from BF01.
- Model comparison. Bayes factors and information criteria (WAIC, LOO) handle non-nested models more cleanly than likelihood ratio tests.
- Strong prior information. When previous studies, theory, or physical constraints meaningfully inform the parameter, ignoring that information is wasteful.
- Hierarchical models with sparse cells. Partial pooling via Bayesian shrinkage handles small cell sizes more gracefully than maximum-likelihood mixed models.
When to Stick with Frequentist
- Large pre-registered studies. If you have plenty of data and a clean design, the frequentist analysis will give nearly the same answer with less methodological overhead.
- Regulatory contexts. Clinical trial submissions, official statistics, and many funding-body reports require specific frequentist procedures.
- Journal expectations. Some subfields are still firmly frequentist. Battling reviewers over your prior choice when the result would be the same under either framework is a poor use of time.
- Routine descriptive comparisons. If you just want to know whether two means differ, a t-test communicates faster than a posterior plot.
- Truly diffuse priors. If you have no prior information and no defensible reason to choose any particular default, frequentist procedures are at least transparent about what they assume.
Don't choose a paradigm; choose an analysis. The most defensible practice in modern psychology is to report both — p-values, effect sizes, and confidence intervals alongside Bayes factors and credible intervals — and let readers see whether the conclusions align. When they do, you've gained inferential robustness. When they don't (the Lindley scenario), you've found something genuinely interesting about your data.
A Word on MCMC
For anyone moving past default-prior t-tests into substantive Bayesian modeling, MCMC is the entry ticket. Its honest pros and cons:
Pros. MCMC fits essentially any model you can write down. It naturally produces full posterior distributions for every parameter, with uncertainty quantification baked in. Modern tools (Stan, PyMC, brms) make specification surprisingly tractable, and convergence diagnostics are well established.
Cons. It is slow, sometimes very slow. Convergence problems can be subtle and require statistical intuition to diagnose. Reparameterization tricks (non-centered parameterizations, etc.) are sometimes necessary to get chains to mix. And the posterior depends on the prior in ways that may surprise users who haven't done sensitivity analysis.
Recommended Reading
The papers and books worth reading first, in roughly increasing technical depth:
- Wagenmakers et al. (2018), "Bayesian inference for psychology, Part I." A clear, opinionated introduction targeted at psychologists. Probably the best on-ramp.
- Rouder et al. (2009), "Bayesian t-tests for accepting and rejecting the null hypothesis." The technical foundation of the default Bayes factor scale used in JASP and the
BayesFactorR package. - Kruschke (2015), Doing Bayesian Data Analysis (2nd ed.). A comprehensive, application-oriented textbook. Heavy on examples, light on measure theory. The standard graduate-course text.
- Lindley (1957), "A statistical paradox." The original two-page note. Worth reading in its original form.
Try It in PsyStat Nexus
The Bayesian Lab module computes default Bayes factors for t-tests, correlations, and ANOVAs, and produces credible intervals with adjustable priors so you can run sensitivity analyses without leaving the browser. The Dual Report module runs the same data through both frequentist and Bayesian pipelines side by side — ideal when you want to honor reviewer expectations on both fronts, or when you suspect a Lindley-paradox situation in your data.