← All posts
Methodology

When to Use Bayesian Methods (And When Not To)

By Moonlit Social Labs · April 16, 2026 · 13 min read

Bayesian methods have moved from the methodological fringe to the mainstream of psychology in the past fifteen years. Software like JASP and the BayesFactor R package have made what was once a niche specialty available to any researcher willing to install a free program. Major journals now publish Bayesian analyses without flinching. Influential papers (Wagenmakers et al., 2018; Kruschke, 2015) argue that Bayesian inference solves problems that frequentist statistics cannot.

And yet most published psychology is still frequentist. Most reviewers still think in p-values. Most pre-registration templates ask for power analyses, not prior specifications. The gap between the methodological literature and everyday practice remains wide — and it isn't entirely a failure of education. There are real reasons researchers stay frequentist, and real situations where they shouldn't.

This post is an attempt to give an honest, scholarly account of when Bayesian methods earn their keep, and when they don't.

A Brief Bayesian Primer

The mathematical core of Bayesian inference is one of the simplest equations in statistics:

posterior ∝ likelihood × prior

Three pieces, three jobs:

The interpretive payoff is large. A 95% credible interval means exactly what most students mistakenly think a 95% confidence interval means: given the data and the prior, there is a 95% probability the parameter lies in this range. A frequentist confidence interval is a statement about the long-run behavior of the procedure, not the parameter itself — a distinction that matters mathematically but is almost never honored in practice.

Bayes Factors vs. p-values

For hypothesis testing, the Bayesian alternative to the p-value is the Bayes factor. A Bayes factor BF10 is a likelihood ratio: how much more probable the observed data are under the alternative hypothesis (H1) than under the null (H0).

A worked example. Suppose you run a small two-group experiment and obtain BF10 = 3. The interpretation is direct:

The data are three times more likely under H1 than under H0. If you started with equal odds for the two hypotheses, you should now favor H1 by 3:1.

By Jeffreys' (1961) widely cited heuristic scale, BF10 between 1 and 3 is "anecdotal" evidence, 3 to 10 is "moderate," 10 to 30 is "strong," and above 30 is "very strong." A BF10 of 3 is therefore real but unimpressive evidence — nothing like the rhetorical force of "p < .05."

Crucially, BF01 = 1/BF10 quantifies evidence for the null. A BF01 of 8 means the data are 8 times more likely under H0 than under H1 — a result frequentist methods structurally cannot deliver. A non-significant p-value is famously not evidence for the null; it is merely failure to reject it. Bayes factors fix this asymmetry.

The Real Advantages

1. Quantifying evidence for the null

This is the single most underappreciated reason to go Bayesian. Psychology has a long history of pretending that p = .42 is uninformative when it is often quite informative. Rouder et al. (2009) developed default Bayes factors for t-tests precisely so researchers could report "BF01 = 12, moderate-to-strong evidence the means are equivalent" rather than the evasive "no significant difference was found."

For null-result papers, replication studies, equivalence claims, and any context where "the effect is small or absent" is itself the finding, Bayes factors are simply the right tool.

2. Sequential analysis and optional stopping

Frequentist p-values are not robust to checking your data as you collect it. Each peek inflates the Type I error rate; researchers either pre-commit to a sample size or pay for sequential designs with corrections like alpha spending.

Bayes factors are different. Because they are likelihood ratios rather than tail probabilities, they can in principle be monitored continuously, and you can stop when evidence reaches a pre-specified threshold (e.g., BF10 > 10 or BF01 > 10). This is sometimes overstated — optional stopping based on Bayes factors does still bias the distribution of obtained Bayes factors and requires honest priors — but the practical flexibility is real and substantial.

3. Accumulating prior evidence across studies

If twenty previous studies have established that an intervention has roughly d = 0.3, frequentist methods give you no principled way to use that information in your twenty-first study. You either ignore it or fold it into a separate meta-analysis. Bayesian methods let you encode prior research as — literally — the prior. The posterior then reflects the cumulative scientific picture, not just your single sample.

4. More interpretable intervals

As noted above, credible intervals say what people want them to say. For applied audiences — clinicians, policymakers, journalists — "there is a 95% probability the true effect is between 0.21 and 0.44" is comprehensible in a way no frequentist statement is.

The Real Disadvantages

1. Prior specification is genuinely hard

The most honest Bayesian critique of Bayesianism is that priors are subjective and consequential. Default priors (Cauchy, Jeffreys-Zellner-Siow, etc.) are conventions designed to be reasonable across many situations, but they are not neutral. A wider prior makes BF01 larger by spreading the alternative's probability mass over implausible effects, mechanically punishing H1. Narrower priors do the opposite.

This is not a fatal problem. It does, however, mean that prior sensitivity analysis is mandatory, not optional. Reporting a Bayes factor without showing how it varies across reasonable priors is the Bayesian equivalent of p-hacking.

2. Computational cost

Closed-form Bayesian solutions exist for simple cases (t-tests, ANOVAs, linear regression with conjugate priors). For anything complicated — multilevel models, structural equation models, custom likelihoods — you need Markov chain Monte Carlo (MCMC). MCMC samples from the posterior by random walk; modern variants like Hamiltonian Monte Carlo (the engine behind Stan) are vastly better than the Metropolis-Hastings of the 1990s, but they still require:

For a 2 × 2 ANOVA, MCMC is overkill. For a six-level hierarchical model with crossed random effects, it may be the only honest option — but it is not a quick analysis.

3. Communication friction

Most reviewers were trained on frequentist statistics. Many will ask for a p-value alongside your Bayes factor, or politely express skepticism about your priors, or simply not engage with the analysis at all. Editors at certain journals still treat Bayesian methods as exotic. In regulated contexts (FDA submissions, formal clinical trial reporting), frequentist procedures are often outright required.

This is a sociological problem, not a statistical one, but it is real. Choosing Bayesian methods in the wrong venue can mean choosing not to be read.

The Lindley Paradox

Lindley (1957) pointed out a striking phenomenon: it is possible for a result to yield p < .05 (rejecting the null) while simultaneously yielding a Bayes factor that strongly favors the null. The two paradigms can give opposite answers from the same data.

The resolution is that they are answering different questions. A p-value asks: "How surprising are these data if the null is exactly true?" A Bayes factor asks: "How well does the null predict these data, compared to the alternative averaged over its prior?" When the sample is very large and the effect is very small, the data can be unsurprising under H0 (giving a small p) and yet still better predicted by a tightly null-favoring H0 than by a diffuse H1 that wasted prior probability on large effects that didn't materialize.

The Lindley paradox is not a bug; it is a feature that exposes a hidden assumption. Frequentist tests implicitly weight all alternatives equally; Bayes factors do not. Which weighting is correct depends on what you actually believe before the study — which is exactly what makes the prior non-trivial.

When to Prefer Bayesian Methods

When to Stick with Frequentist

A Pragmatic Heuristic

Don't choose a paradigm; choose an analysis. The most defensible practice in modern psychology is to report bothp-values, effect sizes, and confidence intervals alongside Bayes factors and credible intervals — and let readers see whether the conclusions align. When they do, you've gained inferential robustness. When they don't (the Lindley scenario), you've found something genuinely interesting about your data.

A Word on MCMC

For anyone moving past default-prior t-tests into substantive Bayesian modeling, MCMC is the entry ticket. Its honest pros and cons:

Pros. MCMC fits essentially any model you can write down. It naturally produces full posterior distributions for every parameter, with uncertainty quantification baked in. Modern tools (Stan, PyMC, brms) make specification surprisingly tractable, and convergence diagnostics are well established.

Cons. It is slow, sometimes very slow. Convergence problems can be subtle and require statistical intuition to diagnose. Reparameterization tricks (non-centered parameterizations, etc.) are sometimes necessary to get chains to mix. And the posterior depends on the prior in ways that may surprise users who haven't done sensitivity analysis.

Recommended Reading

The papers and books worth reading first, in roughly increasing technical depth:

Try It in PsyStat Nexus

The Bayesian Lab module computes default Bayes factors for t-tests, correlations, and ANOVAs, and produces credible intervals with adjustable priors so you can run sensitivity analyses without leaving the browser. The Dual Report module runs the same data through both frequentist and Bayesian pipelines side by side — ideal when you want to honor reviewer expectations on both fronts, or when you suspect a Lindley-paradox situation in your data.

Get started free →

Related Posts

Novel Method
Dialectical Inference
What if frequentist and Bayesian analyses disagree? Dialectical Inference synthesizes both into a unified conclusion.
Methodology
Effect Sizes Beyond Cohen's d
Cohen's d is the workhorse, but it isn't the only option. A guide to robust, distribution-free, and probability-of-superiority effect sizes.