What is statistical power?

Statistical power is the probability that a test will correctly reject a false null hypothesis, given a specified true effect size. It equals 1 minus beta and is determined by alpha, the true effect size, sample size, and the variability of the outcome. Cohen recommended a minimum of .80 power for most psychological research.

A p-value is the probability of observing a test statistic at least as extreme as the one obtained, assuming the null hypothesis is true. It does not directly tell you the probability that the null is true, that the effect is real, or that a replication will succeed. The American Statistical Association cautions that p-values should never be the sole basis for scientific conclusions.

Cohen's d is a standardized effect size for the difference between two means, expressed in standard deviation units. Cohen's classic benchmarks are small d = 0.2, medium = 0.5, and large = 0.8, though these are field-specific. APA reporting standards require effect sizes alongside test statistics.

What is Bayesian inference?

Bayesian inference treats parameters as random variables described by probability distributions. It combines prior beliefs with observed data via Bayes' theorem to produce a posterior distribution that quantifies what you should believe about a parameter given everything you knew before plus the data you collected.

What is a Bayes factor?

A Bayes factor is the ratio of the marginal likelihood of the data under one model to the marginal likelihood under another, quantifying relative evidence for competing hypotheses. Unlike p-values, Bayes factors can quantify evidence for the null hypothesis, distinguishing 'evidence of absence' from 'absence of evidence.'

What is Cronbach's alpha?

Cronbach's alpha is an estimate of internal-consistency reliability based on the average inter-item correlations and the number of items in a scale. Values of .70-.79 are considered acceptable, .80-.89 good, and .90+ excellent. Modern recommendations favor McDonald's omega as a more accurate alternative.

What is a confidence interval?

A confidence interval is a range of plausible values for a population parameter, constructed so that the procedure captures the true value at a specified long-run rate (e.g., 95%). A 95% CI is a property of the method, not of any single interval. CIs communicate both magnitude and precision, making them preferable to bare p-values.

What is regression analysis?

Regression analysis models an outcome variable as a function of one or more predictors. Simple linear regression fits Y = b0 + b1*X + e by minimizing squared residuals. Multiple regression extends this to several predictors simultaneously, where each coefficient represents the predicted change in Y for a one-unit change in that predictor holding others constant.

What is the Bonferroni correction?

The Bonferroni correction is the simplest method for adjusting alpha when running multiple statistical tests. It divides alpha by the number of tests (alpha/k) to control the family-wise error rate. It is famously conservative; Holm's step-down procedure is uniformly more powerful and should generally be preferred.

What is a mixed model?

A mixed model (multilevel or hierarchical model) is used when data are nested—students within classrooms, repeated measures within people, patients within clinics. It partitions variance across levels using random intercepts and/or random slopes, honoring the nested structure that ordinary regression would ignore, and producing accurate standard errors.

What is statistical significance?

Statistical significance is a decision in null hypothesis testing that the observed data are surprising enough under the null hypothesis to warrant rejecting it. Conventionally this means a p-value below .05. Significance does not equal practical importance and should always be reported alongside effect sizes and confidence intervals.

Effect size is a standardized index of the magnitude of a phenomenon, independent of sample size. Common effect sizes include Cohen's d and Hedges' g for mean differences, eta-squared for ANOVA, R-squared for regression, and Pearson's r for correlation. Where p-values answer 'is there an effect?', effect sizes answer 'how big is it?'

What is logistic regression?

Logistic regression models a binary outcome (yes/no, success/failure) as a function of predictors using the logit link, returning coefficients on the log-odds scale. Exponentiating coefficients gives odds ratios. It is fit by maximum likelihood and is appropriate when linear regression would predict probabilities outside [0,1].

What is mediation analysis?

Mediation analysis decomposes a total effect of X on Y into a direct effect and an indirect effect transmitted through a mediator M. Modern practice uses bias-corrected bootstrap confidence intervals on the indirect effect (a*b), which are more powerful than the legacy Sobel test and do not require normality.

What is a confounder?

A confounder is a variable that is a common cause of both exposure and outcome (e.g., age affects both smoking and lung cancer). To estimate the causal effect of an exposure, you must adjust for confounders. Confounders are causally distinct from mediators (which lie on the causal pathway) and colliders (which are common effects).

Reference Library

The Statistical Methods Encyclopedia

Q: What is ANOVA?

ANOVA (Analysis of Variance) is a test of whether three or more group means differ on a continuous outcome. The F-statistic is the ratio of between-group variance to within-group variance. A significant F tells you the groups are not all equal, but post-hoc tests like Tukey's HSD are needed to identify which specific groups differ.

Q: What is propensity score matching?

Propensity score matching is a method for estimating causal effects from observational data by matching treated and control units with similar probabilities of receiving treatment. It can only adjust for observed confounders; hidden confounders bias the estimate just as in any non-randomized comparison.

Q: What is meta-analysis?

Meta-analysis is the quantitative synthesis of effect sizes across multiple studies. A well-conducted meta-analysis can yield more precise and generalizable estimates than any single study by pooling data, but it can also propagate the biases of the underlying literature if conducted carelessly.

A scholarly, working reference for the statistical procedures that drive contemporary psychology, sociology, education, criminology, and the broader behavioral sciences. Each entry explains what the method is, when to use it, the assumptions it leans on, and where you can run it in PsyStat Nexus today.

9 Categories

60+ Methods

240+ Modules in PsyStat

~8,000 Words of Reference Content

A. Hypothesis Testing B. Parametric Tests C. Regression D. Non-Parametric E. Bayesian F. Psychometrics G. Multilevel & Mixed H. Causal Inference I. Meta-Analysis

No methods match your search. Try a different keyword.

A. Hypothesis Testing & Inference 8 Entries

The grammar of inferential statistics. These are the conceptual building blocks that govern how we move from a sample to a claim about a population, and how we quantify the uncertainty that necessarily comes with that move.

Null Hypothesis Significance Testing (NHST)

A formal procedure for deciding whether observed data are surprising enough under a "no effect" assumption to warrant rejecting that assumption.

NHST emerged from a fusion of Fisher's significance testing and the Neyman-Pearson decision framework. The analyst specifies a null hypothesis (H₀), typically of "no effect" or "no difference," and computes a test statistic whose sampling distribution under H₀ is known. If the observed statistic falls in the rejection region (defined by an alpha level, conventionally .05), the null is rejected.

NHST is appropriate when you have a clear comparison and a well-defined sampling model, but it has been heavily criticized for encouraging dichotomous "significant vs. not" thinking and for being misinterpreted as the probability that H₀ is true (Wasserstein & Lazar, 2016). Use it as a screening tool, not a verdict.

Worked example: A clinical trial compares a new SSRI to placebo on depression scores. With H₀: mean_drug = mean_placebo and a two-sample t-test yielding p = .003, you reject H₀ at alpha = .05.

PsyStat: T-Test & ANOVA modulesInference

p-values

The probability of observing a test statistic at least as extreme as the one obtained, assuming the null hypothesis is true.

A p-value answers a narrow conditional question: given that H₀ is exactly true, how often would chance alone produce a result this extreme or more so? A small p-value means the observed data are unusual under H₀, but it does not directly tell you the probability that H₀ is true, that the effect is real, or that a replication will succeed.

The American Statistical Association's 2016 statement explicitly cautioned that p-values should never be the sole basis for scientific conclusions and that the .05 threshold has no inherent epistemic meaning. Modern best practice reports exact p-values alongside effect sizes and confidence intervals, and treats p as one piece of evidence among many.

Two common misreadings: (1) p = .04 does not mean there is a 96% chance the effect is real; (2) p > .05 does not mean "no effect" — it means the data are insufficient to distinguish the observed effect from zero.

PsyStat: All inferential modulesInference

Type I and Type II Errors

The two ways a hypothesis test can mislead you: rejecting a true null (Type I, alpha) or failing to reject a false null (Type II, beta).

In the Neyman-Pearson framework, every test trades off these two errors. The Type I error rate (alpha) is the long-run frequency of false positives if you ran the same study many times under a true H₀. Researchers fix this in advance, classically at .05 or .01.

The Type II error rate (beta) is the probability of missing a real effect of a given size. It depends on alpha, the true effect size, the sample size, and the variability of the data. Statistical power, defined as 1 − beta, quantifies your test's sensitivity to detect effects when they exist.

Use case: A diagnostic study using a Type I rate of .01 to avoid falsely flagging healthy patients accepts a higher Type II rate — meaning some sick patients will be missed unless the sample is large. Designing studies is fundamentally about balancing these error rates against the cost of each kind of mistake.

PsyStat: Power Analysis suiteInference

Confidence Intervals

A range of plausible values for a population parameter, constructed so that the procedure captures the true value at a specified long-run rate (e.g., 95%).

A 95% confidence interval is a property of the method, not of any single interval. If you repeated the study an infinite number of times and constructed a 95% CI each time, 95% of those intervals would contain the true parameter. The interval you actually obtained either does or does not — you cannot say there is a 95% probability the true value falls inside it (that's a Bayesian credible interval).

CIs are increasingly preferred over bare p-values because they communicate both magnitude and precision. A narrow CI suggests a precise estimate; a wide one signals high uncertainty. Geoff Cumming and others have championed "the new statistics" precisely because CIs make estimation, rather than yes/no testing, the central activity.

Worked example: A study reports a mean difference of 4.2 IQ points (95% CI [1.1, 7.3]). The interval excludes zero (so a two-sided test would be significant), but the lower bound suggests the effect could be as small as 1 point.

PsyStat: Estimation & Bootstrap CI modulesEstimation

Effect Sizes

Standardized indices of the magnitude of a phenomenon, independent of sample size.

Where p-values answer "is there an effect?", effect sizes answer "how big is it?" Common standardized effect sizes include Cohen's d and Hedges' g for mean differences, eta-squared and omega-squared for ANOVA, R² and f² for regression, and Pearson's r for correlation.

Cohen's classic benchmarks (small d = 0.2, medium = 0.5, large = 0.8) are convenient but field-specific; in some areas of social psychology, "small" effects can have substantial practical impact. Always interpret effect sizes against domain norms and against the cost-benefit of the intervention being studied.

APA's Publication Manual now requires effect sizes alongside test statistics. PsyStat reports them automatically with bootstrap or non-central CIs whenever applicable, and offers Hedges' correction for small samples.

→ Deep dive: Effect Sizes Beyond Cohen's d

PsyStat: Effect Size Calculator (every test)Reporting

Statistical Power

The probability that a test will correctly reject a false null hypothesis, given a specified true effect size.

Power = 1 − beta. It is determined by four interacting quantities: alpha, the true effect size, the sample size, and the variability of the outcome. Increase any one (or shrink variability) and power goes up.

Cohen recommended a minimum of .80 power for most psychological research. Studies with low power are not just unlikely to detect real effects — the effects they do detect are systematically inflated, a phenomenon known as the winner's curse or Type M (magnitude) error (Gelman & Carlin, 2014).

Use case: Before running a 2x2 between-subjects experiment expecting a medium-sized interaction (f = .25), an a priori power analysis at alpha = .05 and power = .80 returns N = 128 — 32 per cell. Skipping this step is the single most common reason underpowered psychology studies fail to replicate.

Avoid post-hoc "observed power" calculations — they are mathematically redundant with the p-value and provide no additional information.

→ Deep dive: Power Analysis

PsyStat: Power Analysis (a priori, sensitivity, compromise)Design

Multiple Comparisons Correction

Statistical adjustments that control the inflated false-positive rate that arises when many tests are run on the same data.

Run 20 independent tests at alpha = .05 against a true null and you expect, on average, one "significant" result by chance alone. Multiple comparisons procedures protect against this by either controlling the family-wise error rate (FWER) — the probability of any false positive — or the false discovery rate (FDR) — the expected proportion of rejected nulls that are actually false.

The Bonferroni correction is the simplest FWER procedure (alpha / k) but is famously conservative. Holm's step-down is uniformly more powerful and should generally be preferred. The Benjamini-Hochberg (1995) FDR procedure is the standard in genomics, neuroimaging, and high-throughput contexts where some false positives are acceptable in exchange for greater discovery power.

Worked example: An fMRI study testing 50,000 voxels uses BH-FDR at q = .05; significant voxels are those whose p-values fall below the line p_(i) ≤ (i/m) × q after sorting.

→ Deep dive: Multiple Comparisons (Bonferroni, Holm, FDR)

PsyStat: Post-Hoc & Multiple Testing moduleInference

One-Tailed vs Two-Tailed Tests

A choice about whether the alternative hypothesis specifies a direction (one-tailed) or only that the parameters differ (two-tailed).

A two-tailed test rejects H₀ if the observed statistic is extreme in either direction. A one-tailed test commits in advance to a specific direction (e.g., the new drug improves outcomes), placing the entire alpha in one tail. This makes it more powerful for detecting effects in the predicted direction but blind to effects in the opposite direction.

One-tailed tests are appropriate when (a) a directional prediction is genuinely justified by theory and (b) an effect in the opposite direction would be treated identically to no effect — both would lead to the same decision. They become inappropriate (and a form of researcher degrees of freedom) if the analyst chooses tailedness after seeing the data.

Best practice: Pre-register your tail choice. When in doubt, use two-tailed: a one-tailed test that turns out to point the "wrong" way produces a counterintuitive non-significant result that is hard to publish honestly.

PsyStat: T-Test & Z-Test modulesInference

B. Parametric Tests 7 Entries

The classical workhorses of group-comparison statistics. These methods assume the data are drawn from specific (usually normal) distributions and that variance behaves predictably across groups. When their assumptions hold, they are maximally powerful.

Independent Samples t-Test

A test of whether two unrelated groups differ on the mean of a continuous outcome.

The independent-samples (or "two-sample") t-test compares the means of two groups by dividing the observed mean difference by the standard error of that difference. Two flavors exist: the Student t-test assumes equal variances across groups, while Welch's t-test relaxes that assumption and is generally recommended as the default (Delacre, Lakens, & Leys, 2017).

Assumptions: independence of observations, approximate normality of the outcome within each group (or large enough samples for the Central Limit Theorem to apply), and either equal variances (Student) or no assumption about variance (Welch). Levene's test is sometimes used to check variance equality, but the safer move is to use Welch's by default.

Worked example: Comparing extraversion scores between 80 musicians (M = 4.2) and 75 accountants (M = 3.6), Welch's t(149.3) = 3.85, p < .001, d = 0.62 [0.30, 0.94]. Pair this with effect size and CI for full APA reporting.

PsyStat: T-Test moduleTwo groups

Paired Samples t-Test

A test of whether the mean of within-subject difference scores is non-zero.

When each subject contributes two measurements — pre/post, left/right, twin pairs — the paired t-test analyzes the differences rather than the raw scores. This dramatically increases power because between-subject variability cancels out: you're testing whether the typical individual changed.

Assumptions: the pairs are independent of each other, and the difference scores are approximately normally distributed. Note that it is the differences, not the raw scores in either condition, that need to be normal.

Worked example: Forty patients are weighed before and after an 8-week intervention. Mean weight loss = 3.2 kg (SD = 2.1), t(39) = 9.64, p < .001, d_z = 1.52. Report the paired Cohen's d_z (mean difference / SD of differences), not d for between-groups comparisons — the two are not interchangeable and confusion is rampant in the literature.

PsyStat: Paired T-Test moduleWithin-subjects

One-Way ANOVA

A test of whether three or more group means differ on a continuous outcome, controlling the family-wise error rate of pairwise comparisons.

One-way ANOVA generalizes the t-test to more than two groups. The F-statistic is the ratio of between-group variance to within-group variance. A significant F tells you the groups are not all equal, but not which ones differ — that requires post-hoc tests like Tukey's HSD, Scheffe, or planned contrasts.

Assumptions: independence of observations, normality of residuals within each group, and homogeneity of variance across groups (the latter is often violated; Welch's ANOVA is a robust alternative). The Brown-Forsythe and Welch corrections handle heteroscedasticity well.

Worked example: Comparing memory recall across four study conditions (massed, spaced, interleaved, control), F(3, 196) = 7.42, p < .001, eta² = .102. Tukey HSD reveals spaced > massed (p = .003) and interleaved > massed (p < .001).

PsyStat: One-Way ANOVA moduleThree+ groups

Repeated Measures ANOVA

A within-subjects extension of ANOVA used when each participant is measured under three or more conditions or time points.

RM-ANOVA gains power over between-subjects designs by partitioning out individual differences. It tests whether the mean response varies across the within-subject factor while accounting for the correlated structure of the data.

The sphericity assumption is unique and important: the variances of the differences between all pairs of conditions must be equal. Mauchly's test assesses this; if violated, apply the Greenhouse-Geisser (conservative) or Huynh-Feldt (less conservative) corrections to the degrees of freedom. Many statisticians now recommend skipping Mauchly entirely and applying Greenhouse-Geisser by default, or moving to a linear mixed model.

Use case: A working memory study measures 30 participants at 4 retention intervals. RM-ANOVA, F(3, 87) = 24.1, p < .001 with Greenhouse-Geisser epsilon = .82, eta_p² = .45. Pairwise comparisons with Bonferroni adjustment localize the decline.

PsyStat: Repeated Measures ANOVA moduleWithin-subjects

Factorial ANOVA

A multi-factor ANOVA that decomposes variance into main effects of each factor and the interactions among them.

A factorial design crosses two or more independent variables (e.g., a 2x3 design with two levels of Condition and three levels of Difficulty). Factorial ANOVA partitions the total variance into main effects (does Condition matter overall? does Difficulty?) and an interaction (does the effect of Condition depend on Difficulty?).

The interaction is usually the scientifically interesting result. If a significant interaction exists, main effects must be interpreted with caution — "Condition matters, but only when Difficulty is high." Follow up with simple-effects analyses to characterize the interaction pattern.

Worked example: A 2 (Reward: present/absent) x 2 (Task: easy/hard) design on persistence yields a significant Reward x Task interaction, F(1, 116) = 8.91, p = .003, eta_p² = .071. Simple effects: reward boosts persistence on hard tasks (p < .001) but not easy ones (p = .42). The interaction tells the actual story.

PsyStat: Factorial ANOVA moduleMulti-factor

ANCOVA

A hybrid of ANOVA and regression that compares group means after statistically adjusting for one or more continuous covariates.

ANCOVA increases power and reduces bias by removing variance attributable to a covariate that is correlated with the outcome but, ideally, independent of the grouping variable. The classic use case is adjusting post-test scores for pre-test scores in a randomized experiment.

Critical assumption: homogeneity of regression slopes — the relationship between the covariate and outcome must be the same across groups. If the covariate-by-group interaction is significant, ANCOVA is misleading and you should report the interaction itself. Also, the covariate must be measured before treatment (or at least be unaffected by it) to avoid removing real treatment variance — a phenomenon known as Lord's Paradox.

Use case: An RCT of a tutoring program adjusts post-test math scores for pre-test scores. Adjusted treatment effect = 4.7 points (95% CI [2.1, 7.3]) vs. an unadjusted estimate of 4.9 points — the small change reflects baseline equivalence from successful randomization.

PsyStat: ANCOVA moduleAdjusted comparison

MANOVA

An extension of ANOVA to multiple, conceptually related dependent variables analyzed simultaneously.

MANOVA tests whether group means differ on a linear combination of multiple outcomes. It is appropriate when the dependent variables are theoretically related (e.g., subscales of the same construct) and you want to control the Type I error rate that would arise from running separate ANOVAs.

Test statistics include Wilks' Lambda (the most common), Pillai's Trace (most robust to assumption violations), Hotelling's Trace, and Roy's Largest Root. Pillai is generally recommended when assumptions are uncertain.

Assumptions: multivariate normality, homogeneity of variance-covariance matrices (Box's M), and linearity among DVs. A significant MANOVA is typically followed by univariate ANOVAs or descriptive discriminant analysis to interpret which DVs are driving the multivariate effect. Do not interpret MANOVA as a simple "fancier ANOVA" — it answers a fundamentally multivariate question.

PsyStat: MANOVA moduleMultivariate

C. Regression 9 Entries

Regression methods model an outcome variable as a function of one or more predictors. They are the foundation of nearly all modern statistical inference — ANOVA, t-tests, and ANCOVA are all special cases of the linear model.

Simple Linear Regression

Models a continuous outcome as a linear function of a single predictor by minimizing the sum of squared residuals.

Simple regression fits the equation Y = b₀ + b₁X + e, where b₁ is the predicted change in Y for a one-unit change in X. Ordinary Least Squares (OLS) chooses b₀ and b₁ to minimize the sum of squared vertical distances from each point to the line.

Assumptions (LINE): Linearity of the X-Y relationship, Independence of errors, Normality of residuals, and Equality of error variance (homoscedasticity). Diagnostic plots — residuals vs. fitted, Q-Q plot, scale-location — reveal violations more clearly than any single test.

Worked example: Predicting GPA from study hours per week with N = 200, b₁ = 0.08 (95% CI [0.05, 0.11]), R² = .14, F(1, 198) = 32.4, p < .001. Each additional study hour is associated with a 0.08-point GPA increase, accounting for 14% of GPA variance.

PsyStat: Linear Regression moduleContinuous outcome

Multiple Regression

Extends simple regression to model an outcome as a linear function of multiple predictors simultaneously.

The model Y = b₀ + b₁X₁ + b₂X₂ + ... + e estimates each coefficient as the predicted change in Y for a one-unit change in that predictor holding all other predictors constant. This conditioning interpretation is what gives regression its analytic power — and what makes it dangerous if predictors are highly correlated or causally tangled.

Standardized coefficients (betas) put all predictors on a common scale and are useful for comparing relative importance, but they should not be over-interpreted: a beta of .30 is not "larger" than a beta of .25 in any deep sense. Report unstandardized coefficients with CIs as the primary effect estimates.

Use case: Predicting depression scores from neuroticism, sleep quality, and social support, N = 412. Each is a significant unique predictor; semi-partial r² reveals neuroticism uniquely accounts for 12% of variance, sleep 4%, and support 3%.

PsyStat: Multiple Regression moduleMultivariable

Logistic Regression

Models a binary outcome as a function of predictors using the logit link, returning coefficients on the log-odds scale.

When the outcome is dichotomous (yes/no, dropout/retain, recidivate/desist), linear regression is inappropriate — it can predict probabilities outside [0, 1] and violates homoscedasticity. Logistic regression instead models log(p / (1 − p)) as a linear function of predictors, fit by maximum likelihood.

Coefficients are interpretable as log-odds ratios; exponentiating gives odds ratios. An OR of 1.5 means a one-unit increase in X is associated with 50% greater odds of the outcome. Effect sizes include McFadden's pseudo-R², Nagelkerke's R², and the area under the ROC curve (AUC) for predictive accuracy.

Assumptions: independence of observations, linearity of predictors with the logit, no extreme multicollinearity, and adequate sample size (typically 10-20 events per predictor). Hosmer-Lemeshow tests model calibration.

Use case: Predicting graduation (1) vs. dropout (0) from GPA, attendance, and SES; OR for GPA = 2.4 (95% CI [1.8, 3.2]).

PsyStat: Logistic Regression moduleBinary outcome

Poisson Regression

A generalized linear model for count outcomes, using a log link and an assumed Poisson distribution.

When the outcome is a count of discrete events (number of drinks per week, hospital visits per year, errors per task), Poisson regression is the natural starting point. Coefficients exponentiate to incidence rate ratios — an IRR of 1.3 means the rate increases by 30% per unit of X.

Critical caveat: Poisson assumes the variance equals the mean (equidispersion). Real count data are often overdispersed, in which case standard errors are too small and tests become anti-conservative. Check the deviance/df ratio — if it is much greater than 1, switch to negative binomial regression or use quasi-Poisson standard errors.

For data with excess zeros (e.g., counts of risky behaviors where many participants report zero), zero-inflated or hurdle models separate the structural zero process from the count process.

Use case: Modeling annual ER visits as a function of chronic condition count and insurance status, with population at risk as an offset.

PsyStat: GLM & Count Models moduleCount outcome

Hierarchical Regression

A theory-driven sequential approach in which predictors are entered in pre-specified blocks to assess incremental variance explained.

Distinct from "stepwise" regression (which is largely discredited), hierarchical regression entries are decided by the analyst based on theory or temporal logic. Each new block's contribution is judged by the change in R² and an associated F-test.

Common pattern: enter demographic controls in Block 1, established predictors in Block 2, and the novel theoretical predictor in Block 3. A significant delta-R² for Block 3 demonstrates incremental validity over and above what was already known.

Worked example: Predicting job performance, Block 1 (age, tenure) R² = .08; Block 2 (cognitive ability) delta-R² = .14, p < .001; Block 3 (emotional intelligence) delta-R² = .03, p = .04. EI provides modest but significant incremental prediction beyond cognitive ability.

Note: "hierarchical regression" in this sense should not be confused with "hierarchical linear modeling" (HLM), which refers to multilevel models for nested data — see Section G.

PsyStat: Hierarchical Regression moduleTheory-driven

Moderation

Tests whether the relationship between a predictor and an outcome depends on the level of a third variable.

Moderation is implemented by adding an interaction term (X * W) to the regression model. A significant interaction indicates the slope of X on Y differs across levels of W. The classic interpretation is "the effect of X depends on W."

Two follow-up techniques characterize the interaction: simple slopes (pick-a-point) compute and test the slope of X at low (-1 SD), mean, and high (+1 SD) values of W; the Johnson-Neyman procedure identifies the regions of W in which the X-Y relationship is significant. Always mean-center continuous predictors before computing the interaction term to reduce non-essential multicollinearity (Aiken & West, 1991).

Use case: Does social support (W) moderate the effect of stress (X) on depression (Y)? Significant Stress x Support interaction (b = -0.14, p = .002). Simple slopes show stress strongly predicts depression at low support (b = 0.43) but not at high support (b = 0.08).

PsyStat: Moderation & PROCESS-style moduleConditional effects

Mediation

Decomposes a total effect of X on Y into a direct effect and an indirect effect transmitted through a mediator M.

The classic Baron and Kenny (1986) causal-steps approach has been superseded by direct estimation and inference on the indirect effect (a*b), where a is the X → M path and b is the M → Y path controlling for X. Inference uses the bias-corrected bootstrap confidence interval (Preacher & Hayes, 2008), which does not require the indirect effect to be normally distributed and is far more powerful than the legacy Sobel test.

Critically, mediation is a causal claim. Cross-sectional mediation analyses can describe statistical patterns but cannot establish causal mechanism without temporal precedence and (ideally) experimental manipulation of the mediator. The presence of unmeasured confounders of M and Y is the most common threat to validity.

Use case: Does self-efficacy mediate the effect of training on performance? Indirect effect = 0.21 (95% bootstrap CI [0.09, 0.34]), significant; direct effect = 0.12 (CI [-0.03, 0.27]), suggesting full mediation.

PsyStat: Mediation & SEM moduleProcess

Multicollinearity

A condition in which predictors in a regression model are highly correlated with each other, inflating standard errors and destabilizing coefficient estimates.

Multicollinearity does not bias coefficients but it inflates their variance, making individual predictors appear non-significant even when their joint contribution is large. The Variance Inflation Factor (VIF) is the standard diagnostic: a VIF of 1 indicates no inflation, values above 5 warrant attention, and values above 10 are typically considered problematic. The condition number of the design matrix is a more rigorous global indicator.

Solutions include: dropping one of the redundant predictors, combining them via a composite score or PCA, mean-centering interaction components, or using ridge regression (which adds an L2 penalty that stabilizes estimates at the cost of small bias). Lasso and elastic net regularization are popular when many candidate predictors are correlated.

Use case: A model includes height (cm), height (in), and BMI. VIF for the two height variables is > 100 — trivially redundant. Drop one; the model becomes well-conditioned.

PsyStat: Regression Diagnostics moduleDiagnostic

Robust Regression

Regression methods that down-weight outlying observations to produce coefficient estimates resistant to violations of normality.

OLS is highly sensitive to outliers because it squares residuals — a single extreme point can dominate the fit. Robust regression replaces the squared-error loss with a function that grows more slowly for large residuals, such as Huber's M-estimator (linear beyond a threshold) or the MM-estimator (which combines high breakdown with high efficiency).

Diagnostics like Cook's distance, leverage (hat values), and DFFITS identify influential cases; a robust fit and an OLS fit that disagree substantially is a strong signal that the OLS results are being driven by a small number of points.

Use case: A regression of CEO compensation on firm performance shows OLS b = 0.04 (n.s.), while MM-estimator b = 0.21 (p < .001). The OLS estimate was being pulled toward zero by three founder-CEOs with anomalous compensation structures.

PsyStat: Robust Regression moduleOutlier-resistant

D. Non-Parametric Tests 7 Entries

Methods that make minimal assumptions about the underlying distribution of the data, typically operating on ranks or frequencies. Use these when sample sizes are small, distributions are heavily skewed, or the outcome is genuinely ordinal.

Mann-Whitney U Test

A non-parametric alternative to the independent-samples t-test that compares the distributions of two groups using ranks.

Sometimes called the Wilcoxon rank-sum test, Mann-Whitney U pools the data from two groups, ranks the combined values, then asks whether the rank sums differ more than expected by chance. It tests stochastic dominance — whether values in one group tend to be larger than values in the other — rather than literal mean differences.

Assumptions: independence of observations and (for a strict mean-difference interpretation) similar distribution shapes across groups. With dissimilar shapes it remains a valid test of stochastic dominance.

The effect size is typically reported as r = Z / sqrt(N) or as the rank-biserial correlation. Common language effect size (CLES) — the probability that a randomly chosen value from group A exceeds one from group B — is increasingly preferred for interpretability.

Use case: Comparing self-reported pain ratings (1-10) between two surgical techniques with N = 28 per group; the highly skewed distributions argue against a t-test, so U = 312, p = .04, r = .27.

PsyStat: Non-Parametric moduleTwo groups

Wilcoxon Signed-Rank Test

A non-parametric alternative to the paired t-test that ranks the absolute differences between paired observations.

The Wilcoxon signed-rank test handles within-subject comparisons when the difference scores are not normally distributed or when the outcome is ordinal. Differences are computed for each pair, ranked by absolute value, and the signs reattached. The test statistic compares the sum of positive ranks to the sum of negative ranks.

Assumption: the distribution of difference scores is symmetric about its median. If symmetry is implausible, the simpler sign test can be used, though it has lower power.

Use case: A study asks 22 participants to rate two product designs on a 7-point scale. The differences are heavily skewed; Wilcoxon V = 178, p = .008. Effect size r = .51 indicates a large preference for Design B.

PsyStat: Non-Parametric moduleWithin-subjects

Kruskal-Wallis Test

The non-parametric extension of one-way ANOVA, comparing the rank distributions of three or more independent groups.

Kruskal-Wallis ranks all observations across groups, sums the ranks within each group, and compares them to the expected sum under the null hypothesis of identical distributions. The test statistic is approximately chi-square distributed with k − 1 degrees of freedom.

A significant Kruskal-Wallis is typically followed by Dunn's test with Bonferroni or Holm correction to identify which pairs of groups differ. Effect size is reported as epsilon-squared or eta-squared based on the H statistic.

Use case: Comparing time-to-recovery (heavily right-skewed) across three rehab programs, H(2) = 11.4, p = .003, epsilon² = .14. Dunn's test with Holm correction reveals Program C recovers faster than Program A (p_adj = .002).

PsyStat: Non-Parametric moduleThree+ groups

Friedman Test

The non-parametric counterpart to repeated-measures ANOVA, comparing three or more related conditions by ranking observations within each subject.

Friedman ranks each subject's observations across the within-subject conditions and tests whether the average ranks differ across conditions. It avoids the sphericity assumption of RM-ANOVA entirely and is robust to outliers.

Significant Friedman tests are followed by post-hoc procedures such as Nemenyi, Conover-Iman, or pairwise Wilcoxon tests with multiple-comparisons correction. Effect size is the Kendall's W coefficient of concordance, ranging 0 (no agreement) to 1 (perfect agreement among rankings).

Use case: Forty raters evaluate four AI-generated essays on a 5-point quality scale. Friedman chi²(3) = 28.7, p < .001, W = .24. Pairwise Wilcoxon with Holm reveals Essays 2 and 4 are rated higher than Essays 1 and 3.

PsyStat: Non-Parametric moduleWithin-subjects

Spearman Rank Correlation

A non-parametric measure of the strength and direction of a monotonic relationship between two variables, computed as the Pearson correlation of their ranks.

Spearman's rho ranges from −1 to +1 and equals 1 when the two variables are perfectly monotonically related — i.e., one is a strictly increasing function of the other — even if the relationship is not linear. It is more robust to outliers than Pearson's r and is appropriate for ordinal data.

Use Spearman when the bivariate scatter shows a clear monotonic but non-linear pattern, when one or both variables are ordinal, or when extreme outliers are pulling Pearson's r in misleading directions. Kendall's tau is an alternative that handles ties more gracefully and has a more direct probabilistic interpretation.

Use case: Correlating SES rank with happiness rating (1-7), n = 412. Spearman rho = .31 (95% bootstrap CI [.22, .40]), p < .001. Pearson r = .19 was attenuated by ceiling effects in the happiness measure.

PsyStat: Correlation moduleAssociation

Chi-Square Tests

A family of tests comparing observed and expected frequencies in categorical data, used for goodness-of-fit and independence of two categorical variables.

The two main flavors are the chi-square goodness-of-fit test (does the observed distribution of one categorical variable match an expected distribution?) and the chi-square test of independence (are two categorical variables associated in a contingency table?). Both compute Sum[(O − E)² / E] and compare to a chi-square distribution.

Assumptions: independent observations, mutually exclusive categories, and adequate expected cell counts (commonly all E ≥ 5; some advise ≥ 1 with no more than 20% below 5). Yates' continuity correction is sometimes applied to 2x2 tables but is generally over-conservative; Fisher's exact test is preferred for small samples. Effect sizes include Cramer's V for general r x c tables and the phi coefficient for 2x2 tables.

Use case: Testing whether voting preference is independent of region in a 4 x 3 contingency table, chi²(6) = 24.1, p < .001, Cramer's V = .14.

PsyStat: Categorical Tests moduleFrequencies

Fisher's Exact Test

An exact test of independence in contingency tables, computed from the hypergeometric distribution rather than relying on the chi-square approximation.

When expected cell counts are small (a common situation in clinical trials of rare events or small-sample social science), the chi-square approximation breaks down. Fisher's exact test conditions on the marginal totals and computes the exact probability of obtaining the observed table or one more extreme.

Originally developed for 2x2 tables, Fisher's exact test extends to larger r x c tables via Monte Carlo or network algorithms. It is always valid and is the gold standard for small samples; with large samples, results converge with the chi-square test.

Use case: A pilot drug trial with 8 responders out of 12 in the treatment arm versus 2 of 11 in the placebo arm. The expected count in one cell is below 5, ruling out chi-square. Fisher's exact p = .009, OR = 9.0 (95% CI [1.4, 78.4]). The wide CI reflects the small sample — the test is significant but the precision is poor.

PsyStat: Categorical Tests moduleSmall samples

E. Bayesian Statistics 6 Entries

An alternative inferential paradigm that treats parameters as random variables described by probability distributions. Bayesian methods quantify what you should believe given the data and your prior knowledge, and have surged in popularity as computational power has made them practical.

Bayes Factors

The ratio of the marginal likelihood of the data under one model to the marginal likelihood under another, quantifying relative evidence for competing hypotheses.

BF₁₀ = P(data | H₁) / P(data | H₀). A BF₁₀ of 10 means the data are 10 times more likely under H₁ than under H₀; BF₀₁ = 1/BF₁₀ reverses the comparison. Unlike p-values, Bayes factors can quantify evidence for the null, distinguishing "evidence of absence" from "absence of evidence."

Jeffreys' (1961) interpretive scale — later refined by Lee and Wagenmakers — calls BFs of 1-3 anecdotal, 3-10 moderate, 10-30 strong, 30-100 very strong, and >100 extreme. These are heuristics, not laws.

Bayes factors require specifying a prior on the effect size under H₁. The JZS (Jeffreys-Zellner-Siow) Cauchy prior with scale .707 is the default in the BayesFactor R package and is generally well-behaved. Sensitivity analysis — reporting BFs across multiple prior scales — is best practice.

Use case: A replication of a marginally significant original study returns BF₀₁ = 8.4: substantial evidence in favor of the null relative to the original effect size hypothesis.

→ Deep dive: Bayesian Methods in Psychology

PsyStat: Bayesian Inference moduleModel comparison

Posterior Distributions

The probability distribution of a parameter after combining prior beliefs with observed data via Bayes' theorem.

Bayes' theorem states posterior ∝ likelihood × prior. The posterior is the central object of Bayesian inference: it answers "what should I believe about this parameter, given everything I knew before plus the data I just collected?"

Summaries of the posterior include the posterior mean or median (point estimate), the 95% credible interval or highest density interval (HDI) (range containing 95% of posterior probability), and the posterior probability of direction (e.g., P(b > 0 | data)). Unlike a frequentist CI, a 95% credible interval can be interpreted as "there is a 95% probability the parameter lies in this range, given the data and prior."

Use case: A Bayesian regression yields a posterior on the slope with median = 0.34, 95% HDI [0.18, 0.51], P(b > 0 | data) = 0.999. The full posterior — not just the point estimate — is the inference.

→ Deep dive: Bayesian Methods in Psychology

PsyStat: Bayesian Regression moduleInference

Prior Selection

The process of choosing a probability distribution to encode beliefs about a parameter before observing the current data.

Priors range from non-informative (uniform, Jeffreys) to weakly informative (centered on plausible values with broad spread) to informative (sharply concentrated on previous estimates from prior research). The current consensus, championed by Andrew Gelman and others, is that weakly informative priors are usually preferable: they regularize estimates and avoid pathological behavior with small samples without imposing strong substantive commitments.

For regression coefficients on standardized predictors, a Normal(0, 2.5) or Cauchy(0, 2.5) prior is a common weakly informative default. For variance components, Half-Normal or Half-t priors are well-behaved.

Always conduct a sensitivity analysis: refit the model with several plausible priors and report whether substantive conclusions change. If the data are informative, the posterior is robust to reasonable prior choices; if it isn't, that itself is the headline finding.

→ Deep dive: Bayesian Methods in Psychology

PsyStat: Bayesian Inference moduleModeling choice

Markov Chain Monte Carlo (MCMC)

A class of algorithms for sampling from complex posterior distributions when analytical solutions are unavailable.

MCMC constructs a Markov chain whose stationary distribution is the posterior of interest. After a burn-in period, the chain produces samples that can be summarized to estimate posterior means, intervals, and any function of the parameters. Key algorithms include Metropolis-Hastings, Gibbs sampling, and modern gradient-based methods like Hamiltonian Monte Carlo (HMC) and the No-U-Turn Sampler (NUTS) implemented in Stan and PyMC.

Convergence diagnostics are essential. R-hat (Gelman-Rubin) compares within-chain to between-chain variance; values close to 1.00 (commonly < 1.01) indicate convergence. Effective sample size (ESS) measures how much information the autocorrelated chain provides relative to independent samples; aim for ESS > 400 for stable inference. Trace plots should look like fuzzy caterpillars with no drift or stickiness.

Use case: Fitting a multilevel logistic model with crossed random effects via NUTS. After 4 chains x 2,000 iterations (1,000 warmup), all R-hats < 1.005 and bulk ESS > 1,200 — safe to interpret.

PsyStat: Bayesian Modeling backendComputation

Bayesian Regression

A regression framework that returns full posterior distributions for every coefficient, naturally incorporating uncertainty and prior information.

Bayesian regression uses the same likelihood as classical regression but updates a prior on each coefficient to produce a posterior. Benefits include: principled uncertainty quantification, natural shrinkage (regularization) toward prior means, the ability to incorporate previous research as informative priors, and graceful handling of hierarchical structures and missing data.

Sparse-prior variants like the horseshoe prior (Carvalho, Polson, & Scott, 2010) provide a Bayesian analog to the Lasso, shrinking irrelevant predictors strongly toward zero while leaving important ones nearly unbiased. Hierarchical priors enable partial pooling across groups, an alternative to mixed-effects models in many use cases.

Use case: A small-sample (N = 60) study fits a Bayesian linear model with weakly informative priors. The posterior distribution shows the slope is "probably positive" (P(b > 0 | data) = 0.93) but with substantial uncertainty — a more honest summary than a frequentist p = .07 with no effect-size CI.

PsyStat: Bayesian Regression moduleEstimation

Posterior Predictive Checks

A graphical diagnostic in which simulated datasets generated from the posterior are compared to the observed data to assess model adequacy.

If the model captures the structure of the data, datasets simulated from the posterior should look similar to the observed data on key features — the mean, the variance, the proportion of zeros, the maximum, the shape of the histogram. Systematic mismatches reveal model misspecification.

Common implementations include overlaying density plots of observed vs. simulated data (using the bayesplot R package or arviz in Python), comparing test statistics (e.g., the proportion of zeros) between observed and simulated, and computing posterior predictive p-values.

PPCs are especially valuable because they evaluate the model on the scale of the data — "does this thing actually look like my data?" — rather than abstract goodness-of-fit metrics. Gelman has argued they should be the default model-checking tool in applied Bayesian work.

Use case: A Poisson regression PPC reveals the model massively under-predicts zeros in observed counts — clear evidence to switch to a zero-inflated negative binomial.

PsyStat: Bayesian Diagnostics moduleDiagnostics

F. Psychometrics & Measurement 6 Entries

The science of measuring psychological constructs. These methods address how well your instruments capture the latent traits you intend to measure, and whether they do so consistently across people, time, and contexts.

Cronbach's Alpha

An estimate of internal-consistency reliability based on the average inter-item correlations and the number of items in a scale.

Cronbach's alpha ranges from 0 to 1, with values commonly interpreted as: .70-.79 acceptable, .80-.89 good, .90+ excellent for established measures. However, alpha rests on the often-violated assumption of tau-equivalence (all items measure the same construct with equal loadings); when violated, alpha typically underestimates true reliability.

Modern recommendations (Sijtsma, 2009; McNeish, 2018) favor McDonald's omega, computed from the factor loadings of a confirmatory factor model, as a more accurate reliability estimate. Omega does not assume tau-equivalence and is now the default in many psychometrics packages.

Caveats: alpha is heavily influenced by scale length — adding items mechanically inflates it. A high alpha does not establish unidimensionality (you can get high alpha from a multifactor scale). Always supplement alpha with factor analysis to interpret what is actually being measured.

PsyStat: Psychometrics moduleReliability

Confirmatory Factor Analysis (CFA)

A theory-driven method for testing whether observed indicators reflect a hypothesized latent factor structure.

Whereas exploratory factor analysis (EFA) discovers structure inductively, CFA tests a pre-specified model: which items load on which factors, which loadings are fixed, which residuals are correlated. Estimation is typically by maximum likelihood (or robust ML for non-normal data) within a structural equation modeling framework.

Fit is judged by multiple indices: CFI ≥ .95, TLI ≥ .95, RMSEA ≤ .06 with 90% CI upper bound ≤ .08, SRMR ≤ .08 (Hu & Bentler, 1999). The chi-square test of exact fit is sensitive to sample size and is rarely satisfied in real data with N > 200.

Use case: Testing whether a 20-item Big Five inventory has the expected 5-factor structure. CFA yields chi²(160) = 387, CFI = .94, RMSEA = .055 [.048, .062], SRMR = .051. The 5-factor model fits adequately; modification indices suggest two pairs of items with similar wording have correlated residuals.

PsyStat: SEM/CFA moduleLatent variables

Item Response Theory (IRT)

A family of latent-trait models that describe how the probability of a particular response to an item depends on the respondent's ability and item parameters.

IRT shifts the focus from the test score to the underlying latent trait (theta). The Rasch / 1PL model estimates only item difficulty; the 2PL model adds item discrimination; the 3PL model adds a guessing parameter. For polytomous items there are graded-response and partial-credit models.

IRT enables key psychometric advances: invariant measurement (item parameters and ability are on the same scale, not test-dependent), item information functions that reveal where each item is most informative, and computerized adaptive testing in which each next item is chosen to maximize information about the current ability estimate.

Use case: A graduate admissions test uses 3PL IRT to score examinees. Item information functions identify items that primarily discriminate around the cutoff score, and CAT delivers a ~30-item test with the same precision as a fixed 80-item test.

PsyStat: IRT moduleLatent trait

Measurement Invariance

A property of a measurement instrument indicating that it measures the same construct in the same way across groups or time points.

Without measurement invariance, observed group differences could reflect either substantive differences in the construct or differences in how the instrument operates — an existential threat to cross-cultural research, longitudinal designs, and any group comparison. Invariance is tested in nested CFA models with progressively stricter constraints:

Configural invariance: the same factor structure across groups. Metric (weak) invariance: equal factor loadings — necessary for comparing covariances and regression slopes. Scalar (strong) invariance: equal loadings and intercepts — required for meaningful comparison of latent means. Strict invariance: equal residual variances — rarely required.

A model qualifies as invariant at a level if constraining parameters does not produce a meaningful drop in fit (e.g., delta-CFI < .010, delta-RMSEA < .015; Chen, 2007). When full invariance fails, partial invariance with a few freed parameters often suffices for comparison of those parameters that remain constrained.

PsyStat: Multi-Group CFA moduleCross-group

Test-Retest Reliability

The consistency of scores when the same instrument is administered to the same respondents at two time points.

Test-retest reliability is typically quantified by the Pearson correlation or, more rigorously, the intraclass correlation coefficient (ICC), which captures both consistency and absolute agreement. Acceptable values are context-dependent but generally ICC ≥ .70 for research, ≥ .80 for clinical decisions, ≥ .90 for high-stakes individual decisions.

The retest interval is critical. Too short and memory effects inflate the estimate; too long and genuine change in the trait deflates it. Reporting the interval and a justification for it is essential. For traits expected to be stable over months (e.g., personality), 4-8 weeks is typical; for state measures (mood), much shorter intervals or alternate-form designs are preferred.

Pearson r ignores systematic shifts (e.g., everyone scores higher at Time 2); ICC variants — ICC(2,1), ICC(3,1) — do not. Choose based on whether absolute agreement matters or only consistency of rank order.

PsyStat: Reliability moduleStability

Convergent and Discriminant Validity

Two complementary forms of construct validity. Convergent validity is correlation with measures of the same construct; discriminant validity is the absence of correlation with measures of different constructs.

Campbell and Fiske's (1959) multi-trait multi-method (MTMM) matrix remains the conceptual gold standard: assess multiple constructs with multiple methods and inspect whether same-construct correlations across methods exceed different-construct correlations within methods.

In SEM, convergent validity is often summarized by the Average Variance Extracted (AVE) — the mean of squared loadings on a factor — with values above .50 considered adequate. Discriminant validity is supported when the square root of AVE for each construct exceeds its correlation with other constructs (Fornell-Larcker criterion). The newer HTMT (heterotrait-monotrait ratio) proposed by Henseler et al. (2015) is now widely recommended, with values below .85 indicating adequate discriminant validity.

Use case: A new mindfulness scale correlates r = .68 with an established mindfulness measure (convergent) but only r = .12 with a measure of executive function (discriminant) — supporting that it taps mindfulness specifically rather than general cognitive ability.

PsyStat: Validity moduleConstruct validity

G. Multilevel & Mixed Models 5 Entries

When data are nested — students within classrooms, repeated measures within people, patients within clinics — ordinary regression underestimates standard errors and inflates Type I error. Multilevel models honor the nested structure by partitioning variance across levels.

Intraclass Correlation (ICC)

The proportion of total variance in an outcome that lies between higher-level units (e.g., classrooms, clinics, families).

The unconditional (null model) ICC = sigma²_between / (sigma²_between + sigma²_within). An ICC of .15 means 15% of the variance in the outcome is between groups and 85% is within groups. ICCs of even .05 can produce substantial inflation of Type I error if ignored, especially with large within-cluster sample sizes.

The design effect = 1 + (n − 1) * ICC, where n is the average cluster size, quantifies how much the effective sample size is reduced by clustering. With ICC = .10 and clusters of size 30, the design effect is 3.9 — meaning a sample of 600 has the effective inferential power of only 154 independent observations.

Whenever you suspect nested data, fit an unconditional multilevel model and report the ICC as a justification for the modeling approach. ICCs above ~.05 typically warrant multilevel analysis.

→ Deep dive: Mixed Models for Nested Data

PsyStat: Multilevel Modeling moduleNested data

Random Intercepts

A multilevel model in which the intercept varies across higher-level units, capturing baseline differences across groups.

The simplest mixed model: y_ij = beta₀ + u_0j + beta₁X_ij + e_ij, where u_0j is a normally distributed random deviation of group j's intercept from the overall mean. The slope of X is fixed across groups; only the baseline level differs.

Random intercepts produce partial pooling: estimates for each group are shrunk toward the grand mean, with the degree of shrinkage proportional to within-group sample size. Small groups borrow strength from the larger sample; large groups are estimated nearly independently. This shrinkage typically improves out-of-sample prediction.

Use case: Modeling student test scores nested in 50 schools, with a random intercept for school. The model estimates an overall intercept of 72.4 and a school-level SD of 6.1 — meaning a school one SD above average has expected scores ~6 points higher.

→ Deep dive: Mixed Models for Nested Data

PsyStat: Multilevel Modeling moduleVarying baseline

Random Slopes

A multilevel extension in which the slope of a level-1 predictor is allowed to vary across higher-level units, capturing heterogeneity in the within-group relationship.

Specifying y_ij = beta₀ + u_0j + (beta₁ + u_1j) X_ij + e_ij means the relationship between X and Y differs from group to group. The variance of u_1j quantifies how much the slope varies across groups, and a covariance between u_0j and u_1j indicates whether high-baseline groups also tend to have steeper or shallower slopes.

Random slopes are essential when the research question concerns variability in effects across contexts — e.g., whether an intervention works equally well in all schools or whether the dose-response relationship is uniform across patients. Cross-level interactions explain that variability with level-2 predictors.

Use case: Studying within-person practice effects on a cognitive task across 100 people. Random intercept and slope reveal substantial individual differences in learning rate (slope SD = 0.22), which a baseline working-memory measure partially explains as a cross-level interaction.

→ Deep dive: Mixed Models for Nested Data

PsyStat: Multilevel Modeling moduleVarying effects

Cross-Classified Models

Multilevel models for data nested in two or more non-hierarchical grouping factors that cross each other rather than nest cleanly.

Strictly hierarchical nesting (students in classrooms in schools) is the textbook case, but real data often violate it: students attend a primary school and then a secondary school, residents belong to a neighborhood and visit a particular doctor, raters evaluate multiple targets and each target is evaluated by multiple raters. Cross-classified models include random effects for both grouping factors simultaneously.

Estimation is computationally intensive but routine in modern software (lme4 in R, MixedModels.jl in Julia, brms for Bayesian). Cross-classified random effects are often misdiagnosed as "complicated" interactions in standard regression; the multilevel formulation is more honest about the data structure.

Use case: A judgment study where each of 60 raters evaluates a randomly selected subset of 100 photographs. A cross-classified model with random intercepts for raters and photos correctly partitions variance and produces honest standard errors that a wrongly nested or fully crossed ANOVA would miss.

PsyStat: Multilevel Modeling moduleNon-nested

Growth Curve Modeling

A multilevel or latent-variable framework for modeling individual trajectories of change over repeated measurements.

Each individual has their own intercept (initial level) and slope (rate of change) over time, both treated as random effects with normal distributions. The model partitions trajectory variance into mean trajectory parameters (the average growth pattern) and individual differences in those parameters. Polynomial, piecewise, or spline functions extend the model beyond linear change.

Growth curve modeling can be implemented as a multilevel model (with time as a level-1 predictor) or as a latent growth curve model in SEM (with intercept and slope as latent variables). The two are mathematically equivalent for many specifications but differ in extensibility — SEM more easily incorporates time-varying predictors and parallel processes.

Use case: Tracking depression scores at 6 timepoints over a year for 200 patients. The model estimates a mean linear decrease of 1.4 points per month with substantial individual variation (slope SD = 0.9), and treatment group is added as a predictor of slope to test intervention efficacy on rate of recovery.

PsyStat: Growth Modeling moduleLongitudinal

H. Causal Inference 6 Entries

A toolkit for moving from association to causation when randomized experiments are impossible. These methods make causal assumptions explicit, encode them in graphs or design choices, and exploit features of the data that would only arise under particular causal structures.

Directed Acyclic Graphs (DAGs)

A graphical formalism for representing causal assumptions as a network of variables connected by directed arrows representing direct causal influences.

Developed and popularized by Judea Pearl, DAGs make causal assumptions visible and testable. Each arrow represents a hypothesized direct causal effect; the absence of an arrow is also a substantive claim — that no direct causal effect exists. The graph then determines, via the backdoor criterion, which variables must be adjusted for to estimate a target causal effect without bias.

DAGs clarify common confusions: a variable can be a confounder (must adjust for it), a mediator (adjusting blocks the effect of interest), or a collider (adjusting creates bias by opening a previously closed backdoor path). The infamous Berkson's bias and Table 2 fallacy are special cases of mishandled colliders.

Tools like dagitty.net let researchers draw a DAG and automatically derive the minimal sufficient adjustment set for any causal query. Drawing the DAG before data analysis is a transparency practice that the field is increasingly adopting.

→ Deep dive: Causal Inference in Observational Data

PsyStat: DAG Builder moduleCausal modeling

Propensity Score Matching

A method for estimating causal effects from observational data by matching treated and control units with similar probabilities of receiving treatment.

The propensity score, introduced by Rosenbaum and Rubin (1983), is the probability of treatment assignment given observed covariates. Its key result: conditioning on the propensity score is sufficient to remove confounding from those covariates, even though it is a single number rather than a high-dimensional vector. Methods include 1:1 nearest-neighbor matching, optimal matching, kernel matching, and inverse probability of treatment weighting (IPTW).

The critical limitation: propensity scores can only adjust for observed confounders. Hidden confounders bias the estimate just as in any non-randomized comparison — matching does not solve confounding, it just makes the matched sample look like a randomized trial on the variables you measured.

Always check covariate balance after matching (standardized mean differences below .10 are a common target) and the region of common support (units outside the overlap range cannot be matched and must be discarded, narrowing the population to which the result generalizes).

→ Deep dive: Causal Inference in Observational Data

PsyStat: Propensity Matching moduleObservational

Instrumental Variables

A technique for estimating causal effects when treatment is endogenous, using a third variable that affects the outcome only through its effect on treatment.

An instrument Z must satisfy three conditions: (1) relevance — Z is associated with treatment X; (2) exclusion restriction — Z affects the outcome Y only through X; (3) independence — Z is independent of unmeasured confounders. The classical estimator is two-stage least squares (2SLS): regress X on Z to get predicted X, then regress Y on the predicted X.

IV identifies the local average treatment effect (LATE) for "compliers" — the units whose treatment status would have changed had the instrument changed. This is not the average treatment effect for the whole population, a subtlety often glossed over in applied work.

Weak instruments (low relevance) produce wildly unstable IV estimates with bias toward OLS. The first-stage F-statistic should be at least 10, and ideally much higher. Recent advances (Mendelian randomization in epidemiology) leverage genetic variants as instruments under explicit causal assumptions.

→ Deep dive: Causal Inference in Observational Data

PsyStat: Causal Inference moduleEndogeneity

Regression Discontinuity Design

A quasi-experimental design that estimates causal effects by exploiting a known cutoff that determines treatment assignment.

RDD applies when treatment is assigned based on whether a continuous "running variable" exceeds a threshold — a scholarship awarded above a test score cutoff, a Medicare benefit eligibility starting at age 65, a class-size limit triggering an extra teacher. Units just above and below the cutoff are presumed comparable except for treatment, allowing local causal inference.

Two flavors: sharp RDD (treatment is a deterministic function of the running variable) and fuzzy RDD (the cutoff changes the probability of treatment but not deterministically; estimated via IV with the cutoff as instrument). Estimation is typically by local linear regression within an optimally chosen bandwidth (Imbens & Kalyanaraman, 2012).

Validity checks: covariates should not jump at the cutoff, the density of the running variable should not show manipulation around the cutoff (McCrary test), and the result should be robust to bandwidth choice. RDD has the strongest internal validity among observational designs — close to a randomized trial — but the result only generalizes to units near the cutoff.

→ Deep dive: Causal Inference in Observational Data

PsyStat: RDD moduleQuasi-experiment

Difference-in-Differences

A causal-inference design that estimates a treatment effect by comparing the before-after change in a treated group to the before-after change in an untreated group.

DiD subtracts out two sources of bias simultaneously: time-invariant differences between groups (via the within-group difference) and common time trends (via the comparison to the control group's change). Implemented as a regression with group, time, and group-by-time interaction terms, where the interaction coefficient is the DiD estimate.

The crucial identifying assumption is parallel trends: in the absence of treatment, the two groups would have evolved similarly. Pre-treatment data should be inspected for parallel trajectories, and event-study plots showing leads and lags around the treatment date are now standard for transparency.

Recent econometric work (Goodman-Bacon, 2021; Callaway & Sant'Anna, 2021) has shown that the standard two-way fixed effects estimator can be badly biased when treatment is staggered across units and effects are heterogeneous. New estimators address this and are increasingly the norm.

Use case: A state raises its minimum wage in 2020. DiD compares wage and employment changes in that state vs. neighboring states pre- and post-2020.

→ Deep dive: Causal Inference in Observational Data

PsyStat: DiD modulePanel data

Confounders, Mediators, and Colliders

Three causally distinct roles a variable can play relative to an exposure-outcome relationship, each requiring different statistical handling.

A confounder is a common cause of both exposure and outcome (e.g., age affects both smoking and lung cancer). To estimate the causal effect of exposure, you must adjust for confounders. A mediator lies on the causal pathway from exposure to outcome (e.g., tar deposition mediates smoking's effect on cancer). Adjusting for it removes part of the very effect you're trying to estimate.

A collider is a common effect of two variables (e.g., a hospital admission caused jointly by injury and illness). Adjusting for a collider opens a non-causal path between its parents and induces spurious association — the famous collider stratification bias. Restricting analysis to a subset based on a collider (e.g., studying only hospitalized patients) creates the same bias.

The Table 2 fallacy (Westreich & Greenland, 2013) is the practice of interpreting all coefficients in a multivariable regression as causal effects of their respective predictors, when in fact only the target exposure has its causal effect identified by the chosen adjustment set; the other coefficients reflect a confused mix of direct and indirect paths.

→ Deep dive: Causal Inference in Observational Data

PsyStat: DAG Builder moduleConceptual

I. Meta-Analysis 5 Entries

The quantitative synthesis of effect sizes across studies. A well-conducted meta-analysis can yield more precise and generalizable estimates than any single study, but it can also propagate the biases of the underlying literature if conducted carelessly.

Fixed vs Random Effects Models

Two competing assumptions about the population of studies being synthesized: a single common true effect (fixed) versus a distribution of true effects (random).

A fixed-effect (or common-effect) model assumes that every included study estimates the same true effect, with observed differences arising solely from sampling error. Each study is weighted by the inverse of its variance, giving large studies a dominant role.

A random-effects model assumes the studies' true effects are themselves drawn from a distribution — methodologically diverse studies may genuinely target slightly different populations, contexts, or operationalizations. The model adds a between-study variance component (tau²) and broadens confidence intervals accordingly. The DerSimonian-Laird estimator is classic; REML and Paule-Mandel are now generally preferred.

Random-effects is the safer default for behavioral science meta-analyses, where heterogeneity is the rule rather than the exception. Always report the implied prediction interval alongside the summary effect — it tells readers the range of true effects that future similar studies are likely to find, which is often dramatically wider than the summary CI.

PsyStat: Meta-Analysis moduleSynthesis

Heterogeneity (Q, I², τ²)

Statistics quantifying the variability of effect sizes across studies in a meta-analysis beyond what would be expected from sampling error alone.

Cochran's Q tests whether the observed variation exceeds what sampling error predicts; it is sensitive to the number of studies (low power with few, hyper-significant with many) and so should not drive modeling choices on its own. I² (Higgins & Thompson, 2002) expresses the proportion of total variance attributable to between-study heterogeneity rather than sampling error; conventional benchmarks are 25% (low), 50% (moderate), 75% (high), but these are rough heuristics.

τ² is the actual variance of true effects across studies, on the same scale as the effect size itself. It is arguably the most interpretable heterogeneity statistic because it directly informs the prediction interval. A τ (the standard deviation) of 0.20 on a Cohen's d scale means substantial real-world variability in effects across study contexts.

High heterogeneity is not a flaw to be eliminated but a signal that the question "what is the average effect?" needs to be supplemented with "what moderates it?" — leading to meta-regression and subgroup analyses.

PsyStat: Meta-Analysis moduleHeterogeneity

Publication Bias

Systematic distortion in a meta-analytic estimate caused by the selective publication of studies based on their results — typically, studies with significant or large effects.

The classic diagnostic is the funnel plot: effect sizes plotted against precision (or its inverse). Without bias, the plot should be symmetric around the summary effect. Asymmetry — a gap of small, null-finding studies in the bottom of the plot — suggests publication bias. Egger's regression test formalizes this asymmetry, though it has low power with few studies and can confound bias with genuine small-study effects.

Adjustment methods include trim-and-fill (impute hypothetical missing studies to restore symmetry), PET-PEESE (regress effect on standard error to estimate the bias-adjusted effect), and selection models that explicitly parameterize the publication process. p-curve and p-uniform use the distribution of significant p-values to infer evidential value beyond what selective reporting could produce.

No single method is dispositive. Modern best practice runs several adjustment methods and reports their range as a sensitivity analysis. Pre-registration of studies and registries that include null results are the long-term solutions; statistical adjustments are damage control.

→ Deep dive: The Replication Crisis & Pre-Registration

PsyStat: Publication Bias moduleBias correction

Meta-Regression

A regression of study-level effect sizes on study-level moderators, used to explain between-study heterogeneity.

When a meta-analysis shows substantial heterogeneity, meta-regression asks whether systematic study features — sample mean age, dose of intervention, year of publication, country, methodological quality — account for it. Each study contributes one observation, with the moderator value as the predictor.

Important cautions: meta-regression with few studies has very low power; a useful rule of thumb is at least 10 studies per moderator. Analyses are inherently ecological — cross-study correlations cannot establish individual-level causal effects, and confounding at the study level is rampant. The Knapp-Hartung adjustment to standard errors is now standard for meta-regression and should be reported.

Use case: A meta-analysis of CBT for anxiety finds moderate heterogeneity (I² = 62%). Meta-regression on number of sessions reveals each additional session is associated with d = 0.04 larger effect (95% CI [0.01, 0.07]), explaining roughly 30% of the heterogeneity.

PsyStat: Meta-Regression moduleModerators

Forest Plot

The canonical visualization of a meta-analysis: each study's effect size and confidence interval as a horizontal line, with a summary diamond at the bottom.

Each row of a forest plot represents one study, with a square (sized proportional to the study's weight) marking the point estimate and a horizontal line showing the confidence interval. The summary diamond at the bottom shows the meta-analytic estimate and its CI, with width equal to the diamond's horizontal span. A vertical line of "no effect" (e.g., d = 0 or OR = 1) makes immediately apparent which studies cross it.

Well-designed forest plots include columns for the study citation, sample sizes, raw event counts (for binary outcomes), the effect size with CI, and the relative weight in the meta-analysis. Adding a prediction interval as a separate row at the bottom (Riley, Higgins, & Deeks, 2011) is best practice for random-effects models because it conveys the range of effects future studies are likely to find.

The forest plot has rightly been called the "single most informative graphic in evidence-based medicine" — a reader can grasp the heterogeneity, the precision of each study, and the direction and magnitude of the synthesis at a glance.

PsyStat: Meta-Analysis VisualizationVisualization

Selected References

Foundational and contemporary works that informed the entries above. Many of these are open-access or available through major university libraries.

Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Sage.

Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social psychological research. Journal of Personality and Social Psychology, 51(6), 1173-1182.

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate. Journal of the Royal Statistical Society: Series B, 57(1), 289-300.

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81-105.

Carvalho, C. M., Polson, N. G., & Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2), 465-480.

Chen, F. F. (2007). Sensitivity of goodness of fit indexes to lack of measurement invariance. Structural Equation Modeling, 14(3), 464-504.

Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7-29.

Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default use Welch's t-test. International Review of Social Psychology, 30(1), 92-101.

Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing Type S and Type M errors. Perspectives on Psychological Science, 9(6), 641-651.

Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254-277.

Henseler, J., Ringle, C. M., & Sarstedt, M. (2015). A new criterion for assessing discriminant validity in variance-based structural equation modeling. Journal of the Academy of Marketing Science, 43(1), 115-135.

Higgins, J. P. T., & Thompson, S. G. (2002). Quantifying heterogeneity in a meta-analysis. Statistics in Medicine, 21(11), 1539-1558.

Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis. Structural Equation Modeling, 6(1), 1-55.

Imbens, G. W., & Kalyanaraman, K. (2012). Optimal bandwidth choice for the regression discontinuity estimator. Review of Economic Studies, 79(3), 933-959.

McNeish, D. (2018). Thanks coefficient alpha, we'll take it from here. Psychological Methods, 23(3), 412-433.

Pearl, J. (2009). Causality: Models, reasoning, and inference (2nd ed.). Cambridge University Press.

Preacher, K. J., & Hayes, A. F. (2008). Asymptotic and resampling strategies for assessing and comparing indirect effects. Behavior Research Methods, 40(3), 879-891.

Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41-55.

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach's alpha. Psychometrika, 74(1), 107-120.

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA's statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129-133.

Westreich, D., & Greenland, S. (2013). The Table 2 fallacy: Presenting and interpreting confounder and modifier coefficients. American Journal of Epidemiology, 177(4), 292-298.