← All posts
Methodology

Effect Sizes Beyond Cohen's d

By Moonlit Social Labs · April 16, 2026 · 11 min read

Ask a working researcher to name an effect size and you will, with near certainty, hear "Cohen's d." Ask them to name a second one and the room often gets quiet. That silence is a problem. Cohen's d is a useful summary, but it is one of perhaps a dozen widely used effect-size metrics, each suited to a different design, distribution, or question. Choosing well — and reporting transparently — matters more than most introductory courses suggest.

This guide is intended as a working reference for behavioral and biomedical researchers. We start with why effect sizes matter at all, walk through the most common metrics, and finish with the field-specific benchmarks debate that has reshaped how the discipline interprets "small," "medium," and "large."

Why Effect Sizes Matter (More Than p-Values)

A p-value answers a single, narrow question: assuming the null is true, how surprising is this data? It does not tell you how big the effect is, how clinically meaningful it is, or whether anyone should care. With a large enough sample, trivially small differences become "highly significant." With a small sample, real and important effects are routinely missed.

An effect size answers a different and ultimately more useful question: how large is the thing we found? Reporting effect sizes alongside test statistics has been required by APA-style journals for two decades and is now standard in CONSORT, STROBE, and most major reporting guidelines. The American Statistical Association's 2016 statement on p-values made the same point bluntly: a p-value, divorced from effect size and uncertainty, is barely an analysis at all.

"Statistical significance is not equivalent to scientific, human, or economic significance. The effect size is what determines practical importance." — Wasserstein & Lazar, 2016 (ASA statement)

Cohen's d: The Default, and Its Discontents

Cohen's d is the standardized mean difference between two groups. The standard formulation divides the raw mean difference by the pooled within-group standard deviation:

d = (M₁ − M₂) / SDpooled

It is intuitive, it has been around since the 1960s, and it is the input most meta-analysts expect. But Cohen's d has three well-documented limitations.

First, it is upwardly biased in small samples. The pooled SD is computed from sample standard deviations, which underestimate the population SD when n is small; this inflates d. The bias becomes non-trivial below roughly n = 50 per group and is severe in pilot studies.

Second, it assumes equal variances. Pooling SDs from groups with very different spreads produces a value that is mathematically defined but conceptually muddled — the standardizer is a weighted average of two things that are not the same.

Third, the popular 0.2 / 0.5 / 0.8 cutpoints are not laws of nature. Cohen himself proposed them as last-resort heuristics for fields that had no empirical benchmarks of their own, and he repeatedly cautioned against treating them as universal. We will return to this in the section on field-specific norms.

Hedges' g: The Small-Sample Correction

Hedges and Olkin (1985) showed that multiplying Cohen's d by a correction factor J — approximately 1 − 3 / (4 df − 1) — produces an unbiased estimator of the population standardized mean difference. The corrected statistic is called Hedges' g.

For large samples, g and d are nearly identical. For small samples (especially n < 20 per group), they can differ by 5–10%. Most modern meta-analytic software defaults to g precisely because the bias matters when synthesizing many small studies. If you are running a study with modest n — which is to say, most psychology studies — you should be reporting g, not d.

Glass's Delta: When Variances Don't Match

Glass's Δ (delta) sidesteps the equal-variance problem by standardizing on the control group's SD alone:

Δ = (Mtreatment − Mcontrol) / SDcontrol

This is the right choice when the intervention itself changes variability — which it often does. A successful psychotherapy may compress the spread of depression scores, making the treatment group's SD genuinely smaller than the control's. Pooling those SDs would muddy the standardizer; using only the control SD preserves a clean reference scale.

Glass's Δ is also the right choice when one group is a published normative sample and you do not have raw scores to pool. Its drawbacks are that it discards information from the treatment group's variability and tends to have wider confidence intervals than d or g.

Correlation r and Its Cousins

For continuous bivariate relationships, the Pearson correlation r is itself an effect size — bounded between −1 and 1, scale-free, and easily understood. It converts cleanly to and from d:

r = d / √(d² + 4)   and   d = 2r / √(1 − r²)

For ranked or ordinal data, Spearman's ρ and Kendall's τ are the analogous effect sizes. They are less efficient than r when assumptions of bivariate normality hold, but considerably more robust when they do not. For 2×2 tables, the phi coefficient is mathematically equivalent to Pearson's r on dichotomous variables.

Eta-Squared, Partial Eta-Squared, and the Trap

For ANOVA designs, eta-squared (η²) reports the proportion of total variance explained by an effect:

η² = SSeffect / SStotal

It is intuitive but problematic in factorial designs: as you add factors to your model, the denominator grows, mechanically shrinking the η² for any single effect even if the underlying relationship is unchanged. Partial eta-squared (η²p) was introduced to fix this:

η²p = SSeffect / (SSeffect + SSerror)

By removing variance attributable to other factors from the denominator, partial η² gives a cleaner answer to "how much of the variance left over after controlling for everything else does this factor explain?"

Why partial eta-squared can mislead

Partial η² values from different studies are not directly comparable, because each study's denominator depends on which other factors happened to be in that particular model. SPSS, until quite recently, mislabeled partial η² as η² in its output, which led to a generation of papers reporting inflated effect sizes. If you see a paper reporting "η² = .35" from a 2×3 ANOVA, treat it as suspect — that is almost always partial η², and the unstandardized η² would be considerably smaller. Always state explicitly which version you are reporting.

Omega-Squared: The Less Biased Alternative

Both eta-squared and partial eta-squared are biased upward in small samples, sometimes substantially. Omega-squared (ω²) corrects for this by adjusting both numerator and denominator with degrees-of-freedom terms:

ω² = (SSeffect − dfeffect × MSerror) / (SStotal + MSerror)

Omega-squared is essentially always smaller than eta-squared and approaches it as n grows. Lakens (2013) and many subsequent methodologists recommend ω² (or its partial counterpart, ω²p) as the default reporting metric for ANOVA-style designs. The cost is a slight loss of interpretive simplicity; the gain is a substantially more honest estimate.

Cohen's f: For Power Analysis on ANOVA

Cohen's f is a convenience reparameterization for use in power analysis software like G*Power. It is functionally r-like for ANOVA:

f = √(η² / (1 − η²))

You will rarely see f reported in a results section, but you will use it constantly when planning sample sizes for ANOVA-based studies. Cohen's tentative benchmarks for f are 0.10 (small), 0.25 (medium), and 0.40 (large). Same caveats apply.

Effect Sizes for Non-Parametric Tests

When your data violate the assumptions of parametric tests — ordinal scales, heavy skew, ceiling and floor effects — you need effect sizes built for ranks rather than means.

Cliff's delta

Cliff's d (sometimes δ) is the probability that a randomly drawn observation from group 1 exceeds one from group 2, minus the probability of the reverse. It ranges from −1 to +1, requires no distributional assumptions, and is robust to outliers and ceiling effects. Romano et al. (2006) suggested |d| < 0.147 = negligible, < 0.33 = small, < 0.474 = medium, otherwise large — but again, treat thresholds as field-dependent.

Rank-biserial correlation

The rank-biserial correlation rrb is the standard effect size for the Mann–Whitney U and Wilcoxon signed-rank tests. It can be computed directly from U:

rrb = 1 − (2U / (n₁ × n₂))

It has the same −1 to +1 interpretation as Pearson's r, but operates on ranks rather than raw values.

Effect Sizes for Categorical Outcomes

When the dependent variable is a count, a category, or a binary event, the right effect sizes look very different from d or η².

Odds ratios and risk ratios

The odds ratio (OR) is the ratio of odds of an event in two groups; the risk ratio (RR) is the ratio of probabilities. Both are bounded below by zero with no upper bound, and both equal 1.0 under the null. ORs are returned by logistic regression and are appropriate for case-control studies; RRs are interpretable for cohort and intervention studies but cannot be estimated from case-control sampling.

Critically, ORs and RRs are not interchangeable. For rare outcomes (event probability under ~10%), OR and RR are numerically close. For common outcomes, OR can be substantially larger than RR, and journalists and even researchers routinely confuse them. When the base rate is high, prefer RR or absolute risk difference for plain-language interpretation.

Number Needed to Treat (NNT)

NNT is the number of patients who must receive a treatment for one additional patient to benefit. It is computed as the reciprocal of the absolute risk reduction:

NNT = 1 / (pcontrol − ptreatment)

NNT is the most clinically intuitive of all effect sizes — "you need to treat 12 patients to prevent one stroke" is something a physician, a patient, and a policymaker can all reason about. It is also asymmetric: NNT for harm (NNH) is computed the same way for adverse events. The catch is that NNT depends heavily on the baseline event rate; the same relative risk reduction yields wildly different NNTs in low- versus high-risk populations.

Common Language Effect Size (CLES)

McGraw and Wong's (1992) common language effect size translates a standardized difference into a probability statement: what is the probability that a randomly chosen score from group 1 exceeds a randomly chosen score from group 2? For a Cohen's d of 0.5, CLES is approximately 0.64 — meaning if you randomly pair members of the two groups, the higher-mean group will win about 64% of the time.

CLES is exceptional for communicating effects to non-technical audiences. "There is a 64% chance our drug recipient outperforms the placebo recipient" lands harder than "d = 0.5." It is also closely related to the area under the ROC curve and to the probability of superiority in non-parametric contexts.

A Worked Comparison

Suppose a randomized trial compares a new cognitive-behavioral intervention against a wait-list control on the Beck Depression Inventory. The treatment group (n = 40) has M = 14.2, SD = 6.1; the control group (n = 40) has M = 18.7, SD = 7.4. The base rate of "clinical remission" (BDI < 10) is 35% in treatment, 15% in control.

MetricValueInterpretation
Cohen's d−0.66Standardized mean difference, pooled SD
Hedges' g−0.65Small-sample bias correction (~1.5% shrinkage)
Glass's Δ−0.61Standardized on control SD only
Pearson r−0.31Point-biserial correlation
CLES0.6868% chance treatment recipient has lower BDI than control
Risk ratio (remission)2.33Treatment patients 2.3× more likely to remit
Risk difference0.2020 percentage points more remit on treatment
NNT5Treat 5 patients to produce 1 additional remission

Every row above describes the same data. Each tells a slightly different story, suited to a different audience. A meta-analyst wants g; a clinician wants NNT and risk difference; a peer reviewer wants d with its 95% CI; a journalist wants the CLES.

Confidence Intervals on Effect Sizes

An effect size without a confidence interval is not a finished analysis — it is an unmoored point estimate. APA reporting standards have required CIs since 2010, yet a depressing share of papers still report bare effect sizes. The CI tells your reader how precisely you have measured the effect, which is essential for judging replication potential and for downstream meta-analysis.

Most effect sizes have non-symmetric sampling distributions, so CIs should be computed using non-central t, F, or chi-squared distributions rather than the simple symmetric formula estimate ± 1.96 SE. For Cohen's d and Hedges' g, non-central t intervals (Cumming, 2014) are standard. For correlations, the Fisher z transformation is appropriate. For odds ratios, log-transform before computing CIs and then back-transform. PsyStat Nexus and most modern packages handle these transformations automatically.

The "Small / Medium / Large" Trap

Cohen's 1988 textbook offered tentative benchmarks (d = 0.2 / 0.5 / 0.8) for fields that had no empirical guidance of their own. He framed them as last-resort defaults and explicitly warned against universal application. Three decades later, those defaults have hardened into something between conventional wisdom and law — a reviewer dismisses d = 0.15 as "small" without asking what is normal in that subfield.

Two recent empirical re-examinations have shifted the conversation. Funder and Ozer (2019) reviewed effect sizes across personality and social psychology and concluded that Cohen's anchors are too generous: in those fields, r = 0.10 represents a meaningful effect, r = 0.20 is moderate, and r = 0.30 is large. A correlation of 0.40 — "medium" by Cohen's translation — is in fact very large for individual-difference research.

Lovakov and Agadullina (2021) ran a parallel exercise on social psychology and proposed similar thresholds: d = 0.15 / 0.36 / 0.65 as small, medium, and large benchmarks empirically derived from the field's own literature. Both papers make the same broader point: benchmarks should be calibrated to the actual distribution of effect sizes in your field, not imported wholesale from a textbook published when behaviorism was still respectable.

SourceFieldSmallMediumLarge
Cohen (1988)General defaultd = 0.2d = 0.5d = 0.8
Funder & Ozer (2019)Personality / socialr = 0.10r = 0.20r = 0.30
Lovakov & Agadullina (2021)Social psychologyd = 0.15d = 0.36d = 0.65
Hemphill (2003)Psych assessmentr = 0.20r = 0.30r = 0.30+

"Effect-size labels are useful for orientation only. The substantive importance of an effect depends on the cost of intervention, the consequences of the outcome, and the prior literature in the relevant domain — never on a single textbook table." — Funder & Ozer (2019), paraphrased

Practical Reporting Checklist

  1. Report at least one effect size for every test. p-values alone are insufficient and have been since 2010.
  2. Report a 95% CI on the effect size, computed with the appropriate non-central distribution.
  3. Specify which version you are reporting. Cohen's d with pooled SD? Hedges' g? Partial eta-squared or eta-squared? Be explicit; do not let SPSS labels speak for you.
  4. Match the metric to the design. Glass's Δ for unequal variances; g for small n; ω² for ANOVA; Cliff's d for non-parametric tests; OR/RR/NNT for binary outcomes.
  5. Interpret against field-specific benchmarks, not Cohen's 1988 defaults. Cite the empirical source for your benchmark.
  6. Report multiple complementary metrics when audiences differ. A clinical paper benefits from both d (for meta-analysis) and NNT (for practice).

Try It in PsyStat Nexus

The Effect Sizes module in PsyStat Nexus computes every metric discussed in this article from raw data, summary statistics, or test output. It returns Cohen's d, Hedges' g, Glass's Δ, Pearson and rank-biserial correlations, eta-squared, partial eta-squared, omega-squared, Cohen's f, Cliff's d, odds ratios, risk ratios, NNT, and CLES — each with appropriate non-central confidence intervals and field-specific benchmark comparisons drawn from Funder & Ozer (2019) and Lovakov & Agadullina (2021).

Get started free →

References

Related Posts

Methodology
Power Analysis: How Much Data Do You Actually Need?
Underpowered studies are the engine of the replication crisis. A practical guide to calculating sample sizes that actually deliver answers.
Methodology
The Replication Crisis: What Actually Happened
A clear-eyed account of how the social sciences discovered that many of their canonical findings would not replicate — and what changed.