← All posts
Methodology

Mixed Models for Nested Data

By Moonlit Social Labs · April 16, 2026 · 14 min read

Most of the data social scientists, clinicians, and education researchers actually collect violates the central assumption of ordinary least squares regression. Students sit in classrooms, classrooms sit in schools, schools sit in districts. Patients are clustered within hospitals, and hospitals within health systems. Daily mood ratings are nested within persons, and persons within experimental conditions. Whenever observations share a context, they share unmeasured causes — and that shared variance breaks the independence assumption that ordinary regression silently relies on.

Mixed-effects models — also called multilevel models, hierarchical linear models (HLM), or random-effects models, depending on which discipline you trained in — are the standard tool for analyzing such data. This piece walks through what they do, why they matter, and where they fail. It is intended as a graduate-level orientation rather than a tutorial: by the end you should be able to read a methods section critically and know whether a mixed model is the right tool for your own design.

1. Why Independence Matters

Ordinary regression assumes that, conditional on the predictors, the residuals are independent and identically distributed. The standard errors that appear in your output are computed under that assumption. When observations are nested, residuals from the same cluster are correlated — two students in the same classroom are more similar than two students drawn at random — and the standard errors are too small. Confidence intervals are too narrow. p-values are too low. You reject null hypotheses you should not reject.

This is not a small correction. In a study of 30 schools with 20 students each, ignoring clustering can inflate the Type I error rate from the nominal 5% to anywhere between 15% and 50%, depending on the magnitude of the cluster effect (Snijders & Bosker, 2012). The problem is not that your point estimate is biased — it usually isn't — but that you have wildly overstated how much information your sample contains.

The intraclass correlation coefficient

The basic diagnostic is the intraclass correlation coefficient (ICC), defined as the proportion of total residual variance that lies between clusters:

ICC = σ²between / (σ²between + σ²within)

An ICC of 0 means clusters are interchangeable; OLS would be fine. An ICC of 0.10 — a typical value in education research — means 10% of the variance is between clusters. That sounds modest, but combined with even a small cluster size it produces a substantial design effect: 1 + (n¯ − 1) × ICC, where n¯ is the average cluster size. With 25 students per classroom and ICC = 0.10, the design effect is 3.4, meaning your effective sample size is roughly one-third of your nominal n. Standard errors should be inflated by √3.4 ≈ 1.85.

Raudenbush and Bryk (2002) make the point that even ICCs as small as 0.05 can produce serious distortions when cluster sizes exceed 50, which is routine in patient-level outcomes within hospitals or employee surveys within firms.

2. Random Effects: The Conceptual Move

The mixed-model fix is to add a random effect for the clustering variable. Conceptually, you allow each cluster to have its own deviation from the overall pattern, but rather than estimating a separate parameter for every cluster (as you would with a fixed-effect dummy), you assume those cluster-specific deviations are draws from a normal distribution with mean zero and variance τ. That single variance parameter replaces what would otherwise be hundreds or thousands of intercepts.

Random intercepts

The simplest model adds a random intercept: each cluster gets its own baseline, but the slopes are common across clusters. In lme4 notation:

lmer(achievement ~ ses + (1 | school), data = d)

This says achievement depends on socioeconomic status, but each school has its own intercept drawn from N(0, τ00). The fixed-effect SE for ses is now correctly inflated to reflect the dependence within schools.

Random slopes

Random intercepts assume the effect of ses is identical across schools. That is rarely true. Some schools may amplify SES gaps; others may compress them. Adding a random slope captures that:

lmer(achievement ~ ses + (1 + ses | school), data = d)

You now estimate the variance in slopes across schools (τ11) and the covariance between intercept and slope (τ01). A negative covariance, for example, would suggest that schools with higher baseline achievement also have flatter SES gradients — a substantively interesting finding.

Crossed vs. nested random effects

Random effects are nested when one grouping is strictly contained within another (students within schools; classroom 7 in school A is a different unit from classroom 7 in school B). They are crossed when units cross multiple groupings independently (subjects responding to a fixed set of items, where every subject sees every item). The syntax differs — (1 | school/classroom) for nesting, (1 | subject) + (1 | item) for crossing — and getting it wrong silently produces incorrect variance partitions. When labels are unique across the higher level (e.g., classrooms have unique IDs across schools), the two specifications are equivalent; when they recycle, they are not.

3. Three-Level and Cross-Classified Designs

Real data are rarely two-level. Students sit in classrooms within schools (three levels), and a student may move classrooms across grades (cross-classified across teachers). The framework extends naturally:

lmer(score ~ poverty + (1 | school/classroom), data = d)

fits a three-level model with random intercepts at both the school and classroom-within-school levels. Variance is now partitioned three ways — between schools, between classrooms within schools, and between students within classrooms — and the corresponding ICCs tell you where most of the action is. In US elementary education, classroom-level variance often dwarfs school-level variance, which has substantive implications: interventions targeting teachers may matter more than school-level reforms.

Cross-classified models handle the case where a unit belongs to multiple non-nested groupings simultaneously. A child may live in neighborhood A and attend school B, and the children in school B come from many neighborhoods:

lmer(outcome ~ x + (1 | school) + (1 | neighborhood), data = d)

These models are computationally heavier and require larger samples at each crossing, but they let you decompose contextual effects that would otherwise be confounded.

4. Growth Curves and Longitudinal Data

Repeated measures within persons are nested data, and mixed models are the modern replacement for repeated-measures ANOVA. The key advantages are tolerance of unbalanced designs (subjects can have different numbers of waves), explicit handling of time as a continuous variable, and graceful accommodation of time-varying covariates.

Singer and Willett (2003) lay out the canonical two-stage logic. At Level 1, each person's outcome trajectory over time is modeled as an individual growth curve:

yij = π0i + π1i · timeij + eij

At Level 2, the person-specific intercepts (π0i) and slopes (π1i) are themselves modeled as functions of time-invariant predictors:

π0i = γ00 + γ01 · treatmenti + u0i

π1i = γ10 + γ11 · treatmenti + u1i

The cross-level interaction γ11 is what most longitudinal studies actually care about: does treatment change the rate of change? Coding time matters: centering at baseline makes the intercept the predicted score at t = 0; centering at the midpoint makes the intercept the average score across waves. Nonlinear growth (quadratic, piecewise, latent basis) is handled by adding terms or restructuring the time variable, but every additional growth parameter is one more random effect that the data must support.

5. Centering: A Decision That Changes Your Question

When a Level-1 predictor varies both within and between clusters — say, daily stress within persons — how you center it changes what the coefficient means.

These are not interchangeable. A within-person effect of stress on mood (−0.4 per stress unit) and a between-person effect (−0.1 per stress unit) tell different scientific stories: one is about state-level reactivity, the other about trait-level adaptation. Reporting only the grand-mean-centered coefficient hides the distinction. Enders and Tofighi (2007) treat this in detail; the short version is that group-mean centering with the cluster mean re-entered at Level 2 is the safest default when within and between effects might differ.

6. Estimation: REML, ML, and Convergence

Two estimators dominate. Maximum likelihood (ML) jointly estimates fixed effects and variance components, but its variance estimates are biased downward in small samples. Restricted maximum likelihood (REML) corrects this bias by integrating out the fixed effects before estimating variances. The practical rule:

Convergence problems are common and informative. A "singular fit" warning from lme4 usually means a variance component has been estimated as zero (or a correlation as ±1). The model is telling you the data do not support the random-effects structure you requested. The fix is rarely to ignore the warning; it is to simplify the random structure, recheck the data for design quirks, or, when the data simply cannot support the model, retreat to a cluster-robust standard error.

7. Generalized Linear Mixed Models

When the outcome is binary, count, ordinal, or otherwise non-Gaussian, generalized linear mixed models (GLMMs) extend the framework via a link function. Logistic GLMMs for binary outcomes, Poisson and negative-binomial GLMMs for counts, and ordinal GLMMs for Likert items are all routine.

glmer(passed ~ hours_studied + (1 | school), family = binomial, data = d)

Two complications matter. First, GLMMs do not have closed-form likelihoods and are estimated by approximations — Laplace approximation, adaptive Gauss-Hermite quadrature, or pseudo-likelihood — which differ in accuracy and speed. Adaptive quadrature is the gold standard but scales poorly with crossed random effects. Second, fixed-effect coefficients in nonlinear GLMMs are conditional on the random effect, not marginal. The cluster-specific log-odds ratio is not the population-averaged log-odds ratio. If your scientific question is about a typical individual within a cluster, the GLMM coefficient is what you want; if it is about a typical cluster's mean response, you may want a marginal model (GEE) instead.

8. Common Pitfalls

9. A Worked Comparison

Consider a small synthetic study: 40 schools, 20 students each (n = 800), with a school-level intervention assigned to half the schools and a continuous student-level outcome. The true intervention effect is 0.30 standard deviations; the true ICC is 0.15.

The point estimate is unchanged, but the OLS confidence interval is roughly half as wide as it should be. In a borderline case — say, a true effect of 0.15 with the same design — OLS would have declared significance and the mixed model would have correctly reported a null result. Multiplied across thousands of published studies in education and clinical research, this is part of why intervention effects so often shrink in replication.

10. Mixed Models, GEE, and Cluster-Robust SEs

Three approaches handle clustering, and they answer subtly different questions.

None of these is uniformly best. With few clusters (< 30–40), cluster-robust SEs are themselves biased and mixed models with small-sample corrections often dominate. With many clusters and a binary outcome where you want a marginal effect, GEE is a defensible default. With variance components or random slopes as your scientific quantity of interest, only mixed models will do.

11. When Mixed Models Are Overkill

Mixed models are not free. They demand careful thinking about levels, distributions, and centering; they fail to converge in ways that take hours to debug; they invite the reader to interrogate every random effect. They are not always the right tool.

The honest framing is this: mixed models are essential when (a) clustering is non-trivial, (b) you have enough clusters to estimate variance components, and (c) you care about either correct inference or the variance structure itself. Outside that envelope, simpler tools may serve you better.

Practical Workflow

Fit an empty (intercept-only) model first to estimate the ICC. If it is meaningful (say, > 0.05) or your cluster sizes are large, proceed with a mixed model. Add fixed effects, then random slopes for any predictor whose effect could plausibly vary across clusters. Watch for singular fits and simplify the random structure principled when warned. Use REML for final reporting, ML for likelihood-ratio comparisons of fixed effects, and a small-sample correction (Kenward–Roger, Satterthwaite) when clusters are few.

Try It in PsyStat Nexus

The HLM module in PsyStat Nexus fits two- and three-level random-intercept and random-slope models, reports ICCs and design effects, runs Kenward–Roger small-sample corrections, and flags singular fits with diagnostic guidance. Growth-curve templates and cross-classified specifications are built in, and GLMM extensions handle binary and count outcomes through Laplace and adaptive quadrature.

Get started free →

References

Related Posts

Methodology
Causal Inference Without Randomization
When you can't randomize, what can you actually claim? A tour of matching, instruments, regression discontinuity, and the assumptions each one rests on.
Methodology
Power Analysis: How Much Data Do You Actually Need?
From smallest effect size of interest to simulation-based power for complex designs — how to plan a study that can actually answer its question.