Mixed Models for Nested Data
Most of the data social scientists, clinicians, and education researchers actually collect violates the central assumption of ordinary least squares regression. Students sit in classrooms, classrooms sit in schools, schools sit in districts. Patients are clustered within hospitals, and hospitals within health systems. Daily mood ratings are nested within persons, and persons within experimental conditions. Whenever observations share a context, they share unmeasured causes — and that shared variance breaks the independence assumption that ordinary regression silently relies on.
Mixed-effects models — also called multilevel models, hierarchical linear models (HLM), or random-effects models, depending on which discipline you trained in — are the standard tool for analyzing such data. This piece walks through what they do, why they matter, and where they fail. It is intended as a graduate-level orientation rather than a tutorial: by the end you should be able to read a methods section critically and know whether a mixed model is the right tool for your own design.
1. Why Independence Matters
Ordinary regression assumes that, conditional on the predictors, the residuals are independent and identically distributed. The standard errors that appear in your output are computed under that assumption. When observations are nested, residuals from the same cluster are correlated — two students in the same classroom are more similar than two students drawn at random — and the standard errors are too small. Confidence intervals are too narrow. p-values are too low. You reject null hypotheses you should not reject.
This is not a small correction. In a study of 30 schools with 20 students each, ignoring clustering can inflate the Type I error rate from the nominal 5% to anywhere between 15% and 50%, depending on the magnitude of the cluster effect (Snijders & Bosker, 2012). The problem is not that your point estimate is biased — it usually isn't — but that you have wildly overstated how much information your sample contains.
The intraclass correlation coefficient
The basic diagnostic is the intraclass correlation coefficient (ICC), defined as the proportion of total residual variance that lies between clusters:
ICC = σ²between / (σ²between + σ²within)
An ICC of 0 means clusters are interchangeable; OLS would be fine. An ICC of 0.10 — a typical value in education research — means 10% of the variance is between clusters. That sounds modest, but combined with even a small cluster size it produces a substantial design effect: 1 + (n¯ − 1) × ICC, where n¯ is the average cluster size. With 25 students per classroom and ICC = 0.10, the design effect is 3.4, meaning your effective sample size is roughly one-third of your nominal n. Standard errors should be inflated by √3.4 ≈ 1.85.
Raudenbush and Bryk (2002) make the point that even ICCs as small as 0.05 can produce serious distortions when cluster sizes exceed 50, which is routine in patient-level outcomes within hospitals or employee surveys within firms.
2. Random Effects: The Conceptual Move
The mixed-model fix is to add a random effect for the clustering variable. Conceptually, you allow each cluster to have its own deviation from the overall pattern, but rather than estimating a separate parameter for every cluster (as you would with a fixed-effect dummy), you assume those cluster-specific deviations are draws from a normal distribution with mean zero and variance τ. That single variance parameter replaces what would otherwise be hundreds or thousands of intercepts.
Random intercepts
The simplest model adds a random intercept: each cluster gets its own baseline, but the slopes are common across clusters. In lme4 notation:
lmer(achievement ~ ses + (1 | school), data = d)
This says achievement depends on socioeconomic status, but each school has its own intercept drawn from N(0, τ00). The fixed-effect SE for ses is now correctly inflated to reflect the dependence within schools.
Random slopes
Random intercepts assume the effect of ses is identical across schools. That is rarely true. Some schools may amplify SES gaps; others may compress them. Adding a random slope captures that:
lmer(achievement ~ ses + (1 + ses | school), data = d)
You now estimate the variance in slopes across schools (τ11) and the covariance between intercept and slope (τ01). A negative covariance, for example, would suggest that schools with higher baseline achievement also have flatter SES gradients — a substantively interesting finding.
Crossed vs. nested random effects
Random effects are nested when one grouping is strictly contained within another (students within schools; classroom 7 in school A is a different unit from classroom 7 in school B). They are crossed when units cross multiple groupings independently (subjects responding to a fixed set of items, where every subject sees every item). The syntax differs — (1 | school/classroom) for nesting, (1 | subject) + (1 | item) for crossing — and getting it wrong silently produces incorrect variance partitions. When labels are unique across the higher level (e.g., classrooms have unique IDs across schools), the two specifications are equivalent; when they recycle, they are not.
3. Three-Level and Cross-Classified Designs
Real data are rarely two-level. Students sit in classrooms within schools (three levels), and a student may move classrooms across grades (cross-classified across teachers). The framework extends naturally:
lmer(score ~ poverty + (1 | school/classroom), data = d)
fits a three-level model with random intercepts at both the school and classroom-within-school levels. Variance is now partitioned three ways — between schools, between classrooms within schools, and between students within classrooms — and the corresponding ICCs tell you where most of the action is. In US elementary education, classroom-level variance often dwarfs school-level variance, which has substantive implications: interventions targeting teachers may matter more than school-level reforms.
Cross-classified models handle the case where a unit belongs to multiple non-nested groupings simultaneously. A child may live in neighborhood A and attend school B, and the children in school B come from many neighborhoods:
lmer(outcome ~ x + (1 | school) + (1 | neighborhood), data = d)
These models are computationally heavier and require larger samples at each crossing, but they let you decompose contextual effects that would otherwise be confounded.
4. Growth Curves and Longitudinal Data
Repeated measures within persons are nested data, and mixed models are the modern replacement for repeated-measures ANOVA. The key advantages are tolerance of unbalanced designs (subjects can have different numbers of waves), explicit handling of time as a continuous variable, and graceful accommodation of time-varying covariates.
Singer and Willett (2003) lay out the canonical two-stage logic. At Level 1, each person's outcome trajectory over time is modeled as an individual growth curve:
yij = π0i + π1i · timeij + eij
At Level 2, the person-specific intercepts (π0i) and slopes (π1i) are themselves modeled as functions of time-invariant predictors:
π0i = γ00 + γ01 · treatmenti + u0i
π1i = γ10 + γ11 · treatmenti + u1i
The cross-level interaction γ11 is what most longitudinal studies actually care about: does treatment change the rate of change? Coding time matters: centering at baseline makes the intercept the predicted score at t = 0; centering at the midpoint makes the intercept the average score across waves. Nonlinear growth (quadratic, piecewise, latent basis) is handled by adding terms or restructuring the time variable, but every additional growth parameter is one more random effect that the data must support.
5. Centering: A Decision That Changes Your Question
When a Level-1 predictor varies both within and between clusters — say, daily stress within persons — how you center it changes what the coefficient means.
- Grand-mean centering subtracts the overall mean. The resulting coefficient is a blend of within-person and between-person effects, weighted by the relative variance at each level.
- Group-mean (or cluster-mean) centering subtracts each person's own mean. The coefficient is now purely the within-person effect: when this person is more stressed than usual, what happens to their mood?
- Adding the cluster mean as a Level-2 predictor — the "contextual" or "Mundlak" approach — lets you estimate within and between effects separately in a single model.
These are not interchangeable. A within-person effect of stress on mood (−0.4 per stress unit) and a between-person effect (−0.1 per stress unit) tell different scientific stories: one is about state-level reactivity, the other about trait-level adaptation. Reporting only the grand-mean-centered coefficient hides the distinction. Enders and Tofighi (2007) treat this in detail; the short version is that group-mean centering with the cluster mean re-entered at Level 2 is the safest default when within and between effects might differ.
6. Estimation: REML, ML, and Convergence
Two estimators dominate. Maximum likelihood (ML) jointly estimates fixed effects and variance components, but its variance estimates are biased downward in small samples. Restricted maximum likelihood (REML) corrects this bias by integrating out the fixed effects before estimating variances. The practical rule:
- Use REML when your interest is in the variance components or you are reporting final estimates.
- Use ML when comparing models with different fixed-effects structures via likelihood ratio tests — REML log-likelihoods are not comparable across different fixed-effect specifications.
Convergence problems are common and informative. A "singular fit" warning from lme4 usually means a variance component has been estimated as zero (or a correlation as ±1). The model is telling you the data do not support the random-effects structure you requested. The fix is rarely to ignore the warning; it is to simplify the random structure, recheck the data for design quirks, or, when the data simply cannot support the model, retreat to a cluster-robust standard error.
7. Generalized Linear Mixed Models
When the outcome is binary, count, ordinal, or otherwise non-Gaussian, generalized linear mixed models (GLMMs) extend the framework via a link function. Logistic GLMMs for binary outcomes, Poisson and negative-binomial GLMMs for counts, and ordinal GLMMs for Likert items are all routine.
glmer(passed ~ hours_studied + (1 | school), family = binomial, data = d)
Two complications matter. First, GLMMs do not have closed-form likelihoods and are estimated by approximations — Laplace approximation, adaptive Gauss-Hermite quadrature, or pseudo-likelihood — which differ in accuracy and speed. Adaptive quadrature is the gold standard but scales poorly with crossed random effects. Second, fixed-effect coefficients in nonlinear GLMMs are conditional on the random effect, not marginal. The cluster-specific log-odds ratio is not the population-averaged log-odds ratio. If your scientific question is about a typical individual within a cluster, the GLMM coefficient is what you want; if it is about a typical cluster's mean response, you may want a marginal model (GEE) instead.
8. Common Pitfalls
- Singular fits ignored. A variance estimated at zero indicates the data cannot identify that random effect. Forcing it in produces unstable standard errors elsewhere.
- Overfit random-effects structures. Barr et al. (2013) famously argued for "keeping it maximal" — fitting the random-effects structure justified by the design — to control Type I error in confirmatory analyses. Subsequent work (Matuschek et al., 2017; Bates et al., 2015) qualified the advice: maximal models often fail to converge or overfit, and parsimonious random structures may have better power without inflating Type I error. The current consensus is to start maximal but be willing to simplify principled when the data cannot support it.
- Treating fixed effects as random (or vice versa). Random effects are appropriate when levels are exchangeable draws from a population and you want to generalize beyond the sampled units. Fixed effects are appropriate when the levels themselves are of interest (the four experimental conditions, the two countries you happen to study). Treating the four experimental conditions as random because there are "only four" is a common error and produces uninformative variance estimates.
- Few clusters. McNeish and Stapleton (2016) document that with fewer than roughly 30 clusters, REML variance estimates are downwardly biased and Wald-based p-values for fixed effects are anti-conservative. Kenward–Roger or Satterthwaite degree-of-freedom corrections, parametric bootstrap, or Bayesian estimation with weakly informative priors are reasonable remedies.
- Reading R² off a mixed model. The marginal and conditional R² (Nakagawa & Schielzeth, 2013) do reasonable jobs, but they are not the same quantity as OLS R², and reporting a single number obscures more than it reveals.
9. A Worked Comparison
Consider a small synthetic study: 40 schools, 20 students each (n = 800), with a school-level intervention assigned to half the schools and a continuous student-level outcome. The true intervention effect is 0.30 standard deviations; the true ICC is 0.15.
- OLS: Estimate = 0.31, SE = 0.07, p < .001. Looks decisive.
- Mixed model with random school intercepts: Estimate = 0.31, SE = 0.13, p = .02. Same point estimate, nearly double the SE.
- Cluster-robust SE on the OLS fit: Estimate = 0.31, SE = 0.12, p = .01.
The point estimate is unchanged, but the OLS confidence interval is roughly half as wide as it should be. In a borderline case — say, a true effect of 0.15 with the same design — OLS would have declared significance and the mixed model would have correctly reported a null result. Multiplied across thousands of published studies in education and clinical research, this is part of why intervention effects so often shrink in replication.
10. Mixed Models, GEE, and Cluster-Robust SEs
Three approaches handle clustering, and they answer subtly different questions.
- Mixed models partition variance and produce conditional (cluster-specific) effects. They give you everything — ICCs, random slopes, BLUPs for individual clusters — at the cost of distributional assumptions on the random effects.
- Generalized estimating equations (GEE) estimate marginal (population-averaged) effects with a working correlation structure and robust standard errors. They are the right choice when the question is "what is the average effect across the population," especially for binary outcomes where the conditional/marginal distinction matters.
- OLS with cluster-robust standard errors is a minimalist option: you keep the OLS point estimates and only correct the inference. It is appropriate when you have many clusters, do not care about variance partitioning, and want a model robust to misspecification of the random-effects distribution.
None of these is uniformly best. With few clusters (< 30–40), cluster-robust SEs are themselves biased and mixed models with small-sample corrections often dominate. With many clusters and a binary outcome where you want a marginal effect, GEE is a defensible default. With variance components or random slopes as your scientific quantity of interest, only mixed models will do.
11. When Mixed Models Are Overkill
Mixed models are not free. They demand careful thinking about levels, distributions, and centering; they fail to converge in ways that take hours to debug; they invite the reader to interrogate every random effect. They are not always the right tool.
- If your ICC is essentially zero and your cluster sizes are small, OLS will give nearly identical inference with less complexity.
- If you have only a handful of clusters (say, 4–8) and the clusters themselves are of substantive interest, fixed-effect dummies are simpler and more interpretable.
- If your sole interest is in fixed-effect inference and you have many clusters, OLS with cluster-robust SEs may be more robust and equally valid.
- If your design is purely cross-sectional with no obvious clustering and you have invented a "random effect" to handle a single nuisance variable, you are probably overcomplicating things.
The honest framing is this: mixed models are essential when (a) clustering is non-trivial, (b) you have enough clusters to estimate variance components, and (c) you care about either correct inference or the variance structure itself. Outside that envelope, simpler tools may serve you better.
Fit an empty (intercept-only) model first to estimate the ICC. If it is meaningful (say, > 0.05) or your cluster sizes are large, proceed with a mixed model. Add fixed effects, then random slopes for any predictor whose effect could plausibly vary across clusters. Watch for singular fits and simplify the random structure principled when warned. Use REML for final reporting, ML for likelihood-ratio comparisons of fixed effects, and a small-sample correction (Kenward–Roger, Satterthwaite) when clusters are few.
Try It in PsyStat Nexus
The HLM module in PsyStat Nexus fits two- and three-level random-intercept and random-slope models, reports ICCs and design effects, runs Kenward–Roger small-sample corrections, and flags singular fits with diagnostic guidance. Growth-curve templates and cross-classified specifications are built in, and GLMM extensions handle binary and count outcomes through Laplace and adaptive quadrature.
References
- Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278.
- Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.
- Enders, C. K., & Tofighi, D. (2007). Centering predictor variables in cross-sectional multilevel models. Psychological Methods, 12(2), 121–138.
- McNeish, D. M., & Stapleton, L. M. (2016). The effect of small sample size on two-level model estimates: A review and illustration. Educational Psychology Review, 28(2), 295–314.
- Nakagawa, S., & Schielzeth, H. (2013). A general and simple method for obtaining R² from generalized linear mixed-effects models. Methods in Ecology and Evolution, 4(2), 133–142.
- Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Sage.
- Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and event occurrence. Oxford University Press.
- Snijders, T. A. B., & Bosker, R. J. (2012). Multilevel analysis: An introduction to basic and advanced multilevel modeling (2nd ed.). Sage.