Methodology

Causal Inference Without Randomization

By Moonlit Social Labs · April 16, 2026 · 15 min read

"Correlation is not causation." Every undergraduate methods student learns it. Every reviewer wields it. And in a strict logical sense, it is true. But as a guide to research practice, the slogan is almost uselessly vague — because most of the questions that matter to scientists, clinicians, and policymakers are causal questions, and most of the data we have to answer them are observational.

Does smoking cause lung cancer? Does cognitive behavioral therapy reduce relapse? Does raising the minimum wage increase unemployment? Does early childhood adversity alter adult HPA-axis reactivity? In an ideal world we would randomize and find out. In the actual world we cannot — for ethical, financial, logistical, or temporal reasons — and yet the questions persist. The choice is not "randomize or remain agnostic." The choice is whether to reason carefully about causation from the data we can obtain, or to retreat into the false modesty of "we only describe associations" while everyone in the room reads causal claims into the discussion section anyway.

The modern causal inference toolkit, developed over the past five decades by Donald Rubin, Judea Pearl, James Robins, and many others, gives us a way to reason carefully. It does not eliminate the need for assumptions — nothing can. What it does is force the assumptions into the open, give them names, and tell us what would have to be true of the unobserved world for our estimates to be wrong.

The Potential Outcomes Framework

The conceptual foundation of contemporary causal inference is the potential outcomes or Rubin causal model (Rubin, 1974, 2005; Imbens & Rubin, 2015). The idea is deceptively simple. For each unit i and each possible treatment value a, define the potential outcome Y_i(a) as the value that Y would take if unit i received treatment a.

If treatment is binary, every individual has two potential outcomes: Y_i(1), what would happen under treatment, and Y_i(0), what would happen under control. The individual causal effect is the contrast Y_i(1) − Y_i(0). The average causal effect (ATE) in the population is E[Y(1) − Y(0)].

The fundamental problem of causal inference, in Holland's (1986) phrasing, is that we never observe both potential outcomes for the same person. Whichever treatment i actually received, the other outcome is counterfactual — permanently missing data. Causal inference is, at heart, a missing data problem.

Randomization solves the problem by assignment design: when treatment is assigned independently of potential outcomes, the observed mean among the treated is an unbiased estimate of E[Y(1)], and similarly for the control group. Without randomization, we have to assume something analogous — usually conditional exchangeability, also called ignorability or no unmeasured confounding: given a sufficient set of covariates L, treatment assignment is independent of the potential outcomes. This is the assumption that the rest of this article is, in one way or another, about.

Directed Acyclic Graphs and the do-Calculus

The potential outcomes framework gives us a language for what we want to estimate. Pearl's (2009) graphical framework gives us a language for the assumptions we are willing to make. A directed acyclic graph (DAG) encodes a set of qualitative causal assumptions: each node is a variable, each arrow is a direct causal influence, and the absence of an arrow is itself a claim — that no direct effect exists.

From a DAG plus the rules of d-separation, you can read off conditional independencies that the data should obey if the DAG is correct. More usefully, you can mechanically determine which sets of covariates suffice to identify a causal effect from observational data. Pearl's back-door criterion says: to estimate the effect of A on Y, find a set L that (a) blocks all paths from A to Y that begin with an arrow into A, and (b) contains no descendants of A. Conditioning on such an L identifies the causal effect — in the potential-outcomes language, it makes ignorability hold.

The do-calculus generalizes this to a complete algebra for translating queries about hypothetical interventions, written P(Y | do(A=a)), into expressions involving only observational quantities, whenever such a translation is possible. When it is not possible, the do-calculus tells you that too — which is itself valuable, because it directs you toward the additional data or design features you would need.

Confounders, Mediators, Colliders

The classification of "third variables" is where DAGs earn their keep, because the textbook advice to "control for everything you can measure" is not merely imperfect — it can make your estimate worse.

A confounder is a common cause of treatment and outcome. Conditioning on a confounder removes a spurious back-door association and brings you closer to the causal effect.

A mediator lies on the causal pathway from treatment to outcome. Conditioning on it blocks the very effect you are trying to estimate — you end up estimating the direct effect rather than the total effect, which may not be what you want.

A collider is a common effect of two variables. Conditioning on a collider opens a previously closed path and induces an association between its parents that does not exist in the population. This is the source of collider bias, also known as selection bias or Berkson's paradox.

If you condition on a collider, you can manufacture a correlation between two genuinely independent causes — or, worse, reverse the sign of a real effect.

A concrete collider example

Suppose academic ability A and athletic ability S are independent in the general population. Both increase the probability of being admitted to an elite university (U = 1). Now sample only admitted students. Within that sample, A and S are negatively correlated: a student who got in despite mediocre academics probably has impressive athletics, and vice versa. The negative correlation is entirely an artifact of conditioning on the collider U. A naive analyst, controlling for "university attended" in a regression, would conclude that academic and athletic ability trade off — a finding generated by the analytic decision, not by any causal mechanism in the world.

The same structure underlies many famous puzzles: low-birthweight babies of smoking mothers appear to have lower mortality than low-birthweight babies of non-smoking mothers; hospitalized patients with disease X show inverse correlations between two of its causes; obese cardiovascular patients sometimes appear to live longer than non-obese ones. In each case, conditioning on a downstream selection variable (low birthweight, hospitalization, having the disease at all) opens a collider path. Drawing the DAG before running the regression is the only reliable way to avoid this trap.

Methods for Adjustment Under Ignorability

Propensity score methods

Rosenbaum and Rubin (1983) showed that if treatment is ignorable given L, it is also ignorable given the scalar propensity score e(L) = P(A=1 | L). This is a remarkable dimension reduction: instead of matching or stratifying on a high-dimensional covariate vector, you can match on a single number.

Propensity score matching (PSM) pairs each treated unit with one or more control units that have similar estimated propensity scores, then compares outcomes within matched pairs. Inverse probability of treatment weighting (IPTW) instead reweights the sample so that the treated and control groups have the same covariate distribution — treated units are weighted by 1/e(L) and controls by 1/(1−e(L)). The weighted means recover the ATE.

Both methods rely on two assumptions: ignorability (the propensity score captures all confounding) and positivity, which requires that 0 < e(L) < 1 for every covariate stratum of interest. Positivity fails — sometimes silently — when there are subgroups in which treatment never occurs, or never fails to occur. Estimated weights of 50, 200, or 5000 are diagnostic of positivity violations and lead to estimates dominated by a handful of influential observations. The correct response is usually to redefine the target population to one in which positivity actually holds, not to truncate the weights and hope for the best.

PSM has been criticized in recent years (King & Nielsen, 2019) for inducing imbalance when matches are imperfect and for being sensitive to specification of the propensity model. Modern practice favors weighting, full matching, or doubly robust extensions over greedy nearest-neighbor matching.

Instrumental variables

When you suspect unmeasured confounding, propensity score methods are not enough — they only adjust for what you measured. Instrumental variable (IV) methods exploit a different identification strategy: find a variable Z that affects treatment but is unrelated to the outcome except through treatment. Then variation in Y driven by Z's influence on A identifies the causal effect, even in the presence of unmeasured confounders of A and Y (Angrist & Pischke, 2009).

An instrument must satisfy three assumptions:

Relevance. Z has a non-trivial effect on A. This is testable; weak instruments produce wildly biased estimates and inflated standard errors.
Exclusion restriction. Z affects Y only through A. This is not testable from the data and must be defended substantively.
Independence (or unconfoundedness of the instrument). Z is independent of the unmeasured confounders that bias the A→Y relationship.

Two-stage least squares (2SLS) is the workhorse estimator. Famous applications include using distance to college as an instrument for educational attainment when estimating returns to schooling (Card, 1995), and the entire field of Mendelian randomization, which uses germline genetic variants as instruments for modifiable exposures — a sensible strategy because alleles are randomized at meiosis and fixed before exposure occurs, plausibly satisfying independence and exclusion (Davey Smith & Ebrahim, 2003).

Under heterogeneous treatment effects, IV identifies not the ATE but the local average treatment effect (LATE) — the effect among compliers, those whose treatment status is shifted by the instrument (Imbens & Angrist, 1994). This is a real estimand, but it is not the same as the population average, and reporting should make the distinction explicit.

Regression discontinuity

When treatment is assigned by a sharp cutoff on a continuous running variable — a test score, an income threshold, a birthdate — regression discontinuity design (RDD) exploits the fact that units just above and just below the cutoff are plausibly comparable on everything except treatment status. The discontinuity in mean outcomes at the cutoff identifies the local causal effect.

In a sharp RDD, treatment is a deterministic function of the running variable: every applicant scoring above the threshold receives the scholarship, none below. In a fuzzy RDD, the cutoff changes the probability of treatment but does not fully determine it — the cutoff is then used as an instrument and the analysis becomes a local 2SLS.

RDD is among the most credible quasi-experimental designs precisely because its assumption — smooth potential outcomes at the cutoff — is local and weak. The principal validity threat is manipulation of the running variable: if applicants can game their score to land just above the threshold, the comparability argument collapses. The McCrary (2008) density test checks for suspicious bunching at the cutoff.

Difference-in-differences

When some units adopt a treatment at a particular point in time and others do not, difference-in-differences (DiD) identifies the causal effect by comparing the change in outcomes for the treated group to the change for the control group. The identifying assumption is parallel trends: in the absence of treatment, the treated and control groups would have evolved in parallel.

Parallel trends is partially probed by inspecting pre-treatment outcome trajectories, but the assumption proper concerns the unobserved counterfactual post-treatment trajectory and so cannot be tested directly. The most defensible DiD applications combine visual evidence of pre-period parallelism with a substantive argument for why post-period divergence would be implausible absent the policy change.

The classical two-period, two-group DiD has been extended to settings where treatment is rolled out across many units at staggered times. It turned out that the standard two-way fixed-effects estimator, when applied to such staggered designs, can be badly biased — sometimes returning the wrong sign — because already-treated units serve as controls for later-treated units, contaminating the comparison (Goodman-Bacon, 2021). The recent literature has produced a new generation of estimators that handle staggered adoption properly: Callaway and Sant'Anna (2021) proposed a framework based on group-time average treatment effects with flexible aggregation; Sun and Abraham (2021) developed an interaction-weighted estimator for event studies. These methods are now standard in empirical economics and should be the default whenever treatment timing varies across units.

G-computation and standardization

G-computation, due to Robins (1986) and a cornerstone of Hernán and Robins's (2020) What If, is in some ways the most direct adjustment method. Fit an outcome model E[Y | A, L]. Then for each individual in the sample, predict their outcome under each treatment value, holding their covariates fixed. Average those predictions to obtain the standardized mean under each treatment, and contrast them.

G-computation generalizes far beyond a simple covariate-adjusted regression. It handles time-varying treatments, time-varying confounders that are themselves affected by prior treatment, and complex dynamic regimes — settings in which conventional regression and even standard IPTW fail. The cost is dependence on a correctly specified outcome model, which is why doubly robust extensions are appealing.

Doubly robust estimation: AIPW

The augmented inverse probability weighted (AIPW) estimator combines a propensity model and an outcome model in a way that is consistent if either model is correctly specified. This double robustness property is genuinely useful: you do not have to bet on a single specification, and the estimator achieves the semiparametric efficiency bound when both models are correct.

The modern incarnation of doubly robust estimation is targeted maximum likelihood estimation (TMLE) and the closely related double/debiased machine learning framework (Chernozhukov et al., 2018). These methods plug flexible machine learning estimators — random forests, gradient-boosted trees, super learners — into the propensity and outcome components, then perform a debiasing step that recovers valid confidence intervals despite the use of high-variance learners. The result is a principled marriage of modern predictive modeling with the asymptotic guarantees of classical statistics.

Sensitivity Analysis: Quantifying What You Cannot Test

Every observational causal estimate rests on an untestable assumption — usually no unmeasured confounding, or some variant of it. The honest response is not to pretend the assumption is satisfied, nor to abandon the estimate, but to ask: how badly would the assumption have to fail for the result to disappear?

The E-value

VanderWeele and Ding (2017) introduced the E-value, which has become the most widely adopted sensitivity metric in epidemiology and is increasingly used in psychology. The E-value is the minimum strength of association, measured as a risk ratio, that an unmeasured confounder would need to have with both the treatment and the outcome — above and beyond the measured covariates — to fully explain away the observed effect.

An E-value of 1.5 means an unmeasured confounder would need to roughly 1.5-fold the risk of treatment and 1.5-fold the risk of outcome to nullify the result. An E-value of 4.0 demands a much more impressive lurking variable. Reporting an E-value reframes the conversation from "could there be unmeasured confounding?" (yes, always) to "is there plausibly an unmeasured confounder this strong?", which is a question reviewers, domain experts, and readers can substantively engage with.

Rosenbaum bounds

For matched designs, Rosenbaum bounds (Rosenbaum, 2002) parameterize the degree to which two matched units could differ in their odds of treatment due to unmeasured factors, summarized by a sensitivity parameter Γ. The bounds report how large Γ can grow before the inferential conclusion (e.g., a significant treatment effect) ceases to hold across all configurations of unmeasured confounding consistent with that Γ. A study robust to Γ = 2 tolerates a doubling of the odds of treatment due to hidden bias; one that breaks at Γ = 1.1 is fragile.

Other useful tools include negative-control outcomes and exposures (Lipsitch et al., 2010), tipping-point analyses for individual covariates, and the bias formulas of VanderWeele and Arah (2011) that decompose how a confounder's strength on each side translates into bias.

An Honest Assessment

Causal inference from observational data is not a magic trick that turns associations into causes. It is a disciplined practice of stating assumptions explicitly, deriving estimators that would be valid if those assumptions held, probing the assumptions where possible, and quantifying the consequences of their failure where it is not. Done well, it produces estimates that are credible enough to act on. Done poorly — with a regression that "controls for confounders" chosen by stepwise selection, no DAG, no positivity diagnostics, no sensitivity analysis — it produces something worse than agnosticism, because it dresses up assumption-laden guesses in the borrowed authority of statistical machinery.

The Practical Bottom Line

Draw the DAG before you fit the model. Choose an estimand that matches your question (ATE, ATT, LATE, CATE). Pick an identification strategy that fits the data-generating process you actually believe in. Use a doubly robust estimator when you can. Report an E-value or Rosenbaum bound. Distinguish what your design rules out from what it merely cannot detect. The reader's faith in your conclusion should be a function of how hard you made it for yourself to reach.

Try It in PsyStat Nexus

The Causal Inference module in PsyStat Nexus walks you through observational designs end-to-end: encoding a DAG, selecting a sufficient adjustment set, estimating with IPTW or AIPW, fitting an IV or RDD or DiD specification (with Callaway–Sant'Anna for staggered adoption), and reporting an E-value alongside the point estimate. The point is not to make causal inference easy — it is not — but to make the disciplined version of it as accessible as the undisciplined version.

Get started free →

References

Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press.
Callaway, B., & Sant'Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200–230.
Chernozhukov, V., et al. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal, 21(1), C1–C68.
Hernán, M. A., & Robins, J. M. (2020). Causal Inference: What If. Chapman & Hall/CRC.
Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press.
Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.
Rosenbaum, P. R. (2002). Observational Studies (2nd ed.). Springer.
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5), 688–701.
Rubin, D. B. (2005). Causal inference using potential outcomes. Journal of the American Statistical Association, 100(469), 322–331.
Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2), 175–199.
VanderWeele, T. J., & Ding, P. (2017). Sensitivity analysis in observational research: Introducing the E-value. Annals of Internal Medicine, 167(4), 268–274.

Methodology

Mixed Models for Nested Data

When observations are clustered within groups, standard regression understates uncertainty. A practical guide to multilevel modeling.

Methodology

When to Use Bayesian Methods (And When Not To)

Bayesian inference is not a religion. A pragmatic look at where priors and posteriors actually earn their keep — and where they don't.

Causal Inference Without Randomization

The Potential Outcomes Framework

Directed Acyclic Graphs and the do-Calculus

Confounders, Mediators, Colliders

A concrete collider example

Methods for Adjustment Under Ignorability

Propensity score methods

Instrumental variables

Regression discontinuity

Difference-in-differences

G-computation and standardization

Doubly robust estimation: AIPW

Sensitivity Analysis: Quantifying What You Cannot Test

The E-value

Rosenbaum bounds

An Honest Assessment

Try It in PsyStat Nexus

References

Related Posts