Causality

Natasha Kang

Xiamen University, Chow Institute

March, 2026

What is Causality?

  • Economics is full of what-if questions:
    • What would happen if we raised the minimum wage?
    • What if this student had gone to college — would she earn more?
    • What if we reduced class sizes — would test scores improve?
  • Answering these requires distinguishing cause from correlation.

Causal Effects

  • Each question asks: what is the effect of a treatment on an outcome, ceteris paribus (all else equal)?
  • This is the causal effect.
  • The challenge: in practice, ceteris paribus rarely holds — treated and untreated groups differ in ways that also affect the outcome.

What’s the Effect of Hospital on Health?

Group Sample Size Mean Health Std. Error
Hospital 7,774 3.21 0.014
No Hospital 90,049 3.93 0.003
  • Hospital = hospitalized during past 12 months
  • Health status = 1 (poor) to 5 (excellent)
  • What would you conclude from this data?

Are Hospitals Making People Sicker?

Are Hospitals Making People Sicker?

  • Maybe, but people who go to the hospital are likely to be less healthy.
    • e.g., pre-existing medical conditions
  • A direct comparison is misleading:
    • Other factors are not held constant.
  • This is the selection problem: the sick are more likely to seek treatment.

Correlation ≠ Causality

  • Statistical analysis uncovers correlations in observed data.
  • The key question is: When does a correlation reflect a causal effect?
  • To answer that, we need a formal framework for defining and reasoning about causality.

Frameworks for Causality

  • Potential Outcomes Framework (Neyman 1923, Rubin 1974)
    • Define causal effects by comparing what would happen under different treatments
    • The primary framework for this course
  • Directed Acyclic Graphs (DAGs) (Pearl 2009)
    • Visualize causal relationships between variables
    • Useful for reasoning about what to control for — and what not to
    • We’ll use DAGs later when we discuss model selection

Potential Outcomes: One Unit, Two Worlds

  • \(Y(1)\): potential outcome if treated
  • \(Y(0)\): potential outcome if untreated

The Potential Outcomes Framework

  • General setup: \[d \rightarrow Y(d)\]
  • SUTVA (Stable Unit Treatment Value Assumption):
    • Consistency: \(Y = Y(D)\), where \(D\) is the actual assigned treatment.
    • No interference: the potential outcomes for any unit should not be affected by the treatment assigned to other units.
      • i.e. no spillover effects.

Does SUTVA Hold? — No Interference

  • Treatment: Job training program; Outcome: Individual productivity

  • Scenario 1: Workers operate independently at personal workstations, but trained and untrained workers interact informally during breaks.

  • Scenario 2: Workers are assigned to 4-person teams on sequential assembly lines. Each member is fixed to one station for the entire shift.

  • Question: In which scenario(s) is it reasonable to assume no interference?

Does SUTVA Hold? — Consistency

  • A worker assigned to training arrives late and misses half the session.
  • Question: Can their observed outcome be treated as \(Y(1)\) — the potential outcome under treatment?
  • Consistency requires the treatment to be well-defined: \(Y = Y(D)\) only holds if there is a single version of treatment.

Revisiting Hospital and Health

  • Outcome: \(Y\) (health status)
  • Treatment: \(D\) (hospitalization) \[ D = \begin{cases} 1, & \text{if hospitalized} \\ 0, & \text{if not hospitalized} \end{cases} \]
  • Potential outcomes:
    • \(Y(1)\): health status if hospitalized
    • \(Y(0)\): health status if not hospitalized
  • Observed outcome: \[ Y = D \cdot Y(1) + (1 - D) \cdot Y(0) \]

The Fundamental Problem

  • Ideally, we’d like to know the individual causal effect: \[ Y(1) - Y(0) \]
  • But for any unit, only one potential outcome is ever observed — the other is a counterfactual.
  • This is the Fundamental Problem of Causal Inference (Holland 1986): \(Y(1)\) and \(Y(0)\) cannot both be observed for the same unit at the same time.

Treatment Effect Estimands

  • Since individual effects are unobservable, we target population averages:
  • ATE — Average Treatment Effect: \[\text{ATE} = \mathbb{E}[Y(1) - Y(0)]\]
  • ATT — Average Treatment Effect on the Treated: \[\text{ATT} = \mathbb{E}[Y(1) - Y(0) \mid D = 1]\]
  • ATU — Average Treatment Effect on the Untreated: \[\text{ATU} = \mathbb{E}[Y(1) - Y(0) \mid D = 0]\]

Which Estimand Are You Recovering?

  • They are related by:

\[\text{ATE} = p \cdot \text{ATT} + (1-p) \cdot \text{ATU}, \quad p = P(D=1)\]

  • In general, ATT \(\neq\) ATU \(\neq\) ATE.
  • They coincide when treatment effects are homogeneous: \(Y(1) - Y(0) = \tau\) for all units.
  • We’ll see another condition shortly.
  • Which estimand you recover depends on your identification strategy — a theme we will return to throughout the course.

What Can Be Estimated?

  • Can we estimate the following?
    1. \(Y(1) - Y(0)\)
    2. \(\mathbb{E}[Y \mid D=1]\)
    3. \(\mathbb{E}[Y \mid D=0]\)
    4. \(\mathbb{E}[Y(1) \mid D=0]\)
    5. \(\mathbb{E}[Y(0) \mid D=1]\)
    6. \(\mathbb{E}[Y(1) \mid D=1]\)
    7. \(\mathbb{E}[Y(0) \mid D=0]\)

Decomposing Observed Differences

\[ \mathbb{E}[Y \mid D = 1] - \mathbb{E}[Y \mid D = 0] \] = Naive comparison (difference in observed group means)

\[ = \mathbb{E}[Y(1) \mid D = 1] - \mathbb{E}[Y(0) \mid D = 1] \] = ATT: Average Treatment Effect on the Treated

\[ \quad + \mathbb{E}[Y(0) \mid D = 1] - \mathbb{E}[Y(0) \mid D = 0] \] = Selection Bias

ATT and Selection Bias

  • The decomposition shows: naive comparison = ATT + Selection Bias
  • Selection Bias \(= \mathbb{E}[Y(0) \mid D=1] - \mathbb{E}[Y(0) \mid D=0]\): the difference in baseline outcomes between treated and untreated.
  • In the hospital example: the sick are more likely to seek treatment, so \(\mathbb{E}[Y(0) \mid D=1] < \mathbb{E}[Y(0) \mid D=0]\) — selection bias is negative, and the naive comparison understates the true benefit of hospitalization.

Random Assignment

  • If treatment \(D\) is randomly assigned: \[ D \;\perp\!\!\!\perp\; (Y(0), Y(1)), \quad 0 < P(D=1) < 1 \]
  • Then \[ \mathbb{E}[Y \mid D=1] - \mathbb{E}[Y \mid D=0] = \text{ATE} \]
  • Random assignment breaks the link between treatment and potential outcomes — eliminating selection bias.
  • This is the second condition under which ATT = ATU = ATE.
  • Though often infeasible in observational studies, it provides a benchmark for causal inference.

Randomized Controlled Trial (RCT)

  • An RCT implements random assignment in practice.
  • Subjects are randomly assigned to:
    • Treatment group: \(D = 1\)
    • Control group: \(D = 0\)
  • With random assignment: \[ \text{ATE} = \mathbb{E}[Y \mid D=1] - \mathbb{E}[Y \mid D=0] \]

  • But large samples are crucial:

    • By the Law of Large Numbers, idiosyncratic differences average out.
    • Ensures treated and control groups are balanced on both observed and unobserved factors.

From Identification to Estimation

  • Random assignment identifies the causal effect (ATE) as a difference in observed means across groups.
  • But this identification is at the population level.
  • In practice, we rely on sample data, so we must estimate the ATE and quantify uncertainty.
  • This brings us to statistical inference:
    • Estimate population means using sample averages.
    • Assess precision using standard errors.

Statistical Inference: Group Means

  • Suppose we have i.i.d. data from an RCT: \[ \{(Y_i, D_i)\}_{i=1}^n \]

  • For each group \(d \in \{0, 1\}\), define the population mean: \[ \theta_d = \mathbb{E}[Y \mid D = d] \]

  • The sample mean is:

\[ \widehat{\theta}_d = \frac{1}{n_d} \sum_{i: D_i = d} Y_i \]

Asymptotic Distribution

CLT reminder: if \(Z_1, \ldots, Z_n\) are i.i.d. with mean \(\mu\) and variance \(\sigma^2 < \infty\), then: \[ \sqrt{n}(\bar{Z} - \mu) \overset{d}{\longrightarrow} \mathrm{N}(0, \sigma^2) \]

  • Applying CLT within each group (conditional on \(D\)), as \(n_d \to \infty\):

\[ \sqrt{n_d}(\widehat{\theta}_d - \theta_d) \overset{d}{\longrightarrow} \mathrm{N}(0, \sigma_d^2), \quad \sigma_d^2 = \mathrm{Var}(Y \mid D = d) \]

Rescaling to \(\sqrt{n}\)

  • Since \(n_d/n \to p_d\), we have \(\sqrt{n/n_d} \to 1/\sqrt{p_d}\). By Slutsky: \[ \sqrt{n}(\widehat{\theta}_d - \theta_d) = \underbrace{\sqrt{n/n_d}}_{\to\, 1/\sqrt{p_d}} \cdot \underbrace{\sqrt{n_d}(\widehat{\theta}_d - \theta_d)}_{\overset{d}{\to}\, \mathrm{N}(0,\,\sigma_d^2)} \overset{d}{\longrightarrow} \mathrm{N}\!\left(0, \frac{\sigma_d^2}{p_d}\right) \]

Joint Distribution of Group Means

  • Groups are independent under random assignment: \[ \sqrt{n} \begin{pmatrix} \widehat{\theta}_1 - \theta_1 \\ \widehat{\theta}_0 - \theta_0 \end{pmatrix} \overset{d}{\longrightarrow} \mathrm{N} \left( \mathbf{0},\; \begin{pmatrix} \sigma_1^2/p_1 & 0 \\ 0 & \sigma_0^2/p_0 \end{pmatrix} \right) \]

Distribution of the Difference in Means

  • Since the two groups are independent, their difference is also asymptotically normal (sum of independent normals): \[ \sqrt{n}\left[(\widehat{\theta}_1 - \widehat{\theta}_0) - (\theta_1 - \theta_0)\right] \overset{d}{\longrightarrow} \mathrm{N}\left(0,\;\frac{\sigma_1^2}{p_1} + \frac{\sigma_0^2}{p_0} \right) \]
  • Equivalently, for large \(n\):

\[ (\widehat{\theta}_1 - \widehat{\theta}_0) \;\approx\; \mathrm{N}\left( \theta_1 - \theta_0,\;\frac{\sigma_1^2}{n_1} + \frac{\sigma_0^2}{n_0} \right) \]

Inference for ATE: Test Statistic

  • We want to test \(H_0: \theta_1 - \theta_0 = 0\) vs \(H_1: \theta_1 - \theta_0 \neq 0\)
  • Sample variance and standard error:

\[ \widehat{\sigma}_d^2 = \frac{1}{n_d - 1} \sum_{i: D_i = d} (Y_i - \widehat{\theta}_d)^2, \qquad \widehat{\mathrm{SE}} = \sqrt{\frac{\widehat{\sigma}_1^2}{n_1} + \frac{\widehat{\sigma}_0^2}{n_0}} \]

  • Test statistic:

\[ T = \frac{\widehat{\theta}_1 - \widehat{\theta}_0}{\widehat{\mathrm{SE}}}\;\overset{H_0}{\sim}\; \mathrm{N}(0, 1) \]

Estimating Causal Effect: An RCT Example

Scenario: Evaluating the effect of a new fertilizer on crop yield through a randomized controlled trial (RCT).

  • ▶️ Treatment Group: 50 farms use the new fertilizer.
  • ▶️ Control Group: 50 farms use the standard fertilizer.
Group Sample Size Mean Yield Std. Dev.
Treatment 50 1800 kg/ha 150 kg/ha
Control 50 1650 kg/ha 140 kg/ha

Estimating Causal Effect: Analysis

  • Hypotheses: \[H_0: \mu_T - \mu_C = 0 \quad \text{vs} \quad H_1: \mu_T - \mu_C \neq 0\]

  • Estimated ATE: \[ \widehat{\text{ATE}} = \bar{Y}_T - \bar{Y}_C = 1800 - 1650 = 150 \text{ kg/ha} \]

  • Test statistic: \[ t = \frac{\bar{Y}_T - \bar{Y}_C}{\sqrt{\frac{s_T^2}{n_T} + \frac{s_C^2}{n_C}}} = \frac{150}{\sqrt{\frac{150^2}{50} + \frac{140^2}{50}}} = \frac{150}{29.10} \approx 5.15 \]

  • \(t = 5.15 > 1.96\) — reject \(H_0\) at the 5% level.
  • Conclusion: The new fertilizer increases crop yield by approximately 150 kg/ha.

Covariates and Heterogeneity

  • So far, we’ve focused on the average treatment effect (ATE) across the whole sample.

  • But individuals differ — age, education, baseline health, etc.

  • These observable characteristics are often included in analysis as covariates, to explore differences in treatment effects across subgroups.
  • This motivates studying treatment effect heterogeneity:
    • Who benefits more?
    • Are there subgroups where treatment is less effective?

Conditional Effects: CATE

  • Conditional Average Treatment Effect (CATE): \[ \mathbb{E}[Y(1) - Y(0) \mid W] \]
  • We never observe both \(Y(1)\) and \(Y(0)\). We only observe: \[ \mathbb{E}[Y \mid D=1, W] - \mathbb{E}[Y \mid D=0, W] \]

Identifying CATE

  • Under random assignment \(D \perp\!\!\!\perp (Y(0), Y(1), W)\) with \(0 < P(D=1) < 1\), the CATE is identified: \[ \mathbb{E}[Y(1) - Y(0) \mid W] = \mathbb{E}[Y \mid D = 1, W] - \mathbb{E}[Y \mid D = 0, W] \]

Testing Covariate Balance

  • Random assignment implies covariate balance — covariates are distributed the same across groups:

\[ W \mid D = 1 \;\overset{d}{=}\; W \mid D = 0 \]

  • Equivalently:

\[ D \mid W \;\overset{d}{=}\; D \]

  • A useful implication: \(D\) is not predictable by \(W\):

\[ \mathbb{E}[D \mid W] = \mathbb{E}[D] \]

  • This can be tested using a regression of \(D\) on \(W\).

Causal Diagrams: RCT

  • Nodes represent variables; directed edges represent direct causal effects.
  • In a randomized experiment, \(D\) is assigned independently of all covariates — so \(W \nrightarrow D\).

Causal Diagrams: Confounding

  • In the hospital example, Medical Condition causes both hospitalization and health outcomes — a confounder.

Limitations of RCTs

  • Violations of SUTVA:
    • Spillover effects — e.g. vaccination may protect others via herd immunity.
    • Equilibrium effects — e.g. the earnings return to college may depend on how many people get degrees:
      • In small-scale trials: no effect on market wages.
      • At scale: wages may adjust, reducing the college premium.
  • Ethical concerns:
    • Some treatments cannot be randomly assigned — e.g. smoking, pollution, trauma.
  • Practical constraints:
    • RCTs can be costly, time-consuming, or infeasible in many settings.

When Treatment Isn’t Random

  • In observational studies, treatment is often influenced by factors that also affect the outcome.

  • Do hospitals improve health?
    We observe health outcomes and hospitalization status —
    but sicker individuals are more likely to be hospitalized.

  • This makes it hard to separate the effect of hospitalization from the effect of underlying medical condition.

Recovering Causality

To recover causal effects from observational data, we rely on two key assumptions:

1. Conditional Ignorability (Unconfoundedness)

\[ D \;\perp\!\!\!\perp\; Y(d) \mid X \]

Once we condition on covariates \(X\), treatment is “as if” randomly assigned.
This rules out confounding by observed variables.

2. Overlap (Common Support)

\[ 0 < P(D = 1 \mid X) < 1 \]

For every covariate profile, we observe both treated and untreated individuals.

  • The probability \(p(X) = P(D = 1 \mid X)\) is called the propensity score — it summarizes how likely a unit with covariates \(X\) is to receive treatment.

Without Conditioning: Groups Not Comparable

With Conditioning: Comparable Within Strata

What the Assumptions Let Us Do

  • Under Conditional Ignorability and Overlap, we can equate observed and counterfactual conditional expectations: \[ \mathbb{E}[Y \mid D = d, X] = \mathbb{E}[Y(d) \mid X] \]
  • So the CATE is identified from observed data: \[ \mathbb{E}[Y(1) - Y(0) \mid X] = \mathbb{E}[Y \mid D = 1, X] - \mathbb{E}[Y \mid D = 0, X] \]
  • Averaging over \(X\), we recover the ATE: \[ \text{ATE} = \mathbb{E}\big[\mathbb{E}[Y \mid D = 1, X] - \mathbb{E}[Y \mid D = 0, X]\big] \]

Summary

  • Causality asks what would happen under a different treatment — not just what is correlated with what.
  • The potential outcomes framework: \(Y(1)\) and \(Y(0)\) exist for every unit, but only one is observed — the Fundamental Problem of Causal Inference.
  • The target estimand matters: ATE, ATT, and ATU differ whenever treatment effects are heterogeneous. Which one you recover depends on your identification strategy.
  • The naive comparison = ATT + selection bias. Selection bias vanishes under random assignment, giving ATT = ATU = ATE.
  • When randomization is infeasible, conditional ignorability and overlap allow causal identification by conditioning on observed covariates \(X\).

What’s Next?

Course Roadmap: Steps in Empirical Analysis

  1. Question: Define the research question or problem.
  2. Econometric Model: Specify the model for data analysis.
  3. Formulate a Hypothesis: Develop a testable hypothesis of interest.
  4. Estimate the Model with Data: Use econometric methods to estimate model parameters.
  5. Inference/Hypothesis Testing: Perform statistical tests for making inferences about the population.

Econometric Model: Linear Regression

  • Why is regression essential?

    • Despite its simplicity, it is highly versatile and widely used in empirical analysis.
    • Bridges theory and evidence, allowing us to test hypotheses and validate economic models.
    • Provides insights into both causal relationships and predictive patterns.
  • Linear regression is used for:

    • Causality: Estimating the average treatment effect.
    • Prediction: Forecasting outcomes based on observables.

Any Questions?