Difference-in-Differences

Natasha Kang

Xiamen University, Chow Institute

May, 2026

6a Recap

Continuous \(X_{it}\). Linear additive PO function:

\[ Y_{it}(x) = \alpha_i + \lambda_t + x'\beta + \varepsilon_{it} \]

— linearity and a homogeneous slope \(\beta\) are part of the structural assumption.

Empirical model (\(Y_{it}\) = PO at the realized \(X_{it}\)):

\[ Y_{it} = \alpha_i + \lambda_t + X_{it}'\beta + \varepsilon_{it} \]

Identifying assumption: strict exogeneity on \(\varepsilon\) \(\Rightarrow\) estimate \(\beta\) by TWFE.

What’s Different in DiD

Setting — binary policy switch:

\[ D_{it} = G_i \cdot \mathbf{1}[t \geq t^*] \]

Potential outcomes. Binary \(D\) gives two POs per unit-period: \(Y_{it}(0)\) and \(Y_{it}(1)\). Write

\[ Y_{it}(1) = Y_{it}(0) + \tau_i \]

with \(\tau_i\) unrestricted — no homogeneous-slope assumption. What the design recovers:

\[ \text{ATT} \;=\; E[\tau_i \mid G_i = 1] \]

— the mean effect on the treated. ATE would require restrictions on \(Y(1)\) too.

Next: Card & Krueger (1994) — DiD with two groups and two periods.

Minimum Wage and Employment: Card & Krueger (1994)

Question: Does raising the minimum wage reduce employment?

Policy: April 1992 — New Jersey raises minimum wage from $4.25 to $5.05. Pennsylvania does not.

Data:

  • 410 fast-food restaurants in NJ and eastern PA — a minimum-wage-intensive industry, with wages tightly tied to the legal floor.
  • Surveyed before (Feb–Mar 1992) and after (Nov–Dec 1992).

Outcome: average employment per restaurant.

Why Naive Comparisons Fail

Cross-sectional — NJ vs. PA, after the hike:

\[ \bar Y_{\text{NJ, after}} \;-\; \bar Y_{\text{PA, after}} \]

Confounded by pre-existing differences between states.

Before-after — NJ alone, before vs. after:

\[ \bar Y_{\text{NJ, after}} \;-\; \bar Y_{\text{NJ, before}} \]

Confounded by time trends — economy-wide changes between Feb and Nov 1992.

DiD combines the two to cancel each one’s bias — formalized below.

The 2×2 Design

Two groups, two periods. Let:

  • \(G_i \in \{0, 1\}\): group (treated = 1, control = 0)
  • \(T_t \in \{0, 1\}\): time (after = 1, before = 0)

The treatment indicator is the product:

\[ D_{it} = G_i \cdot T_t \]

— equals 1 only in the (treated, after) cell.

\(T = 0\) (before) \(T = 1\) (after)
\(G = 1\) (treated) \(D = 0\) \(D = 1\)
\(G = 0\) (control) \(D = 0\) \(D = 0\)

We observe outcome \(Y_{it}\) for every unit-period.

Potential Outcomes

For each unit-period \((i, t)\), define:

  • \(Y_{it}(0)\) — outcome without treatment.
  • \(Y_{it}(1)\) — outcome with treatment.

The observed outcome equals the PO under realized treatment:

\[ Y_{it} = D_{it} \cdot Y_{it}(1) + (1 - D_{it}) \cdot Y_{it}(0) \]

The fundamental problem: at any \((i, t)\), only one of \(\{Y_{it}(0), Y_{it}(1)\}\) is observed. The other is the counterfactual.

Naive 1: Before-After (Treated Only)

Compare the treated group across periods:

\[ E[Y_{i1} \mid G=1] \;-\; E[Y_{i0} \mid G=1] \]

Substitute \(Y_{i1} = Y_{i1}(1)\) and \(Y_{i0} = Y_{i0}(0)\) for the treated group, then add and subtract \(E[Y_{i1}(0) \mid G=1]\):

\[ \begin{aligned} &= E[Y_{i1}(1) \mid G=1] - E[Y_{i0}(0) \mid G=1] \\ &= \underbrace{E[Y_{i1}(1) - Y_{i1}(0) \mid G=1]}_{\text{ATT}} + \underbrace{E[Y_{i1}(0) - Y_{i0}(0) \mid G=1]}_{\text{time trend (treated)}} \end{aligned} \]

Naive 2: Cross-Section (Post-Period Only)

Compare treated and control after treatment:

\[ E[Y_{i1} \mid G=1] \;-\; E[Y_{i1} \mid G=0] \]

Substitute observed = realized PO, then add and subtract \(E[Y_{i1}(0) \mid G=1]\):

\[ \begin{aligned} &= E[Y_{i1}(1) \mid G=1] - E[Y_{i1}(0) \mid G=0] \\ &= \underbrace{E[Y_{i1}(1) - Y_{i1}(0) \mid G=1]}_{\text{ATT}} + \underbrace{E[Y_{i1}(0) \mid G=1] - E[Y_{i1}(0) \mid G=0]}_{\text{selection bias}} \end{aligned} \]

Combining the Naives

Each naive picks up the ATT plus one bias:

  • Before-after: treated group’s \(Y(0)\)-trend.
  • Cross-section: group-level \(Y(0)\)-difference.

The DiD idea: subtract the control’s before-after change from the treated’s. Let \(\bar Y_{g,t}\) denote the sample mean for group \(g\) in period \(t\):

\[ \hat\tau^{\text{DiD}} \;=\; \bigl(\bar Y_{1,1} - \bar Y_{1,0}\bigr) \;-\; \bigl(\bar Y_{0,1} - \bar Y_{0,0}\bigr) \]

The control’s change estimates the common \(Y(0)\)-trend — wiping out the time-trend bias in Naive 1.

Differencing across groups wipes out the group-level \(Y(0)\)-difference in Naive 2.

Both bias terms vanish — under an assumption we name next.

Maintained Assumptions

Beyond PT, DiD inherits one fundamental condition from the PO framework:

SUTVA (lect1):

  • No interference — one unit’s treatment doesn’t affect another’s outcome.
  • Single version of treatment.

Like any causal inference, DiD breaks if SUTVA fails.

What DiD Identifies: ATT

The target: the average treatment effect on the treated in the post-period:

\[ \text{ATT} \;=\; E[Y_{i1}(1) - Y_{i1}(0) \mid G_i = 1] \]

  • \(E[Y_{i1}(1) \mid G_i = 1]\) — observed (treated, post).
  • \(E[Y_{i1}(0) \mid G_i = 1]\)counterfactual, reconstructed under PT.

Why ATT, not ATE? PT restricts \(Y(0)\) only. The control’s change estimates the treated group’s \(Y(0)\)-counterfactual — but tells us nothing about how controls would respond to treatment.

Identification: DiD \(\to\) ATT

Apply LLN to the four cell means. The probability limit:

\[ \begin{aligned} \hat\tau^{\text{DiD}} \;\xrightarrow{p}\; &\bigl(E[Y_{i1}(1) \mid G=1] - E[Y_{i0}(0) \mid G=1]\bigr) \\ &- \bigl(E[Y_{i1}(0) \mid G=0] - E[Y_{i0}(0) \mid G=0]\bigr) \end{aligned} \]

Add and subtract \(E[Y_{i1}(0) \mid G=1]\):

\[ \begin{aligned} =\;& \underbrace{E[Y_{i1}(1) - Y_{i1}(0) \mid G=1]}_{\text{ATT}} \\ &+ \underbrace{E[Y_{i1}(0) - Y_{i0}(0) \mid G=1] - E[Y_{i1}(0) - Y_{i0}(0) \mid G=0]}_{=\,0\text{ under PT}} \end{aligned} \]

Under parallel trends: \(\hat\tau^{\text{DiD}} \xrightarrow{p} \text{ATT}\). ✓

Card-Krueger: The 2×2 in Numbers

Average employment per restaurant:

NJ (\(G=1\)) PA (\(G=0\)) Difference
Before 20.44 23.33 $-$2.89
After 21.03 21.17 $-$0.14
Change \(+\)0.59 \(-\)2.16

\[ \hat\tau^{\text{DiD}} \;=\; (+0.59) \;-\; (-2.16) \;=\; +2.75 \]

  • Employment rose in NJ relative to PA after the wage hike.
  • The design uses only two waves — a strict 2×2 DiD.

A Surprising Sign

The competitive labor-market intuition predicts \(\hat\tau < 0\):

  • Bind the price (wage) above market-clearing → quantity (employment) falls.

CK find \(\hat\tau > 0\). Two interpretations:

  1. DiD failed: PT was violated, \(\hat\tau\) is biased upward, the “true” effect is negative or zero.
  1. DiD worked: the competitive model is wrong about this labor market.

Subsequent literature has converged on (2). Why? — next.

Why Did Employment Rise? Rejecting Perfect Competition

  • Wage-taking firms pay competitive wage \(w^*\).
  • \(w_{\min} > w^*\) ⇒ employment falls to \(L_{\min}\).
  • CK find the opposite — perfect competition is rejected by the data.

Monopsony Reconciles the Finding

  • Monopsony: marginal cost of labor exceeds supply ⇒ hires at \((L_m, w_m)\), below competitive.
  • Min wage between \(w_m\) and \(w^*\) flattens MC ⇒ employment rises toward competitive.

Where Does Wage-Setting Power Come From?

Workers can’t costlessly switch employers:

  • Search costs.
  • Local commutes — labor markets are small and spatially segmented.
  • Heterogeneous job attributes (schedules, locations, benefits).

Where the literature stands:

  • Manning (2003): widespread evidence of monopsony in low-wage labor markets.
  • Azar–Marinescu–Steinbaum (2022): local labor markets often have few competing employers — direct evidence of concentration.
  • Post-CK minimum-wage research: moderate hikes show small/zero employment effects — consistent with frictional monopsony.

The CK lesson: a credible causal estimate can overturn a textbook prediction.

From Cell Means to Regression

We’ve built DiD as a four-cell calculation. A regression form lets us:

  1. Compute the same number in standard software.
  1. Get standard errors and confidence intervals.
  1. Add covariates for precision or to relax PT.
  1. Generalize naturally to many units and many periods.

The translation: replace the 2×2 cell structure with dummies and an interaction.

The 2×2 DiD Regression

Map the four cell means to a regression with dummies and an interaction:

\[ Y_{it} \;=\; \alpha \;+\; \beta\, G_i \;+\; \gamma\, T_t \;+\; \tau\,(G_i \times T_t) \;+\; U_{it} \]

Reading off the coefficients:

  • \(\alpha\): baseline — control group, pre-period.
  • \(\beta\): group dummy — level shift for treated.
  • \(\gamma\): time dummy — common shift after period.
  • \(\tau\): interactionextra shift for the (treated, after) cell. The DiD estimator.

\(\tau\) identifies the ATT under PT — proved on the Identification slide earlier.

Interpretation: One Conditional Mean Per Cell

\(T = 0\) \(T = 1\)
\(G = 0\) \(\alpha\) \(\alpha + \gamma\)
\(G = 1\) \(\alpha + \beta\) \(\alpha + \beta + \gamma + \tau\)

Read off:

  • Control change: \((\alpha + \gamma) - \alpha = \gamma\).
  • Treated change: \((\alpha + \beta + \gamma + \tau) - (\alpha + \beta) = \gamma + \tau\).
  • Difference of changes: \(\tau\).

OLS Reproduces the Four-Cell Difference

With only group and time dummies plus their interaction, the OLS estimator is:

\[ \hat\tau^{\text{OLS}} \;=\; (\bar Y_{1,1} - \bar Y_{1,0}) \;-\; (\bar Y_{0,1} - \bar Y_{0,0}) \]

\(\hat\tau^{\text{OLS}}\) equals the hand calculation by construction.

What we gain: SEs, CIs, and an extension framework — without changing the estimate.

Adding Controls — Two Motivations

DiD can be augmented with pre-treatment covariates \(W_i\). Two distinct motivations, with different specifications.

Motivation 1 — Precision.

\[ Y_{it} \;=\; \alpha + \beta G_i + \gamma T_t + \tau(G_i \times T_t) + \boldsymbol\theta' W_i + U_{it} \]

If \(W_i\) predicts \(Y\), including it as a level reduces residual variance ⇒ tighter SE on \(\hat\tau\). Identification unchanged.

Motivation 2 — Conditional PT.

PT may fail unconditionally but hold within strata of \(W_i\). Needs a different spec — next slide.

Conditional PT in 2×2

If \(W_i\) predicts the trend, unconditional PT fails:

\[ E[Y_{i1}(0) - Y_{i0}(0) \mid G=1] \neq E[Y_{i1}(0) - Y_{i0}(0) \mid G=0] \]

PT may still hold given \(W_i\):

\[ E[Y_{i1}(0) - Y_{i0}(0) \mid G=1, W_i] \;=\; E[Y_{i1}(0) - Y_{i0}(0) \mid G=0, W_i] \]

Implementing Conditional PT

Under conditional PT:

  • The \(Y(0)\)-trend can depend on \(W\) (but not on \(G\), given \(W\)).
  • Equivalently: \(W\)’s effect on \(Y\) may shift between pre and post.
  • \(\Rightarrow\) include both \(W_i\) (level) and \(W_i \times T_t\) (interaction) in the regression.

\[ Y_{it} = \alpha + \beta G_i + \gamma T_t + \tau(G_i \times T_t) + \boldsymbol\theta' W_i + \boldsymbol\rho'(W_i \times T_t) + U_{it} \]

  • \(\boldsymbol\theta' W_i\): \(W\)’s level effect.
  • \(\boldsymbol\rho'(W_i \times T_t)\): change in \(W\)’s effect from pre to post.

\(W_i\) must be pre-treatment (lect4c, bad controls).

Unit Fixed Effects for Precision

CK has 410 restaurants. Replace the group dummy \(G_i\) by unit dummies \(\alpha_i\) — one per restaurant — absorbing each restaurant’s baseline:

\[ Y_{it} = \alpha_i + \gamma T_t + \tau(G_i \times T_t) + U_{it} \]

  • \(\alpha_i\) — one intercept per restaurant. Absorbs state membership \(G_i\) and any time-invariant restaurant feature (location, owner, layout).
  • The interaction \(G_i \times T_t\) is unchanged — same point estimate \(\hat\tau\).
  • Residual variance shrinks \(\Rightarrow\) tighter SEs.

Identification is still group-level — PT compares treated and control \(Y(0)\)-trends. Unit FE is a regression-spec choice for precision, not a shift to unit-level identification.

Many Periods

Now extend the design: suppose we observe restaurants in multiple months before and after the policy, not just one pre and one post. Replace the single time dummy \(T_t\) by a period dummy for each calendar period:

\[ Y_{it} = \alpha_i + \lambda_t + \tau\, D_{it} + U_{it} \]

  • \(\lambda_t\) — one intercept per period (drop one for the dummy trap).
  • \(D_{it} = G_i \cdot \mathbf{1}[t \geq t^*]\) — treatment indicator (1 only when \(i\) is treated and \(t\) is post).

Multiple pre-periods make PT testable — we can check whether the \(Y(0)\)-trend was parallel before treatment. Coming up.

The Idea: Estimate a Treatment “Effect” Each Period

Replace the single treatment indicator \(D_{it}\) by one indicator per period, with \(k = -1\) omitted as the reference:

\[ Y_{it} = \alpha_i + \lambda_t + \sum_{k \neq -1} \beta_k \cdot \mathbf{1}[t - t^* = k] \cdot G_i + \varepsilon_{it} \]

Notation: \(k = t - t^*\) is event time — periods relative to the treatment date.

  • \(k = 0\): first treated period.
  • \(k = -1\): last pre-treatment period (the reference — see below).
  • \(k < -1\): earlier pre-periods (leads).
  • \(k \geq 0\): post-periods (lags).

Reading Off \(\beta_k\)

For a control unit (\(G_i = 0\)): all interaction terms vanish. \[Y_{it} = \alpha_i + \lambda_t + \varepsilon_{it}\]

For a treated unit at event time \(k \neq -1\): \[Y_{it} = \alpha_i + \lambda_t + \beta_k + \varepsilon_{it}\]

So \(\beta_k\) is the treated-vs.-control gap at event time \(k\), relative to the gap at \(k = -1\).

  • \(\beta_k\) for \(k < -1\): pre-period leads. Should be ≈ 0 if PT holds.
  • \(\beta_k\) for \(k \geq 0\): post-period lags. Dynamic ATT at horizon \(k\).

Pre-Trend Test

Joint hypothesis on the leads:

\[ H_0: \beta_{-K} = \beta_{-K+1} = \cdots = \beta_{-2} = 0 \]

Tested with a single \(F\)-statistic (or equivalent Wald test). Two failure modes:

  1. Rejected — pre-trends are non-flat. PT is suspect.
  1. Not rejected — consistent with PT, but does not prove PT in the post-period.

Always pair the test with a plot — a non-rejection can hide trended-but-noisy leads.

The Event-Study Plot

  • (a) flat leads with CIs covering 0 — supportive of PT.
  • (b) leads drift linearly toward 0; individual CIs all cover 0 — joint F-test likely doesn’t reject, yet the pattern is concerning. PT suspect.

Standard Errors

Treatment varies at the group level (state in CK). Group-level shocks correlate outcomes within group:

  • Serially: over time within unit.
  • Cross-sectionally: across units within group at each \(t\).

Cluster at the group — the level above which observations are independent.

  • lect6a TWFE: clustered at unit — serial correlation within unit.
  • 6b DiD: cluster at group — serial + cross-sectional within group.

Few clusters (e.g., 2 states): wild bootstrap, or aggregate to group-period means.

Exercise: Dropping the Reference Period

In the event-study regression, \(k = -1\) is omitted from the sum:

\[ Y_{it} = \alpha_i + \lambda_t + \sum_{k \neq -1} \beta_k \cdot \mathbf{1}[t - t^* = k] \cdot G_i + \varepsilon_{it} \]

Q1. If we include every event-time dummy (no reference), what goes wrong with OLS?

Q2. Which \(k\) should we drop by convention, and what do the remaining \(\beta_k\) then measure?

Anticipation

Recall: \(\beta_k\) is measured relative to \(k = -1\). What if \(k = -1\) isn’t a clean baseline — because treated units already shifted behavior in anticipation of \(t^*\)?

Slemrod (1995) — Tax Reform Act of 1986:

  • Top marginal rate rose, effective Jan 1987.
  • Anticipating, high-income earners pulled capital gains, bonuses, and other income into 1986 — locking in the lower rate.
  • Pre-period outcomes reflected anticipation, not a clean no-treatment baseline.
  • Diagnostic: re-estimate with \(k = -2\) as reference. A non-zero \(\hat\beta_{-1}\) then flags anticipation.
  • Remedy: redate the treatment to the announcement date, not implementation.

Staggered Rollout: A Modern Setting

Most modern DiD applications don’t have a single \(t^*\). Treatment rolls out across units at different times:

  • US states adopt a policy in different years.
  • Firms get access to a new technology over a multi-year phase-in.
  • Hospitals join a payment program on different dates.

The treatment indicator generalizes:

\[ D_{it} = \mathbf{1}[t \geq t_i^*] \]

— each unit has its own switch date \(t_i^*\). Untreated units have \(t_i^* = \infty\).

Natural approach: run TWFE with this \(D_{it}\)\(y_{it} = \alpha_i + \lambda_t + \tau D_{it} + U_{it}\). But…

Why Naïve TWFE Breaks Under Staggered Rollout

Three units, four periods. Treatment turns on at different times:

Unit \(t=1\) \(t=2\) \(t=3\) \(t=4\)
A (never treated) 0 0 0 0
B (treated at \(t=2\)) 0 1 1 1
C (treated at \(t=3\)) 0 0 1 1

Goodman-Bacon (2021): TWFE’s \(\hat\tau\) is implicitly a weighted average of all 2×2 DiDs in the panel. With three units, the implicit comparisons include:

  • A as control for B’s switch ✓ (clean)
  • A as control for C’s switch ✓ (clean)
  • C vs B around C’s switch ✗ — uses B (already treated) as a control

Modern Estimators for Staggered DiD

Common principle: never use already-treated units as controls. Build treatment effects from clean comparisons, then aggregate.

Callaway-Sant’Anna (2021): let \(\text{ATT}(g, t)\) = treatment effect on units treated at time \(g\), measured at time \(t\). Estimate each one as a clean 2×2 DiD using never-treated A as control.

Cohort \(g\) \(t=2\) \(t=3\) \(t=4\)
\(g = 2\) (B) \(\text{ATT}(2,2)\) \(\text{ATT}(2,3)\) \(\text{ATT}(2,4)\)
\(g = 3\) (C) \(\text{ATT}(3,3)\) \(\text{ATT}(3,4)\)

Aggregate the cells (non-negative weights) → overall ATT, or average across cohorts at each event-time \(k = t - g\).

Variants on the same principle: Sun-Abraham (event study), dCdH (newly-treated vs not-yet-treated), BJS (impute counterfactual).