Difference-in-Differences
Natasha Kang
Xiamen University, Chow Institute
May, 2026
6a Recap
Continuous \(X_{it}\) . Linear additive PO function:
\[
Y_{it}(x) = \alpha_i + \lambda_t + x'\beta + \varepsilon_{it}
\]
— linearity and a homogeneous slope \(\beta\) are part of the structural assumption.
Empirical model (\(Y_{it}\) = PO at the realized \(X_{it}\) ):
\[
Y_{it} = \alpha_i + \lambda_t + X_{it}'\beta + \varepsilon_{it}
\]
Identifying assumption: strict exogeneity on \(\varepsilon\) \(\Rightarrow\) estimate \(\beta\) by TWFE.
What’s Different in DiD
Setting — binary policy switch:
\[
D_{it} = G_i \cdot \mathbf{1}[t \geq t^*]
\]
Potential outcomes. Binary \(D\) gives two POs per unit-period: \(Y_{it}(0)\) and \(Y_{it}(1)\) . Write
\[
Y_{it}(1) = Y_{it}(0) + \tau_i
\]
with \(\tau_i\) unrestricted — no homogeneous-slope assumption . What the design recovers:
\[
\text{ATT} \;=\; E[\tau_i \mid G_i = 1]
\]
— the mean effect on the treated . ATE would require restrictions on \(Y(1)\) too.
Next : Card & Krueger (1994) — DiD with two groups and two periods.
Minimum Wage and Employment: Card & Krueger (1994)
Question : Does raising the minimum wage reduce employment?
Policy : April 1992 — New Jersey raises minimum wage from $4.25 to $5.05. Pennsylvania does not.
Data :
410 fast-food restaurants in NJ and eastern PA — a minimum-wage-intensive industry, with wages tightly tied to the legal floor.
Surveyed before (Feb–Mar 1992) and after (Nov–Dec 1992).
Outcome : average employment per restaurant.
Why Naive Comparisons Fail
Cross-sectional — NJ vs. PA, after the hike:
\[
\bar Y_{\text{NJ, after}} \;-\; \bar Y_{\text{PA, after}}
\]
Confounded by pre-existing differences between states.
Before-after — NJ alone, before vs. after:
\[
\bar Y_{\text{NJ, after}} \;-\; \bar Y_{\text{NJ, before}}
\]
Confounded by time trends — economy-wide changes between Feb and Nov 1992.
DiD combines the two to cancel each one’s bias — formalized below.
The 2×2 Design
Two groups, two periods. Let:
\(G_i \in \{0, 1\}\) : group (treated = 1, control = 0)
\(T_t \in \{0, 1\}\) : time (after = 1, before = 0)
The treatment indicator is the product:
\[
D_{it} = G_i \cdot T_t
\]
— equals 1 only in the (treated, after) cell.
\(G = 1\) (treated)
\(D = 0\)
\(D = 1\)
\(G = 0\) (control)
\(D = 0\)
\(D = 0\)
We observe outcome \(Y_{it}\) for every unit-period.
Potential Outcomes
For each unit-period \((i, t)\) , define:
\(Y_{it}(0)\) — outcome without treatment.
\(Y_{it}(1)\) — outcome with treatment.
The observed outcome equals the PO under realized treatment:
\[
Y_{it} = D_{it} \cdot Y_{it}(1) + (1 - D_{it}) \cdot Y_{it}(0)
\]
The fundamental problem : at any \((i, t)\) , only one of \(\{Y_{it}(0), Y_{it}(1)\}\) is observed. The other is the counterfactual .
Naive 1: Before-After (Treated Only)
Compare the treated group across periods:
\[
E[Y_{i1} \mid G=1] \;-\; E[Y_{i0} \mid G=1]
\]
Substitute \(Y_{i1} = Y_{i1}(1)\) and \(Y_{i0} = Y_{i0}(0)\) for the treated group, then add and subtract \(E[Y_{i1}(0) \mid G=1]\) :
\[
\begin{aligned}
&= E[Y_{i1}(1) \mid G=1] - E[Y_{i0}(0) \mid G=1] \\
&= \underbrace{E[Y_{i1}(1) - Y_{i1}(0) \mid G=1]}_{\text{ATT}} + \underbrace{E[Y_{i1}(0) - Y_{i0}(0) \mid G=1]}_{\text{time trend (treated)}}
\end{aligned}
\]
Naive 2: Cross-Section (Post-Period Only)
Compare treated and control after treatment:
\[
E[Y_{i1} \mid G=1] \;-\; E[Y_{i1} \mid G=0]
\]
Substitute observed = realized PO, then add and subtract \(E[Y_{i1}(0) \mid G=1]\) :
\[
\begin{aligned}
&= E[Y_{i1}(1) \mid G=1] - E[Y_{i1}(0) \mid G=0] \\
&= \underbrace{E[Y_{i1}(1) - Y_{i1}(0) \mid G=1]}_{\text{ATT}} + \underbrace{E[Y_{i1}(0) \mid G=1] - E[Y_{i1}(0) \mid G=0]}_{\text{selection bias}}
\end{aligned}
\]
Combining the Naives
Each naive picks up the ATT plus one bias:
Before-after: treated group’s \(Y(0)\) -trend.
Cross-section: group-level \(Y(0)\) -difference.
The DiD idea : subtract the control’s before-after change from the treated’s . Let \(\bar Y_{g,t}\) denote the sample mean for group \(g\) in period \(t\) :
\[
\hat\tau^{\text{DiD}} \;=\; \bigl(\bar Y_{1,1} - \bar Y_{1,0}\bigr) \;-\; \bigl(\bar Y_{0,1} - \bar Y_{0,0}\bigr)
\]
The control’s change estimates the common \(Y(0)\) -trend — wiping out the time-trend bias in Naive 1.
Differencing across groups wipes out the group-level \(Y(0)\) -difference in Naive 2.
Both bias terms vanish — under an assumption we name next .
Parallel Trends
The assumption that makes DiD work:
\[
E[Y_{i1}(0) - Y_{i0}(0) \mid G=1] \;=\; E[Y_{i1}(0) - Y_{i0}(0) \mid G=0]
\]
In words : absent treatment, both groups would experience the same average change in \(Y\) .
A counterfactual statement :
Left side is unobservable — the treated group’s no-treatment trend.
Right side is observable — controls are never treated.
PT equates an unobservable to an observable.
Parallel Trends — In a Picture
The dashed red line is the missing counterfactual. PT says it slopes the same as the control line. The vertical gap at \(t=1\) is the ATT.
What Parallel Trends Does NOT Require
Levels can differ. The treated group can start higher or lower than the control.
Treatment effects can be heterogeneous. \(Y_{i1}(1) - Y_{i1}(0)\) can vary across units.
Only the average \(Y(0)\) -trends must match. It’s a statement about one potential outcome.
Compared to lect2a’s independence :
Statement
\(\{Y(0), Y(1)\} \perp D\)
\(E[Y_{i1}(0) - Y_{i0}(0) \mid G]\) same across \(G\)
Levels differ across groups?
ruled out
allowed
\(\tau_i\) correlated with treatment status?
ruled out
allowed
Defended by
randomization
pre-trends test + theory
Maintained Assumptions
Beyond PT, DiD inherits one fundamental condition from the PO framework:
SUTVA (lect1) :
No interference — one unit’s treatment doesn’t affect another’s outcome.
Single version of treatment.
Like any causal inference, DiD breaks if SUTVA fails.
What DiD Identifies: ATT
The target: the average treatment effect on the treated in the post-period:
\[
\text{ATT} \;=\; E[Y_{i1}(1) - Y_{i1}(0) \mid G_i = 1]
\]
\(E[Y_{i1}(1) \mid G_i = 1]\) — observed (treated, post).
\(E[Y_{i1}(0) \mid G_i = 1]\) — counterfactual , reconstructed under PT.
Why ATT, not ATE? PT restricts \(Y(0)\) only. The control’s change estimates the treated group’s \(Y(0)\) -counterfactual — but tells us nothing about how controls would respond to treatment.
Identification: DiD \(\to\) ATT
Apply LLN to the four cell means. The probability limit:
\[
\begin{aligned}
\hat\tau^{\text{DiD}} \;\xrightarrow{p}\; &\bigl(E[Y_{i1}(1) \mid G=1] - E[Y_{i0}(0) \mid G=1]\bigr) \\
&- \bigl(E[Y_{i1}(0) \mid G=0] - E[Y_{i0}(0) \mid G=0]\bigr)
\end{aligned}
\]
Add and subtract \(E[Y_{i1}(0) \mid G=1]\) :
\[
\begin{aligned}
=\;& \underbrace{E[Y_{i1}(1) - Y_{i1}(0) \mid G=1]}_{\text{ATT}} \\
&+ \underbrace{E[Y_{i1}(0) - Y_{i0}(0) \mid G=1] - E[Y_{i1}(0) - Y_{i0}(0) \mid G=0]}_{=\,0\text{ under PT}}
\end{aligned}
\]
Under parallel trends: \(\hat\tau^{\text{DiD}} \xrightarrow{p} \text{ATT}\) . ✓
Card-Krueger: The 2×2 in Numbers
Average employment per restaurant:
Before
20.44
23.33
$-$2.89
After
21.03
21.17
$-$0.14
Change
\(+\) 0.59
\(-\) 2.16
\[
\hat\tau^{\text{DiD}} \;=\; (+0.59) \;-\; (-2.16) \;=\; +2.75
\]
Employment rose in NJ relative to PA after the wage hike.
The design uses only two waves — a strict 2×2 DiD.
A Surprising Sign
The competitive labor-market intuition predicts \(\hat\tau < 0\) :
Bind the price (wage) above market-clearing → quantity (employment) falls.
CK find \(\hat\tau > 0\) . Two interpretations:
DiD failed : PT was violated, \(\hat\tau\) is biased upward, the “true” effect is negative or zero.
DiD worked : the competitive model is wrong about this labor market.
Subsequent literature has converged on (2). Why? — next.
Why Did Employment Rise? Rejecting Perfect Competition
Wage-taking firms pay competitive wage \(w^*\) .
\(w_{\min} > w^*\) ⇒ employment falls to \(L_{\min}\) .
CK find the opposite — perfect competition is rejected by the data.
Monopsony Reconciles the Finding
Monopsony: marginal cost of labor exceeds supply ⇒ hires at \((L_m, w_m)\) , below competitive.
Min wage between \(w_m\) and \(w^*\) flattens MC ⇒ employment rises toward competitive.
Where Does Wage-Setting Power Come From?
Workers can’t costlessly switch employers:
Search costs.
Local commutes — labor markets are small and spatially segmented.
Heterogeneous job attributes (schedules, locations, benefits).
Where the literature stands :
Manning (2003): widespread evidence of monopsony in low-wage labor markets.
Azar–Marinescu–Steinbaum (2022): local labor markets often have few competing employers — direct evidence of concentration.
Post-CK minimum-wage research: moderate hikes show small/zero employment effects — consistent with frictional monopsony.
The CK lesson : a credible causal estimate can overturn a textbook prediction.
From Cell Means to Regression
We’ve built DiD as a four-cell calculation. A regression form lets us:
Compute the same number in standard software.
Get standard errors and confidence intervals.
Add covariates for precision or to relax PT.
Generalize naturally to many units and many periods.
The translation: replace the 2×2 cell structure with dummies and an interaction .
The 2×2 DiD Regression
Map the four cell means to a regression with dummies and an interaction:
\[
Y_{it} \;=\; \alpha \;+\; \beta\, G_i \;+\; \gamma\, T_t \;+\; \tau\,(G_i \times T_t) \;+\; U_{it}
\]
Reading off the coefficients:
\(\alpha\) : baseline — control group, pre-period.
\(\beta\) : group dummy — level shift for treated.
\(\gamma\) : time dummy — common shift after period.
\(\tau\) : interaction — extra shift for the (treated, after) cell. The DiD estimator.
\(\tau\) identifies the ATT under PT — proved on the Identification slide earlier.
Interpretation: One Conditional Mean Per Cell
\(G = 0\)
\(\alpha\)
\(\alpha + \gamma\)
\(G = 1\)
\(\alpha + \beta\)
\(\alpha + \beta + \gamma + \tau\)
Read off:
Control change: \((\alpha + \gamma) - \alpha = \gamma\) .
Treated change: \((\alpha + \beta + \gamma + \tau) - (\alpha + \beta) = \gamma + \tau\) .
Difference of changes: \(\tau\) .
OLS Reproduces the Four-Cell Difference
With only group and time dummies plus their interaction, the OLS estimator is:
\[
\hat\tau^{\text{OLS}} \;=\; (\bar Y_{1,1} - \bar Y_{1,0}) \;-\; (\bar Y_{0,1} - \bar Y_{0,0})
\]
\(\hat\tau^{\text{OLS}}\) equals the hand calculation by construction.
What we gain : SEs, CIs, and an extension framework — without changing the estimate.
Adding Controls — Two Motivations
DiD can be augmented with pre-treatment covariates \(W_i\) . Two distinct motivations, with different specifications .
Motivation 1 — Precision.
\[
Y_{it} \;=\; \alpha + \beta G_i + \gamma T_t + \tau(G_i \times T_t) + \boldsymbol\theta' W_i + U_{it}
\]
If \(W_i\) predicts \(Y\) , including it as a level reduces residual variance ⇒ tighter SE on \(\hat\tau\) . Identification unchanged.
Motivation 2 — Conditional PT.
PT may fail unconditionally but hold within strata of \(W_i\) . Needs a different spec — next slide.
Conditional PT in 2×2
If \(W_i\) predicts the trend , unconditional PT fails:
\[
E[Y_{i1}(0) - Y_{i0}(0) \mid G=1] \neq E[Y_{i1}(0) - Y_{i0}(0) \mid G=0]
\]
PT may still hold given \(W_i\) :
\[
E[Y_{i1}(0) - Y_{i0}(0) \mid G=1, W_i] \;=\; E[Y_{i1}(0) - Y_{i0}(0) \mid G=0, W_i]
\]
Implementing Conditional PT
Under conditional PT:
The \(Y(0)\) -trend can depend on \(W\) (but not on \(G\) , given \(W\) ).
Equivalently: \(W\) ’s effect on \(Y\) may shift between pre and post.
\(\Rightarrow\) include both \(W_i\) (level) and \(W_i \times T_t\) (interaction) in the regression.
\[
Y_{it} = \alpha + \beta G_i + \gamma T_t + \tau(G_i \times T_t) + \boldsymbol\theta' W_i + \boldsymbol\rho'(W_i \times T_t) + U_{it}
\]
\(\boldsymbol\theta' W_i\) : \(W\) ’s level effect.
\(\boldsymbol\rho'(W_i \times T_t)\) : change in \(W\) ’s effect from pre to post.
\(W_i\) must be pre-treatment (lect4c, bad controls ).
Unit Fixed Effects for Precision
CK has 410 restaurants. Replace the group dummy \(G_i\) by unit dummies \(\alpha_i\) — one per restaurant — absorbing each restaurant’s baseline:
\[
Y_{it} = \alpha_i + \gamma T_t + \tau(G_i \times T_t) + U_{it}
\]
\(\alpha_i\) — one intercept per restaurant. Absorbs state membership \(G_i\) and any time-invariant restaurant feature (location, owner, layout).
The interaction \(G_i \times T_t\) is unchanged — same point estimate \(\hat\tau\) .
Residual variance shrinks \(\Rightarrow\) tighter SEs.
Identification is still group-level — PT compares treated and control \(Y(0)\) -trends. Unit FE is a regression-spec choice for precision, not a shift to unit-level identification.
Many Periods
Now extend the design : suppose we observe restaurants in multiple months before and after the policy, not just one pre and one post. Replace the single time dummy \(T_t\) by a period dummy for each calendar period :
\[
Y_{it} = \alpha_i + \lambda_t + \tau\, D_{it} + U_{it}
\]
\(\lambda_t\) — one intercept per period (drop one for the dummy trap).
\(D_{it} = G_i \cdot \mathbf{1}[t \geq t^*]\) — treatment indicator (1 only when \(i\) is treated and \(t\) is post).
Multiple pre-periods make PT testable — we can check whether the \(Y(0)\) -trend was parallel before treatment. Coming up.
Can We Test Parallel Trends?
PT is a statement about the post-period counterfactual \(Y_{i1}(0)\) for the treated group — never observed.
We cannot directly test PT in the treatment period.
But with multiple pre-treatment periods , we can ask: did the groups move together before treatment?
If yes, more plausible that they would have continued to move together absent treatment.
Caveat : parallel pre-trends are necessary but not sufficient for PT in the post-period. A coincidental shock at \(t^*\) could break post-period PT even if pre-trends are clean.
The Idea: Estimate a Treatment “Effect” Each Period
Replace the single treatment indicator \(D_{it}\) by one indicator per period , with \(k = -1\) omitted as the reference:
\[
Y_{it} = \alpha_i + \lambda_t + \sum_{k \neq -1} \beta_k \cdot \mathbf{1}[t - t^* = k] \cdot G_i + \varepsilon_{it}
\]
Notation : \(k = t - t^*\) is event time — periods relative to the treatment date.
\(k = 0\) : first treated period.
\(k = -1\) : last pre-treatment period (the reference — see below).
\(k < -1\) : earlier pre-periods (leads ).
\(k \geq 0\) : post-periods (lags ).
Reading Off \(\beta_k\)
For a control unit (\(G_i = 0\) ): all interaction terms vanish. \[Y_{it} = \alpha_i + \lambda_t + \varepsilon_{it}\]
For a treated unit at event time \(k \neq -1\) : \[Y_{it} = \alpha_i + \lambda_t + \beta_k + \varepsilon_{it}\]
So \(\beta_k\) is the treated-vs.-control gap at event time \(k\) , relative to the gap at \(k = -1\) .
\(\beta_k\) for \(k < -1\) : pre-period leads. Should be ≈ 0 if PT holds.
\(\beta_k\) for \(k \geq 0\) : post-period lags. Dynamic ATT at horizon \(k\) .
Pre-Trend Test
Joint hypothesis on the leads:
\[
H_0: \beta_{-K} = \beta_{-K+1} = \cdots = \beta_{-2} = 0
\]
Tested with a single \(F\) -statistic (or equivalent Wald test). Two failure modes:
Rejected — pre-trends are non-flat. PT is suspect.
Not rejected — consistent with PT, but does not prove PT in the post-period.
Always pair the test with a plot — a non-rejection can hide trended-but-noisy leads.
The Event-Study Plot
(a) flat leads with CIs covering 0 — supportive of PT.
(b) leads drift linearly toward 0; individual CIs all cover 0 — joint F-test likely doesn’t reject, yet the pattern is concerning. PT suspect.
Standard Errors
Treatment varies at the group level (state in CK). Group-level shocks correlate outcomes within group:
Serially : over time within unit.
Cross-sectionally : across units within group at each \(t\) .
Cluster at the group — the level above which observations are independent.
lect6a TWFE: clustered at unit — serial correlation within unit.
6b DiD: cluster at group — serial + cross-sectional within group.
Few clusters (e.g., 2 states): wild bootstrap, or aggregate to group-period means.
Exercise: Dropping the Reference Period
In the event-study regression, \(k = -1\) is omitted from the sum:
\[
Y_{it} = \alpha_i + \lambda_t + \sum_{k \neq -1} \beta_k \cdot \mathbf{1}[t - t^* = k] \cdot G_i + \varepsilon_{it}
\]
Q1. If we include every event-time dummy (no reference), what goes wrong with OLS?
Q2. Which \(k\) should we drop by convention, and what do the remaining \(\beta_k\) then measure?
Q1: perfect collinearity. \(\sum_k \mathbf{1}[t-t^*=k] \cdot G_i = G_i\) , which is time-invariant within unit and thus in span(\(\alpha_i\) ).
Q2: convention is \(k=-1\) , the last pre-treatment period. Remaining \(\beta_k\) measure the treated-vs-control gap at event time \(k\) , relative to the gap at \(k=-1\) (the closest untreated baseline).
Anticipation
Recall: \(\beta_k\) is measured relative to \(k = -1\) . What if \(k = -1\) isn’t a clean baseline — because treated units already shifted behavior in anticipation of \(t^*\) ?
Slemrod (1995) — Tax Reform Act of 1986:
Top marginal rate rose, effective Jan 1987 .
Anticipating, high-income earners pulled capital gains, bonuses, and other income into 1986 — locking in the lower rate.
Pre-period outcomes reflected anticipation, not a clean no-treatment baseline.
Diagnostic : re-estimate with \(k = -2\) as reference. A non-zero \(\hat\beta_{-1}\) then flags anticipation.
Remedy : redate the treatment to the announcement date, not implementation.
Staggered Rollout: A Modern Setting
Most modern DiD applications don’t have a single \(t^*\) . Treatment rolls out across units at different times :
US states adopt a policy in different years.
Firms get access to a new technology over a multi-year phase-in.
Hospitals join a payment program on different dates.
The treatment indicator generalizes:
\[
D_{it} = \mathbf{1}[t \geq t_i^*]
\]
— each unit has its own switch date \(t_i^*\) . Untreated units have \(t_i^* = \infty\) .
Natural approach : run TWFE with this \(D_{it}\) — \(y_{it} = \alpha_i + \lambda_t + \tau D_{it} + U_{it}\) . But…
Why Naïve TWFE Breaks Under Staggered Rollout
Three units, four periods. Treatment turns on at different times:
A (never treated)
0
0
0
0
B (treated at \(t=2\) )
0
1
1
1
C (treated at \(t=3\) )
0
0
1
1
Goodman-Bacon (2021) : TWFE’s \(\hat\tau\) is implicitly a weighted average of all 2×2 DiDs in the panel. With three units, the implicit comparisons include:
A as control for B’s switch ✓ (clean)
A as control for C’s switch ✓ (clean)
C vs B around C’s switch ✗ — uses B (already treated) as a control
Modern Estimators for Staggered DiD
Common principle : never use already-treated units as controls. Build treatment effects from clean comparisons, then aggregate.
Callaway-Sant’Anna (2021) : let \(\text{ATT}(g, t)\) = treatment effect on units treated at time \(g\) , measured at time \(t\) . Estimate each one as a clean 2×2 DiD using never-treated A as control.
\(g = 2\) (B)
\(\text{ATT}(2,2)\)
\(\text{ATT}(2,3)\)
\(\text{ATT}(2,4)\)
\(g = 3\) (C)
—
\(\text{ATT}(3,3)\)
\(\text{ATT}(3,4)\)
Aggregate the cells (non-negative weights) → overall ATT, or average across cohorts at each event-time \(k = t - g\) .
Variants on the same principle : Sun-Abraham (event study), dCdH (newly-treated vs not-yet-treated), BJS (impute counterfactual).