Xiamen University, Chow Institute
May, 2026
Two cross-sectional levers from earlier:
Both fail when the confounder is unobserved and no valid instrument exists.
Canonical example — returns to schooling: ability is unobserved and correlated with schooling. OLS is biased; no clean instrument is universally available. Need a different route.
When the same unit is observed at multiple periods, we can compare the unit to itself:
Payoff: no instrument needed; no requirement to know what the confounder is.
Requires: assumptions on how \(X_{it}\) and \(\varepsilon_{it}\) co-move within unit over time — shapes the rest of the lecture.
For each unit-period \((i, t)\), the potential outcome \(Y_{it}(x)\) is the value \(Y\) would take if the regressor were set to \(x\):
\[ Y_{it}(x) \;=\; \text{unit } i\text{'s outcome at time } t \text{ when } X_{it} = x \]
Notation:
Cross-section: at a single \(t\), we see only one PO per unit. Every other value of \(x\) is counterfactual — the fundamental problem of causal inference.
Panel: across periods, the same unit is observed at different realized \(X\) values — \(Y_{it}(X_{it})\) and \(Y_{it'}(X_{it'})\). In a sense, we observe both potential outcomes for unit \(i\) — just at different times.
Assume the PO function is linear and additively separable:
\[ Y_{it}(x) \;=\; \alpha_i \;+\; \lambda_t \;+\; x'\beta \;+\; \varepsilon_{it} \]
Setting \(x = X_{it}\) in the linear PO function:
\[ Y_{it} \;=\; \alpha_i + \lambda_t + X_{it}'\beta + \varepsilon_{it} \]
\(X_{it}\) is allowed to correlate with both:
In a cross-section, \(\alpha_i\) would sit inside the error, requiring \(\mathrm{Cov}(X_i, \alpha_i) = 0\).
The parameter we want is \(\beta\) — the causal slope. \(\alpha_i\) and \(\lambda_t\) are unobserved nuisance terms in the regression.
Standard approach — Two-Way Fixed Effects (TWFE), via the within transformation:
Take averages of both sides of \(Y_{it} = \alpha_i + \lambda_t + X_{it}'\beta + \varepsilon_{it}\):
Combine \(\tilde Y_{it} \equiv Y_{it} - \bar Y_i - \bar Y_t + \bar Y\) — \(\alpha\) and \(\lambda\) terms cancel exactly:
\[ \tilde Y_{it} \;=\; \tilde X_{it}'\beta + \tilde\varepsilon_{it} \]
Apply OLS to the double-demeaned model:
\[ \hat\beta_{\text{TWFE}} \;=\; \left(\sum_{i,t} \tilde X_{it}\tilde X_{it}'\right)^{-1} \sum_{i,t} \tilde X_{it}\tilde Y_{it} \]
Substitute \(\tilde Y_{it} = \tilde X_{it}'\beta + \tilde\varepsilon_{it}\):
\[ \hat\beta_{\text{TWFE}} - \beta \;=\; \left(\sum_{i,t} \tilde X_{it}\tilde X_{it}'\right)^{-1} \sum_{i,t} \tilde X_{it}\tilde\varepsilon_{it} \]
Whether \(\hat\beta_{\text{TWFE}} \to_p \beta\) depends on the behavior of these two sums. The substantive condition is the identifying assumption — strict exogeneity, next.
The identifying assumption for \(\beta\):
\[ E[\varepsilon_{it} \mid X_{i1}, \dots, X_{iT}, \alpha_i, \lambda_t] = 0 \]
The residual \(\varepsilon_{it}\) is mean-zero conditional on the entire \(X\) trajectory and the FEs.
Why “strict”? Unit-demeaning gives
\[ \tilde\varepsilon_{it} \;=\; \varepsilon_{it} - \frac{1}{T}\sum_{s=1}^{T}\varepsilon_{is} \]
— a linear combination of errors at every period. So \(E[\tilde X_{it}\tilde\varepsilon_{it}] = 0\) requires \(E[X_{ir}\varepsilon_{is}] = 0\) at every pair \((r, s)\).
The natural condition \(E[\varepsilon_{it} \mid X_{it}, \alpha_i, \lambda_t] = 0\) can hold while strict exogeneity fails — when \(X\) at other periods correlates with \(\varepsilon_{it}\):
Fixed \(T\), \(N \to \infty\), units i.i.d. The double sum is an average over i.i.d. units:
\[ \frac{1}{NT}\sum_{i,t}\tilde X_{it}\tilde X_{it}' \;=\; \frac{1}{N}\sum_i \;\frac{1}{T}\sum_t \tilde X_{it}\tilde X_{it}' \]
LLN applies to the outer sum (over \(i\)):
By Slutsky: \(\;\;\hat\beta_{\text{TWFE}} \;\xrightarrow{p}\; \beta\).
By a CLT over units (fixed \(T\), \(N \to \infty\)):
\[ \sqrt{N}(\hat\beta_{\text{TWFE}} - \beta) \;\xrightarrow{d}\; \mathcal{N}\!\left(0,\; Q_{\tilde X \tilde X}^{-1}\,\Omega\,Q_{\tilde X \tilde X}^{-1}\right) \]
The “meat” of the sandwich:
\[ \Omega \;\equiv\; \mathrm{Var}\!\left(\frac{1}{\sqrt N}\sum_{i,t} \tilde X_{it}\tilde\varepsilon_{it}\right) \;=\; E\!\left[\Bigl(\sum_t \tilde X_{it}\tilde\varepsilon_{it}\Bigr)\Bigl(\sum_t \tilde X_{it}\tilde\varepsilon_{it}\Bigr)'\right] \]
(Last equality: i.i.d. across units.) Estimating \(\Omega\) correctly is the next slide.
Recall (Asymptotic Normality slide):
\[ \Omega \;=\; E\!\left[\Bigl(\sum_t \tilde X_{it}\tilde\varepsilon_{it}\Bigr)\Bigl(\sum_t \tilde X_{it}\tilde\varepsilon_{it}\Bigr)'\right] \]
Expand the outer sum-product:
\[ \Omega \;=\; \underbrace{\sum_{t} E[\tilde\varepsilon_{it}^2\, \tilde X_{it}\tilde X_{it}']}_{\text{same-period (within } (i,t)\text{)}} \;+\; \underbrace{\sum_{t \neq s} E[\tilde\varepsilon_{it}\tilde\varepsilon_{is}\, \tilde X_{it}\tilde X_{is}']}_{\text{cross-period (within unit } i\text{)}} \]
Within a unit, residuals \(\tilde\varepsilon_{i,1}, \dots, \tilde\varepsilon_{i,T}\) are typically serially correlated — shocks persist over time. The cross-period sum is non-zero.
Naive SE treats \((i,t)\) observations as i.i.d. — sets the cross-period sum to zero, captures only the same-period sum:
\[ \hat\Omega_{\text{naive}} \;=\; \frac{1}{NT}\sum_{i,t}\hat\varepsilon_{it}^2\, \tilde X_{it}\tilde X_{it}' \]
Under serial correlation: underestimates \(\Omega\).
Cluster-robust at unit level — keeps both sums:
\[ \hat\Omega_{\text{cluster}} \;=\; \frac{1}{N}\sum_i \Bigl(\sum_t \tilde X_{it}\hat\varepsilon_{it}\Bigr)\Bigl(\sum_t \tilde X_{it}\hat\varepsilon_{it}\Bigr)' \]
with \(\hat\varepsilon_{it} = \tilde Y_{it} - \tilde X_{it}'\hat\beta_{\text{TWFE}}\).
Bertrand, Duflo & Mullainathan (2004 QJE): in panel-FE/DiD settings, ignoring within-unit correlation causes nominal 5% tests to reject 45%+ of the time.
TWFE includes \(\alpha_i\) and \(\lambda_t\). These absorb:
Implication for the choice of \(X\):
Research question (ADH 2013): Did the rise in Chinese import competition between 1990 and 2007 reduce manufacturing employment in U.S. local labor markets exposed to it?
What is a local labor market?
Commuting zones (CZs) operationalize this:
Shift-share treatment: sum across industries \(j\) of (local share) \(\times\) (national shift):
\[ \text{ChinaShock}_{ct} \;=\; \sum_{j} \underbrace{\frac{L_{cjt-1}}{L_{ct-1}}}_{\text{share}_{cj}} \cdot \underbrace{\frac{\Delta M_{j,\text{China}\to\text{US},t}}{L_{j,t-1}}}_{\text{shift}_{jt}} \]
We now write down a yearly TWFE specification for the ADH setup:
\[ L_{ct} = \alpha_c + \lambda_t + \beta \cdot \text{ChinaShock}_{ct} + \varepsilon_{ct} \]
Honest disclosure: ADH (2013) does not run this yearly TWFE. They use stacked long-differences (1990→2000, 2000→2007). The framework below is an adaptation we use to illustrate panel-regression machinery — same identifying logic, different estimation.
Shorthand: \(s_{cj} \equiv \text{share}_{cj}\), \(x_{jt} \equiv \text{shift}_{jt}\), \(C_{ct} \equiv \text{ChinaShock}_{ct} = \sum_j s_{cj}\, x_{jt}\).
Three means (using \(s_{cj}\) fixed in \(t\), \(x_{jt}\) fixed in \(c\)):
\[ \bar C_c = \sum_j s_{cj}\,\bar x_j, \quad \bar C_t = \sum_j \bar s_j\, x_{jt}, \quad \bar C = \sum_j \bar s_j\,\bar x_j \]
Within transformation:
\[ \begin{aligned} \widetilde C_{ct} \;=\; C_{ct} - \bar C_c - \bar C_t + \bar C &= \sum_j \big[\,s_{cj}\,x_{jt} - s_{cj}\,\bar x_j - \bar s_j\,x_{jt} + \bar s_j\,\bar x_j\,\big] \\ &= \sum_j (s_{cj} - \bar s_j)(x_{jt} - \bar x_j) \end{aligned} \]
\[ \widetilde C_{ct} \;=\; \sum_j (s_{cj} - \bar s_j)(x_{jt} - \bar x_j) \]
\(\widetilde C_{ct}\) is the idiosyncratic part of CZ-\(c\)’s exposure in year \(t\) — what’s left after netting out CZ-\(c\)’s average across years and year-\(t\)’s average across CZs.
Where does the variation come from?
By FWL, \(\hat\beta\) is the OLS slope of \(\widetilde L_{ct}\) on \(\widetilde C_{ct}\) across the panel — how much idiosyncratic employment moves per unit of idiosyncratic exposure.
For \(\hat\beta\) to be consistent, both pieces of \(C_{ct} = \sum_j s_{cj} x_{jt}\) must be uncorrelated with the error:
given the FE.
ADH defend each separately:
If some unobserved \(U\) affects both the shifter \(x_{jt}\) and the CZ-year shock \(\varepsilon_{ct}\), then \(x_{jt}\) is correlated with \(\varepsilon_{ct}\) — strict exogeneity fails.
Question: what could \(U\) plausibly be? \(\to\) next slide.
\(U\) = US-side demand shocks. When industry \(j\) weakens domestically (shifting tastes, productivity slowdown, etc.):
So \(\Delta M_{j,\text{CN}\to\text{US},t}\) confounds Chinese supply (what we want — productivity, WTO entry, capacity expansion) with US demand (the confounder).
ADH IV: keep the same shift-share construction, but replace the shifter — use Chinese imports to other high-income countries instead of US:
\[ \text{ChinaShock}^{\text{IV}}_{ct} \;=\; \sum_j s_{cj} \cdot \frac{\Delta M_{j,\text{CN}\to\text{other},t}}{L_{j,t-1}} \]
(8 countries: Australia, Denmark, Finland, Germany, Japan, New Zealand, Spain, Switzerland.)
ADH’s preferred specification adds, interacted with the period dummy:
Why pre-period?
We’ll write these as \(W_c^{\text{pre}\prime}\gamma_t\) — interactions of \(W^{\text{pre}}\) with period dummies, with a separate coefficient vector per period. Formalized in the next section.
Problem: CZs with different baseline characteristics may follow different time paths.
Resolution: let \(\lambda_t\) depend on a baseline characteristic \(W_c^{\text{pre}}\).
For continuous or multi-dimensional baseline characteristic \(W_c^{\text{pre}}\):
\[ Y_{ct} \;=\; \alpha_c + \underbrace{\lambda_t + W_c^{\text{pre}\prime}\gamma_t}_{\text{CZ-specific time path}} + X_{ct}'\beta + \varepsilon_{ct} \]
\(W_c^{\text{pre}}\) must be time-invariant (measured pre-treatment).
Period dummies and their \(W^{\text{pre}}\) interactions written out (observation indexed by \((c, t')\) on this slide; \(t\) is the summation index):
\[ Y_{c,t'} = \alpha_c \;+\; \sum_{t=2}^{T} \mathbf{1}[t'=t]\,\lambda_{t} \;+\; \sum_{t=2}^{T} \mathbf{1}[t'=t] \cdot W_c^{\text{pre}\prime}\gamma_{t} \;+\; X_{c,t'}'\beta + \varepsilon_{c,t'} \]
Parameters:
Why drop one period for both \(\lambda\) and \(\gamma\)?
Standalone period dummies: \(\sum_{t=1}^{T} \mathbf{1}[t'=t] = 1\) for all \((c, t')\) — the constant is in span(\(\alpha_c\)). Including all \(T\) → perfect collinearity. Drop \(\lambda_{t_0}\).
\(W^{\text{pre}}\)-interaction dummies: \(\sum_{t=1}^{T} \mathbf{1}[t'=t] \cdot W_c^{\text{pre}} = W_c^{\text{pre}}\) — time-invariant, also in span(\(\alpha_c\)). Drop \(\gamma_{t_0}\).
Convention: drop the same reference period \(t_0\) — typically the last pre-treatment period. Coefficients \(\lambda_{t}, \gamma_{t}\) are then interpreted relative to \(t_0\).
Pulling everything together — what does ADH actually find?
From their preferred IV spec with \(W^{\text{pre}}\) controls (Table 3, col. 6, stacked long-differences):
A $1,000-per-worker increase in import exposure over a decade \(\Rightarrow\) a 0.596 pp drop in CZ manufacturing employment / working-age population.
Stability across spec columns 1–6 (with vs. without \(W^{\text{pre}}\) controls) is part of ADH’s robustness argument.
When treatment is binary and group-structured:
\[ D_{it} = G_i \cdot \mathbf{1}[t \geq t^*] \]
TWFE specializes to difference-in-differences, the canonical design covered in lect6b.