Panel Regression

Natasha Kang

Xiamen University, Chow Institute

May, 2026

When Cross-Sectional Tools Aren’t Enough

Two cross-sectional levers from earlier:

  • Lect4: condition on observable confounders.
  • Lect5: use exogenous variation via instruments.

Both fail when the confounder is unobserved and no valid instrument exists.

Canonical example — returns to schooling: ability is unobserved and correlated with schooling. OLS is biased; no clean instrument is universally available. Need a different route.

A New Source of Variation

When the same unit is observed at multiple periods, we can compare the unit to itself:

  • \(X_{it}\) takes different values across periods \(t\), for fixed unit \(i\).
  • \(Y_{it}\) moves with it.
  • Anything fixed about unit \(i\) — observed or not — is held constant by construction.

Payoff: no instrument needed; no requirement to know what the confounder is.

Requires: assumptions on how \(X_{it}\) and \(\varepsilon_{it}\) co-move within unit over time — shapes the rest of the lecture.

Potential Outcomes

For each unit-period \((i, t)\), the potential outcome \(Y_{it}(x)\) is the value \(Y\) would take if the regressor were set to \(x\):

\[ Y_{it}(x) \;=\; \text{unit } i\text{'s outcome at time } t \text{ when } X_{it} = x \]

Notation:

  • Capital \(X_{it}\) — the realized random regressor.
  • Lowercase \(x\) — a hypothetical value.
  • \(Y_{it} = Y_{it}(X_{it})\) — the observed outcome is the PO evaluated at the realized \(X_{it}\).

Cross-section: at a single \(t\), we see only one PO per unit. Every other value of \(x\) is counterfactual — the fundamental problem of causal inference.

Panel: across periods, the same unit is observed at different realized \(X\) values — \(Y_{it}(X_{it})\) and \(Y_{it'}(X_{it'})\). In a sense, we observe both potential outcomes for unit \(i\) — just at different times.

A Linear PO Assumption

Assume the PO function is linear and additively separable:

\[ Y_{it}(x) \;=\; \alpha_i \;+\; \lambda_t \;+\; x'\beta \;+\; \varepsilon_{it} \]

  • \(\alpha_i\)unit-fixed term: every time-invariant feature of individual \(i\), observed or unobserved (ability, gender, family background, …).
  • \(\lambda_t\)time-fixed term: every unit-invariant feature of period \(t\) (national price level, common macro shock, period-specific common event).
  • \(\beta\) — the causal slope in \(x\), common across units and periods.
  • \(\varepsilon_{it}\) — idiosyncratic disturbance, varying across \(i\) and \(t\).

The Empirical Model

Setting \(x = X_{it}\) in the linear PO function:

\[ Y_{it} \;=\; \alpha_i + \lambda_t + X_{it}'\beta + \varepsilon_{it} \]

\(X_{it}\) is allowed to correlate with both:

  • \(\alpha_i\) — unit-level fixed factors.
  • \(\lambda_t\) — common time shocks.

In a cross-section, \(\alpha_i\) would sit inside the error, requiring \(\mathrm{Cov}(X_i, \alpha_i) = 0\).

Estimating \(\beta\)

The parameter we want is \(\beta\) — the causal slope. \(\alpha_i\) and \(\lambda_t\) are unobserved nuisance terms in the regression.

Standard approach — Two-Way Fixed Effects (TWFE), via the within transformation:

  • Demean \(Y, X\) by unit and period means.
  • \(\alpha_i\) and \(\lambda_t\) are eliminated from the regression.
  • Leaves an estimating equation in \(\beta\) alone.

The Within Transformation

Take averages of both sides of \(Y_{it} = \alpha_i + \lambda_t + X_{it}'\beta + \varepsilon_{it}\):

  • Unit mean: \(\;\;\bar Y_i \;=\; \alpha_i + \bar\lambda + \bar X_i'\beta + \bar\varepsilon_i\)
  • Period mean: \(\;\;\bar Y_t \;=\; \bar\alpha + \lambda_t + \bar X_t'\beta + \bar\varepsilon_t\)
  • Grand mean: \(\;\;\bar Y \;=\; \bar\alpha + \bar\lambda + \bar X'\beta + \bar\varepsilon\)

Combine \(\tilde Y_{it} \equiv Y_{it} - \bar Y_i - \bar Y_t + \bar Y\)\(\alpha\) and \(\lambda\) terms cancel exactly:

\[ \tilde Y_{it} \;=\; \tilde X_{it}'\beta + \tilde\varepsilon_{it} \]

TWFE Estimator: Formal Derivation

Apply OLS to the double-demeaned model:

\[ \hat\beta_{\text{TWFE}} \;=\; \left(\sum_{i,t} \tilde X_{it}\tilde X_{it}'\right)^{-1} \sum_{i,t} \tilde X_{it}\tilde Y_{it} \]

Substitute \(\tilde Y_{it} = \tilde X_{it}'\beta + \tilde\varepsilon_{it}\):

\[ \hat\beta_{\text{TWFE}} - \beta \;=\; \left(\sum_{i,t} \tilde X_{it}\tilde X_{it}'\right)^{-1} \sum_{i,t} \tilde X_{it}\tilde\varepsilon_{it} \]

Whether \(\hat\beta_{\text{TWFE}} \to_p \beta\) depends on the behavior of these two sums. The substantive condition is the identifying assumption — strict exogeneity, next.

Strict Exogeneity

The identifying assumption for \(\beta\):

\[ E[\varepsilon_{it} \mid X_{i1}, \dots, X_{iT}, \alpha_i, \lambda_t] = 0 \]

The residual \(\varepsilon_{it}\) is mean-zero conditional on the entire \(X\) trajectory and the FEs.

Why “strict”? Unit-demeaning gives

\[ \tilde\varepsilon_{it} \;=\; \varepsilon_{it} - \frac{1}{T}\sum_{s=1}^{T}\varepsilon_{is} \]

— a linear combination of errors at every period. So \(E[\tilde X_{it}\tilde\varepsilon_{it}] = 0\) requires \(E[X_{ir}\varepsilon_{is}] = 0\) at every pair \((r, s)\).

When Strict Exogeneity Fails

The natural condition \(E[\varepsilon_{it} \mid X_{it}, \alpha_i, \lambda_t] = 0\) can hold while strict exogeneity fails — when \(X\) at other periods correlates with \(\varepsilon_{it}\):

  • Lagged dependent variable: \(X_{it} = Y_{i,t-1}\) depends on \(\varepsilon_{i,t-1}\)\(\hat\beta_{\text{TWFE}}\) inconsistent (Nickell bias).
  • Feedback / Ashenfelter dip: treatment adoption \(X_{i,t+1}\) responds to past outcomes (and hence past errors).
  • Anticipation: future \(X\) enters today’s \(\varepsilon\) via expectations.

TWFE Estimator: Consistency

Fixed \(T\), \(N \to \infty\), units i.i.d. The double sum is an average over i.i.d. units:

\[ \frac{1}{NT}\sum_{i,t}\tilde X_{it}\tilde X_{it}' \;=\; \frac{1}{N}\sum_i \;\frac{1}{T}\sum_t \tilde X_{it}\tilde X_{it}' \]

LLN applies to the outer sum (over \(i\)):

  • Rank: \(\frac{1}{NT}\sum_{i,t}\tilde X_{it}\tilde X_{it}' \;\to_p\; Q_{\tilde X \tilde X} \;\equiv\; E\!\left[\frac{1}{T}\sum_t \tilde X_{it}\tilde X_{it}'\right]\), positive definite by assumption.
  • Orthogonality: \(\frac{1}{NT}\sum_{i,t}\tilde X_{it}\tilde\varepsilon_{it} \;\to_p\; E\!\left[\frac{1}{T}\sum_t \tilde X_{it}\tilde\varepsilon_{it}\right] \;=\; 0\) under strict exogeneity.

By Slutsky: \(\;\;\hat\beta_{\text{TWFE}} \;\xrightarrow{p}\; \beta\).

TWFE Estimator: Asymptotic Normality

By a CLT over units (fixed \(T\), \(N \to \infty\)):

\[ \sqrt{N}(\hat\beta_{\text{TWFE}} - \beta) \;\xrightarrow{d}\; \mathcal{N}\!\left(0,\; Q_{\tilde X \tilde X}^{-1}\,\Omega\,Q_{\tilde X \tilde X}^{-1}\right) \]

The “meat” of the sandwich:

\[ \Omega \;\equiv\; \mathrm{Var}\!\left(\frac{1}{\sqrt N}\sum_{i,t} \tilde X_{it}\tilde\varepsilon_{it}\right) \;=\; E\!\left[\Bigl(\sum_t \tilde X_{it}\tilde\varepsilon_{it}\Bigr)\Bigl(\sum_t \tilde X_{it}\tilde\varepsilon_{it}\Bigr)'\right] \]

(Last equality: i.i.d. across units.) Estimating \(\Omega\) correctly is the next slide.

Decomposing \(\Omega\)

Recall (Asymptotic Normality slide):

\[ \Omega \;=\; E\!\left[\Bigl(\sum_t \tilde X_{it}\tilde\varepsilon_{it}\Bigr)\Bigl(\sum_t \tilde X_{it}\tilde\varepsilon_{it}\Bigr)'\right] \]

Expand the outer sum-product:

\[ \Omega \;=\; \underbrace{\sum_{t} E[\tilde\varepsilon_{it}^2\, \tilde X_{it}\tilde X_{it}']}_{\text{same-period (within } (i,t)\text{)}} \;+\; \underbrace{\sum_{t \neq s} E[\tilde\varepsilon_{it}\tilde\varepsilon_{is}\, \tilde X_{it}\tilde X_{is}']}_{\text{cross-period (within unit } i\text{)}} \]

Within a unit, residuals \(\tilde\varepsilon_{i,1}, \dots, \tilde\varepsilon_{i,T}\) are typically serially correlated — shocks persist over time. The cross-period sum is non-zero.

Naive vs. Cluster-Robust

Naive SE treats \((i,t)\) observations as i.i.d. — sets the cross-period sum to zero, captures only the same-period sum:

\[ \hat\Omega_{\text{naive}} \;=\; \frac{1}{NT}\sum_{i,t}\hat\varepsilon_{it}^2\, \tilde X_{it}\tilde X_{it}' \]

Under serial correlation: underestimates \(\Omega\).

Cluster-robust at unit level — keeps both sums:

\[ \hat\Omega_{\text{cluster}} \;=\; \frac{1}{N}\sum_i \Bigl(\sum_t \tilde X_{it}\hat\varepsilon_{it}\Bigr)\Bigl(\sum_t \tilde X_{it}\hat\varepsilon_{it}\Bigr)' \]

with \(\hat\varepsilon_{it} = \tilde Y_{it} - \tilde X_{it}'\hat\beta_{\text{TWFE}}\).

Bertrand, Duflo & Mullainathan (2004 QJE): in panel-FE/DiD settings, ignoring within-unit correlation causes nominal 5% tests to reject 45%+ of the time.

Which \(X\)’s Can TWFE Identify?

TWFE includes \(\alpha_i\) and \(\lambda_t\). These absorb:

  • Anything that’s a function of \(i\) alone (constant within unit).
  • Anything that’s a function of \(t\) alone (constant within period).

Implication for the choice of \(X\):

  • Time-invariant within unit (gender, race, baseline characteristics): absorbed by \(\alpha_i\). Cannot be identified in TWFE — its coefficient drops out with the FE.
  • Common across units in each period (national-level policy, macro shock that hits everyone): absorbed by \(\lambda_t\). Cannot be identified.
  • Two-way varying (varies across both \(i\) and \(t\)): identifiable. This is what TWFE can estimate.

Continuous Treatment: Autor-Dorn-Hanson (ADH) China Shock

Research question (ADH 2013): Did the rise in Chinese import competition between 1990 and 2007 reduce manufacturing employment in U.S. local labor markets exposed to it?

What is a local labor market?

  • A geographic area where workers can take jobs without relocating.
  • The scope where displaced workers search and employers recruit.

Commuting zones (CZs) operationalize this:

  • Groups of counties merged by commuting flows; ~700 cover the U.S.
  • Capture the boundary of where workers actually look for jobs.
  • Differ in pre-1990 industrial mix → variation in exposure to Chinese imports.

ADH: Setup

  • Unit \(c\): commuting zone. Period \(t\): 1990–2007.
  • Outcome \(L_{ct}\): manufacturing employment share.
  • Treatment \(\text{ChinaShock}_{ct}\): continuous, shift-share construction.

Shift-share treatment: sum across industries \(j\) of (local share) \(\times\) (national shift):

\[ \text{ChinaShock}_{ct} \;=\; \sum_{j} \underbrace{\frac{L_{cjt-1}}{L_{ct-1}}}_{\text{share}_{cj}} \cdot \underbrace{\frac{\Delta M_{j,\text{China}\to\text{US},t}}{L_{j,t-1}}}_{\text{shift}_{jt}} \]

  • share\(_{cj}\): industry \(j\)’s employment share in CZ \(c\) at baseline. Varies across CZs; fixed in \(t\).
  • shift\(_{jt}\): national change in Chinese imports in industry \(j\), per worker. Varies across industries and years; common to all CZs.

A Pedagogical TWFE Adaptation

We now write down a yearly TWFE specification for the ADH setup:

\[ L_{ct} = \alpha_c + \lambda_t + \beta \cdot \text{ChinaShock}_{ct} + \varepsilon_{ct} \]

  • \(\alpha_c\): CZ fixed effect — absorbs time-invariant CZ characteristics.
  • \(\lambda_t\): year fixed effect — absorbs national-year shocks common to all CZs.
  • \(\beta\): average causal effect of one extra unit of import exposure on the manufacturing employment share.

Honest disclosure: ADH (2013) does not run this yearly TWFE. They use stacked long-differences (1990→2000, 2000→2007). The framework below is an adaptation we use to illustrate panel-regression machinery — same identifying logic, different estimation.

Within-Transformed Treatment

Shorthand: \(s_{cj} \equiv \text{share}_{cj}\), \(x_{jt} \equiv \text{shift}_{jt}\), \(C_{ct} \equiv \text{ChinaShock}_{ct} = \sum_j s_{cj}\, x_{jt}\).

Three means (using \(s_{cj}\) fixed in \(t\), \(x_{jt}\) fixed in \(c\)):

\[ \bar C_c = \sum_j s_{cj}\,\bar x_j, \quad \bar C_t = \sum_j \bar s_j\, x_{jt}, \quad \bar C = \sum_j \bar s_j\,\bar x_j \]

Within transformation:

\[ \begin{aligned} \widetilde C_{ct} \;=\; C_{ct} - \bar C_c - \bar C_t + \bar C &= \sum_j \big[\,s_{cj}\,x_{jt} - s_{cj}\,\bar x_j - \bar s_j\,x_{jt} + \bar s_j\,\bar x_j\,\big] \\ &= \sum_j (s_{cj} - \bar s_j)(x_{jt} - \bar x_j) \end{aligned} \]

What Identifies \(\beta\)

\[ \widetilde C_{ct} \;=\; \sum_j (s_{cj} - \bar s_j)(x_{jt} - \bar x_j) \]

\(\widetilde C_{ct}\) is the idiosyncratic part of CZ-\(c\)’s exposure in year \(t\) — what’s left after netting out CZ-\(c\)’s average across years and year-\(t\)’s average across CZs.

Where does the variation come from?

  • Across years: in year \(t\), some industries had unusually large national shifts (large \(x_{jt} - \bar x_j\)).
  • Across CZs: CZ \(c\) is over-exposed to certain industries relative to the average CZ (large \(s_{cj} - \bar s_j\)).

By FWL, \(\hat\beta\) is the OLS slope of \(\widetilde L_{ct}\) on \(\widetilde C_{ct}\) across the panel — how much idiosyncratic employment moves per unit of idiosyncratic exposure.

What Strict Exogeneity Requires

For \(\hat\beta\) to be consistent, both pieces of \(C_{ct} = \sum_j s_{cj} x_{jt}\) must be uncorrelated with the error:

  • shares \(s_{cj}\) (baseline industry mix) uncorrelated with CZ-year shocks \(\varepsilon_{ct}\),
  • shifts \(x_{jt}\) (national industry shocks) uncorrelated with \(\varepsilon_{ct}\),

given the FE.

ADH defend each separately:

  • Shares: measured at \(t-1=1990\) — predetermined relative to later \(\varepsilon_{ct}\).
  • Shifts: are they exogenous? \(\to\) next slide.

Are the Shifts Exogenous?

If some unobserved \(U\) affects both the shifter \(x_{jt}\) and the CZ-year shock \(\varepsilon_{ct}\), then \(x_{jt}\) is correlated with \(\varepsilon_{ct}\) — strict exogeneity fails.

Question: what could \(U\) plausibly be? \(\to\) next slide.

Why the Shifts Aren’t Exogenous

\(U\) = US-side demand shocks. When industry \(j\) weakens domestically (shifting tastes, productivity slowdown, etc.):

  • US firms can’t compete \(\Rightarrow\) imports fill the gap \(\Rightarrow\) raises \(x_{jt}\).
  • Employment in CZs exposed to industry \(j\) falls for non-China reasons \(\Rightarrow\) shows up in \(\varepsilon_{ct}\).

So \(\Delta M_{j,\text{CN}\to\text{US},t}\) confounds Chinese supply (what we want — productivity, WTO entry, capacity expansion) with US demand (the confounder).

ADH: The Instrument

ADH IV: keep the same shift-share construction, but replace the shifter — use Chinese imports to other high-income countries instead of US:

\[ \text{ChinaShock}^{\text{IV}}_{ct} \;=\; \sum_j s_{cj} \cdot \frac{\Delta M_{j,\text{CN}\to\text{other},t}}{L_{j,t-1}} \]

(8 countries: Australia, Denmark, Finland, Germany, Japan, New Zealand, Spain, Switzerland.)

  • Relevance: same Chinese supply surge hits all 8 destinations \(\Rightarrow\) correlated with the US shifter.
  • Exclusion: other countries’ imports are not driven by US-specific demand shocks \(\Rightarrow\) purges the demand confound.

ADH: Controls Strengthen Identification

ADH’s preferred specification adds, interacted with the period dummy:

  • Census division \(\times\) period — region-specific time trends.
  • Baseline CZ characteristics \(\times\) period — % college, % female, % manufacturing employment, % foreign-born — measured at a single pre-period baseline.

Why pre-period?

  • Predetermined: values fixed before the shock arrives.
  • Using later values would mix in the outcome of treatment — they’d absorb the very effect we want to estimate (“bad controls”).
  • Same logic as fixing shares \(s_{cj}\) at baseline.

We’ll write these as \(W_c^{\text{pre}\prime}\gamma_t\) — interactions of \(W^{\text{pre}}\) with period dummies, with a separate coefficient vector per period. Formalized in the next section.

Why Add \(W^{\text{pre}}\) Controls?

  • TWFE assumes the time path \(\lambda_t\) is common to all CZs.
  • Strict exogeneity (given \(\alpha_c, \lambda_t\)) then requires the error to be mean-zero given \(X\).

Problem: CZs with different baseline characteristics may follow different time paths.

  • A common \(\lambda_t\) is misspecified.
  • \(\varepsilon_{ct}\) inherits CZ-specific time variation correlated with \(X_{ct}\) — strict exogeneity fails.

Resolution: let \(\lambda_t\) depend on a baseline characteristic \(W_c^{\text{pre}}\).

  • Stratifies the time path by pre-period composition.
  • Relaxes strict exogeneity to a conditional version — exogeneity within strata of \(W^{\text{pre}}\).
  • Formalized next.

General Form: \(W^{\text{pre}}\) Interactions

For continuous or multi-dimensional baseline characteristic \(W_c^{\text{pre}}\):

\[ Y_{ct} \;=\; \alpha_c + \underbrace{\lambda_t + W_c^{\text{pre}\prime}\gamma_t}_{\text{CZ-specific time path}} + X_{ct}'\beta + \varepsilon_{ct} \]

  • \(\gamma_t \in \mathbb{R}^K\): a period-specific coefficient vector on \(W^{\text{pre}}\).
  • \(W_c^{\text{pre}\prime}\gamma_t\) is CZ \(c\)’s time-\(t\) tilt away from the common \(\lambda_t\) — CZs with different \(W^{\text{pre}}\) travel different time paths.

\(W_c^{\text{pre}}\) must be time-invariant (measured pre-treatment).

  • Adding it as a level alone (without time interaction) is absorbed by \(\alpha_c\).
  • Only the time-varying interaction \(W_c^{\text{pre}\prime}\gamma_t\) identifies anything new.

The Fully Expanded Form

Period dummies and their \(W^{\text{pre}}\) interactions written out (observation indexed by \((c, t')\) on this slide; \(t\) is the summation index):

\[ Y_{c,t'} = \alpha_c \;+\; \sum_{t=2}^{T} \mathbf{1}[t'=t]\,\lambda_{t} \;+\; \sum_{t=2}^{T} \mathbf{1}[t'=t] \cdot W_c^{\text{pre}\prime}\gamma_{t} \;+\; X_{c,t'}'\beta + \varepsilon_{c,t'} \]

Parameters:

  • \(\alpha_c\): \(N\) unit FE.
  • \(\lambda_{t}\) for \(t = 2, \ldots, T\): \(T-1\) scalar coefficients on period dummies.
  • \(\gamma_{t}\) for \(t = 2, \ldots, T\): each a \(K\)-vector — \(K(T-1)\) coefficients on (period \(\times W^{\text{pre}}\)) interactions.
  • \(\beta\): causal slope on \(X\).

Dummy Trap and Reference Period

Why drop one period for both \(\lambda\) and \(\gamma\)?

Standalone period dummies: \(\sum_{t=1}^{T} \mathbf{1}[t'=t] = 1\) for all \((c, t')\) — the constant is in span(\(\alpha_c\)). Including all \(T\) → perfect collinearity. Drop \(\lambda_{t_0}\).

\(W^{\text{pre}}\)-interaction dummies: \(\sum_{t=1}^{T} \mathbf{1}[t'=t] \cdot W_c^{\text{pre}} = W_c^{\text{pre}}\) — time-invariant, also in span(\(\alpha_c\)). Drop \(\gamma_{t_0}\).

Convention: drop the same reference period \(t_0\) — typically the last pre-treatment period. Coefficients \(\lambda_{t}, \gamma_{t}\) are then interpreted relative to \(t_0\).

ADH: Headline Finding

Pulling everything together — what does ADH actually find?

From their preferred IV spec with \(W^{\text{pre}}\) controls (Table 3, col. 6, stacked long-differences):

A $1,000-per-worker increase in import exposure over a decade \(\Rightarrow\) a 0.596 pp drop in CZ manufacturing employment / working-age population.

Stability across spec columns 1–6 (with vs. without \(W^{\text{pre}}\) controls) is part of ADH’s robustness argument.

What’s Next: DiD as a Special Case

When treatment is binary and group-structured:

\[ D_{it} = G_i \cdot \mathbf{1}[t \geq t^*] \]

  • \(G_i \in \{0, 1\}\): indicator that unit \(i\) belongs to the treated group (treated at some point).
  • \(t^*\): the period when treatment turns on for the treated group.

TWFE specializes to difference-in-differences, the canonical design covered in lect6b.