IV — Estimation

Natasha Kang

Xiamen University, Chow Institute

May, 2026

Recap: From Identification to Estimation

Lecture 5a: under exogeneity, the structural parameter is identified by the population Wald ratio:

\[ \beta_1 = \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, X)} \]

We applied this to Angrist (1990) — the draft lottery gave a Wald estimate of -$2,741 (~17% of mean civilian earnings).

This is a statement about population moments. To estimate \(\beta_1\) from data, we replace population moments with sample analogs.

This lecture: develop the asymptotic theory for the just-identified case (the Angrist setting), then extend to 2SLS to handle controls and (later) multiple instruments.

Just-Identified IV: The Sample Estimator

The just-identified case (the Angrist setting from 5a): one \(Z\), one endogenous \(X\), no controls.

Replace population covariances with sample covariances:

\[ \hat\beta_1^{IV} = \frac{\widehat{\text{Cov}}(Z, Y)}{\widehat{\text{Cov}}(Z, X)} \]

This is the just-identified IV estimator: one instrument, one endogenous regressor, no controls.

For binary \(Z\) (e.g., the draft lottery), this simplifies to the Wald estimator:

\[ \hat\beta_1^{IV} = \frac{\bar Y_{Z=1} - \bar Y_{Z=0}}{\bar X_{Z=1} - \bar X_{Z=0}} \]

Does this estimator work? Check consistency next.

Consistency of the IV Estimator

Write the IV estimator as:

\[ \hat\beta_1^{IV} = \beta_1 + \frac{n^{-1}\sum_{i=1}^n (Z_i - \bar{Z}) U_i}{n^{-1}\sum_{i=1}^n (Z_i - \bar{Z}) X_i} \]

Numerator: \(n^{-1}\sum(Z_i - \bar{Z})U_i \xrightarrow{p} \text{Cov}(Z, U) = 0\) by LLN + exogeneity.

Denominator: \(n^{-1}\sum(Z_i - \bar{Z})X_i \xrightarrow{p} \text{Cov}(Z, X) \neq 0\) by LLN + relevance.

By the continuous mapping theorem:

\[ \hat\beta_1^{IV} \xrightarrow{p} \beta_1 + \frac{0}{\text{Cov}(Z, X)} = \beta_1 \]

Regularity: i.i.d. sampling, finite second moments, \(\text{Cov}(Z, X) \neq 0\).

Asymptotic Normality

Under the IV conditions, i.i.d. sampling, and finite fourth moments:

\[ \sqrt{n}(\hat\beta_1^{IV} - \beta_1) \xrightarrow{d} N(0, V_{IV}) \]

Derivation sketch: Rewrite

\[ \sqrt{n}(\hat\beta_1^{IV} - \beta_1) = \frac{n^{-1/2}\sum(Z_i - \bar{Z})U_i}{n^{-1}\sum(Z_i - \bar{Z})X_i} \]

By CLT (numerator) + Slutsky (denominator):

\[ V_{IV} = \frac{E[(Z_i - E[Z])^2 U_i^2]}{[\text{Cov}(Z,X)]^2} \]

Note: IV is not unbiased in finite samples — the ratio of two random variables has no closed-form mean. Inference relies on asymptotics.

Angrist’s Setting Is Special

The Angrist (1990) draft lottery is a rare ideal:

  • Identification by randomization: \(Z\) (lottery eligibility) is literally random. Independence\(Z \perp\!\!\!\perp (Y(z,d), D(z))\) — holds by design, no theoretical argument needed for that part. (Exclusion is still a question — recall 5a’s threats.)
  • \(Z\) is the instrument, not the treatment: non-compliance (deferments, voluntary enlistment) means \(Z \neq D\). That’s why we need IV — to leverage random \(Z\) to identify the effect of endogenous \(D\).

Most empirical questions don’t have a randomized \(Z\). Researchers must construct identification from observational data — finding instruments that are plausibly exogenous, defending the assumption from theory and institutions.

We turn to one such case next: Card (1995).

Card (1995): Background

The question: how much do wages rise with each additional year of schooling — the “return to schooling”?

The data: Card uses the National Longitudinal Survey of Young Men (NLSYM), a panel of US men aged 14–24 in 1966. Headline regressions use the 1976 cross-section, when respondents were 24–34.

  • Outcome \(\ln W_i\): log hourly wage in 1976
  • Regressor \(E_i\): years of completed schooling

Schooling is endogenous — people choose how much education to get, so we can’t naively compare wages across schooling levels.

Why Controls Aren’t Enough

We could try OLS with all the covariates we observe:

\[ \ln W_i = \beta_0 + \beta_1 E_i + \mathbf{W}_i'\boldsymbol{\gamma} + U_i \]

where \(\mathbf{W}_i\) = race, region, parental education, experience (controls).

But \(U_i\) still contains unobservables that drive both schooling and wages:

  • Cognitive ability
  • Motivation, persistence
  • Family environment beyond parental education

So \(\text{Cov}(E_i, U_i \mid \mathbf{W}_i) \neq 0\) — OLS remains biased even after controlling for \(\mathbf{W}\).

An Instrument for Schooling

Card (1995): \(Z = \mathbb{1}\{\text{lived near a 4-year college at age 14}\}\).

  • Relevance: lower cost of attendance → more schooling.
  • Exogeneity: where you grew up shouldn’t directly affect adult wages, conditional on family background controls.

But the bare Wald ratio (just-identified, no controls) doesn’t accommodate \(\mathbf{W}\) in the equation.

We need an estimator that uses \(Z\) to identify the schooling coefficient while including \(\mathbf{W}\) as controls.

Two-Stage Least Squares (2SLS)

Idea: replace \(E\) with the part driven by exogenous variation — then regress \(\ln W\) on that.

Stage 1 (First Stage): regress \(E\) on \((Z, \mathbf{W})\): \[ E_i = \pi_0 + \pi_1 Z_i + \boldsymbol{\delta}'\mathbf{W}_i + V_i \] Take fitted values \(\hat E_i\) — a linear combination of exogenous variables, uncorrelated with \(U\).

Stage 2 (Second Stage): regress \(\ln W\) on \((\hat E, \mathbf{W})\): \[ \ln W_i = \beta_0 + \beta_1 \hat E_i + \boldsymbol{\gamma}'\mathbf{W}_i + \text{error}_i \] The coefficient on \(\hat E\) is the 2SLS estimator \(\hat\beta_1^{2SLS}\).

Why 2SLS Works

Decompose \(E = \hat E + V\). The first stage isolates the exogenous part \(\hat E\) (driven by \(Z\) and \(\mathbf{W}\)) from the endogenous part \(V\) (the residual, correlated with \(U\)).

Stage 2 regresses \(\ln W\) on \(\hat E\) — using only the clean variation. The endogenous variation in \(V\) is discarded.

Asymptotic theory carries over: the same LLN + CLT + Slutsky logic from the just-identified case extends to 2SLS — \(\hat\beta_1^{2SLS} \xrightarrow{p} \beta_1\) and is asymptotically normal under analogous conditions.

Worked Example: Card via 2SLS

Setup: Card (1995), \(L = 1\) instrument (proximity), \(k = 1\) endogenous regressor (schooling), with controls \(\mathbf{W}\).

First stage: regress \(E\) on \(Z + \mathbf{W}\):

\[ E_i = \pi_0 + \pi_1 Z_i + \boldsymbol{\delta}'\mathbf{W}_i + V_i \]

Card finds \(\hat\pi_1 \approx 0.32\) — proximity raises schooling by 0.32 years.

Second stage: regress \(\ln W\) on \(\hat E + \mathbf{W}\):

\[ \ln W_i = \beta_0 + \beta_1 \hat E_i + \boldsymbol{\gamma}'\mathbf{W}_i + e_i \]

Result: \(\hat\beta_1^{2SLS} = 0.132\) (SE 0.049), vs. OLS 0.073 (SE 0.004).

The Puzzle: IV \(>\) OLS

Card (1995)
\(\hat\beta^{OLS}\) 0.073
\(\hat\beta^{2SLS}\) 0.132

IV is almost twice OLS.

OVB would predict the opposite. Ability is omitted; ability raises both schooling and wages \(\Rightarrow\) OLS overstates the return \(\Rightarrow\) \(\hat\beta^{OLS} > \hat\beta^{IV}\).

But \(\hat\beta^{OLS} < \hat\beta^{IV}\) — and the same pattern shows up across education-IV studies. What’s going on?

Explanation 1: ME Attenuates OLS

\(E_{\text{obs}} = E^* + \varepsilon\), classical (\(\varepsilon \perp\!\!\!\perp E^*, U, Z, \mathbf{W}\)).

By FWL, the OLS slope on \(E_{\text{obs}}\) equals the slope of \(\tilde Y\) on \(\tilde E_{\text{obs}}\) (residuals after partialling out \(\mathbf{W}\)). Since \(\varepsilon \perp \mathbf{W}\): \[ \tilde E_{\text{obs}} = \tilde E^* + \varepsilon. \]

Apply 5a’s bivariate attenuation result to the residualized variables: \[ \hat\beta^{OLS} \xrightarrow{p} \beta \cdot \underbrace{\frac{\text{Var}(\tilde E^*)}{\text{Var}(\tilde E^*) + \text{Var}(\varepsilon)}}_{\text{attenuation factor}\,<\,1}. \]

ME Doesn’t Bias the First Stage

Stage 1: OLS of \(E_{\text{obs}}\) on \((Z, \mathbf{W})\). Population coefficients \((\pi_1, \boldsymbol{\delta})\) are defined by the orthogonality conditions \[ E\!\bigl[\,Z\,(E_{\text{obs}} - \pi_0 - \pi_1 Z - \boldsymbol{\delta}'\mathbf{W})\,\bigr] = 0,\quad E\!\bigl[\,\mathbf{W}\,(\cdot)\,\bigr] = \mathbf{0}. \]

Substitute \(E_{\text{obs}} = E^* + \varepsilon\). Since \(\varepsilon \perp\!\!\!\perp (Z, \mathbf{W})\), \(E[Z\varepsilon] = 0\) and \(E[\mathbf{W}\varepsilon] = \mathbf{0}\) — they cancel: \[ E\!\bigl[\,Z\,(E^* - \pi_0 - \pi_1 Z - \boldsymbol{\delta}'\mathbf{W})\,\bigr] = 0,\quad E\!\bigl[\,\mathbf{W}\,(\cdot)\,\bigr] = \mathbf{0}. \]

These are the moment conditions defining the linear projection of \(E^*\) on \((Z, \mathbf{W})\). So \((\pi_1, \boldsymbol{\delta})\) are identical to the no-ME projection coefficients.

⇒ 2SLS Estimand Unchanged

Since \((\hat\pi_1, \hat{\boldsymbol{\delta}})\) are consistent for the no-ME projection coefficients, the Stage-1 fit \[ \hat E_i \xrightarrow{p} \pi_0 + \pi_1 Z_i + \boldsymbol{\delta}'\mathbf{W}_i = \mathrm{proj}(E^* \mid Z_i, \mathbf{W}_i). \]

Stage 2 receives the same predictor it would have under \(E^*\). By the consistency argument earlier in the lecture, \[ \hat\beta^{2SLS} \xrightarrow{p} \beta, \] identical to the no-ME case.

How Big Is the Noise?

The unconditional noise share \(\text{Var}(\varepsilon)/\text{Var}(E_{\text{obs}})\) is unobserved directly. Two empirical strategies:

  • Administrative validation: match self-reports to transcripts or employer records. Discrepancies measure \(\varepsilon\).
  • Twin/sibling cross-reports (Ashenfelter & Krueger 1994): each twin reports their own and their twin’s schooling — under classical ME, discrepancies pin down \(\text{Var}(\varepsilon)\).

Both land at noise share ≈ 10–15% for US years-of-schooling.

Caveat: the with-controls attenuation factor uses \(\text{Var}(\tilde E^*) < \text{Var}(E^*)\). (Why?)

\(\Rightarrow\) With-controls attenuation is more severe than the unconditional noise share suggests.

Even pessimistically — say attenuation factor \(\approx 0.80\) — the implied no-ME OLS would be \(\approx 0.073/0.80 \approx 0.091\), still well short of the IV’s \(0.132\). ME contributes; it doesn’t close the gap.

Explanation 2: LATE \(>\) ATE

Key fact (5a): IV/2SLS identifies the LATE on compliers — those whose treatment is shifted by \(Z\) — not the population ATE.

Card’s compliers: kids whose schooling was swayed by college proximity. By definition, they’re at the cost margin — return \(\approx\) cost.

  • Plausibly liquidity-constrained: cost is what kept them out, not low returns.
  • Compliers sit in the high-return tail of the population \(\Rightarrow\) complier returns exceed the population mean.

\(\Rightarrow \text{LATE}_{\text{Card}} > \text{ATE}\), and \(\hat\beta^{2SLS} > \hat\beta^{OLS}\) even with no OLS bias at all.

The Cost of 2SLS: Variance

2SLS is consistent but less efficient than OLS. Intuitively:

  • OLS identifies the slope from all the variation in \(E\) (after controls).
  • 2SLS identifies it only from the variation in \(E\) driven by \(Z\) — analogous to learning only from compliers (5a).

Less variation used \(\Rightarrow\) wider standard errors. The weaker the instrument’s effect on \(E\) given \(\mathbf{W}\), the smaller the slice 2SLS leverages, and the bigger the precision penalty.

Card illustrates this directly:

\(\hat\beta_1\) SE 95% CI
OLS 0.073 0.004 [0.065, 0.081]
2SLS 0.132 0.049 [0.036, 0.228]

Proximity shifts \(E\) by only ~0.32 years — small relative to \(E\)’s overall variation (SD ~3 years). 2SLS uses only that slice \(\Rightarrow\) SE 12× OLS.

OLS and 2SLS Variances via FWL

By FWL, both slopes on \(E\) reduce to bivariate forms in residualized variables \(\tilde Y, \tilde E, \tilde Z\) (after partialling out \(\mathbf{W}\)).

OLS (homoskedastic): \[ V_{OLS} = \frac{\sigma^2}{\text{Var}(\tilde E)}. \]

2SLS (just-identified, homoskedastic) — the just-id IV asymptotic variance from earlier, applied to residualized variables: \[ V_{2SLS} = \frac{\sigma^2 \cdot \text{Var}(\tilde Z)}{[\,\text{Cov}(\tilde Z, \tilde E)\,]^2}. \]

The Variance Ratio

\[ \frac{V_{2SLS}}{V_{OLS}} = \frac{\text{Var}(\tilde Z)\cdot\text{Var}(\tilde E)}{[\,\text{Cov}(\tilde Z, \tilde E)\,]^2} = \frac{1}{\rho^2_{\tilde Z, \tilde E}} \geq 1. \]

\(\rho^2_{\tilde Z, \tilde E}\) = squared correlation between \(\tilde Z\) and \(\tilde E\) = partial \(R^2\): the share of \(E\)’s variation (net of \(\mathbf{W}\)) that \(Z\) explains.

The “small slice” intuition, made precise: as \(Z\)’s explanatory power for \(E\) given \(\mathbf{W}\) shrinks, the variance ratio explodes.

OLS vs. IV — Visualization

OLS is centered at the wrong value but precise; IV is centered at the truth but imprecise. IV trades bias for variance.

Multiple Instruments

Researchers often have several candidate instruments for the same regressor. Two reasons this can be valuable:

  • Precision: combining instruments uses more variation in \(E\) → smaller standard errors.
  • Broader complier coverage: different instruments may shift different complier groups, potentially expanding what 2SLS averages over (caveats in two slides).

Terminology (single endogenous regressor): \(L = 1\) instrument = just-identified; \(L > 1\) = overidentified.

2SLS with Multiple Instruments

Same two stages — only Stage 1 changes.

Stage 1: regress \(E\) on all instruments and controls: \[ E_i = \pi_0 + \boldsymbol{\pi}'Z_i + \boldsymbol{\delta}'\mathbf{W}_i + V_i. \] Take fitted values \(\hat E_i\) — the best linear combination of \((Z_1, \ldots, Z_L)\) for predicting \(E\).

Stage 2: regress \(\ln W\) on \((\hat E, \mathbf{W})\)identical to the just-identified case: \[ \ln W_i = \beta_0 + \beta_1 \hat E_i + \boldsymbol{\gamma}'\mathbf{W}_i + \text{error}_i. \]

2SLS as a Weighted Average of LATEs

With multiple instruments, 2SLS is asymptotically just-identified IV using the conditional mean \(p(Z) = E[E \mid Z]\) as a scalar (multi-valued) instrument: \[ \beta_1^{2SLS} = \frac{\text{Cov}(\ln W,\, p(Z))}{\text{Var}(p(Z))}. \]

With heterogeneous effects, this is a weighted average of pairwise LATEs across the levels \(p_0 < p_1 < \cdots < p_K\) of \(p(Z)\): \[ \beta_1^{2SLS} = \sum_{k=0}^{K-1} w_k \cdot \text{LATE}(p_k, p_{k+1}). \]

The weights factor as: \[ w_k \;\propto\; \underbrace{\lambda_k}_{\text{instrument-shape factor}} \;\cdot\; \underbrace{[p_{k+1} - p_k]}_{\text{complier mass at margin }k}. \]

Complier Mass at Each Margin

\(p_{k+1} - p_k\) is the rise in \(P(D=1 \mid Z)\) between adjacent levels — under monotonicity, the share of compliers who switch into treatment at margin \(k\).

A property of the instrument-treatment relationship. Tells you how much \(D\) moves at each margin.

Leverage Factor \(\lambda_k\)

\(\lambda_k\) depends only on the sample distribution of \(Z\): \[ \lambda_k = P(Z > z_k) \cdot \{E[Z \mid Z > z_k] - E[Z]\}. \]

A product of two factors moving in opposite directions:

  • \(P(Z > z_k)\) shrinks as the cut rises.
  • \(E[Z\mid Z > z_k] - E[Z]\) grows as the cut rises.

At extreme cuts one factor vanishes. The product peaks at \(z_k = E[Z]\).

No causal content: \(\lambda_k\) depends only on \(Z\)’s distribution. Trim the sample and the weights shift, even though no causal mechanism has changed.

What Variation Identifies \(\beta^{2SLS}\)?

The weight \(w_k \propto \lambda_k \cdot [p_{k+1} - p_k]\) tells us where the identifying variation comes from:

  • Compliers, not the population: only people whose \(D\) moves across some margin of \(p(Z)\) contribute. Always-takers and never-takers are silent.
  • Many complier groups, averaged: each margin \(k\) has its own complier subpopulation. With multiple instruments, \(p(Z)\) has multiple jumps, each indexing a different group. 2SLS pools them.
  • In non-uniform proportions: weights depend on \(\lambda_k\) and \([p_{k+1} - p_k]\), not on how representative each complier group is of the population. A small but high-leverage margin can dominate.

\(\Rightarrow\) \(\beta^{2SLS}\) is a specific weighted aggregation of complier-group-specific effects — not a single well-defined “treatment effect” for the population.

Multiple Instruments \(\neq\) ATE

Two important features of the weights \(w_k\):

  • Not population shares: margins where \(p(Z)\) varies more get more weight, regardless of how representative those compliers are.
  • Coverage may be narrow: if \(p(Z)\) takes few effectively distinct values (e.g., all instruments shift the same margin), the weighted average covers a limited range of compliers.

\(\Rightarrow\) More instruments does not automatically move 2SLS toward ATE.

Example: Angrist & Krueger (1991)

Setting: AK use US Census 1980 data on US-born men born 1930–39 (\(n \approx 313{,}000\)).

  • Outcome \(Y_i\): log weekly wage in 1980
  • Endogenous regressor \(E_i\): years of completed schooling
  • Instrument: quarter of birth (and cohort interactions)

OLS benchmark (with year-of-birth controls): \(\hat\beta^{OLS} = 0.071\) (SE \(0.0003\)).

Why Quarter of Birth?

How can quarter of birth shift schooling?

The mechanism:

  • Districts enroll children in 1st grade based on the calendar year they turn 6.
  • A Q1-born and Q4-born child in the same cohort enter together — but the Q1-born is ~9 months older.
  • Compulsory schooling laws let students drop out at age 16.
  • Q1-borns reach the dropout age with fewer completed grades than Q4-borns.
  • Detrended schooling gap \(\bar E_{\text{Q4}} - \bar E_{\text{Q1}} \approx 0.124\) years.

AK Step 1: Wald IV (Q1 vs the Rest)

AK first run a simple Wald estimator with a single binary instrument: \[ Z_i = \mathbf{1}\{\text{quarter}_i = 1\}. \] This is the just-identified case from 5a — one \(Z\), one endogenous regressor, no controls.

\[ \hat\beta^{IV} \;=\; \frac{\bar Y_{Z=1} - \bar Y_{Z=0}}{\bar E_{Z=1} - \bar E_{Z=0}} \;=\; \frac{-0.0111}{-0.109} \;=\; 0.1020 \quad (\text{SE } 0.0239). \]

Same pattern as Card: IV \(>\) OLS (\(0.1020\) vs \(0.0711\)) — the systematic IV-bigger-than-OLS finding across education-IV studies. SE is also two orders of magnitude wider than OLS’s \(0.0003\) — IV’s precision cost.

From Wald to 2SLS: Two Changes

AK make two changes to move from the Wald to a more credible specification:

  1. Add cohort controls — year-of-birth dummies enter the wage equation. Schooling and wages drift across cohorts (secular trends, labor-market conditions); the no-controls Wald pools all cohorts.
  1. Replace the single \(Q1\) instrument with \(3 \times 10 = 30\) QoB × cohort interactions: \[ Z_{i,(q,y)} \;=\; \mathbf{1}\{\text{quarter}_i = q\} \cdot \mathbf{1}\{\text{cohort}_i = y\}, \] \(q \in \{1, 2, 3\}\), \(y \in \{1930, \ldots, 1939\}\).

Payoff: cohort-specific first-stage signals add variation to \(\hat E\) → tighter SE.

AK Step 2: 2SLS with 30 Instruments

Specification Controls \(\hat\beta_1\) SE
OLS none 0.0709 0.0003
OLS YoB dummies 0.0711 0.0003
Wald (1 IV: \(Z = \mathbf{1}\{\text{Q1}\}\)) none 0.1020 0.0239
2SLS (30 IVs: QoB × cohort) YoB dummies 0.0891 0.0161
  • OLS unchanged by YoB controls (0.0709 → 0.0711): cohort confounding is small.
  • IV \(>\) OLS in both Wald and 2SLS — same pattern as Card.
  • 2SLS shrinks SE by ~⅓ (0.0239 → 0.0161): precision gain from multiple instruments.
  • Point estimate moves toward OLS (0.1020 → 0.0891): cohort interactions reweight the LATE.

What Do AK’s IV Estimates Identify?

Both the Wald and the 2SLS aggregate LATEs on kids whose schooling responds to QoB — kids at the dropout margin (compulsory-schooling cutoff at age 16).

  • Wald (Q1 vs rest): pools all cohorts into one \(Q1\)-vs-rest comparison → a single LATE on cohort-pooled compliers.
  • 2SLS (QoB × cohort): each (quarter, cohort) cell contributes its own complier group → weighted average of cohort-specific LATEs.

Both target the same complier population (dropout-margin kids) but aggregate over them differently.

Who is silent in both?

  • Always-takers — kids who go to college regardless of birth quarter.
  • Never-takers — kids who drop out at 16 regardless of quarter.

\(\Rightarrow\) Neither \(\hat\beta^{IV}\) corresponds to the ATE.

A Word of Caution

AK’s \(\hat\beta^{2SLS} = 0.0891\) is the return to compulsory schoolingnot the population return to education.

  • Effect on: kids whose schooling is shifted at the dropout margin (compulsory-attendance cutoff at age 16).
  • Right parameter for: policies operating at that margin (e.g., raising the dropout age).
  • Wrong parameter for: policies targeting other groups (college access, gifted students).

Beyond this lecture: the marginal treatment effect (MTE) framework (Heckman-Vytlacil) targets effects for specific subpopulations directly.

Multiple Endogenous Regressors

  • \(\mathbf{X}_i\) (\(k \times 1\)): endogenous regressors.
  • \(\mathbf{Z}_i\) (\(L \times 1\)): instruments.
  • Moment conditions: \(E[\mathbf{Z}_i(Y_i - \mathbf{X}_i'\beta)] = \mathbf{0}\).

Identification requires:

Rank condition: \(\mathrm{rank}(E[\mathbf{Z}_i\mathbf{X}_i']) = k\).

The instruments must shift \(X_1, \ldots, X_k\) in linearly independent ways. If two instruments shift \(X_1\) and \(X_2\) identically (e.g., both proxy for the same factor), we can’t separate \(\beta_1\) from \(\beta_2\).

This implies the order condition \(L \geq k\) — a count check, since a matrix with fewer rows than columns can’t have full column rank.

Computing 2SLS with multiple endogenous regressors: run a separate first-stage for each \(X_j\) (using all \(L\) instruments), then plug all fitted values \(\hat X_1, \ldots, \hat X_k\) into the second stage together.

Diagnostics: Before You Trust an IV

Three diagnostic questions for any IV estimate:

  1. Are the instruments valid? Joint exogeneity is mostly defended from theory, but we have one partial empirical check — Sargan/Hansen.
  1. Do we need IV? Is the regressor really endogenous, or would OLS suffice? — Hausman test.
  1. Is the instrument strong enough? Weak instruments distort point estimates and inference at once — first-stage \(F\), Staiger-Stock.

Big caveat: none of these validates the design. Exogeneity is ultimately a theoretical claim; the tests catch some failures, not all.

Overidentification Test (Sargan/Hansen)

Why care? IV’s identification rests on assumptions — \(Z\) exogenous, excluded — defended from theory, not directly testable. With \(L > k\) instruments, we get one empirical check: if all are valid, they should give consistent estimates.

The test: regress the 2SLS residuals \(\hat U_i = Y_i - \mathbf{X}_i'\hat\beta^{2SLS}\) on the instruments \(\mathbf{Z}_i\). Under \(H_0\) (all instruments valid), residuals should be uncorrelated with \(\mathbf{Z}_i\) — so the \(R^2\) from this regression should be near zero.

Statistic: \(J = n \cdot R^2 \;\xrightarrow{d}\; \chi^2_{L - k}\) under \(H_0\).

Falsification, not validation: rejection signals at least one invalid instrument. Non-rejection means the data don’t contradict joint validity — it isn’t proof.

Endogeneity Test (Hausman)

Question: do we even need IV? Or is OLS fine?

Idea: under \(H_0: \text{Cov}(X, U) = 0\) (\(X\) exogenous), both OLS and IV are consistent — they should agree. If they don’t, \(X\) is endogenous and OLS is biased.

Test statistic:

\[ H = (\hat\beta^{IV} - \hat\beta^{OLS})' (\hat V^{IV} - \hat V^{OLS})^{-1} (\hat\beta^{IV} - \hat\beta^{OLS}) \xrightarrow{d} \chi^2_k \]

where \(k\) is the number of endogenous regressors. Reject \(\Rightarrow\) \(X\) is endogenous, IV is needed.

Caveat: even with a valid, reasonably strong instrument, the test is often underpowered. IV’s SE is intrinsically much larger than OLS’s (the \(1/\rho^2\) penalty), so substantively large OLS–IV gaps can still fail to reject statistically. The test also relies on IV being valid in the first place — if it isn’t, the comparison is meaningless.

Hausman — Card Numerical Example

Apply the formula to Card’s \(\hat\beta^{OLS} = 0.073\) (SE 0.004) and \(\hat\beta^{IV} = 0.132\) (SE 0.049): \[ H = \frac{(0.132 - 0.073)^2}{0.049^2 - 0.004^2} \approx 1.46 \;<\; \chi^2_{1, 0.05} = 3.84. \] Fail to reject.

But: IV is 80% larger than OLS — a substantively huge gap. The test fails to reject only because IV’s SE is so wide. Hausman is underpowered when IV is imprecise — it can’t detect endogeneity that IV’s precision is too low to resolve.

Don’t lean on Hausman as the primary diagnostic for whether to use IV. It tells you more about IV’s precision than about \(E\)’s endogeneity.

What If the Instrument Is Weak?

Recall: \[ V_{2SLS} = \frac{\sigma^2 \cdot \text{Var}(\tilde Z)}{[\,\text{Cov}(\tilde Z, \tilde E)\,]^2} = \frac{\sigma^2}{\rho^2_{\tilde Z, \tilde E} \cdot \text{Var}(\tilde E)} \]

As \(\rho^2_{\tilde Z, \tilde E} \to 0\) (weak instrument):

  • \(V_{2SLS} \to \infty\): standard errors explode.
  • But the problem is worse than large variance.

Bias and Non-Normality

\(F\) = first-stage \(F\)-statistic for \(H_0: \boldsymbol{\pi} = \mathbf{0}\) in \(X_i = \mathbf{Z}_i'\boldsymbol{\pi} + V_i\). Larger \(F\) ⇒ stronger instrument(s).

As the first stage weakens, \(\hat\beta^{2SLS}\) shifts toward OLS and its distribution becomes non-normal and heavy-tailed. Standard confidence intervals lose their nominal coverage.

The Staiger-Stock Rule

First-stage \(F\)-statistic: joint test of \(H_0: \boldsymbol{\pi} = \mathbf{0}\) in \[ X_i = \mathbf{Z}_i'\boldsymbol{\pi} + \mathbf{W}_i'\boldsymbol{\delta} + V_i. \]

Rule of thumb (Staiger & Stock 1997): \(F < 10\) ⇒ weak instrument. The cutoff is calibrated so that finite-sample 2SLS bias is at most ~10% of OLS bias.

Always report \(F\). The single most important diagnostic for IV.

Beyond Staiger–Stock: Weak-IV Inference

When the first-stage \(F\) is in the danger zone (~5–15) and a stronger instrument isn’t available — common with natural-experiment IVs (one lottery, one policy discontinuity) — modern weak-IV inference provides confidence sets that are valid regardless of instrument strength.

  • Anderson–Rubin (AR) confidence set: a CI whose coverage is correct regardless of first-stage strength (relevance still required). Approaches the standard 2SLS CI under a strong first stage; widens (and can become unbounded) when the first stage is weak.
  • Lee–McCrary–Moreira–Porter (2022) \(tF\) correction: keep the usual \(\hat\beta_1/\text{SE}\) statistic, but compare it to a critical value \(c(F)\) that depends on the first-stage \(F\), instead of \(1.96\). \(c(F) \approx 1.96\) when the first stage is strong; \(c(F)\) is larger (CI wider) when it is weak.

Practical recommendation: when \(F\) is borderline, report AR (or \(tF\)) intervals alongside standard 2SLS CIs. If they agree, inference is robust; if they diverge, the AR/\(tF\) interval is the trustworthy one.

Practical Workflow

When reporting IV estimates, always include:

  1. First-stage \(F\)-statistic (Staiger-Stock: \(F > 10\)).
    • When \(F\) is in the danger zone, also report AR (or \(tF\)) CIs alongside standard 2SLS CIs.
  2. Robust (or cluster-robust) standard errors.
  3. OLS estimate alongside IV — for transparency and Hausman comparison.

If overidentified (\(L > k\)):

  1. Sargan/Hansen \(J\)-test for overidentification.
  2. Just-identified estimates for each instrument separately, to check stability.

Defend exogeneity from theory — the most important step. Tests are partial diagnostics; the data alone cannot validate identification.

Computing 2SLS Correctly

A final implementation note — at the software level, not the theory level. If you compute 2SLS literally as two OLS regressions in sequence, the Stage-2 slope is correct, but the Stage-2 reported SE is wrong.

The Stage-2 OLS routine builds residuals using \(\hat E_i\): \[ \hat e_i = Y_i - \hat\beta_0 - \hat\beta^{2SLS}\,\hat E_i - \hat{\boldsymbol{\gamma}}'\mathbf{W}_i. \] But the correct \(V_{2SLS}\) uses residuals with the actual endogenous regressor: \[ \hat U_i = Y_i - \hat\beta_0 - \hat\beta^{2SLS}\, E_i - \hat{\boldsymbol{\gamma}}'\mathbf{W}_i. \]

Use the built-in IV command — it computes the correct \(V_{2SLS}\) internally:

  • Stata: ivregress 2sls Y W (X = Z), robust
  • R: ivreg(Y ~ X + W | Z + W, data = ...)
  • Python: IV2SLS(Y, [W, const], X, Z).fit()

What’s Next

Lecture 6a — Panel Regression:

  • A new dimension of variation: panel data
  • Fixed effects and within-unit identification
  • Strict exogeneity
  • Continuous treatment (ADH China shock)

Lecture 6b — Difference-in-Differences:

  • The canonical panel design with binary, group-structured treatment
  • Parallel trends as the identifying assumption
  • The 2×2 case and its event-study generalization