Xiamen University, Chow Institute
May, 2026
Lecture 5a: under exogeneity, the structural parameter is identified by the population Wald ratio:
\[ \beta_1 = \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, X)} \]
We applied this to Angrist (1990) — the draft lottery gave a Wald estimate of -$2,741 (~17% of mean civilian earnings).
This is a statement about population moments. To estimate \(\beta_1\) from data, we replace population moments with sample analogs.
This lecture: develop the asymptotic theory for the just-identified case (the Angrist setting), then extend to 2SLS to handle controls and (later) multiple instruments.
The just-identified case (the Angrist setting from 5a): one \(Z\), one endogenous \(X\), no controls.
Replace population covariances with sample covariances:
\[ \hat\beta_1^{IV} = \frac{\widehat{\text{Cov}}(Z, Y)}{\widehat{\text{Cov}}(Z, X)} \]
This is the just-identified IV estimator: one instrument, one endogenous regressor, no controls.
For binary \(Z\) (e.g., the draft lottery), this simplifies to the Wald estimator:
\[ \hat\beta_1^{IV} = \frac{\bar Y_{Z=1} - \bar Y_{Z=0}}{\bar X_{Z=1} - \bar X_{Z=0}} \]
Does this estimator work? Check consistency next.
Write the IV estimator as:
\[ \hat\beta_1^{IV} = \beta_1 + \frac{n^{-1}\sum_{i=1}^n (Z_i - \bar{Z}) U_i}{n^{-1}\sum_{i=1}^n (Z_i - \bar{Z}) X_i} \]
Numerator: \(n^{-1}\sum(Z_i - \bar{Z})U_i \xrightarrow{p} \text{Cov}(Z, U) = 0\) by LLN + exogeneity.
Denominator: \(n^{-1}\sum(Z_i - \bar{Z})X_i \xrightarrow{p} \text{Cov}(Z, X) \neq 0\) by LLN + relevance.
By the continuous mapping theorem:
\[ \hat\beta_1^{IV} \xrightarrow{p} \beta_1 + \frac{0}{\text{Cov}(Z, X)} = \beta_1 \]
Regularity: i.i.d. sampling, finite second moments, \(\text{Cov}(Z, X) \neq 0\).
Under the IV conditions, i.i.d. sampling, and finite fourth moments:
\[ \sqrt{n}(\hat\beta_1^{IV} - \beta_1) \xrightarrow{d} N(0, V_{IV}) \]
Derivation sketch: Rewrite
\[ \sqrt{n}(\hat\beta_1^{IV} - \beta_1) = \frac{n^{-1/2}\sum(Z_i - \bar{Z})U_i}{n^{-1}\sum(Z_i - \bar{Z})X_i} \]
By CLT (numerator) + Slutsky (denominator):
\[ V_{IV} = \frac{E[(Z_i - E[Z])^2 U_i^2]}{[\text{Cov}(Z,X)]^2} \]
Note: IV is not unbiased in finite samples — the ratio of two random variables has no closed-form mean. Inference relies on asymptotics.
The Angrist (1990) draft lottery is a rare ideal:
Most empirical questions don’t have a randomized \(Z\). Researchers must construct identification from observational data — finding instruments that are plausibly exogenous, defending the assumption from theory and institutions.
We turn to one such case next: Card (1995).
The question: how much do wages rise with each additional year of schooling — the “return to schooling”?
The data: Card uses the National Longitudinal Survey of Young Men (NLSYM), a panel of US men aged 14–24 in 1966. Headline regressions use the 1976 cross-section, when respondents were 24–34.
Schooling is endogenous — people choose how much education to get, so we can’t naively compare wages across schooling levels.
We could try OLS with all the covariates we observe:
\[ \ln W_i = \beta_0 + \beta_1 E_i + \mathbf{W}_i'\boldsymbol{\gamma} + U_i \]
where \(\mathbf{W}_i\) = race, region, parental education, experience (controls).
But \(U_i\) still contains unobservables that drive both schooling and wages:
So \(\text{Cov}(E_i, U_i \mid \mathbf{W}_i) \neq 0\) — OLS remains biased even after controlling for \(\mathbf{W}\).
Card (1995): \(Z = \mathbb{1}\{\text{lived near a 4-year college at age 14}\}\).
But the bare Wald ratio (just-identified, no controls) doesn’t accommodate \(\mathbf{W}\) in the equation.
We need an estimator that uses \(Z\) to identify the schooling coefficient while including \(\mathbf{W}\) as controls.
Idea: replace \(E\) with the part driven by exogenous variation — then regress \(\ln W\) on that.
Stage 1 (First Stage): regress \(E\) on \((Z, \mathbf{W})\): \[ E_i = \pi_0 + \pi_1 Z_i + \boldsymbol{\delta}'\mathbf{W}_i + V_i \] Take fitted values \(\hat E_i\) — a linear combination of exogenous variables, uncorrelated with \(U\).
Stage 2 (Second Stage): regress \(\ln W\) on \((\hat E, \mathbf{W})\): \[ \ln W_i = \beta_0 + \beta_1 \hat E_i + \boldsymbol{\gamma}'\mathbf{W}_i + \text{error}_i \] The coefficient on \(\hat E\) is the 2SLS estimator \(\hat\beta_1^{2SLS}\).
Decompose \(E = \hat E + V\). The first stage isolates the exogenous part \(\hat E\) (driven by \(Z\) and \(\mathbf{W}\)) from the endogenous part \(V\) (the residual, correlated with \(U\)).
Stage 2 regresses \(\ln W\) on \(\hat E\) — using only the clean variation. The endogenous variation in \(V\) is discarded.
Asymptotic theory carries over: the same LLN + CLT + Slutsky logic from the just-identified case extends to 2SLS — \(\hat\beta_1^{2SLS} \xrightarrow{p} \beta_1\) and is asymptotically normal under analogous conditions.
Setup: Card (1995), \(L = 1\) instrument (proximity), \(k = 1\) endogenous regressor (schooling), with controls \(\mathbf{W}\).
First stage: regress \(E\) on \(Z + \mathbf{W}\):
\[ E_i = \pi_0 + \pi_1 Z_i + \boldsymbol{\delta}'\mathbf{W}_i + V_i \]
Card finds \(\hat\pi_1 \approx 0.32\) — proximity raises schooling by 0.32 years.
Second stage: regress \(\ln W\) on \(\hat E + \mathbf{W}\):
\[ \ln W_i = \beta_0 + \beta_1 \hat E_i + \boldsymbol{\gamma}'\mathbf{W}_i + e_i \]
Result: \(\hat\beta_1^{2SLS} = 0.132\) (SE 0.049), vs. OLS 0.073 (SE 0.004).
| Card (1995) | |
|---|---|
| \(\hat\beta^{OLS}\) | 0.073 |
| \(\hat\beta^{2SLS}\) | 0.132 |
IV is almost twice OLS.
OVB would predict the opposite. Ability is omitted; ability raises both schooling and wages \(\Rightarrow\) OLS overstates the return \(\Rightarrow\) \(\hat\beta^{OLS} > \hat\beta^{IV}\).
But \(\hat\beta^{OLS} < \hat\beta^{IV}\) — and the same pattern shows up across education-IV studies. What’s going on?
\(E_{\text{obs}} = E^* + \varepsilon\), classical (\(\varepsilon \perp\!\!\!\perp E^*, U, Z, \mathbf{W}\)).
By FWL, the OLS slope on \(E_{\text{obs}}\) equals the slope of \(\tilde Y\) on \(\tilde E_{\text{obs}}\) (residuals after partialling out \(\mathbf{W}\)). Since \(\varepsilon \perp \mathbf{W}\): \[ \tilde E_{\text{obs}} = \tilde E^* + \varepsilon. \]
Apply 5a’s bivariate attenuation result to the residualized variables: \[ \hat\beta^{OLS} \xrightarrow{p} \beta \cdot \underbrace{\frac{\text{Var}(\tilde E^*)}{\text{Var}(\tilde E^*) + \text{Var}(\varepsilon)}}_{\text{attenuation factor}\,<\,1}. \]
Stage 1: OLS of \(E_{\text{obs}}\) on \((Z, \mathbf{W})\). Population coefficients \((\pi_1, \boldsymbol{\delta})\) are defined by the orthogonality conditions \[ E\!\bigl[\,Z\,(E_{\text{obs}} - \pi_0 - \pi_1 Z - \boldsymbol{\delta}'\mathbf{W})\,\bigr] = 0,\quad E\!\bigl[\,\mathbf{W}\,(\cdot)\,\bigr] = \mathbf{0}. \]
Substitute \(E_{\text{obs}} = E^* + \varepsilon\). Since \(\varepsilon \perp\!\!\!\perp (Z, \mathbf{W})\), \(E[Z\varepsilon] = 0\) and \(E[\mathbf{W}\varepsilon] = \mathbf{0}\) — they cancel: \[ E\!\bigl[\,Z\,(E^* - \pi_0 - \pi_1 Z - \boldsymbol{\delta}'\mathbf{W})\,\bigr] = 0,\quad E\!\bigl[\,\mathbf{W}\,(\cdot)\,\bigr] = \mathbf{0}. \]
These are the moment conditions defining the linear projection of \(E^*\) on \((Z, \mathbf{W})\). So \((\pi_1, \boldsymbol{\delta})\) are identical to the no-ME projection coefficients.
Since \((\hat\pi_1, \hat{\boldsymbol{\delta}})\) are consistent for the no-ME projection coefficients, the Stage-1 fit \[ \hat E_i \xrightarrow{p} \pi_0 + \pi_1 Z_i + \boldsymbol{\delta}'\mathbf{W}_i = \mathrm{proj}(E^* \mid Z_i, \mathbf{W}_i). \]
Stage 2 receives the same predictor it would have under \(E^*\). By the consistency argument earlier in the lecture, \[ \hat\beta^{2SLS} \xrightarrow{p} \beta, \] identical to the no-ME case.
The unconditional noise share \(\text{Var}(\varepsilon)/\text{Var}(E_{\text{obs}})\) is unobserved directly. Two empirical strategies:
Both land at noise share ≈ 10–15% for US years-of-schooling.
Caveat: the with-controls attenuation factor uses \(\text{Var}(\tilde E^*) < \text{Var}(E^*)\). (Why?)
\(\Rightarrow\) With-controls attenuation is more severe than the unconditional noise share suggests.
Even pessimistically — say attenuation factor \(\approx 0.80\) — the implied no-ME OLS would be \(\approx 0.073/0.80 \approx 0.091\), still well short of the IV’s \(0.132\). ME contributes; it doesn’t close the gap.
Key fact (5a): IV/2SLS identifies the LATE on compliers — those whose treatment is shifted by \(Z\) — not the population ATE.
Card’s compliers: kids whose schooling was swayed by college proximity. By definition, they’re at the cost margin — return \(\approx\) cost.
\(\Rightarrow \text{LATE}_{\text{Card}} > \text{ATE}\), and \(\hat\beta^{2SLS} > \hat\beta^{OLS}\) even with no OLS bias at all.
2SLS is consistent but less efficient than OLS. Intuitively:
Less variation used \(\Rightarrow\) wider standard errors. The weaker the instrument’s effect on \(E\) given \(\mathbf{W}\), the smaller the slice 2SLS leverages, and the bigger the precision penalty.
Card illustrates this directly:
| \(\hat\beta_1\) | SE | 95% CI | |
|---|---|---|---|
| OLS | 0.073 | 0.004 | [0.065, 0.081] |
| 2SLS | 0.132 | 0.049 | [0.036, 0.228] |
Proximity shifts \(E\) by only ~0.32 years — small relative to \(E\)’s overall variation (SD ~3 years). 2SLS uses only that slice \(\Rightarrow\) SE 12× OLS.
By FWL, both slopes on \(E\) reduce to bivariate forms in residualized variables \(\tilde Y, \tilde E, \tilde Z\) (after partialling out \(\mathbf{W}\)).
OLS (homoskedastic): \[ V_{OLS} = \frac{\sigma^2}{\text{Var}(\tilde E)}. \]
2SLS (just-identified, homoskedastic) — the just-id IV asymptotic variance from earlier, applied to residualized variables: \[ V_{2SLS} = \frac{\sigma^2 \cdot \text{Var}(\tilde Z)}{[\,\text{Cov}(\tilde Z, \tilde E)\,]^2}. \]
\[ \frac{V_{2SLS}}{V_{OLS}} = \frac{\text{Var}(\tilde Z)\cdot\text{Var}(\tilde E)}{[\,\text{Cov}(\tilde Z, \tilde E)\,]^2} = \frac{1}{\rho^2_{\tilde Z, \tilde E}} \geq 1. \]
\(\rho^2_{\tilde Z, \tilde E}\) = squared correlation between \(\tilde Z\) and \(\tilde E\) = partial \(R^2\): the share of \(E\)’s variation (net of \(\mathbf{W}\)) that \(Z\) explains.
The “small slice” intuition, made precise: as \(Z\)’s explanatory power for \(E\) given \(\mathbf{W}\) shrinks, the variance ratio explodes.
OLS is centered at the wrong value but precise; IV is centered at the truth but imprecise. IV trades bias for variance.
Researchers often have several candidate instruments for the same regressor. Two reasons this can be valuable:
Terminology (single endogenous regressor): \(L = 1\) instrument = just-identified; \(L > 1\) = overidentified.
Same two stages — only Stage 1 changes.
Stage 1: regress \(E\) on all instruments and controls: \[ E_i = \pi_0 + \boldsymbol{\pi}'Z_i + \boldsymbol{\delta}'\mathbf{W}_i + V_i. \] Take fitted values \(\hat E_i\) — the best linear combination of \((Z_1, \ldots, Z_L)\) for predicting \(E\).
Stage 2: regress \(\ln W\) on \((\hat E, \mathbf{W})\) — identical to the just-identified case: \[ \ln W_i = \beta_0 + \beta_1 \hat E_i + \boldsymbol{\gamma}'\mathbf{W}_i + \text{error}_i. \]
With multiple instruments, 2SLS is asymptotically just-identified IV using the conditional mean \(p(Z) = E[E \mid Z]\) as a scalar (multi-valued) instrument: \[ \beta_1^{2SLS} = \frac{\text{Cov}(\ln W,\, p(Z))}{\text{Var}(p(Z))}. \]
With heterogeneous effects, this is a weighted average of pairwise LATEs across the levels \(p_0 < p_1 < \cdots < p_K\) of \(p(Z)\): \[ \beta_1^{2SLS} = \sum_{k=0}^{K-1} w_k \cdot \text{LATE}(p_k, p_{k+1}). \]
The weights factor as: \[ w_k \;\propto\; \underbrace{\lambda_k}_{\text{instrument-shape factor}} \;\cdot\; \underbrace{[p_{k+1} - p_k]}_{\text{complier mass at margin }k}. \]
\(p_{k+1} - p_k\) is the rise in \(P(D=1 \mid Z)\) between adjacent levels — under monotonicity, the share of compliers who switch into treatment at margin \(k\).
A property of the instrument-treatment relationship. Tells you how much \(D\) moves at each margin.
\(\lambda_k\) depends only on the sample distribution of \(Z\): \[ \lambda_k = P(Z > z_k) \cdot \{E[Z \mid Z > z_k] - E[Z]\}. \]
A product of two factors moving in opposite directions:
At extreme cuts one factor vanishes. The product peaks at \(z_k = E[Z]\).
No causal content: \(\lambda_k\) depends only on \(Z\)’s distribution. Trim the sample and the weights shift, even though no causal mechanism has changed.
The weight \(w_k \propto \lambda_k \cdot [p_{k+1} - p_k]\) tells us where the identifying variation comes from:
\(\Rightarrow\) \(\beta^{2SLS}\) is a specific weighted aggregation of complier-group-specific effects — not a single well-defined “treatment effect” for the population.
Two important features of the weights \(w_k\):
\(\Rightarrow\) More instruments does not automatically move 2SLS toward ATE.
Setting: AK use US Census 1980 data on US-born men born 1930–39 (\(n \approx 313{,}000\)).
OLS benchmark (with year-of-birth controls): \(\hat\beta^{OLS} = 0.071\) (SE \(0.0003\)).
How can quarter of birth shift schooling?
The mechanism:
AK first run a simple Wald estimator with a single binary instrument: \[ Z_i = \mathbf{1}\{\text{quarter}_i = 1\}. \] This is the just-identified case from 5a — one \(Z\), one endogenous regressor, no controls.
\[ \hat\beta^{IV} \;=\; \frac{\bar Y_{Z=1} - \bar Y_{Z=0}}{\bar E_{Z=1} - \bar E_{Z=0}} \;=\; \frac{-0.0111}{-0.109} \;=\; 0.1020 \quad (\text{SE } 0.0239). \]
Same pattern as Card: IV \(>\) OLS (\(0.1020\) vs \(0.0711\)) — the systematic IV-bigger-than-OLS finding across education-IV studies. SE is also two orders of magnitude wider than OLS’s \(0.0003\) — IV’s precision cost.
AK make two changes to move from the Wald to a more credible specification:
Payoff: cohort-specific first-stage signals add variation to \(\hat E\) → tighter SE.
| Specification | Controls | \(\hat\beta_1\) | SE |
|---|---|---|---|
| OLS | none | 0.0709 | 0.0003 |
| OLS | YoB dummies | 0.0711 | 0.0003 |
| Wald (1 IV: \(Z = \mathbf{1}\{\text{Q1}\}\)) | none | 0.1020 | 0.0239 |
| 2SLS (30 IVs: QoB × cohort) | YoB dummies | 0.0891 | 0.0161 |
Both the Wald and the 2SLS aggregate LATEs on kids whose schooling responds to QoB — kids at the dropout margin (compulsory-schooling cutoff at age 16).
Both target the same complier population (dropout-margin kids) but aggregate over them differently.
Who is silent in both?
\(\Rightarrow\) Neither \(\hat\beta^{IV}\) corresponds to the ATE.
AK’s \(\hat\beta^{2SLS} = 0.0891\) is the return to compulsory schooling — not the population return to education.
Beyond this lecture: the marginal treatment effect (MTE) framework (Heckman-Vytlacil) targets effects for specific subpopulations directly.
Identification requires:
Rank condition: \(\mathrm{rank}(E[\mathbf{Z}_i\mathbf{X}_i']) = k\).
The instruments must shift \(X_1, \ldots, X_k\) in linearly independent ways. If two instruments shift \(X_1\) and \(X_2\) identically (e.g., both proxy for the same factor), we can’t separate \(\beta_1\) from \(\beta_2\).
This implies the order condition \(L \geq k\) — a count check, since a matrix with fewer rows than columns can’t have full column rank.
Computing 2SLS with multiple endogenous regressors: run a separate first-stage for each \(X_j\) (using all \(L\) instruments), then plug all fitted values \(\hat X_1, \ldots, \hat X_k\) into the second stage together.
Three diagnostic questions for any IV estimate:
Big caveat: none of these validates the design. Exogeneity is ultimately a theoretical claim; the tests catch some failures, not all.
Why care? IV’s identification rests on assumptions — \(Z\) exogenous, excluded — defended from theory, not directly testable. With \(L > k\) instruments, we get one empirical check: if all are valid, they should give consistent estimates.
The test: regress the 2SLS residuals \(\hat U_i = Y_i - \mathbf{X}_i'\hat\beta^{2SLS}\) on the instruments \(\mathbf{Z}_i\). Under \(H_0\) (all instruments valid), residuals should be uncorrelated with \(\mathbf{Z}_i\) — so the \(R^2\) from this regression should be near zero.
Statistic: \(J = n \cdot R^2 \;\xrightarrow{d}\; \chi^2_{L - k}\) under \(H_0\).
Falsification, not validation: rejection signals at least one invalid instrument. Non-rejection means the data don’t contradict joint validity — it isn’t proof.
Question: do we even need IV? Or is OLS fine?
Idea: under \(H_0: \text{Cov}(X, U) = 0\) (\(X\) exogenous), both OLS and IV are consistent — they should agree. If they don’t, \(X\) is endogenous and OLS is biased.
Test statistic:
\[ H = (\hat\beta^{IV} - \hat\beta^{OLS})' (\hat V^{IV} - \hat V^{OLS})^{-1} (\hat\beta^{IV} - \hat\beta^{OLS}) \xrightarrow{d} \chi^2_k \]
where \(k\) is the number of endogenous regressors. Reject \(\Rightarrow\) \(X\) is endogenous, IV is needed.
Caveat: even with a valid, reasonably strong instrument, the test is often underpowered. IV’s SE is intrinsically much larger than OLS’s (the \(1/\rho^2\) penalty), so substantively large OLS–IV gaps can still fail to reject statistically. The test also relies on IV being valid in the first place — if it isn’t, the comparison is meaningless.
Apply the formula to Card’s \(\hat\beta^{OLS} = 0.073\) (SE 0.004) and \(\hat\beta^{IV} = 0.132\) (SE 0.049): \[ H = \frac{(0.132 - 0.073)^2}{0.049^2 - 0.004^2} \approx 1.46 \;<\; \chi^2_{1, 0.05} = 3.84. \] Fail to reject.
But: IV is 80% larger than OLS — a substantively huge gap. The test fails to reject only because IV’s SE is so wide. Hausman is underpowered when IV is imprecise — it can’t detect endogeneity that IV’s precision is too low to resolve.
Don’t lean on Hausman as the primary diagnostic for whether to use IV. It tells you more about IV’s precision than about \(E\)’s endogeneity.
Recall: \[ V_{2SLS} = \frac{\sigma^2 \cdot \text{Var}(\tilde Z)}{[\,\text{Cov}(\tilde Z, \tilde E)\,]^2} = \frac{\sigma^2}{\rho^2_{\tilde Z, \tilde E} \cdot \text{Var}(\tilde E)} \]
As \(\rho^2_{\tilde Z, \tilde E} \to 0\) (weak instrument):
\(F\) = first-stage \(F\)-statistic for \(H_0: \boldsymbol{\pi} = \mathbf{0}\) in \(X_i = \mathbf{Z}_i'\boldsymbol{\pi} + V_i\). Larger \(F\) ⇒ stronger instrument(s).
As the first stage weakens, \(\hat\beta^{2SLS}\) shifts toward OLS and its distribution becomes non-normal and heavy-tailed. Standard confidence intervals lose their nominal coverage.
First-stage \(F\)-statistic: joint test of \(H_0: \boldsymbol{\pi} = \mathbf{0}\) in \[ X_i = \mathbf{Z}_i'\boldsymbol{\pi} + \mathbf{W}_i'\boldsymbol{\delta} + V_i. \]
Rule of thumb (Staiger & Stock 1997): \(F < 10\) ⇒ weak instrument. The cutoff is calibrated so that finite-sample 2SLS bias is at most ~10% of OLS bias.
Always report \(F\). The single most important diagnostic for IV.
When the first-stage \(F\) is in the danger zone (~5–15) and a stronger instrument isn’t available — common with natural-experiment IVs (one lottery, one policy discontinuity) — modern weak-IV inference provides confidence sets that are valid regardless of instrument strength.
Practical recommendation: when \(F\) is borderline, report AR (or \(tF\)) intervals alongside standard 2SLS CIs. If they agree, inference is robust; if they diverge, the AR/\(tF\) interval is the trustworthy one.
When reporting IV estimates, always include:
If overidentified (\(L > k\)):
Defend exogeneity from theory — the most important step. Tests are partial diagnostics; the data alone cannot validate identification.
A final implementation note — at the software level, not the theory level. If you compute 2SLS literally as two OLS regressions in sequence, the Stage-2 slope is correct, but the Stage-2 reported SE is wrong.
The Stage-2 OLS routine builds residuals using \(\hat E_i\): \[ \hat e_i = Y_i - \hat\beta_0 - \hat\beta^{2SLS}\,\hat E_i - \hat{\boldsymbol{\gamma}}'\mathbf{W}_i. \] But the correct \(V_{2SLS}\) uses residuals with the actual endogenous regressor: \[ \hat U_i = Y_i - \hat\beta_0 - \hat\beta^{2SLS}\, E_i - \hat{\boldsymbol{\gamma}}'\mathbf{W}_i. \]
Use the built-in IV command — it computes the correct \(V_{2SLS}\) internally:
ivregress 2sls Y W (X = Z), robustivreg(Y ~ X + W | Z + W, data = ...)IV2SLS(Y, [W, const], X, Z).fit()Lecture 6a — Panel Regression:
Lecture 6b — Difference-in-Differences: