Asymptotics & Heteroskedasticity

Natasha Kang

Xiamen University, Chow Institute

March, 2026

Why Large-Sample Theory?

  • In Lecture 3a, inference relied on the normality assumption (MLR.6): \(u \mid X \sim N(0, \sigma^2)\).
  • This gave us exact \(t\)- and \(F\)-distributions for any sample size.
  • But normality is a strong assumption. What if the errors are not normal?
  • Large-sample (asymptotic) theory provides approximate distributions under weaker assumptions, valid when \(n\) is large.

Two Approaches to Inference

Exact (finite-sample) Asymptotic (large-sample)
Assumptions MLR.1–MLR.6 (normality) MLR.1–MLR.5 (no normality)
Distributions Exact \(t\) and \(F\) Approximate (via CLT)
Sample size Any \(n\) Requires \(n\) large

We need two key tools: the Law of Large Numbers and the Central Limit Theorem.

Convergence in Probability

Let \(\theta_n\) be a sequence of random variables indexed by \(n\). We say \(\theta_n\) converges in probability to \(\theta\) if

\[ P(|\theta_n - \theta| \geq \varepsilon) \to 0 \quad \text{for all } \varepsilon > 0 \]

Notation: \(\theta_n \xrightarrow{p} \theta\) or \(\text{plim}\, \theta_n = \theta\).

Informally: as \(n\) grows, \(\theta_n\) is increasingly likely to be close to \(\theta\).

Law of Large Numbers

Let \(X_1, X_2, \ldots, X_n\) be i.i.d. with \(E[X_i] = \mu\) and \(\text{Var}(X_i) < \infty\). Then:

\[ \bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{p} \mu \]

  • The sample average converges to the population mean.
  • More generally, any sample moment converges to its population counterpart: \(\frac{1}{n}\sum_{i=1}^n g(X_i) \xrightarrow{p} E[g(X_i)]\).

Properties of Convergence in Probability

If \(\theta_n \xrightarrow{p} \theta\) and \(\phi_n \xrightarrow{p} \phi\):

  • \(\theta_n + \phi_n \xrightarrow{p} \theta + \phi\)
  • \(\theta_n \phi_n \xrightarrow{p} \theta \phi\)
  • \(\theta_n / \phi_n \xrightarrow{p} \theta / \phi\)   (if \(\phi \neq 0\))

Continuous mapping: if \(g(\cdot)\) is continuous at \(\theta\), then \(g(\theta_n) \xrightarrow{p} g(\theta)\).

Consistency

An estimator \(\hat\theta_n\) is consistent for \(\theta\) if \(\hat\theta_n \xrightarrow{p} \theta\).

  • As \(n\) increases, the estimator concentrates around the true value.
  • Consistency is a minimum requirement for sensible estimators.

“If you can’t get it right as \(n\) goes to infinity, you shouldn’t be in this business.” — C. W. J. Granger

Consistency: Visualized

Consistency of OLS

Theorem: Under MLR.1–MLR.4, the OLS estimator \(\hat\beta_j\) is consistent for \(\beta_j\).

Note: we do not need homoskedasticity (MLR.5) or normality (MLR.6).

Consistency of OLS: Proof Sketch (SLR)

Write \(\hat\beta_1 = \beta_1 + \frac{n^{-1}\sum(x_i - \bar{x})u_i}{n^{-1}\sum(x_i - \bar{x})^2}\).

Denominator: \(\frac{1}{n}\sum(x_i - \bar{x})^2 = \frac{1}{n}\sum x_i^2 - \bar{x}^2 \xrightarrow{p} E[x_i^2] - (E[x_i])^2 = \text{Var}(x_i)\)

Numerator: expand as \(\frac{1}{n}\sum x_i u_i - \bar{x}\cdot\frac{1}{n}\sum u_i\).

  • \(\frac{1}{n}\sum x_i u_i \xrightarrow{p} E[x_i u_i] = 0\)   (LLN + MLR.4)
  • \(\bar{x} \xrightarrow{p} E[x_i]\),   \(\frac{1}{n}\sum u_i \xrightarrow{p} 0\)
  • By Slutsky: numerator \(\xrightarrow{p} 0\).

A Weaker Condition for Consistency

The proof only used \(E[x_i u_i] = 0\), not the full \(E(u \mid X) = 0\). So we can replace MLR.4 with a weaker assumption:

  • MLR.4: \(E(u \mid x_1, \ldots, x_k) = 0\)   (zero conditional mean)
  • MLR.4’: \(E(u) = 0\) and \(\text{Cov}(x_j, u) = 0\) for all \(j\)   (zero mean and zero correlation)
  • MLR.4 implies MLR.4’, but not vice versa.
  • MLR.4 \(\Rightarrow\) unbiasedness (finite-sample property).
  • MLR.4’ \(\Rightarrow\) consistency (large-sample property).

Consistency vs. Unbiasedness

Example: True model is \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + u\), with \(E(u \mid x) = 0\) and \(x \sim N(0,1)\).

Suppose we estimate the misspecified model: \(y = \alpha_0 + \beta_1 x + v\).

  • \(v = \beta_2 x^2 + u - \beta_2 E(x^2)\), so \(E(v \mid x) = \beta_2(x^2 - 1) \neq 0\).
  • MLR.4 is violated: \(\hat\beta_1\) is biased.
  • But \(\text{Cov}(x, v) = E[xv] = \beta_2 E[x^3] = 0\) (odd moments of a symmetric distribution vanish).
  • MLR.4’ holds: \(\hat\beta_1\) is consistent.

Inconsistency

When \(\text{Cov}(x_j, u) \neq 0\), OLS is inconsistent. In the SLR case:

\[ \text{plim}\, \hat\beta_1 = \beta_1 + \frac{\text{Cov}(x_i, u_i)}{\text{Var}(x_i)} \]

  • The second term is the asymptotic bias.
  • More data does not help: as \(n \to \infty\), \(\hat\beta_1\) converges to the wrong value.

Asymptotic OVB

True model: \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + v\), with \(E(v \mid x_1, x_2) = 0\).

Misspecified model omits \(x_2\): \(y = \beta_0 + \beta_1 x_1 + u\).

\[ \text{plim}\, \tilde\beta_1 = \beta_1 + \beta_2 \frac{\text{Cov}(x_1, x_2)}{\text{Var}(x_1)} \]

Same structure as finite-sample OVB, but now stated as a probability limit.

Convergence in Distribution

A sequence \(W_n\) converges in distribution to \(W\) if:

\[ P(W_n \leq x) \to P(W \leq x) \]

at every point \(x\) where the CDF of \(W\) is continuous.

Notation: \(W_n \xrightarrow{d} W\).

  • This is convergence of CDFs, not of the random variables themselves.
  • Weaker than convergence in probability.

Central Limit Theorem

Let \(X_1, \ldots, X_n\) be i.i.d. with \(E[X_i] = \mu\) and \(\text{Var}(X_i) = \sigma^2 < \infty\). Then:

\[ \sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2) \]

  • The standardized sample average is approximately normal in large samples, regardless of the population distribution.
  • This is the foundation for asymptotic inference.

Properties

If \(W_n \xrightarrow{d} W\) and \(\theta_n \xrightarrow{p} \theta\):

  • \(W_n + \theta_n \xrightarrow{d} W + \theta\)
  • \(W_n \theta_n \xrightarrow{d} W\theta\)

If \(Z_n \xrightarrow{d} N(0,1)\), then \(Z_n^2 \xrightarrow{d} \chi^2_1\).

These results let us combine convergence in probability (for consistent estimators) with convergence in distribution (from the CLT).

Asymptotic Normality of OLS (SLR)

Theorem: Under MLR.1–MLR.4 and \(\text{Var}(u \mid X) = \sigma^2\) (MLR.5):

\[ \sqrt{n}(\hat\beta_1 - \beta_1) \xrightarrow{d} N\!\left(0, \; \frac{\sigma^2}{\text{Var}(x_i)}\right) \]

  • No normality assumption (MLR.6) needed.
  • The CLT drives the normal approximation.

Proof Sketch

\[ \sqrt{n}(\hat\beta_1 - \beta_1) = \frac{\frac{1}{\sqrt{n}}\sum(x_i - \bar{x})u_i}{\frac{1}{n}\sum(x_i - \bar{x})^2} \]

Denominator: \(\xrightarrow{p} \text{Var}(x_i)\)   (as in the consistency proof)

Numerator: expand as \(\frac{1}{\sqrt{n}}\sum(x_i - E[x_i])u_i + (E[x_i] - \bar{x})\frac{1}{\sqrt{n}}\sum u_i\).

  • First term: \((x_i - E[x_i])u_i\) are i.i.d. with mean zero, so by CLT \(\xrightarrow{d} N(0, E[(x_i - E[x_i])^2 u_i^2])\).
  • Second term \(\xrightarrow{p} 0\).

Proof Sketch (cont.)

Under homoskedasticity (MLR.5):

\[ E[(x_i - E[x_i])^2 u_i^2] = \sigma^2 \text{Var}(x_i) \]

Combining by Slutsky’s theorem:

\[ \sqrt{n}(\hat\beta_1 - \beta_1) \xrightarrow{d} N\!\left(0, \; \frac{\sigma^2}{\text{Var}(x_i)}\right) \]

From SLR to MLR

In SLR, the asymptotic variance involves \(\text{Var}(x_i)\): the total variation in \(x\).

In MLR, the variation that identifies \(\beta_j\) is the variation in \(x_j\) after partialling out the other regressors.

General result (MLR.1–MLR.5):

\[ \hat\beta_j \overset{a}{\sim} N\!\left(\beta_j, \; \frac{\sigma^2}{\text{SST}_j(1 - R_j^2)}\right) \]

where \(\text{SST}_j = \sum(x_{ij} - \bar{x}_j)^2\) and \(R_j^2\) is the \(R^2\) from regressing \(x_j\) on the other regressors. The denominator \(\text{SST}_j(1 - R_j^2) = \sum \hat{r}_{ij}^2\) is the residual variation in \(x_j\).

Asymptotic \(t\)-Test

To test \(H_0: \beta_j = \beta_{j,0}\):

\[ t = \frac{\hat\beta_j - \beta_{j,0}}{\text{se}(\hat\beta_j)} \xrightarrow{d} N(0,1) \quad \text{under } H_0 \]

  • Reject \(H_0\) at level \(\alpha\) if \(|t| > z_{1-\alpha/2}\).
  • Similarly, \(F_{q,\, n-k-1} \to \chi^2_q / q\) as \(n \to \infty\).
  • In practice, software uses \(t_{n-k-1}\) and \(F_{q,\, n-k-1}\), which are more conservative and coincide with the exact results under MLR.6.

The standard error \(\text{se}(\hat\beta_j)\) is called the asymptotic standard error when MLR.6 is not assumed.

Practical Implications

CLM (MLR.1–6) Gauss-Markov (MLR.1–5)
\(t\) and \(F\) distributions Exact, any \(n\) Approximate, large \(n\)
Normality assumed? Yes No
  • In large samples, the same \(t\)- and \(F\)-tests are valid either way.
  • The distinction matters most in small samples where the normal approximation may be poor.

Asymptotic Efficiency

Theorem: Under MLR.1–MLR.5, OLS is asymptotically efficient among linear estimators: no other linear, consistent, asymptotically normal estimator has a smaller asymptotic variance.

This is the large-sample analog of the Gauss-Markov theorem (BLUE).

Summary: Finite-Sample vs. Large-Sample

Finite-sample (MLR.1–MLR.6):

  • Unbiasedness, variance formulas (MLR.1–5)
  • BLUE / Gauss-Markov (MLR.1–5)
  • Exact \(t\) and \(F\) distributions (MLR.1–6)

Large-sample (MLR.1–MLR.5):

  • Consistency (MLR.1–4)
  • Asymptotic normality, approximate \(t\) and \(F\) (MLR.1–5)
  • Asymptotic efficiency (MLR.1–5)

The key tradeoff: we drop normality (MLR.6), but results are now approximations that require large \(n\).

So Far: Homoskedasticity

All results above assumed MLR.5 (homoskedasticity):

\[ \text{Var}(u \mid X) = \sigma^2 \]

  • The error variance is the same for all values of \(X\).
  • This gave us the variance formula \(\text{Var}(\hat\beta_j \mid X) = \sigma^2 / [\text{SST}_j(1 - R_j^2)]\).

What if MLR.5 fails?

Heteroskedasticity

If the error variance depends on \(X\):

\[ \text{Var}(u \mid X) = \sigma^2(X) \]

the errors are heteroskedastic.

What Breaks, What Doesn’t

Under heteroskedasticity (MLR.5 fails, MLR.1–4 hold):

Still valid:

  • OLS is unbiased and consistent
  • OLS coefficients are asymptotically normal (but with a different variance)

No longer valid:

  • The usual variance formula \(\hat\sigma^2 / [\text{SST}_j(1 - R_j^2)]\)
  • The usual \(t\)- and \(F\)-statistics (which rely on that formula)
  • OLS is no longer BLUE (or asymptotically efficient)

Why the Usual Standard Errors Fail

The asymptotic variance of \(\hat\beta_1\) (SLR) is:

\[ V_1 = \frac{E[(x_i - E[x_i])^2 u_i^2]}{[\text{Var}(x_i)]^2} \]

Under homoskedasticity, \(E[u_i^2 \mid x_i] = \sigma^2\), so we can simplify:

\[ \begin{aligned} E[(x_i - E[x_i])^2 u_i^2] &= E\!\big[(x_i - E[x_i])^2\, E[u_i^2 \mid x_i]\big] \\ &= \sigma^2\, E[(x_i - E[x_i])^2] = \sigma^2 \text{Var}(x_i) \end{aligned} \]

and \(V_1 = \sigma^2 / \text{Var}(x_i)\).

Under heteroskedasticity, \(E[u_i^2 \mid x_i]\) varies with \(x_i\), so \(\sigma^2\) cannot be pulled out. The usual formula understates or overstates the true variance, leading to invalid inference.

Heteroskedasticity-Robust Standard Errors

Idea: estimate the general \(V_1\) directly, without assuming homoskedasticity.

In SLR:

\[ \hat{V}_1^{HC} = \frac{n^{-1}\sum(x_i - \bar{x})^2 \hat{u}_i^2}{\left(n^{-1}\text{SST}_x\right)^2} \]

  • Each squared residual \(\hat{u}_i^2\) gets its own weight, rather than using a common \(\hat\sigma^2\).
  • In MLR, the same principle applies: the general form is a “sandwich” estimator. Software computes this automatically.

The resulting standard errors are called heteroskedasticity-robust (or HC) standard errors (White, 1980).

Robust Inference

With robust standard errors, we form the \(t\)-statistic as before:

\[ t = \frac{\hat\beta_j - \beta_{j,0}}{\text{se}^{HC}(\hat\beta_j)} \]

  • Under \(H_0\), \(t \xrightarrow{d} N(0,1)\).
  • Robust \(F\)-statistics for joint hypotheses are also available.

Caveat: robust standard errors rely on large-sample theory. In small samples, they can be unreliable.

Example: Hourly Wage Equation

\[ \begin{aligned} \widehat{\log(\text{wage})} = \underset{(.105)\; [.107]}{-1.28} &+ \underset{(.0075)\; [.0078]}{.0904}\; \text{educ} + \underset{(.0052)\; [.0050]}{.0410}\; \text{exper} \\ &- \underset{(.0001)\; [.0001]}{.0007}\; \text{exper}^2 \end{aligned} \]

Usual standard errors in \((\cdot)\), robust in \([\cdot]\).

  • The differences are small here.
  • With strong heteroskedasticity, differences can be substantial.
  • To be safe: always report robust standard errors.

When Does Heteroskedasticity Arise?

  • Intrinsic economic variation: household spending is more volatile for wealthier households; large firms have larger profit fluctuations.
  • Data aggregation: if observations are group averages (e.g., firm-level means of employee data), the error variance is \(\sigma^2 / m_i\), where \(m_i\) is the group size. Larger groups have smaller error variance.
  • Binary outcomes (LPM): when \(y \in \{0,1\}\), the conditional variance is \(\text{Var}(y \mid x) = P(y=1 \mid x)(1 - P(y=1 \mid x))\), which depends on \(x\) by construction.

Heteroskedasticity is the norm in cross-sectional data, not the exception.

Summary

  • Large-sample theory allows valid inference without normality (MLR.6).
  • Consistency: \(\hat\beta_j \xrightarrow{p} \beta_j\) under MLR.1–4.
  • Asymptotic normality: OLS is asymptotically normal under MLR.1–4; MLR.5 simplifies the variance formula.
  • Heteroskedasticity does not bias OLS, but invalidates the usual standard errors and test statistics.
  • Robust standard errors restore valid inference under heteroskedasticity (large samples).

What’s Next

Lecture 4a — Functional Form & Scaling:

  • Logarithmic models and elasticities
  • Quadratic and interaction terms
  • Scaling and units of measurement