Asymptotics & Heteroskedasticity

Natasha Kang

Xiamen University, Chow Institute

March, 2026

Why Large-Sample Theory?

In Lecture 3a, inference relied on the normality assumption (MLR.6): \(u \mid X \sim N(0, \sigma^2)\).
This gave us exact \(t\)- and \(F\)-distributions for any sample size.

But normality is a strong assumption. What if the errors are not normal?
Large-sample (asymptotic) theory provides approximate distributions under weaker assumptions, valid when \(n\) is large.

Two Approaches to Inference

	Exact (finite-sample)	Asymptotic (large-sample)
Assumptions	MLR.1–MLR.6 (normality)	MLR.1–MLR.5 (no normality)
Distributions	Exact \(t\) and \(F\)	Approximate (via CLT)
Sample size	Any \(n\)	Requires \(n\) large

We need two key tools: the Law of Large Numbers and the Central Limit Theorem.

Convergence in Probability

Let \(\theta_n\) be a sequence of random variables indexed by \(n\). We say \(\theta_n\) converges in probability to \(\theta\) if

\[ P(|\theta_n - \theta| \geq \varepsilon) \to 0 \quad \text{for all } \varepsilon > 0 \]

Notation: \(\theta_n \xrightarrow{p} \theta\) or \(\text{plim}\, \theta_n = \theta\).

Informally: as \(n\) grows, \(\theta_n\) is increasingly likely to be close to \(\theta\).

Law of Large Numbers

Let \(X_1, X_2, \ldots, X_n\) be i.i.d. with \(E[X_i] = \mu\) and \(\text{Var}(X_i) < \infty\). Then:

\[ \bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{p} \mu \]

The sample average converges to the population mean.
More generally, any sample moment converges to its population counterpart: \(\frac{1}{n}\sum_{i=1}^n g(X_i) \xrightarrow{p} E[g(X_i)]\).

Properties of Convergence in Probability

If \(\theta_n \xrightarrow{p} \theta\) and \(\phi_n \xrightarrow{p} \phi\):

\(\theta_n + \phi_n \xrightarrow{p} \theta + \phi\)
\(\theta_n \phi_n \xrightarrow{p} \theta \phi\)
\(\theta_n / \phi_n \xrightarrow{p} \theta / \phi\) (if \(\phi \neq 0\))

Continuous mapping: if \(g(\cdot)\) is continuous at \(\theta\), then \(g(\theta_n) \xrightarrow{p} g(\theta)\).

Consistency

An estimator \(\hat\theta_n\) is consistent for \(\theta\) if \(\hat\theta_n \xrightarrow{p} \theta\).

As \(n\) increases, the estimator concentrates around the true value.
Consistency is a minimum requirement for sensible estimators.

“If you can’t get it right as \(n\) goes to infinity, you shouldn’t be in this business.” — C. W. J. Granger

Consistency: Visualized

Consistency of OLS

Theorem: Under MLR.1–MLR.4, the OLS estimator \(\hat\beta_j\) is consistent for \(\beta_j\).

Note: we do not need homoskedasticity (MLR.5) or normality (MLR.6).

Consistency of OLS: Proof Sketch (SLR)

Write \(\hat\beta_1 = \beta_1 + \frac{n^{-1}\sum(x_i - \bar{x})u_i}{n^{-1}\sum(x_i - \bar{x})^2}\).

Denominator: \(\frac{1}{n}\sum(x_i - \bar{x})^2 = \frac{1}{n}\sum x_i^2 - \bar{x}^2 \xrightarrow{p} E[x_i^2] - (E[x_i])^2 = \text{Var}(x_i)\)

Numerator: expand as \(\frac{1}{n}\sum x_i u_i - \bar{x}\cdot\frac{1}{n}\sum u_i\).

\(\frac{1}{n}\sum x_i u_i \xrightarrow{p} E[x_i u_i] = 0\) (LLN + MLR.4)
\(\bar{x} \xrightarrow{p} E[x_i]\), \(\frac{1}{n}\sum u_i \xrightarrow{p} 0\)
By Slutsky: numerator \(\xrightarrow{p} 0\).

A Weaker Condition for Consistency

The proof only used \(E[x_i u_i] = 0\), not the full \(E(u \mid X) = 0\). So we can replace MLR.4 with a weaker assumption:

MLR.4: \(E(u \mid x_1, \ldots, x_k) = 0\) (zero conditional mean)
MLR.4’: \(E(u) = 0\) and \(\text{Cov}(x_j, u) = 0\) for all \(j\) (zero mean and zero correlation)

MLR.4 implies MLR.4’, but not vice versa.
MLR.4 \(\Rightarrow\) unbiasedness (finite-sample property).
MLR.4’ \(\Rightarrow\) consistency (large-sample property).

Consistency vs. Unbiasedness

Example: True model is \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + u\), with \(E(u \mid x) = 0\) and \(x \sim N(0,1)\).

Suppose we estimate the misspecified model: \(y = \alpha_0 + \beta_1 x + v\).

\(v = \beta_2 x^2 + u - \beta_2 E(x^2)\), so \(E(v \mid x) = \beta_2(x^2 - 1) \neq 0\).
MLR.4 is violated: \(\hat\beta_1\) is biased.

But \(\text{Cov}(x, v) = E[xv] = \beta_2 E[x^3] = 0\) (odd moments of a symmetric distribution vanish).
MLR.4’ holds: \(\hat\beta_1\) is consistent.

Inconsistency

When \(\text{Cov}(x_j, u) \neq 0\), OLS is inconsistent. In the SLR case:

\[ \text{plim}\, \hat\beta_1 = \beta_1 + \frac{\text{Cov}(x_i, u_i)}{\text{Var}(x_i)} \]

The second term is the asymptotic bias.
More data does not help: as \(n \to \infty\), \(\hat\beta_1\) converges to the wrong value.

Asymptotic OVB

True model: \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + v\), with \(E(v \mid x_1, x_2) = 0\).

Misspecified model omits \(x_2\): \(y = \beta_0 + \beta_1 x_1 + u\).

\[ \text{plim}\, \tilde\beta_1 = \beta_1 + \beta_2 \frac{\text{Cov}(x_1, x_2)}{\text{Var}(x_1)} \]

Same structure as finite-sample OVB, but now stated as a probability limit.

Convergence in Distribution

A sequence \(W_n\) converges in distribution to \(W\) if:

\[ P(W_n \leq x) \to P(W \leq x) \]

at every point \(x\) where the CDF of \(W\) is continuous.

Notation: \(W_n \xrightarrow{d} W\).

This is convergence of CDFs, not of the random variables themselves.
Weaker than convergence in probability.

Central Limit Theorem

Let \(X_1, \ldots, X_n\) be i.i.d. with \(E[X_i] = \mu\) and \(\text{Var}(X_i) = \sigma^2 < \infty\). Then:

\[ \sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2) \]

The standardized sample average is approximately normal in large samples, regardless of the population distribution.
This is the foundation for asymptotic inference.

Properties

If \(W_n \xrightarrow{d} W\) and \(\theta_n \xrightarrow{p} \theta\):

\(W_n + \theta_n \xrightarrow{d} W + \theta\)
\(W_n \theta_n \xrightarrow{d} W\theta\)

If \(Z_n \xrightarrow{d} N(0,1)\), then \(Z_n^2 \xrightarrow{d} \chi^2_1\).

These results let us combine convergence in probability (for consistent estimators) with convergence in distribution (from the CLT).

Asymptotic Normality of OLS (SLR)

Theorem: Under MLR.1–MLR.4 and \(\text{Var}(u \mid X) = \sigma^2\) (MLR.5):

\[ \sqrt{n}(\hat\beta_1 - \beta_1) \xrightarrow{d} N\!\left(0, \; \frac{\sigma^2}{\text{Var}(x_i)}\right) \]

No normality assumption (MLR.6) needed.
The CLT drives the normal approximation.

Proof Sketch

\[ \sqrt{n}(\hat\beta_1 - \beta_1) = \frac{\frac{1}{\sqrt{n}}\sum(x_i - \bar{x})u_i}{\frac{1}{n}\sum(x_i - \bar{x})^2} \]

Denominator: \(\xrightarrow{p} \text{Var}(x_i)\) (as in the consistency proof)

Numerator: expand as \(\frac{1}{\sqrt{n}}\sum(x_i - E[x_i])u_i + (E[x_i] - \bar{x})\frac{1}{\sqrt{n}}\sum u_i\).

First term: \((x_i - E[x_i])u_i\) are i.i.d. with mean zero, so by CLT \(\xrightarrow{d} N(0, E[(x_i - E[x_i])^2 u_i^2])\).
Second term \(\xrightarrow{p} 0\).

Proof Sketch (cont.)

Under homoskedasticity (MLR.5):

\[ E[(x_i - E[x_i])^2 u_i^2] = \sigma^2 \text{Var}(x_i) \]

Combining by Slutsky’s theorem:

\[ \sqrt{n}(\hat\beta_1 - \beta_1) \xrightarrow{d} N\!\left(0, \; \frac{\sigma^2}{\text{Var}(x_i)}\right) \]

From SLR to MLR

In SLR, the asymptotic variance involves \(\text{Var}(x_i)\): the total variation in \(x\).

In MLR, the variation that identifies \(\beta_j\) is the variation in \(x_j\) after partialling out the other regressors.

General result (MLR.1–MLR.5):

\[ \hat\beta_j \overset{a}{\sim} N\!\left(\beta_j, \; \frac{\sigma^2}{\text{SST}_j(1 - R_j^2)}\right) \]

where \(\text{SST}_j = \sum(x_{ij} - \bar{x}_j)^2\) and \(R_j^2\) is the \(R^2\) from regressing \(x_j\) on the other regressors. The denominator \(\text{SST}_j(1 - R_j^2) = \sum \hat{r}_{ij}^2\) is the residual variation in \(x_j\).

Asymptotic \(t\)-Test

To test \(H_0: \beta_j = \beta_{j,0}\):

\[ t = \frac{\hat\beta_j - \beta_{j,0}}{\text{se}(\hat\beta_j)} \xrightarrow{d} N(0,1) \quad \text{under } H_0 \]

Reject \(H_0\) at level \(\alpha\) if \(|t| > z_{1-\alpha/2}\).
Similarly, \(F_{q,\, n-k-1} \to \chi^2_q / q\) as \(n \to \infty\).
In practice, software uses \(t_{n-k-1}\) and \(F_{q,\, n-k-1}\), which are more conservative and coincide with the exact results under MLR.6.

The standard error \(\text{se}(\hat\beta_j)\) is called the asymptotic standard error when MLR.6 is not assumed.

Practical Implications

	CLM (MLR.1–6)	Gauss-Markov (MLR.1–5)
\(t\) and \(F\) distributions	Exact, any \(n\)	Approximate, large \(n\)
Normality assumed?	Yes	No

In large samples, the same \(t\)- and \(F\)-tests are valid either way.
The distinction matters most in small samples where the normal approximation may be poor.

Asymptotic Efficiency

Theorem: Under MLR.1–MLR.5, OLS is asymptotically efficient among linear estimators: no other linear, consistent, asymptotically normal estimator has a smaller asymptotic variance.

This is the large-sample analog of the Gauss-Markov theorem (BLUE).

Summary: Finite-Sample vs. Large-Sample

Finite-sample (MLR.1–MLR.6):

Unbiasedness, variance formulas (MLR.1–5)
BLUE / Gauss-Markov (MLR.1–5)
Exact \(t\) and \(F\) distributions (MLR.1–6)

Large-sample (MLR.1–MLR.5):

Consistency (MLR.1–4)
Asymptotic normality, approximate \(t\) and \(F\) (MLR.1–5)
Asymptotic efficiency (MLR.1–5)

The key tradeoff: we drop normality (MLR.6), but results are now approximations that require large \(n\).

So Far: Homoskedasticity

All results above assumed MLR.5 (homoskedasticity):

\[ \text{Var}(u \mid X) = \sigma^2 \]

The error variance is the same for all values of \(X\).
This gave us the variance formula \(\text{Var}(\hat\beta_j \mid X) = \sigma^2 / [\text{SST}_j(1 - R_j^2)]\).

What if MLR.5 fails?

Heteroskedasticity

If the error variance depends on \(X\):

\[ \text{Var}(u \mid X) = \sigma^2(X) \]

the errors are heteroskedastic.

What Breaks, What Doesn’t

Under heteroskedasticity (MLR.5 fails, MLR.1–4 hold):

Still valid:

OLS is unbiased and consistent
OLS coefficients are asymptotically normal (but with a different variance)

No longer valid:

The usual variance formula \(\hat\sigma^2 / [\text{SST}_j(1 - R_j^2)]\)
The usual \(t\)- and \(F\)-statistics (which rely on that formula)
OLS is no longer BLUE (or asymptotically efficient)

Why the Usual Standard Errors Fail

The asymptotic variance of \(\hat\beta_1\) (SLR) is:

\[ V_1 = \frac{E[(x_i - E[x_i])^2 u_i^2]}{[\text{Var}(x_i)]^2} \]

Under homoskedasticity, \(E[u_i^2 \mid x_i] = \sigma^2\), so we can simplify:

\[ \begin{aligned} E[(x_i - E[x_i])^2 u_i^2] &= E\!\big[(x_i - E[x_i])^2\, E[u_i^2 \mid x_i]\big] \\ &= \sigma^2\, E[(x_i - E[x_i])^2] = \sigma^2 \text{Var}(x_i) \end{aligned} \]

and \(V_1 = \sigma^2 / \text{Var}(x_i)\).

Under heteroskedasticity, \(E[u_i^2 \mid x_i]\) varies with \(x_i\), so \(\sigma^2\) cannot be pulled out. The usual formula understates or overstates the true variance, leading to invalid inference.

Heteroskedasticity-Robust Standard Errors

Idea: estimate the general \(V_1\) directly, without assuming homoskedasticity.

In SLR:

\[ \hat{V}_1^{HC} = \frac{n^{-1}\sum(x_i - \bar{x})^2 \hat{u}_i^2}{\left(n^{-1}\text{SST}_x\right)^2} \]

Each squared residual \(\hat{u}_i^2\) gets its own weight, rather than using a common \(\hat\sigma^2\).
In MLR, the same principle applies: the general form is a “sandwich” estimator. Software computes this automatically.

The resulting standard errors are called heteroskedasticity-robust (or HC) standard errors (White, 1980).

Robust Inference

With robust standard errors, we form the \(t\)-statistic as before:

\[ t = \frac{\hat\beta_j - \beta_{j,0}}{\text{se}^{HC}(\hat\beta_j)} \]

Under \(H_0\), \(t \xrightarrow{d} N(0,1)\).
Robust \(F\)-statistics for joint hypotheses are also available.

Caveat: robust standard errors rely on large-sample theory. In small samples, they can be unreliable.

Example: Hourly Wage Equation

\[ \begin{aligned} \widehat{\log(\text{wage})} = \underset{(.105)\; [.107]}{-1.28} &+ \underset{(.0075)\; [.0078]}{.0904}\; \text{educ} + \underset{(.0052)\; [.0050]}{.0410}\; \text{exper} \\ &- \underset{(.0001)\; [.0001]}{.0007}\; \text{exper}^2 \end{aligned} \]

Usual standard errors in \((\cdot)\), robust in \([\cdot]\).

The differences are small here.
With strong heteroskedasticity, differences can be substantial.
To be safe: always report robust standard errors.

When Does Heteroskedasticity Arise?

Intrinsic economic variation: household spending is more volatile for wealthier households; large firms have larger profit fluctuations.
Data aggregation: if observations are group averages (e.g., firm-level means of employee data), the error variance is \(\sigma^2 / m_i\), where \(m_i\) is the group size. Larger groups have smaller error variance.
Binary outcomes (LPM): when \(y \in \{0,1\}\), the conditional variance is \(\text{Var}(y \mid x) = P(y=1 \mid x)(1 - P(y=1 \mid x))\), which depends on \(x\) by construction.

Heteroskedasticity is the norm in cross-sectional data, not the exception.

Summary

Large-sample theory allows valid inference without normality (MLR.6).
Consistency: \(\hat\beta_j \xrightarrow{p} \beta_j\) under MLR.1–4.
Asymptotic normality: OLS is asymptotically normal under MLR.1–4; MLR.5 simplifies the variance formula.
Heteroskedasticity does not bias OLS, but invalidates the usual standard errors and test statistics.
Robust standard errors restore valid inference under heteroskedasticity (large samples).

What’s Next

Lecture 4a — Functional Form & Scaling:

Logarithmic models and elasticities
Quadratic and interaction terms
Scaling and units of measurement