Properties of OLS Estimators

Natasha Kang

Xiamen University, Chow Institute

May, 2026

Estimators vs. Estimates

An estimate is a number — the value \(\hat\beta_1\) computed from a particular sample.
An estimator is a random variable — a function of the random sample, whose value changes across samples.

We have discussed the algebraic properties of OLS estimates (they hold in any sample).
Now we study the statistical properties of OLS estimators — properties that hold across repeated sampling:
- Unbiasedness: Is the estimator centered on the true parameter?
- Variance: How much does the estimator vary across samples?

The Gauss-Markov Assumptions: SLR

SLR.1 — Linear in Parameters: \(Y = \beta_0 + \beta_1 X + U\)

SLR.2 — Random Sampling: \(\{(X_i, Y_i) : i = 1, \ldots, n\}\) are i.i.d.
SLR.3 — Variation in \(X\): the sample values of \(X\) are not all equal
SLR.4 — Zero Conditional Mean: \(E[U \mid X] = 0\)
SLR.5 — Homoskedasticity: \(\text{Var}(U \mid X) = \sigma^2\)

SLR.1–SLR.4 are needed for unbiasedness.
SLR.5 is needed for the variance formula and the Gauss-Markov theorem.

Unbiasedness of OLS

Theorem: Under SLR.1–SLR.4,

\[ E[\hat\beta_0] = \beta_0, \qquad E[\hat\beta_1] = \beta_1 \]

Proof sketch: Write \(\hat\beta_1\) in terms of the population error:

\[ \hat\beta_1 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2} = \beta_1 + \frac{\sum_{i=1}^n (X_i - \bar{X})U_i}{\sum_{i=1}^n (X_i - \bar{X})^2} \]

Take conditional expectation given \(X\). By SLR.4, \(E[U_i \mid X] = 0\), so \(E[\hat\beta_1 \mid X] = \beta_1\). By the law of iterated expectations, \(E[\hat\beta_1] = \beta_1\).

What Unbiasedness Means (and Doesn’t)

Which of the following are correct interpretations of unbiasedness?

Estimated coefficients will equal the true values.
On average, estimated coefficients will equal the true values.
In a given sample, estimates may differ considerably from the true values.
In a given sample, estimates are close to the true values.

Correct: 2 and 3.
Unbiasedness is a property of the procedure, not of any single estimate.
A single estimate can be far from the truth — unbiasedness says the errors average out over many samples.

The Sampling Distribution

We said \(E[\hat\beta_1] = \beta_1\). But expectation over what?

The population and \(\beta_j\) are fixed — there is nothing random about them.
Randomness comes from sampling: each sample gives different data, hence a different \(\hat\beta_j\).

The sampling distribution describes how \(\hat\beta_j\) varies across hypothetical repeated samples.
\(E[\hat\beta_1] = \beta_1\) means this distribution is centered on the truth.

Unbiasedness: Visualized

Each draw from the sampling distribution is one estimate from one sample.
The distribution is centered on \(\beta_1 = 2\) — that is what unbiasedness means.

What Can Go Wrong: Violating ZCM

If \(\text{Cov}(X, U) \neq 0\), the second term in

\[ \hat\beta_1 = \beta_1 + \frac{\sum (X_i - \bar{X})U_i}{\sum (X_i - \bar{X})^2} \]

does not vanish in expectation — OLS is biased.

Example: omitting ability from a wage regression on education. If ability is positively correlated with education, \(E[\hat\beta_1] > \beta_1\).

Homoskedasticity

SLR.5 states: \(\text{Var}(U \mid X) = \sigma^2\) — the error variance does not depend on \(X\).

Homoskedasticity: the spread of \(Y\) around the regression line is the same at every value of \(X\).
Heteroskedasticity: the spread varies with \(X\) — e.g., income variance increases with education.

Homoskedasticity is needed for the standard variance formulas. If it fails, OLS is still unbiased, but the usual standard errors are wrong.

Homoskedasticity vs. Heteroskedasticity

Variance of OLS Estimators: SLR

Theorem: Under SLR.1–SLR.5, conditional on \(X\):

\[ \text{Var}(\hat\beta_1 \mid X) = \frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar{X})^2} \]

\[ \text{Var}(\hat\beta_0 \mid X) = \frac{\sigma^2 \cdot \frac{1}{n}\sum_{i=1}^n X_i^2}{\sum_{i=1}^n (X_i - \bar{X})^2} \]

Larger error variance \(\sigma^2\) → less precise estimates.
More spread in \(X\) → more precise estimates.

Variance: Visualized

Both estimators are unbiased — both distributions are centered on \(\beta_1 = 2\).
But the low-variance estimator concentrates more tightly around the truth: any single estimate is likely to be closer.

Estimating \(\sigma^2\): Simple Regression

We don’t observe \(\sigma^2\). If we could observe the errors \(U_i\), we’d estimate \(\sigma^2\) by \(\frac{1}{n}\sum U_i^2\).
We only observe residuals \(\hat{U}_i\), and \(\frac{1}{n}\sum \hat{U}_i^2\) is biased — the residuals are “too small” on average because OLS minimizes their sum of squares.

The FOCs impose two restrictions (\(\sum \hat{U}_i = 0\) and \(\sum X_i \hat{U}_i = 0\)), consuming two degrees of freedom.
The unbiased estimator is:

\[ \hat\sigma^2 = \frac{1}{n - 2}\sum_{i=1}^n \hat{U}_i^2 \]

Gauss-Markov Assumptions: MLR

MLR.1 — Linear in Parameters: \(Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k + U\)

MLR.2 — Random Sampling: \(\{(X_{i1}, \ldots, X_{ik}, Y_i) : i = 1, \ldots, n\}\) are i.i.d.
MLR.3 — No Perfect Collinearity: no exact linear relationships among the regressors
MLR.4 — Zero Conditional Mean: \(E[U \mid X_1, \ldots, X_k] = 0\)
MLR.5 — Homoskedasticity: \(\text{Var}(U \mid X_1, \ldots, X_k) = \sigma^2\)

No Perfect Collinearity (MLR.3)

Perfect collinearity: one regressor is an exact linear combination of others — OLS cannot be computed.

Common violations:
- Including a constant variable alongside the intercept
- Including \(\text{expendA}\), \(\text{expendB}\), and \(\text{totexpend} = \text{expendA} + \text{expendB}\)

Not a violation: including \(X\) and \(X^2\). Since \(X^2\) is a nonlinear function of \(X\), it is not a linear combination.

Unbiasedness in MLR

Theorem: Under MLR.1–MLR.4,

\[ E[\hat\beta_j] = \beta_j, \qquad j = 0, 1, \ldots, k \]

The proof follows the same logic as in SLR, extended to multiple regressors.

Variance of OLS Estimators: MLR

Theorem: Under MLR.1–MLR.5, conditional on \(X\):

\[ \text{Var}(\hat\beta_j \mid X) = \frac{\sigma^2}{\text{SST}_j(1 - R_j^2)}, \qquad j = 1, \ldots, k \]

where \(\text{SST}_j = \sum_{i=1}^n (X_{ij} - \bar{X}_j)^2\) and \(R_j^2\) is the \(R^2\) from regressing \(X_j\) on all other regressors.

Proof: By FWL, \(\hat\beta_j\) is the SLR slope from regressing \(Y\) on \(\hat{r}_{ij}\) (residuals from \(X_j\) on all other \(X\)’s). Applying the SLR variance formula:

\[ \text{Var}(\hat\beta_j \mid X) = \frac{\sigma^2}{\sum_{i=1}^n \hat{r}_{ij}^2} = \frac{\sigma^2}{\text{SST}_j(1 - R_j^2)} \]

where the last equality follows from \(\sum \hat{r}_{ij}^2 = \text{SST}_j(1 - R_j^2)\) by definition of \(R_j^2\). \(\square\)

Understanding \(R_j^2\)

\(R_j^2\) is not the \(R^2\) of the main regression of \(Y\) on \(X\)’s.
\(R_j^2\) measures how well \(X_j\) is linearly predicted by the other regressors.

If \(R_j^2\) is high, most of the variation in \(X_j\) is shared with other regressors — little “unique” variation remains to identify \(\beta_j\).
If \(R_j^2 = 0\) (uncorrelated regressors), the MLR formula reduces to the SLR formula.

Three Factors Driving Estimator Precision

\[ \text{Var}(\hat\beta_j \mid X) = \frac{\sigma^2}{\text{SST}_j(1 - R_j^2)} \]

Error variance \(\sigma^2\): more noise → less precision. Can reduce by adding relevant controls.
Total variation \(\text{SST}_j = n \cdot S_{X_j}^2\): more variation in \(X_j\) → more precision. Increases with sample size.
Collinearity \(R_j^2\): higher \(R_j^2\) → less unique variation in \(X_j\) → less precision.

Variance Factors: Visualized

In each panel, the dashed line marks \(\beta_1 = 2\). A tighter distribution means more precise estimation.

Multicollinearity

\(R_j^2 = 1\): perfect collinearity — \(\text{Var}(\hat\beta_j) \to \infty\) (violates MLR.3).
\(R_j^2\) close to 1: multicollinearity — OLS still works, but estimates are imprecise.

No clean solution:
- Dropping correlated variables risks omitted variable bias.
- Increasing sample size helps (raises \(\text{SST}_j\)).
Multicollinearity is a data problem, not a model problem — the data lack enough independent variation to distinguish the separate effects.

Example: Advertising and Sales

\[ \text{sales} = \beta_0 + \beta_1 \, \text{tv} + \beta_2 \, \text{online} + \beta_3 \, \text{print} + U \]

A firm wants to know which advertising channel drives sales.
But firms that spend more on TV also spend more on online and print — all three scale together.

\(R_j^2\) is high for each channel: most of the variation in one channel’s spending is explained by the others.
OLS can estimate the combined effect precisely, but cannot disentangle the individual contributions — standard errors on \(\hat\beta_1, \hat\beta_2, \hat\beta_3\) are large.

This is not a model failure — it reflects a genuine lack of independent variation in the data.

Estimating \(\sigma^2\): MLR

With \(k + 1\) parameters estimated, the degrees-of-freedom adjustment gives:

\[ \hat\sigma^2 = \frac{1}{n - k - 1}\sum_{i=1}^n \hat{U}_i^2 \]

Under MLR.1–MLR.5, \(E[\hat\sigma^2] = \sigma^2\) — unbiased.
Replacing \(\sigma\) with \(\hat\sigma = \sqrt{\hat\sigma^2}\) in the variance formula gives the standard error:

\[ \text{se}(\hat\beta_j) = \frac{\hat\sigma}{\sqrt{\text{SST}_j(1 - R_j^2)}} \]

The Gauss-Markov Theorem

Theorem: Under MLR.1–MLR.5, the OLS estimator is BLUE — Best Linear Unbiased Estimator.

Unbiased Estimator: \(E[\hat\beta_j] = \beta_j\)
Linear: \(\hat\beta_j = \sum_{i=1}^n w_{ij} Y_i\) for weights \(w_{ij}\) that depend only on \(X\)
Best: among all linear unbiased estimators, OLS has the smallest variance:

\[ \text{Var}(\hat\beta_j \mid X) \leq \text{Var}(\ddot\beta_j \mid X) \]

for any other linear unbiased estimator \(\ddot\beta_j\).

Gauss-Markov does not say OLS is the best estimator overall — only among those that are both linear and unbiased.

Misspecified Models

What happens when the model is wrong?

Two forms of misspecification:

Over-specification: including irrelevant variables (\(\beta_j = 0\) in the population)
Under-specification: omitting relevant variables (\(\beta_j \neq 0\) but excluded)

Over-Specification

Population: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + U\), with \(E[U \mid X_1, X_2, X_3] = 0\)

Estimated model (adds irrelevant \(X_3\), where \(\beta_3 = 0\)):

\[ \hat{Y} = \hat\beta_0 + \hat\beta_1 X_1 + \hat\beta_2 X_2 + \hat\beta_3 X_3 \]

What happens to \(\hat\beta_1\) and \(\hat\beta_2\)? Consider both bias and variance.

Bias? The Gauss-Markov assumptions still hold — \(\hat\beta_1\) and \(\hat\beta_2\) remain unbiased.
Variance? Including \(X_3\) can increase \(R_1^2\) and \(R_2^2\), raising \(\text{Var}(\hat\beta_1)\) and \(\text{Var}(\hat\beta_2)\).
Over-specification trades zero bias for higher variance.

Under-Specification

Population: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + U\), with \(E[U \mid X_1, X_2] = 0\)

Estimated model (omits \(X_2\)): \(Y_i = \tilde\beta_0 + \tilde\beta_1 X_{i1} + V_i\)

What is \(V_i\)? Does the short regression satisfy ZCM?

The error in the short regression is \(V_i = \beta_2 X_{i2} + U_i\).
ZCM requires \(E[V \mid X_1] = 0\), i.e.,

\[ E[\beta_2 X_2 + U \mid X_1] = \beta_2 \, E[X_2 \mid X_1] + \underbrace{E[U \mid X_1]}_{= 0 \text{ (by LIE)}} = \beta_2 \, E[X_2 \mid X_1] \]

This is zero only if \(\beta_2 = 0\) or \(E[X_2 \mid X_1]\) is constant. Otherwise, ZCM fails in the short regression — and from our unbiasedness proof, we know that means \(\tilde\beta_1\) is biased.

Under-Specification: The OVB Formula

By how much is \(\tilde\beta_1\) biased?

From Lecture 2b: \(\tilde\beta_1 = \hat\beta_1 + \hat\beta_2 \tilde\delta_1\)
Taking expectations:

\[ E[\tilde\beta_1 \mid X] = \beta_1 + \underbrace{\beta_2 \tilde\delta_1}_{\text{OVB}} \]

Biased whenever \(\beta_2 \neq 0\) and \(\tilde\delta_1 \neq 0\).
Note: this is conditional on \(X\); \(\tilde\delta_1\) is the sample slope of \(X_2\) on \(X_1\).

Direction of OVB

The sign of the bias \(\beta_2 \times \tilde\delta_1\) depends on:

	\(\text{Corr}(X_1, X_2) > 0\)	\(\text{Corr}(X_1, X_2) < 0\)
\(\beta_2 > 0\)	Positive bias (too large)	Negative bias (too small)
\(\beta_2 < 0\)	Negative bias	Positive bias

OVB Example: Wages and Education

True model: \(\text{wage} = \beta_0 + \beta_1 \, \text{educ} + \beta_2 \, \text{ability} + U\)

Estimated: \(\text{wage} = \tilde\beta_0 + \tilde\beta_1 \, \text{educ} + V\)

Does \(\tilde\beta_1\) overestimate or underestimate the return to education? What do you need to sign the bias?

Sign of \(\beta_2\)? Higher ability → higher wages, so \(\beta_2 > 0\).
Sign of \(\text{Corr}(\text{educ}, \text{ability})\)? Positive, so \(\tilde\delta_1 > 0\).
Bias \(= \beta_2 \tilde\delta_1 > 0\): \(\tilde\beta_1\) overestimates the return to education.

OVB Example: Wages and Experience

True model: \(\text{wage} = \beta_0 + \beta_1 \, \text{educ} + \beta_2 \, \text{exper} + U\)

Estimated: \(\text{wage} = \tilde\beta_0 + \tilde\beta_1 \, \text{educ} + V\)

Same question — does \(\tilde\beta_1\) overestimate or underestimate?

Sign of \(\beta_2\)? More experience → higher wages, so \(\beta_2 > 0\).
Sign of \(\text{Corr}(\text{educ}, \text{exper})\)? Negative (more schooling → later labor market entry), so \(\tilde\delta_1 < 0\).
Bias \(= \beta_2 \tilde\delta_1 < 0\): \(\tilde\beta_1\) underestimates the return to education.

OVB: Beyond Pairwise Correlations

True model: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + U\)

Estimated (omits \(X_3\)): \(Y_i = \tilde\beta_0 + \tilde\beta_1 X_{i1} + \tilde\beta_2 X_{i2} + V_i\)

Suppose \(\text{Corr}(X_2, X_3) = 0\). Is \(\tilde\beta_2\) free of OVB?

Not necessarily. The OVB in \(\tilde\beta_2\) depends on the auxiliary regression of \(X_3\) on \((X_1, X_2)\) jointly — not on pairwise correlations.
If \(X_1\) is correlated with both \(X_2\) and \(X_3\), the auxiliary regression can load on \(X_2\) even though \(X_2\) and \(X_3\) are uncorrelated.
Lesson: in multiple regression, checking pairwise correlations with the omitted variable is not enough.

Summary

Under the Gauss-Markov assumptions (MLR.1–MLR.5):
- OLS is unbiased (MLR.1–MLR.4)
- OLS has the smallest variance among linear unbiased estimators (Gauss-Markov theorem)
Variance depends on: error variance \(\sigma^2\), sample variation in \(X_j\), and collinearity \(R_j^2\)
Over-specification: unbiased but inefficient
Under-specification: biased — OVB formula gives the direction

What’s Next

Lecture 3a — Inference:

Hypothesis testing and confidence intervals
The \(t\)-test and \(F\)-test
Inference under finite-sample assumptions