Properties of OLS Estimators
Natasha Kang
Xiamen University, Chow Institute
May, 2026
Estimators vs. Estimates
- An estimate is a number — the value \(\hat\beta_1\) computed from a particular sample.
- An estimator is a random variable — a function of the random sample, whose value changes across samples.
- We have discussed the algebraic properties of OLS estimates (they hold in any sample).
- Now we study the statistical properties of OLS estimators — properties that hold across repeated sampling:
- Unbiasedness: Is the estimator centered on the true parameter?
- Variance: How much does the estimator vary across samples?
The Gauss-Markov Assumptions: SLR
- SLR.1 — Linear in Parameters: \(Y = \beta_0 + \beta_1 X + U\)
- SLR.2 — Random Sampling: \(\{(X_i, Y_i) : i = 1, \ldots, n\}\) are i.i.d.
- SLR.3 — Variation in \(X\): the sample values of \(X\) are not all equal
- SLR.4 — Zero Conditional Mean: \(E[U \mid X] = 0\)
- SLR.5 — Homoskedasticity: \(\text{Var}(U \mid X) = \sigma^2\)
- SLR.1–SLR.4 are needed for unbiasedness.
- SLR.5 is needed for the variance formula and the Gauss-Markov theorem.
Unbiasedness of OLS
Theorem: Under SLR.1–SLR.4,
\[
E[\hat\beta_0] = \beta_0, \qquad E[\hat\beta_1] = \beta_1
\]
Proof sketch: Write \(\hat\beta_1\) in terms of the population error:
\[
\hat\beta_1 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2} = \beta_1 + \frac{\sum_{i=1}^n (X_i - \bar{X})U_i}{\sum_{i=1}^n (X_i - \bar{X})^2}
\]
Take conditional expectation given \(X\). By SLR.4, \(E[U_i \mid X] = 0\), so \(E[\hat\beta_1 \mid X] = \beta_1\). By the law of iterated expectations, \(E[\hat\beta_1] = \beta_1\).
What Unbiasedness Means (and Doesn’t)
Which of the following are correct interpretations of unbiasedness?
- Estimated coefficients will equal the true values.
- On average, estimated coefficients will equal the true values.
- In a given sample, estimates may differ considerably from the true values.
- In a given sample, estimates are close to the true values.
- Correct: 2 and 3.
- Unbiasedness is a property of the procedure, not of any single estimate.
- A single estimate can be far from the truth — unbiasedness says the errors average out over many samples.
The Sampling Distribution
We said \(E[\hat\beta_1] = \beta_1\). But expectation over what?
- The population and \(\beta_j\) are fixed — there is nothing random about them.
- Randomness comes from sampling: each sample gives different data, hence a different \(\hat\beta_j\).
- The sampling distribution describes how \(\hat\beta_j\) varies across hypothetical repeated samples.
- \(E[\hat\beta_1] = \beta_1\) means this distribution is centered on the truth.
Unbiasedness: Visualized
![]()
- Each draw from the sampling distribution is one estimate from one sample.
- The distribution is centered on \(\beta_1 = 2\) — that is what unbiasedness means.
What Can Go Wrong: Violating ZCM
- If \(\text{Cov}(X, U) \neq 0\), the second term in
\[
\hat\beta_1 = \beta_1 + \frac{\sum (X_i - \bar{X})U_i}{\sum (X_i - \bar{X})^2}
\]
does not vanish in expectation — OLS is biased.
- Example: omitting ability from a wage regression on education. If ability is positively correlated with education, \(E[\hat\beta_1] > \beta_1\).
Homoskedasticity
SLR.5 states: \(\text{Var}(U \mid X) = \sigma^2\) — the error variance does not depend on \(X\).
- Homoskedasticity: the spread of \(Y\) around the regression line is the same at every value of \(X\).
- Heteroskedasticity: the spread varies with \(X\) — e.g., income variance increases with education.
- Homoskedasticity is needed for the standard variance formulas. If it fails, OLS is still unbiased, but the usual standard errors are wrong.
Homoskedasticity vs. Heteroskedasticity
Variance of OLS Estimators: SLR
Theorem: Under SLR.1–SLR.5, conditional on \(X\):
\[
\text{Var}(\hat\beta_1 \mid X) = \frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar{X})^2}
\]
\[
\text{Var}(\hat\beta_0 \mid X) = \frac{\sigma^2 \cdot \frac{1}{n}\sum_{i=1}^n X_i^2}{\sum_{i=1}^n (X_i - \bar{X})^2}
\]
- Larger error variance \(\sigma^2\) → less precise estimates.
- More spread in \(X\) → more precise estimates.
Variance: Visualized
![]()
- Both estimators are unbiased — both distributions are centered on \(\beta_1 = 2\).
- But the low-variance estimator concentrates more tightly around the truth: any single estimate is likely to be closer.
Estimating \(\sigma^2\): Simple Regression
- We don’t observe \(\sigma^2\). If we could observe the errors \(U_i\), we’d estimate \(\sigma^2\) by \(\frac{1}{n}\sum U_i^2\).
- We only observe residuals \(\hat{U}_i\), and \(\frac{1}{n}\sum \hat{U}_i^2\) is biased — the residuals are “too small” on average because OLS minimizes their sum of squares.
- The FOCs impose two restrictions (\(\sum \hat{U}_i = 0\) and \(\sum X_i \hat{U}_i = 0\)), consuming two degrees of freedom.
- The unbiased estimator is:
\[
\hat\sigma^2 = \frac{1}{n - 2}\sum_{i=1}^n \hat{U}_i^2
\]
Gauss-Markov Assumptions: MLR
- MLR.1 — Linear in Parameters: \(Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k + U\)
- MLR.2 — Random Sampling: \(\{(X_{i1}, \ldots, X_{ik}, Y_i) : i = 1, \ldots, n\}\) are i.i.d.
- MLR.3 — No Perfect Collinearity: no exact linear relationships among the regressors
- MLR.4 — Zero Conditional Mean: \(E[U \mid X_1, \ldots, X_k] = 0\)
- MLR.5 — Homoskedasticity: \(\text{Var}(U \mid X_1, \ldots, X_k) = \sigma^2\)
No Perfect Collinearity (MLR.3)
- Perfect collinearity: one regressor is an exact linear combination of others — OLS cannot be computed.
- Common violations:
- Including a constant variable alongside the intercept
- Including \(\text{expendA}\), \(\text{expendB}\), and \(\text{totexpend} = \text{expendA} + \text{expendB}\)
- Not a violation: including \(X\) and \(X^2\). Since \(X^2\) is a nonlinear function of \(X\), it is not a linear combination.
Unbiasedness in MLR
Theorem: Under MLR.1–MLR.4,
\[
E[\hat\beta_j] = \beta_j, \qquad j = 0, 1, \ldots, k
\]
The proof follows the same logic as in SLR, extended to multiple regressors.
Variance of OLS Estimators: MLR
Theorem: Under MLR.1–MLR.5, conditional on \(X\):
\[
\text{Var}(\hat\beta_j \mid X) = \frac{\sigma^2}{\text{SST}_j(1 - R_j^2)}, \qquad j = 1, \ldots, k
\]
where \(\text{SST}_j = \sum_{i=1}^n (X_{ij} - \bar{X}_j)^2\) and \(R_j^2\) is the \(R^2\) from regressing \(X_j\) on all other regressors.
Proof: By FWL, \(\hat\beta_j\) is the SLR slope from regressing \(Y\) on \(\hat{r}_{ij}\) (residuals from \(X_j\) on all other \(X\)’s). Applying the SLR variance formula:
\[
\text{Var}(\hat\beta_j \mid X) = \frac{\sigma^2}{\sum_{i=1}^n \hat{r}_{ij}^2} = \frac{\sigma^2}{\text{SST}_j(1 - R_j^2)}
\]
where the last equality follows from \(\sum \hat{r}_{ij}^2 = \text{SST}_j(1 - R_j^2)\) by definition of \(R_j^2\). \(\square\)
Understanding \(R_j^2\)
- \(R_j^2\) is not the \(R^2\) of the main regression of \(Y\) on \(X\)’s.
- \(R_j^2\) measures how well \(X_j\) is linearly predicted by the other regressors.
- If \(R_j^2\) is high, most of the variation in \(X_j\) is shared with other regressors — little “unique” variation remains to identify \(\beta_j\).
- If \(R_j^2 = 0\) (uncorrelated regressors), the MLR formula reduces to the SLR formula.
Three Factors Driving Estimator Precision
\[
\text{Var}(\hat\beta_j \mid X) = \frac{\sigma^2}{\text{SST}_j(1 - R_j^2)}
\]
- Error variance \(\sigma^2\): more noise → less precision. Can reduce by adding relevant controls.
- Total variation \(\text{SST}_j = n \cdot S_{X_j}^2\): more variation in \(X_j\) → more precision. Increases with sample size.
- Collinearity \(R_j^2\): higher \(R_j^2\) → less unique variation in \(X_j\) → less precision.
Variance Factors: Visualized
![]()
- In each panel, the dashed line marks \(\beta_1 = 2\). A tighter distribution means more precise estimation.
Multicollinearity
- \(R_j^2 = 1\): perfect collinearity — \(\text{Var}(\hat\beta_j) \to \infty\) (violates MLR.3).
- \(R_j^2\) close to 1: multicollinearity — OLS still works, but estimates are imprecise.
- No clean solution:
- Dropping correlated variables risks omitted variable bias.
- Increasing sample size helps (raises \(\text{SST}_j\)).
- Multicollinearity is a data problem, not a model problem — the data lack enough independent variation to distinguish the separate effects.
Example: Advertising and Sales
\[
\text{sales} = \beta_0 + \beta_1 \, \text{tv} + \beta_2 \, \text{online} + \beta_3 \, \text{print} + U
\]
- A firm wants to know which advertising channel drives sales.
- But firms that spend more on TV also spend more on online and print — all three scale together.
- \(R_j^2\) is high for each channel: most of the variation in one channel’s spending is explained by the others.
- OLS can estimate the combined effect precisely, but cannot disentangle the individual contributions — standard errors on \(\hat\beta_1, \hat\beta_2, \hat\beta_3\) are large.
- This is not a model failure — it reflects a genuine lack of independent variation in the data.
Estimating \(\sigma^2\): MLR
With \(k + 1\) parameters estimated, the degrees-of-freedom adjustment gives:
\[
\hat\sigma^2 = \frac{1}{n - k - 1}\sum_{i=1}^n \hat{U}_i^2
\]
- Under MLR.1–MLR.5, \(E[\hat\sigma^2] = \sigma^2\) — unbiased.
- Replacing \(\sigma\) with \(\hat\sigma = \sqrt{\hat\sigma^2}\) in the variance formula gives the standard error:
\[
\text{se}(\hat\beta_j) = \frac{\hat\sigma}{\sqrt{\text{SST}_j(1 - R_j^2)}}
\]
The Gauss-Markov Theorem
Theorem: Under MLR.1–MLR.5, the OLS estimator is BLUE — Best Linear Unbiased Estimator.
- Unbiased Estimator: \(E[\hat\beta_j] = \beta_j\)
- Linear: \(\hat\beta_j = \sum_{i=1}^n w_{ij} Y_i\) for weights \(w_{ij}\) that depend only on \(X\)
- Best: among all linear unbiased estimators, OLS has the smallest variance:
\[
\text{Var}(\hat\beta_j \mid X) \leq \text{Var}(\ddot\beta_j \mid X)
\]
for any other linear unbiased estimator \(\ddot\beta_j\).
- Gauss-Markov does not say OLS is the best estimator overall — only among those that are both linear and unbiased.
Misspecified Models
What happens when the model is wrong?
Two forms of misspecification:
- Over-specification: including irrelevant variables (\(\beta_j = 0\) in the population)
- Under-specification: omitting relevant variables (\(\beta_j \neq 0\) but excluded)
Over-Specification
Population: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + U\), with \(E[U \mid X_1, X_2, X_3] = 0\)
Estimated model (adds irrelevant \(X_3\), where \(\beta_3 = 0\)):
\[
\hat{Y} = \hat\beta_0 + \hat\beta_1 X_1 + \hat\beta_2 X_2 + \hat\beta_3 X_3
\]
What happens to \(\hat\beta_1\) and \(\hat\beta_2\)? Consider both bias and variance.
- Bias? The Gauss-Markov assumptions still hold — \(\hat\beta_1\) and \(\hat\beta_2\) remain unbiased.
- Variance? Including \(X_3\) can increase \(R_1^2\) and \(R_2^2\), raising \(\text{Var}(\hat\beta_1)\) and \(\text{Var}(\hat\beta_2)\).
- Over-specification trades zero bias for higher variance.
Under-Specification
Population: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + U\), with \(E[U \mid X_1, X_2] = 0\)
Estimated model (omits \(X_2\)): \(Y_i = \tilde\beta_0 + \tilde\beta_1 X_{i1} + V_i\)
What is \(V_i\)? Does the short regression satisfy ZCM?
- The error in the short regression is \(V_i = \beta_2 X_{i2} + U_i\).
- ZCM requires \(E[V \mid X_1] = 0\), i.e.,
\[
E[\beta_2 X_2 + U \mid X_1] = \beta_2 \, E[X_2 \mid X_1] + \underbrace{E[U \mid X_1]}_{= 0 \text{ (by LIE)}} = \beta_2 \, E[X_2 \mid X_1]
\]
- This is zero only if \(\beta_2 = 0\) or \(E[X_2 \mid X_1]\) is constant. Otherwise, ZCM fails in the short regression — and from our unbiasedness proof, we know that means \(\tilde\beta_1\) is biased.
Direction of OVB
The sign of the bias \(\beta_2 \times \tilde\delta_1\) depends on:
| \(\beta_2 > 0\) |
Positive bias (too large) |
Negative bias (too small) |
| \(\beta_2 < 0\) |
Negative bias |
Positive bias |
OVB Example: Wages and Education
True model: \(\text{wage} = \beta_0 + \beta_1 \, \text{educ} + \beta_2 \, \text{ability} + U\)
Estimated: \(\text{wage} = \tilde\beta_0 + \tilde\beta_1 \, \text{educ} + V\)
Does \(\tilde\beta_1\) overestimate or underestimate the return to education? What do you need to sign the bias?
- Sign of \(\beta_2\)? Higher ability → higher wages, so \(\beta_2 > 0\).
- Sign of \(\text{Corr}(\text{educ}, \text{ability})\)? Positive, so \(\tilde\delta_1 > 0\).
- Bias \(= \beta_2 \tilde\delta_1 > 0\): \(\tilde\beta_1\) overestimates the return to education.
OVB Example: Wages and Experience
True model: \(\text{wage} = \beta_0 + \beta_1 \, \text{educ} + \beta_2 \, \text{exper} + U\)
Estimated: \(\text{wage} = \tilde\beta_0 + \tilde\beta_1 \, \text{educ} + V\)
Same question — does \(\tilde\beta_1\) overestimate or underestimate?
- Sign of \(\beta_2\)? More experience → higher wages, so \(\beta_2 > 0\).
- Sign of \(\text{Corr}(\text{educ}, \text{exper})\)? Negative (more schooling → later labor market entry), so \(\tilde\delta_1 < 0\).
- Bias \(= \beta_2 \tilde\delta_1 < 0\): \(\tilde\beta_1\) underestimates the return to education.
OVB: Beyond Pairwise Correlations
True model: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + U\)
Estimated (omits \(X_3\)): \(Y_i = \tilde\beta_0 + \tilde\beta_1 X_{i1} + \tilde\beta_2 X_{i2} + V_i\)
Suppose \(\text{Corr}(X_2, X_3) = 0\). Is \(\tilde\beta_2\) free of OVB?
- Not necessarily. The OVB in \(\tilde\beta_2\) depends on the auxiliary regression of \(X_3\) on \((X_1, X_2)\) jointly — not on pairwise correlations.
- If \(X_1\) is correlated with both \(X_2\) and \(X_3\), the auxiliary regression can load on \(X_2\) even though \(X_2\) and \(X_3\) are uncorrelated.
- Lesson: in multiple regression, checking pairwise correlations with the omitted variable is not enough.
Summary
- Under the Gauss-Markov assumptions (MLR.1–MLR.5):
- OLS is unbiased (MLR.1–MLR.4)
- OLS has the smallest variance among linear unbiased estimators (Gauss-Markov theorem)
- Variance depends on: error variance \(\sigma^2\), sample variation in \(X_j\), and collinearity \(R_j^2\)
- Over-specification: unbiased but inefficient
- Under-specification: biased — OVB formula gives the direction
What’s Next
Lecture 3a — Inference:
- Hypothesis testing and confidence intervals
- The \(t\)-test and \(F\)-test
- Inference under finite-sample assumptions