Linear Regression: Estimation
Natasha Kang
Xiamen University, Chow Institute
March, 2026
Where We Are
Last time, we specified the linear regression model and connected it to the population via assumptions (ZCM).
Now: how do we estimate the population parameters \(\beta_0, \beta_1, \ldots, \beta_k\) from data?
- We have a model (parametric specification) and a sample (data).
- Estimation combines the two: it uses the sample to produce numbers \(\hat\beta_0, \hat\beta_1, \ldots, \hat\beta_k\) that approximate the unknown population parameters.
The Estimation Problem
We assume in the population:
\[
Y = \beta_0 + \beta_1 X + U
\]
Given a random sample \(\{(X_i, Y_i): i = 1, 2, \ldots, n\}\), how do we choose \(\hat\beta_0\) and \(\hat\beta_1\)?
- We will derive two approaches — least squares and method of moments — and show they give the same answer.
Fitted Values and Residuals
For any candidate values \(b_0\) and \(b_1\), define:
- Fitted value for observation \(i\):
\[
\tilde{Y}_i(b_0, b_1) = b_0 + b_1 X_i
\]
- Residual for observation \(i\):
\[
e_i(b_0, b_1) = Y_i - b_0 - b_1 X_i
\]
- The fitted value is our prediction of \(Y_i\) given \(X_i\).
- The residual measures how far the prediction misses.
Fitting a Line Through the Data
The Least Squares Idea
We want to choose \(b_0\) and \(b_1\) so that the residuals are “small” overall.
Ordinary Least Squares (OLS): minimize the sum of squared residuals:
\[
\min_{b_0, \, b_1} \sum_{i=1}^n e_i^2 = \sum_{i=1}^n \left(Y_i - b_0 - b_1 X_i\right)^2
\]
- Why squared? Squaring penalizes large errors more, avoids cancellation of positive and negative residuals, and yields a smooth (differentiable) objective.
OLS: First-Order Conditions
Setting the partial derivatives to zero:
\[
\begin{aligned}
\frac{\partial}{\partial b_0}: \quad &{-2} \sum_{i=1}^n \left(Y_i - b_0 - b_1 X_i\right) = 0 \\[8pt]
\frac{\partial}{\partial b_1}: \quad &{-2} \sum_{i=1}^n X_i\left(Y_i - b_0 - b_1 X_i\right) = 0
\end{aligned}
\]
Dividing by \(-2\) and \(n\):
\[
\begin{aligned}
\frac{1}{n}\sum_{i=1}^n \left(Y_i - b_0 - b_1 X_i\right) &= 0 \\[6pt]
\frac{1}{n}\sum_{i=1}^n X_i\left(Y_i - b_0 - b_1 X_i\right) &= 0
\end{aligned}
\]
Method of Moments
An alternative derivation starts from the population. Recall from Lecture 2a that ZCM implies:
\[
E[U] = 0, \qquad E[XU] = 0
\]
- These are moment conditions — restrictions that must hold if the model is correctly specified.
- Two conditions, two unknowns (\(\beta_0, \beta_1\)) — just enough to pin down the parameters.
- The method of moments idea: choose \(b_0, b_1\) so that the sample analogs of these conditions hold exactly.
MM: Sample Counterparts
Replace population expectations with sample averages, and \(U\) with \(Y_i - b_0 - b_1 X_i\):
\[
\begin{aligned}
\frac{1}{n}\sum_{i=1}^n \left(Y_i - b_0 - b_1 X_i\right) &= 0 \\[6pt]
\frac{1}{n}\sum_{i=1}^n X_i\left(Y_i - b_0 - b_1 X_i\right) &= 0
\end{aligned}
\]
These are the same two equations as the OLS first-order conditions — two approaches, one estimator.
The OLS Estimators
The values \(\hat\beta_0, \hat\beta_1\) that solve this system are the OLS estimators:
\[
\hat\beta_1 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2}
\]
\[
\hat\beta_0 = \bar{Y} - \hat\beta_1 \bar{X}
\]
- \(\hat\beta_1\) exists as long as \(\sum (X_i - \bar{X})^2 > 0\) — i.e., there is variation in \(X\).
- The intercept ensures the regression line passes through \((\bar{X}, \bar{Y})\).
- Equivalently: \(\hat\beta_1 = \hat\rho_{XY} \cdot \dfrac{\hat\sigma_Y}{\hat\sigma_X}\), where \(\hat\rho_{XY}\) is the sample correlation.
Multiple Regression: OLS
\[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + U
\]
- There are \(k+1\) unknown parameters — we need \(k+1\) equations.
- Both derivations extend naturally from the simple regression case.
MLR: Method of Moments
ZCM in MLR (\(E[U \mid X_1, \ldots, X_k] = 0\)) implies \(k+1\) moment conditions — one for each parameter:
\[
E[U] = 0, \qquad E[X_j U] = 0, \quad j = 1, 2, \ldots, k
\]
Choose \(b_0, \ldots, b_k\) so the sample analogs hold exactly:
\[
\frac{1}{n}\sum_{i=1}^n (Y_i - b_0 - b_1 X_{i1} - \cdots - b_k X_{ik}) = 0
\]
\[
\frac{1}{n}\sum_{i=1}^n X_{ij}(Y_i - b_0 - b_1 X_{i1} - \cdots - b_k X_{ik}) = 0, \quad j = 1, \ldots, k
\]
MLR: Least Squares
Minimize the sum of squared residuals over \((b_0, \ldots, b_k)\):
\[
\min_{b_0, \ldots, b_k} \sum_{i=1}^n \left(Y_i - b_0 - b_1 X_{i1} - \cdots - b_k X_{ik}\right)^2
\]
The FOCs are the same \(k+1\) equations as the MM sample counterparts.
The solutions are the OLS estimators \(\hat\beta_0, \hat\beta_1, \ldots, \hat\beta_k\).
Fitted Values and Residuals in MLR
\[
\hat{Y}_i = \hat\beta_0 + \hat\beta_1 X_{i1} + \hat\beta_2 X_{i2} + \cdots + \hat\beta_k X_{ik}
\]
\[
\hat{U}_i = Y_i - \hat{Y}_i
\]
Interpreting OLS Coefficients
The estimated regression equation:
\[
\hat{Y} = \hat\beta_0 + \hat\beta_1 X_1 + \hat\beta_2 X_2 + \cdots + \hat\beta_k X_k
\]
- \(\hat\beta_j\) is the partial effect of \(X_j\):
\[
\hat\beta_j = \frac{\Delta \hat{Y}}{\Delta X_j}, \quad \text{holding all other } X_l \ (l \neq j) \ \text{fixed}
\]
- This is a statement about the fitted values — about the regression line, not (yet) about the population.
- Whether \(\hat\beta_j\) estimates a causal effect depends on the assumptions discussed in Lecture 2a.
Example: CEO Salary and ROE
\[
\text{salary} = \beta_0 + \beta_1 \, \text{roe} + U
\]
- salary: annual salary in thousands of dollars
- roe: return on equity (percent)
The fitted regression:
\[
\widehat{\text{salary}} = 963.19 + 18.5 \, \text{roe}
\]
- If ROE increases by one percentage point, salary is predicted to increase by $18,500.
- The intercept: predicted salary when ROE = 0 is $963,190.
Population vs. Sample Regression
![]()
- The PRF (\(E[Y \mid X]\)) is fixed but unknown. The SRF (\(\hat{Y}\)) is our estimate — it varies across samples.
- With a different sample, we would get a different fitted line. How close the SRF is to the PRF is a question about estimator properties (Lecture 2c).
Example: College GPA
\[
\widehat{\text{colGPA}} = 1.286 + 0.453 \, \text{hsGPA} + 0.0091 \, \text{ACT}
\]
- Holding ACT fixed, a one-point higher high school GPA predicts a 0.453-point higher college GPA.
- Holding hsGPA fixed, a one-point higher ACT score predicts a 0.0091-point higher college GPA.
- Predicted college GPA for hsGPA = 3.5, ACT = 24?
\[
\widehat{\text{colGPA}} = 1.286 + 0.453(3.5) + 0.0091(24) = 3.09
\]
Algebraic Properties of OLS
These properties hold by construction — they follow from the FOCs, not from any assumption about the population.
The residuals sum to zero: \(\displaystyle\sum_{i=1}^n \hat{U}_i = 0\)
The sample covariance between each regressor and the residuals is zero: \(\displaystyle\sum_{i=1}^n X_{ij}\hat{U}_i = 0, \quad j = 1, \ldots, k\)
The point \((\bar{X}_1, \ldots, \bar{X}_k, \bar{Y})\) lies on the regression line.
Exercise: What is the sample covariance between the fitted values \(\hat{Y}_i\) and the residuals \(\hat{U}_i\)? (Hint: use properties 1–2.)
Goodness of Fit: Decomposing Variation
How well does the regression line fit the data? We decompose the total variation in \(Y\):
\[
\underbrace{\sum_{i=1}^n (Y_i - \bar{Y})^2}_{\text{SST}} = \underbrace{\sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2}_{\text{SSE}} + \underbrace{\sum_{i=1}^n \hat{U}_i^2}_{\text{SSR}}
\]
- SST (Total Sum of Squares): total variation in \(Y\) around its mean
- SSE (Explained Sum of Squares): variation in \(Y\) explained by the regression
- SSR (Residual Sum of Squares): unexplained variation
The \(R^2\)
The coefficient of determination:
\[
R^2 = \frac{\text{SSE}}{\text{SST}} = 1 - \frac{\text{SSR}}{\text{SST}}
\]
- \(R^2\) is the fraction of the total variation in \(Y\) explained by the regression.
- \(0 \leq R^2 \leq 1\) (when an intercept is included).
- Example: \(R^2 = 0.65\) means 65% of the variation in \(Y\) is captured by the model.
Exercise: Show that \(R^2\) equals the squared correlation between actual and fitted values:
\[
R^2 = \bigl[\text{Corr}(Y_i, \hat{Y}_i)\bigr]^2
\]
(Hint: write \(Y_i = \hat{Y}_i + \hat{U}_i\) and use the result from the previous exercise.)
\(R^2\): What It Does and Doesn’t Tell You
- \(R^2\) never decreases when an additional regressor is added (SSR can only stay the same or fall).
- But a higher \(R^2\) does not mean the new variable belongs in the model. Two reasons:
- For causal inference: whether a variable should be included depends on the underlying causal structure, not on fit. Including the wrong controls can distort the coefficient of interest — we will see why in Lecture 4c.
- For prediction: \(R^2\) measures how well the model fits this particular sample. A model that chases noise in the current data may fit it well but predict poorly on new data — the improvement in \(R^2\) is spurious.
Low \(R^2\) Is Not a Problem
- In economics, we often care about whether \(X\) has a causal effect on \(Y\) — e.g., does education raise income?
- Many other factors also affect income, so education alone explains little of the total variation in \(Y\) (low \(R^2\)).
- What matters is whether the coefficient truly captures a causal effect — not how much of \(Y\) the model explains overall.
Example: What Deters Crime?
Question: Does the threat of conviction deter criminal activity? Does employment help?
- Data on 2,725 men born in 1960–61. Outcome: number of arrests in 1986 (narr86).
| pcnv |
proportion of prior arrests leading to conviction |
| ptime86 |
months spent in prison in 1986 |
| qemp86 |
quarters employed in 1986 |
| avgsen |
average sentence length in prior convictions (months) |
Example: What Deters Crime?
\[
\widehat{\text{narr86}} = 0.712 - 0.150 \, \text{pcnv} - 0.034 \, \text{ptime86} - 0.104 \, \text{qemp86} + 0.007 \, \text{avgsen}
\]
\(n = 2{,}725\), \(\; R^2 = 0.042\).
- Higher conviction rates and more quarters employed are associated with fewer arrests — consistent with deterrence and opportunity cost stories.
- But \(R^2 = 0.042\): we cannot predict which individuals get arrested.
- That is fine — the question is whether these coefficients reflect causal effects, not whether we can forecast individual behavior.
Regression Through the Origin
Sometimes theory tells us \(E[Y \mid X = 0] = 0\) — e.g., if income is zero, tax owed should be zero. We can impose this by dropping the intercept:
\[
\hat{Y} = \hat\beta_1 X_1 + \hat\beta_2 X_2 + \cdots + \hat\beta_k X_k
\]
- Without an intercept, the first FOC (\(\sum \hat{U}_i = 0\)) no longer holds — the residuals need not sum to zero.
- As a consequence, the decomposition \(\text{SST} = \text{SSE} + \text{SSR}\) can break down, and the usual \(R^2 = 1 - \text{SSR}/\text{SST}\) can be negative — meaning the model fits worse than a horizontal line at \(\bar{Y}\).
The Frisch-Waugh-Lovell Theorem
Question: In a multiple regression, what exactly does \(\hat\beta_1\) capture?
- The FWL theorem says: \(\hat\beta_1\) from the full regression equals \(\hat\beta_1\) from a two-step procedure that “partials out” the other regressors.
- This gives a precise meaning to “holding other variables fixed.”
FWL: Setup
The OLS sample decomposition:
\[
Y_i = \hat\beta_0 + \hat\beta_1 X_{i1} + \hat\beta_2 X_{i2} + \cdots + \hat\beta_k X_{ik} + \hat{U}_i
\]
This is an identity — it holds exactly in the sample. We want to understand what \(\hat\beta_1\) captures.
FWL: Step 1 — Partial Out \(X_1\)
Regress \(X_1\) on all the other regressors:
\[
X_{i1} = \hat\pi_0 + \hat\pi_1 X_{i2} + \hat\pi_2 X_{i3} + \cdots + \hat\pi_{k-1} X_{ik} + \hat{R}_{i1}
\]
- The residual \(\hat{R}_{i1}\) is the part of \(X_1\) that cannot be predicted by \(X_2, \ldots, X_k\).
- It captures the “unique” variation in \(X_1\), after removing everything shared with the other regressors.
FWL: Step 2 — Regress \(Y\) on the Residuals
Regress \(Y\) on \(\hat{R}_{i1}\):
\[
Y_i = \hat\alpha + \hat\beta_1 \hat{R}_{i1} + \hat{e}_i
\]
Theorem (Frisch-Waugh-Lovell): The slope \(\hat\beta_1\) from this simple regression is identical to \(\hat\beta_1\) from the full multiple regression.
\[
\hat\beta_1 = \frac{\sum_{i=1}^n \hat{R}_{i1} Y_i}{\sum_{i=1}^n \hat{R}_{i1}^2}
\]
- \(\hat\beta_1\) uses only the variation in \(X_1\) that is orthogonal to the other regressors.
- This is what “holding \(X_2, \ldots, X_k\) fixed” means mechanically in OLS.
Short vs. Long Regression
FWL showed what the long regression does mechanically. A natural follow-up: what happens when we leave a variable out?
We use tilde (\(\tilde{}\)) for the short regression and hat (\(\hat{}\)) for the long regression.
Short regression (regress \(Y\) on \(X_1\) only): \(Y_i = \tilde\beta_0 + \tilde\beta_1 X_{i1} + \tilde{U}_i\)
Long regression (regress \(Y\) on \(X_1\) and \(X_2\)): \(Y_i = \hat\beta_0 + \hat\beta_1 X_{i1} + \hat\beta_2 X_{i2} + \hat{U}_i\)
How do \(\tilde\beta_1\) and \(\hat\beta_1\) relate?
When Does Omitting \(X_2\) Matter?
\[
\tilde\beta_1 = \hat\beta_1 + \hat\beta_2 \, \tilde\delta_1
\]
The short and long regression give different coefficients when both:
- \(X_2\) is associated with \(Y\) in the long regression (\(\hat\beta_2 \neq 0\))
- \(X_2\) is correlated with \(X_1\) in the sample (\(\tilde\delta_1 \neq 0\))
- If either condition fails, \(\tilde\beta_1 = \hat\beta_1\) — omitting \(X_2\) does not change the coefficient on \(X_1\).
- In Lecture 2c, we take expectations of this identity to determine when omitting a variable leads to statistical bias.
Example: College GPA
\[
\begin{aligned}
\text{Long}: \quad \widehat{\text{colGPA}} &= 1.286 + 0.453 \, \text{hsGPA} + 0.0091 \, \text{ACT} \\
\text{Short}: \quad \widehat{\text{colGPA}} &= 2.403 + 0.027 \, \text{ACT}
\end{aligned}
\]
- The coefficient on ACT is three times larger in the short regression (0.027 vs. 0.0091).
- Why? hsGPA and ACT are positively correlated (\(\tilde\delta_1 > 0\)), and hsGPA is associated with colGPA (\(\hat\beta_2 > 0\)). The short regression attributes some of the hsGPA association to ACT.
- Whether this difference reflects “bias” depends on which model is correct — a question we address in Lecture 2c.
Summary
- Two derivations, one estimator: least squares and method of moments both yield OLS.
- Algebraic properties: residuals sum to zero, are uncorrelated with regressors, regression line passes through means.
- \(R^2\): fraction of variation explained — useful but not the goal of causal analysis.
- Frisch-Waugh-Lovell: \(\hat\beta_j\) uses only the variation in \(X_j\) orthogonal to other regressors.
- Short–long regression: \(\tilde\beta_1 = \hat\beta_1 + \hat\beta_2 \tilde\delta_1\) — omitting a variable changes the estimate whenever the omitted variable is correlated with the included one.
What’s Next
Lecture 2c — Estimator Properties:
- Unbiasedness of OLS under the Gauss-Markov assumptions
- Variance of OLS estimators
- The Gauss-Markov theorem: OLS is BLUE
- Model misspecification and omitted variable bias