Linear Regression: Estimation

Natasha Kang

Xiamen University, Chow Institute

March, 2026

Where We Are

Last time, we specified the linear regression model and connected it to the population via assumptions (ZCM).

Now: how do we estimate the population parameters \(\beta_0, \beta_1, \ldots, \beta_k\) from data?

  • We have a model (parametric specification) and a sample (data).
  • Estimation combines the two: it uses the sample to produce numbers \(\hat\beta_0, \hat\beta_1, \ldots, \hat\beta_k\) that approximate the unknown population parameters.

The Estimation Problem

We assume in the population:

\[ Y = \beta_0 + \beta_1 X + U \]

Given a random sample \(\{(X_i, Y_i): i = 1, 2, \ldots, n\}\), how do we choose \(\hat\beta_0\) and \(\hat\beta_1\)?

  • We will derive two approaches — least squares and method of moments — and show they give the same answer.

Fitted Values and Residuals

For any candidate values \(b_0\) and \(b_1\), define:

  • Fitted value for observation \(i\):

\[ \tilde{Y}_i(b_0, b_1) = b_0 + b_1 X_i \]

  • Residual for observation \(i\):

\[ e_i(b_0, b_1) = Y_i - b_0 - b_1 X_i \]

  • The fitted value is our prediction of \(Y_i\) given \(X_i\).
  • The residual measures how far the prediction misses.

Fitting a Line Through the Data

The Least Squares Idea

We want to choose \(b_0\) and \(b_1\) so that the residuals are “small” overall.

Ordinary Least Squares (OLS): minimize the sum of squared residuals:

\[ \min_{b_0, \, b_1} \sum_{i=1}^n e_i^2 = \sum_{i=1}^n \left(Y_i - b_0 - b_1 X_i\right)^2 \]

  • Why squared? Squaring penalizes large errors more, avoids cancellation of positive and negative residuals, and yields a smooth (differentiable) objective.

OLS: First-Order Conditions

Setting the partial derivatives to zero:

\[ \begin{aligned} \frac{\partial}{\partial b_0}: \quad &{-2} \sum_{i=1}^n \left(Y_i - b_0 - b_1 X_i\right) = 0 \\[8pt] \frac{\partial}{\partial b_1}: \quad &{-2} \sum_{i=1}^n X_i\left(Y_i - b_0 - b_1 X_i\right) = 0 \end{aligned} \]

Dividing by \(-2\) and \(n\):

\[ \begin{aligned} \frac{1}{n}\sum_{i=1}^n \left(Y_i - b_0 - b_1 X_i\right) &= 0 \\[6pt] \frac{1}{n}\sum_{i=1}^n X_i\left(Y_i - b_0 - b_1 X_i\right) &= 0 \end{aligned} \]

Method of Moments

An alternative derivation starts from the population. Recall from Lecture 2a that ZCM implies:

\[ E[U] = 0, \qquad E[XU] = 0 \]

  • These are moment conditions — restrictions that must hold if the model is correctly specified.
  • Two conditions, two unknowns (\(\beta_0, \beta_1\)) — just enough to pin down the parameters.
  • The method of moments idea: choose \(b_0, b_1\) so that the sample analogs of these conditions hold exactly.

MM: Sample Counterparts

Replace population expectations with sample averages, and \(U\) with \(Y_i - b_0 - b_1 X_i\):

\[ \begin{aligned} \frac{1}{n}\sum_{i=1}^n \left(Y_i - b_0 - b_1 X_i\right) &= 0 \\[6pt] \frac{1}{n}\sum_{i=1}^n X_i\left(Y_i - b_0 - b_1 X_i\right) &= 0 \end{aligned} \]

These are the same two equations as the OLS first-order conditions — two approaches, one estimator.

The OLS Estimators

The values \(\hat\beta_0, \hat\beta_1\) that solve this system are the OLS estimators:

\[ \hat\beta_1 = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2} \]

\[ \hat\beta_0 = \bar{Y} - \hat\beta_1 \bar{X} \]

  • \(\hat\beta_1\) exists as long as \(\sum (X_i - \bar{X})^2 > 0\) — i.e., there is variation in \(X\).
  • The intercept ensures the regression line passes through \((\bar{X}, \bar{Y})\).
  • Equivalently: \(\hat\beta_1 = \hat\rho_{XY} \cdot \dfrac{\hat\sigma_Y}{\hat\sigma_X}\), where \(\hat\rho_{XY}\) is the sample correlation.

Multiple Regression: OLS

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + U \]

  • There are \(k+1\) unknown parameters — we need \(k+1\) equations.
  • Both derivations extend naturally from the simple regression case.

MLR: Method of Moments

ZCM in MLR (\(E[U \mid X_1, \ldots, X_k] = 0\)) implies \(k+1\) moment conditions — one for each parameter:

\[ E[U] = 0, \qquad E[X_j U] = 0, \quad j = 1, 2, \ldots, k \]

Choose \(b_0, \ldots, b_k\) so the sample analogs hold exactly:

\[ \frac{1}{n}\sum_{i=1}^n (Y_i - b_0 - b_1 X_{i1} - \cdots - b_k X_{ik}) = 0 \]

\[ \frac{1}{n}\sum_{i=1}^n X_{ij}(Y_i - b_0 - b_1 X_{i1} - \cdots - b_k X_{ik}) = 0, \quad j = 1, \ldots, k \]

MLR: Least Squares

Minimize the sum of squared residuals over \((b_0, \ldots, b_k)\):

\[ \min_{b_0, \ldots, b_k} \sum_{i=1}^n \left(Y_i - b_0 - b_1 X_{i1} - \cdots - b_k X_{ik}\right)^2 \]

The FOCs are the same \(k+1\) equations as the MM sample counterparts.

The solutions are the OLS estimators \(\hat\beta_0, \hat\beta_1, \ldots, \hat\beta_k\).

Fitted Values and Residuals in MLR

  • Fitted values:

\[ \hat{Y}_i = \hat\beta_0 + \hat\beta_1 X_{i1} + \hat\beta_2 X_{i2} + \cdots + \hat\beta_k X_{ik} \]

  • Residuals:

\[ \hat{U}_i = Y_i - \hat{Y}_i \]

Interpreting OLS Coefficients

The estimated regression equation:

\[ \hat{Y} = \hat\beta_0 + \hat\beta_1 X_1 + \hat\beta_2 X_2 + \cdots + \hat\beta_k X_k \]

  • \(\hat\beta_j\) is the partial effect of \(X_j\):

\[ \hat\beta_j = \frac{\Delta \hat{Y}}{\Delta X_j}, \quad \text{holding all other } X_l \ (l \neq j) \ \text{fixed} \]

  • This is a statement about the fitted values — about the regression line, not (yet) about the population.
  • Whether \(\hat\beta_j\) estimates a causal effect depends on the assumptions discussed in Lecture 2a.

Example: CEO Salary and ROE

\[ \text{salary} = \beta_0 + \beta_1 \, \text{roe} + U \]

  • salary: annual salary in thousands of dollars
  • roe: return on equity (percent)

The fitted regression:

\[ \widehat{\text{salary}} = 963.19 + 18.5 \, \text{roe} \]

  • If ROE increases by one percentage point, salary is predicted to increase by $18,500.
  • The intercept: predicted salary when ROE = 0 is $963,190.

Population vs. Sample Regression

  • The PRF (\(E[Y \mid X]\)) is fixed but unknown. The SRF (\(\hat{Y}\)) is our estimate — it varies across samples.
  • With a different sample, we would get a different fitted line. How close the SRF is to the PRF is a question about estimator properties (Lecture 2c).

Example: College GPA

\[ \widehat{\text{colGPA}} = 1.286 + 0.453 \, \text{hsGPA} + 0.0091 \, \text{ACT} \]

  • Holding ACT fixed, a one-point higher high school GPA predicts a 0.453-point higher college GPA.
  • Holding hsGPA fixed, a one-point higher ACT score predicts a 0.0091-point higher college GPA.
  • Predicted college GPA for hsGPA = 3.5, ACT = 24?

\[ \widehat{\text{colGPA}} = 1.286 + 0.453(3.5) + 0.0091(24) = 3.09 \]

Algebraic Properties of OLS

These properties hold by construction — they follow from the FOCs, not from any assumption about the population.

  1. The residuals sum to zero: \(\displaystyle\sum_{i=1}^n \hat{U}_i = 0\)

  2. The sample covariance between each regressor and the residuals is zero: \(\displaystyle\sum_{i=1}^n X_{ij}\hat{U}_i = 0, \quad j = 1, \ldots, k\)

  3. The point \((\bar{X}_1, \ldots, \bar{X}_k, \bar{Y})\) lies on the regression line.

Exercise: What is the sample covariance between the fitted values \(\hat{Y}_i\) and the residuals \(\hat{U}_i\)? (Hint: use properties 1–2.)

Goodness of Fit: Decomposing Variation

How well does the regression line fit the data? We decompose the total variation in \(Y\):

\[ \underbrace{\sum_{i=1}^n (Y_i - \bar{Y})^2}_{\text{SST}} = \underbrace{\sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2}_{\text{SSE}} + \underbrace{\sum_{i=1}^n \hat{U}_i^2}_{\text{SSR}} \]

  • SST (Total Sum of Squares): total variation in \(Y\) around its mean
  • SSE (Explained Sum of Squares): variation in \(Y\) explained by the regression
  • SSR (Residual Sum of Squares): unexplained variation

The \(R^2\)

The coefficient of determination:

\[ R^2 = \frac{\text{SSE}}{\text{SST}} = 1 - \frac{\text{SSR}}{\text{SST}} \]

  • \(R^2\) is the fraction of the total variation in \(Y\) explained by the regression.
  • \(0 \leq R^2 \leq 1\) (when an intercept is included).
  • Example: \(R^2 = 0.65\) means 65% of the variation in \(Y\) is captured by the model.

Exercise: Show that \(R^2\) equals the squared correlation between actual and fitted values:

\[ R^2 = \bigl[\text{Corr}(Y_i, \hat{Y}_i)\bigr]^2 \]

(Hint: write \(Y_i = \hat{Y}_i + \hat{U}_i\) and use the result from the previous exercise.)

\(R^2\): What It Does and Doesn’t Tell You

  • \(R^2\) never decreases when an additional regressor is added (SSR can only stay the same or fall).
  • But a higher \(R^2\) does not mean the new variable belongs in the model. Two reasons:
    • For causal inference: whether a variable should be included depends on the underlying causal structure, not on fit. Including the wrong controls can distort the coefficient of interest — we will see why in Lecture 4c.
    • For prediction: \(R^2\) measures how well the model fits this particular sample. A model that chases noise in the current data may fit it well but predict poorly on new data — the improvement in \(R^2\) is spurious.

Low \(R^2\) Is Not a Problem

  • In economics, we often care about whether \(X\) has a causal effect on \(Y\) — e.g., does education raise income?
  • Many other factors also affect income, so education alone explains little of the total variation in \(Y\) (low \(R^2\)).
  • What matters is whether the coefficient truly captures a causal effect — not how much of \(Y\) the model explains overall.

Example: What Deters Crime?

Question: Does the threat of conviction deter criminal activity? Does employment help?

  • Data on 2,725 men born in 1960–61. Outcome: number of arrests in 1986 (narr86).
Variable Meaning
pcnv proportion of prior arrests leading to conviction
ptime86 months spent in prison in 1986
qemp86 quarters employed in 1986
avgsen average sentence length in prior convictions (months)

Example: What Deters Crime?

\[ \widehat{\text{narr86}} = 0.712 - 0.150 \, \text{pcnv} - 0.034 \, \text{ptime86} - 0.104 \, \text{qemp86} + 0.007 \, \text{avgsen} \]

\(n = 2{,}725\), \(\; R^2 = 0.042\).

  • Higher conviction rates and more quarters employed are associated with fewer arrests — consistent with deterrence and opportunity cost stories.
  • But \(R^2 = 0.042\): we cannot predict which individuals get arrested.
  • That is fine — the question is whether these coefficients reflect causal effects, not whether we can forecast individual behavior.

Regression Through the Origin

Sometimes theory tells us \(E[Y \mid X = 0] = 0\) — e.g., if income is zero, tax owed should be zero. We can impose this by dropping the intercept:

\[ \hat{Y} = \hat\beta_1 X_1 + \hat\beta_2 X_2 + \cdots + \hat\beta_k X_k \]

  • Without an intercept, the first FOC (\(\sum \hat{U}_i = 0\)) no longer holds — the residuals need not sum to zero.
  • As a consequence, the decomposition \(\text{SST} = \text{SSE} + \text{SSR}\) can break down, and the usual \(R^2 = 1 - \text{SSR}/\text{SST}\) can be negative — meaning the model fits worse than a horizontal line at \(\bar{Y}\).

The Frisch-Waugh-Lovell Theorem

Question: In a multiple regression, what exactly does \(\hat\beta_1\) capture?

  • The FWL theorem says: \(\hat\beta_1\) from the full regression equals \(\hat\beta_1\) from a two-step procedure that “partials out” the other regressors.
  • This gives a precise meaning to “holding other variables fixed.”

FWL: Setup

The OLS sample decomposition:

\[ Y_i = \hat\beta_0 + \hat\beta_1 X_{i1} + \hat\beta_2 X_{i2} + \cdots + \hat\beta_k X_{ik} + \hat{U}_i \]

This is an identity — it holds exactly in the sample. We want to understand what \(\hat\beta_1\) captures.

FWL: Step 1 — Partial Out \(X_1\)

Regress \(X_1\) on all the other regressors:

\[ X_{i1} = \hat\pi_0 + \hat\pi_1 X_{i2} + \hat\pi_2 X_{i3} + \cdots + \hat\pi_{k-1} X_{ik} + \hat{R}_{i1} \]

  • The residual \(\hat{R}_{i1}\) is the part of \(X_1\) that cannot be predicted by \(X_2, \ldots, X_k\).
  • It captures the “unique” variation in \(X_1\), after removing everything shared with the other regressors.

FWL: Step 2 — Regress \(Y\) on the Residuals

Regress \(Y\) on \(\hat{R}_{i1}\):

\[ Y_i = \hat\alpha + \hat\beta_1 \hat{R}_{i1} + \hat{e}_i \]

Theorem (Frisch-Waugh-Lovell): The slope \(\hat\beta_1\) from this simple regression is identical to \(\hat\beta_1\) from the full multiple regression.

\[ \hat\beta_1 = \frac{\sum_{i=1}^n \hat{R}_{i1} Y_i}{\sum_{i=1}^n \hat{R}_{i1}^2} \]

  • \(\hat\beta_1\) uses only the variation in \(X_1\) that is orthogonal to the other regressors.
  • This is what “holding \(X_2, \ldots, X_k\) fixed” means mechanically in OLS.

Short vs. Long Regression

FWL showed what the long regression does mechanically. A natural follow-up: what happens when we leave a variable out?

We use tilde (\(\tilde{}\)) for the short regression and hat (\(\hat{}\)) for the long regression.

Short regression (regress \(Y\) on \(X_1\) only): \(Y_i = \tilde\beta_0 + \tilde\beta_1 X_{i1} + \tilde{U}_i\)

Long regression (regress \(Y\) on \(X_1\) and \(X_2\)): \(Y_i = \hat\beta_0 + \hat\beta_1 X_{i1} + \hat\beta_2 X_{i2} + \hat{U}_i\)

How do \(\tilde\beta_1\) and \(\hat\beta_1\) relate?

The Short–Long Regression Formula

\[ \tilde\beta_1 = \hat\beta_1 + \hat\beta_2 \, \tilde\delta_1 \]

where \(\tilde\delta_1\) is the slope from regressing \(X_2\) on \(X_1\):

\[ X_{i2} = \tilde\delta_0 + \tilde\delta_1 X_{i1} + \tilde{V}_i \]

  • This is an exact algebraic identity — it holds in any sample, with no assumptions about the population.
  • It decomposes \(\tilde\beta_1\) into two parts: the long regression coefficient \(\hat\beta_1\), plus a term \(\hat\beta_2 \tilde\delta_1\) that captures the indirect association between \(X_1\) and \(Y\) running through \(X_2\).

When Does Omitting \(X_2\) Matter?

\[ \tilde\beta_1 = \hat\beta_1 + \hat\beta_2 \, \tilde\delta_1 \]

The short and long regression give different coefficients when both:

  1. \(X_2\) is associated with \(Y\) in the long regression (\(\hat\beta_2 \neq 0\))
  2. \(X_2\) is correlated with \(X_1\) in the sample (\(\tilde\delta_1 \neq 0\))
  • If either condition fails, \(\tilde\beta_1 = \hat\beta_1\) — omitting \(X_2\) does not change the coefficient on \(X_1\).
  • In Lecture 2c, we take expectations of this identity to determine when omitting a variable leads to statistical bias.

Example: College GPA

\[ \begin{aligned} \text{Long}: \quad \widehat{\text{colGPA}} &= 1.286 + 0.453 \, \text{hsGPA} + 0.0091 \, \text{ACT} \\ \text{Short}: \quad \widehat{\text{colGPA}} &= 2.403 + 0.027 \, \text{ACT} \end{aligned} \]

  • The coefficient on ACT is three times larger in the short regression (0.027 vs. 0.0091).
  • Why? hsGPA and ACT are positively correlated (\(\tilde\delta_1 > 0\)), and hsGPA is associated with colGPA (\(\hat\beta_2 > 0\)). The short regression attributes some of the hsGPA association to ACT.
  • Whether this difference reflects “bias” depends on which model is correct — a question we address in Lecture 2c.

Summary

  • Two derivations, one estimator: least squares and method of moments both yield OLS.
  • Algebraic properties: residuals sum to zero, are uncorrelated with regressors, regression line passes through means.
  • \(R^2\): fraction of variation explained — useful but not the goal of causal analysis.
  • Frisch-Waugh-Lovell: \(\hat\beta_j\) uses only the variation in \(X_j\) orthogonal to other regressors.
  • Short–long regression: \(\tilde\beta_1 = \hat\beta_1 + \hat\beta_2 \tilde\delta_1\) — omitting a variable changes the estimate whenever the omitted variable is correlated with the included one.

What’s Next

Lecture 2c — Estimator Properties:

  • Unbiasedness of OLS under the Gauss-Markov assumptions
  • Variance of OLS estimators
  • The Gauss-Markov theorem: OLS is BLUE
  • Model misspecification and omitted variable bias