The Linear Regression Model
Natasha Kang
Xiamen University, Chow Institute
March, 2026
What Does Regression Do?
- We want to understand how an outcome \(Y\) relates to one or more explanatory variables \(X\).
- Linear regression is the workhorse of applied econometrics.
- Two uses:
- Causal inference: estimate the effect of \(X\) on \(Y\), holding other factors fixed
- Prediction: forecast \(Y\) given observed \(X\)
A Brief History: Galton and “Regression”
- The term regression comes from Francis Galton (1886).
- Galton studied the heights of parents and their adult children.
- He found: children of tall parents tend to be tall — but not as tall as their parents.
- Children of short parents tend to be short — but not as short.
- Galton called this “regression towards mediocrity” — extreme traits tend to be less extreme in the next generation.
- The statistical method he used to study this became known as regression.
Model vs. Reality
Before writing down a regression model, let’s clarify what a model is — and what it isn’t.
In the real world, outcomes are driven by an underlying mechanism — how households respond to policy, how firms set prices, how individuals choose schooling.
- We rarely observe the mechanism directly. Instead, it induces a population distribution over the variables we observe.
- Our goal is to understand features of this population — and for that, we build statistical models.
- A statistical model is a simplified structure we impose — it is not the mechanism, and it is not the full population distribution.
- It captures only the aspects we choose to model, leaving out much of the real-world complexity.
From Mechanism to Model
| Mechanism |
Structural equations, behavioral rules, counterfactuals — unobservable |
| Population distribution |
The joint distribution of observables \((Y, X)\) — induced by the mechanism |
| Statistical model |
Our chosen approximating structure |
- Each level is a simplification of the one above.
- To connect our model to the population, we need assumptions — and the assumptions determine what the model can tell us.
The Simple Linear Regression Model
We specify the following linear regression model:
\[
Y = \beta_0 + \beta_1 X + U
\]
- \(Y\): dependent variable (outcome)
- \(X\): independent variable (regressor, explanatory variable)
- \(\beta_0, \beta_1\): unknown parameters
- \(U\): error term (disturbance) — all other factors affecting \(Y\)
Systematic and Unsystematic Components
\[
Y = \underbrace{\beta_0 + \beta_1 X}_{\text{systematic part}} + \underbrace{U}_{\text{unsystematic part}}
\]
- The systematic part \(\beta_0 + \beta_1 X\) captures the relationship between \(X\) and \(Y\) we choose to model.
- The unsystematic part \(U\) captures everything else: unobserved factors, measurement error, inherent randomness.
What Does the Model Target?
\[
Y = \beta_0 + \beta_1 X + U
\]
- The model alone doesn’t tell us what \(\beta_0\) and \(\beta_1\) represent.
- We need statistical assumptions to connect the model to a feature of the population distribution.
- Different assumptions target different features:
- If \(E[U \mid X] = 0\): the model describes the conditional mean of \(Y\) given \(X\)
- If the \(\tau\)-th quantile of \(U \mid X\) is zero: the model describes the conditional \(\tau\)-th quantile
The Zero Conditional Mean Assumption
We focus on the conditional mean. The key assumption is:
\[
E[U \mid X] = 0
\]
In words: conditional on knowing \(X\), the error \(U\) has mean zero.
- Equivalently: the unobserved factors in \(U\) are, on average, unrelated to \(X\).
- Two consequences by the Law of Iterated Expectations:
\[
\begin{aligned}
E[U] &= E\bigl[E[U \mid X]\bigr] = 0 \\[6pt]
\text{Cov}(X, U) &= E[XU] = E\bigl[X \cdot E[U \mid X]\bigr] = 0
\end{aligned}
\]
- Note: \(E[U \mid X] = 0\) is stronger than \(\text{Cov}(X,U) = 0\) — it rules out any nonlinear dependence.
The Population Regression Function
Under \(E[U \mid X] = 0\):
\[
E[Y \mid X] = \beta_0 + \beta_1 X
\]
This is the Population Regression Function (PRF): the conditional mean of \(Y\) given \(X\).
Interpreting the PRF
\[
E[Y \mid X] = \beta_0 + \beta_1 X
\]
- \(\beta_0 = E[Y \mid X = 0]\): the average outcome when \(X = 0\)
- \(\beta_1 = \frac{\partial}{\partial x} E[Y \mid X = x]\): the change in the average \(Y\) per unit increase in \(X\)
- This is a population average statement — not about any individual.
- Individual \(Y_i\) may deviate from the PRF by \(U_i\).
Association Is Not Causation
Under ZCM, the PRF tells us:
“A one-unit increase in \(X\) is associated with a \(\beta_1\)-unit change in \(E[Y \mid X]\).”
- This describes how the conditional mean varies with \(X\) — a feature of the population distribution.
- But does this mean changing \(X\) would cause \(Y\) to change?
- Not necessarily. The PRF is a statistical relationship — it does not tell us what would happen if we intervened on \(X\).
When Is Regression Causal?
\[
Y = \alpha + \tau D + U, \qquad D \in \{0,1\}, \qquad E[U \mid D] = 0
\]
- Under ZCM, \(\tau\) equals the naive comparison: \(\tau = E[Y \mid D=1] - E[Y \mid D=0]\).
- We already know how to decompose this:
\[
\tau = \underbrace{E[Y(1) - Y(0) \mid D=1]}_{\text{ATT}} + \underbrace{E[Y(0) \mid D=1] - E[Y(0) \mid D=0]}_{\text{selection bias}}
\]
- ZCM alone: \(\tau\) is a conditional mean difference — not causal.
- If \(E[Y(0) \mid D=1] = E[Y(0) \mid D=0]\): \(\tau = \text{ATT}\).
- If \(D \perp\!\!\!\perp (Y(0), Y(1))\): \(\tau = \text{ATE}\).
Statistical and Structural Assumptions
- \(E[U \mid D] = 0\) is a statistical assumption — it connects the model to the population distribution.
- Can sometimes be assessed indirectly: specification tests, robustness checks.
- \(E[Y(0) \mid D\!=\!1] = E[Y(0) \mid D\!=\!0]\) or \(D \perp\!\!\!\perp (Y(0), Y(1))\) are structural assumptions — they connect the population distribution to the mechanism.
- Fundamentally untestable — they involve unobserved counterfactuals.
- \(D \perp\!\!\!\perp (Y(0), Y(1))\) is strong enough to imply \(E[U \mid D] = 0\). (Why?)
Example: Wages and Education
\[
\text{wage} = \beta_0 + \beta_1 \, \text{educ} + U
\]
- Under ZCM, \(\beta_1\) is the average wage difference between workers who differ by one year of education.
- \(U\) includes ability, family background, experience, …
- More educated workers tend to have higher ability.
- If ability is in \(U\) and correlated with education, ZCM fails — \(\beta_1\) does not even describe the conditional mean.
- This motivates multiple regression: include confounders directly in the model.
The Multiple Linear Regression Model
\[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + U
\]
- \(k\) regressors: \(X_1, X_2, \ldots, X_k\)
- \(\beta_0, \beta_1, \ldots, \beta_k\): unknown parameters
- \(U\): error term — all other factors affecting \(Y\)
Example: Wages, Education, and Experience
\[
\text{wage} = \beta_0 + \beta_1 \, \text{educ} + \beta_2 \, \text{exper} + U
\]
- By including experience, we hold it fixed when comparing workers with different education levels.
- \(\beta_1\): average wage difference between workers with the same experience but one more year of education.
- Without experience in the model, \(\beta_1\) mixes the effect of education with the effect of experience (since the two are correlated).
- Including experience removes this confounding — but other omitted variables (ability, family background, …) may remain in \(U\).
Control for Confounders
A confounder is a variable that affects both the treatment and the outcome — omitting it biases the estimated relationship.
Example: Does school spending improve test scores?
![]()
- Family income affects both spending and scores — it is a confounder.
- If we omit it, \(\beta_1\) in \(\text{avgscore} = \beta_0 + \beta_1 \, \text{expend} + U\) mixes the effect of spending with the effect of income.
- Including the confounder as a control:
\[
\text{avgscore} = \beta_0 + \beta_1 \, \text{expend} + \beta_2 \, \text{avginc} + U
\]
Now \(\beta_1\) compares schools with the same average income but different spending.
Zero Conditional Mean in MLR
The key assumption generalizes:
\[
E[U \mid X_1, X_2, \ldots, X_k] = 0
\]
In words: conditional on all included regressors, the error has mean zero.
- This means \(U\) is uncorrelated with every function of \((X_1, \ldots, X_k)\).
- It requires that no relevant variable left in \(U\) is correlated with any \(X_j\) — otherwise we have omitted variable bias.
The PRF in MLR
Under \(E[U \mid X_1, \ldots, X_k] = 0\):
\[
E[Y \mid X_1, \ldots, X_k] = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k
\]
- The PRF is now a hyperplane in \((X_1, \ldots, X_k, Y)\) space.
- Each \(\beta_j\) traces out how the conditional mean of \(Y\) changes as \(X_j\) varies, with all other \(X\)’s fixed.
Recovering Causal Effects with Controls
With multiple regressors, we can control for confounders and revisit the causal question.
\[
Y = \alpha_0 + \tau D + \gamma W + U
\]
- Under \(E[U \mid D, W] = 0\), \(\tau\) is the conditional average predictive effect (CAPE):
\[
\tau = E[Y \mid D=1, W] - E[Y \mid D=0, W]
\]
- This is a statistical quantity — the mean difference between treated and untreated, among units with the same \(W\).
- If additionally \(D \perp\!\!\!\perp (Y(0), Y(1)) \mid W\) (conditional independence), then \(\tau\) identifies the conditional average treatment effect (CATE):
\[
\tau = E[Y(1) \mid W] - E[Y(0) \mid W]
\]
Multivalued Treatment
Many treatments vary in intensity — not just “on” or “off” (e.g., years of schooling, hours of training).
\[
Y = \alpha + \tau S + U, \qquad E[U \mid S] = 0
\]
- \(S\): treatment level (discrete or continuous)
- Under ZCM, \(\tau\) is the average difference in \(Y\) per unit increase in \(S\).
- Let \(Y(s)\) be the potential outcome at treatment level \(s\). Under \(S \perp\!\!\!\perp Y(s)\) for all \(s\):
\[
\tau = E[Y(s) - Y(s-1)]
\]
the average causal effect of a one-unit increase in treatment.
FYI: Matrix Notation
\[
\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u}
\]
where
\[
\mathbf{y} = \begin{pmatrix} Y_1 \\ \vdots \\ Y_n \end{pmatrix}, \quad
\mathbf{X} = \begin{pmatrix} 1 & X_{11} & \cdots & X_{1k} \\ \vdots & & & \vdots \\ 1 & X_{n1} & \cdots & X_{nk} \end{pmatrix}, \quad
\boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{pmatrix}
\]
- We will use this notation when it is convenient to do so.
Regression: Not Just for Causality
- So far, we’ve focused on using regression to study causal relationships.
- But regression is also widely used for prediction.
- What is the best way to predict \(Y\) using \(X\)?
- This question does not require any causal assumption — only that \((Y, X)\) have a joint distribution.
Optimal Prediction: The Conditional Expectation
The best predictor of \(Y\) given \(X\) — in the mean squared error sense — is:
\[
f^*(X) = E[Y \mid X]
\]
Why? For any predictor \(f(X)\):
\[
E[(Y - f(X))^2] = E[(Y - E[Y \mid X])^2] + E[(E[Y \mid X] - f(X))^2]
\]
The first term does not depend on \(f\). The second term is minimized at \(f(X) = E[Y \mid X]\).
Regression as Linear Prediction
- The CEF \(E[Y \mid X]\) can be any function of \(X\) — we may not know its form.
- If we restrict to linear predictors, regression gives the best linear approximation to the CEF — no assumption on \(E[U \mid X]\) needed.
- Under ZCM, the best linear predictor equals the CEF.
- Without ZCM, it is still the best linear approximation.
Summary
The linear regression model under ZCM:
\[
Y = \beta_0 + \beta_1 X + U, \qquad E[U \mid X] = 0
\]
- \(\beta_1\) measures how \(E[Y \mid X]\) changes with \(X\) — a statistical relationship
- Causal interpretation additionally requires structural assumptions (e.g., random assignment)
- MLR extends this to \(k\) regressors; with controls, we can identify causal effects under conditional independence
- Without ZCM, regression still provides the best linear predictor — minimizing mean squared error among linear functions of \(X\).
Next: how do we estimate \(\beta_0, \beta_1, \ldots, \beta_k\) from data?
What’s Next?
Lecture 2b — Estimation:
- Ordinary Least Squares (OLS): derivation and algebraic properties
- Fitted values, residuals, goodness of fit (\(R^2\))
- Frisch-Waugh-Lovell theorem: partialling out
- OLS in matrix form
Lecture 2c — Estimator Properties:
- Unbiasedness, variance, Gauss-Markov theorem
- When does OLS give a causal estimate?