The Linear Regression Model

Natasha Kang

Xiamen University, Chow Institute

March, 2026

What Does Regression Do?

We want to understand how an outcome $Y$ relates to one or more explanatory variables $X$.
Linear regression is the workhorse of applied econometrics.

Two uses:
1. Causal inference: estimate the effect of $X$ on $Y$, holding other factors fixed
2. Prediction: forecast $Y$ given observed $X$

A Brief History: Galton and “Regression”

The term regression comes from Francis Galton (1886).
Galton studied the heights of parents and their adult children.
He found: children of tall parents tend to be tall — but not as tall as their parents.
Children of short parents tend to be short — but not as short.

Galton called this “regression towards mediocrity” — extreme traits tend to be less extreme in the next generation.
The statistical method he used to study this became known as regression.

Model vs. Reality

Before writing down a regression model, let’s clarify what a model is — and what it isn’t.

In the real world, outcomes are driven by an underlying mechanism — how households respond to policy, how firms set prices, how individuals choose schooling.

We rarely observe the mechanism directly. Instead, it induces a population distribution over the variables we observe.
Our goal is to understand features of this population — and for that, we build statistical models.

A statistical model is a simplified structure we impose — it is not the mechanism, and it is not the full population distribution.
It captures only the aspects we choose to model, leaving out much of the real-world complexity.

From Mechanism to Model

Level	What it is
Mechanism	Structural equations, behavioral rules, counterfactuals — unobservable
Population distribution	The joint distribution of observables $(Y, X)$ — induced by the mechanism
Statistical model	Our chosen approximating structure

Each level is a simplification of the one above.
To connect our model to the population, we need assumptions — and the assumptions determine what the model can tell us.

The term “data generating process” (DGP) is often used informally to refer to either the mechanism or the population distribution.

The Simple Linear Regression Model

We specify the following linear regression model:

\[ Y = \beta_0 + \beta_1 X + U \]

$Y$: dependent variable (outcome)
$X$: independent variable (regressor, explanatory variable)
$\beta_0, \beta_1$: unknown parameters
$U$: error term (disturbance) — all other factors affecting $Y$

Systematic and Unsystematic Components

\[ Y = \underbrace{\beta_0 + \beta_1 X}_{\text{systematic part}} + \underbrace{U}_{\text{unsystematic part}} \]

The systematic part $\beta_0 + \beta_1 X$ captures the relationship between $X$ and $Y$ we choose to model.
The unsystematic part $U$ captures everything else: unobserved factors, measurement error, inherent randomness.

What Does the Model Target?

\[ Y = \beta_0 + \beta_1 X + U \]

The model alone doesn’t tell us what $\beta_0$ and $\beta_1$ represent.
We need statistical assumptions to connect the model to a feature of the population distribution.

Different assumptions target different features:
- If $E[U \mid X] = 0$: the model describes the conditional mean of $Y$ given $X$
- If the $\tau$-th quantile of $U \mid X$ is zero: the model describes the conditional $\tau$-th quantile

The Zero Conditional Mean Assumption

We focus on the conditional mean. The key assumption is:

\[ E[U \mid X] = 0 \]

In words: conditional on knowing $X$, the error $U$ has mean zero.

Equivalently: the unobserved factors in $U$ are, on average, unrelated to $X$.

Two consequences by the Law of Iterated Expectations:

\[ \begin{aligned} E[U] &= E\bigl[E[U \mid X]\bigr] = 0 \\[6pt] \text{Cov}(X, U) &= E[XU] = E\bigl[X \cdot E[U \mid X]\bigr] = 0 \end{aligned} \]

Note: $E[U \mid X] = 0$ is stronger than $\text{Cov}(X,U) = 0$ — it rules out any nonlinear dependence.

The Population Regression Function

Under $E[U \mid X] = 0$:

\[ E[Y \mid X] = \beta_0 + \beta_1 X \]

This is the Population Regression Function (PRF): the conditional mean of $Y$ given $X$.

Interpreting the PRF

\[ E[Y \mid X] = \beta_0 + \beta_1 X \]

$\beta_0 = E[Y \mid X = 0]$: the average outcome when $X = 0$
$\beta_1 = \frac{\partial}{\partial x} E[Y \mid X = x]$: the change in the average $Y$ per unit increase in $X$

This is a population average statement — not about any individual.
Individual $Y_i$ may deviate from the PRF by $U_i$.

Association Is Not Causation

Under ZCM, the PRF tells us:

“A one-unit increase in $X$ is associated with a $\beta_1$-unit change in $E[Y \mid X]$.”

This describes how the conditional mean varies with $X$ — a feature of the population distribution.

But does this mean changing $X$ would cause $Y$ to change?
Not necessarily. The PRF is a statistical relationship — it does not tell us what would happen if we intervened on $X$.

When Is Regression Causal?

\[ Y = \alpha + \tau D + U, \qquad D \in \{0,1\}, \qquad E[U \mid D] = 0 \]

Under ZCM, $\tau$ equals the naive comparison: $\tau = E[Y \mid D=1] - E[Y \mid D=0]$.
We already know how to decompose this:

\[ \tau = \underbrace{E[Y(1) - Y(0) \mid D=1]}_{\text{ATT}} + \underbrace{E[Y(0) \mid D=1] - E[Y(0) \mid D=0]}_{\text{selection bias}} \]

ZCM alone: $\tau$ is a conditional mean difference — not causal.
If $E[Y(0) \mid D=1] = E[Y(0) \mid D=0]$: $\tau = \text{ATT}$.
If $D \perp\!\!\!\perp (Y(0), Y(1))$: $\tau = \text{ATE}$.

Statistical and Structural Assumptions

$E[U \mid D] = 0$ is a statistical assumption — it connects the model to the population distribution.
- Can sometimes be assessed indirectly: specification tests, robustness checks.
$E[Y(0) \mid D\!=\!1] = E[Y(0) \mid D\!=\!0]$ or $D \perp\!\!\!\perp (Y(0), Y(1))$ are structural assumptions — they connect the population distribution to the mechanism.
- Fundamentally untestable — they involve unobserved counterfactuals.

$D \perp\!\!\!\perp (Y(0), Y(1))$ is strong enough to imply $E[U \mid D] = 0$. (Why?)

Example: Wages and Education

\[ \text{wage} = \beta_0 + \beta_1 \, \text{educ} + U \]

Under ZCM, $\beta_1$ is the average wage difference between workers who differ by one year of education.
$U$ includes ability, family background, experience, …

More educated workers tend to have higher ability.
If ability is in $U$ and correlated with education, ZCM fails — $\beta_1$ does not even describe the conditional mean.
This motivates multiple regression: include confounders directly in the model.

The Multiple Linear Regression Model

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + U \]

$k$ regressors: $X_1, X_2, \ldots, X_k$
$\beta_0, \beta_1, \ldots, \beta_k$: unknown parameters
$U$: error term — all other factors affecting $Y$

Example: Wages, Education, and Experience

\[ \text{wage} = \beta_0 + \beta_1 \, \text{educ} + \beta_2 \, \text{exper} + U \]

By including experience, we hold it fixed when comparing workers with different education levels.
$\beta_1$: average wage difference between workers with the same experience but one more year of education.

Without experience in the model, $\beta_1$ mixes the effect of education with the effect of experience (since the two are correlated).
Including experience removes this confounding — but other omitted variables (ability, family background, …) may remain in $U$.

Control for Confounders

A confounder is a variable that affects both the treatment and the outcome — omitting it biases the estimated relationship.

Example: Does school spending improve test scores?

Family income affects both spending and scores — it is a confounder.
If we omit it, $\beta_1$ in $\text{avgscore} = \beta_0 + \beta_1 \, \text{expend} + U$ mixes the effect of spending with the effect of income.

Including the confounder as a control:

\[ \text{avgscore} = \beta_0 + \beta_1 \, \text{expend} + \beta_2 \, \text{avginc} + U \]

Now $\beta_1$ compares schools with the same average income but different spending.

Zero Conditional Mean in MLR

The key assumption generalizes:

\[ E[U \mid X_1, X_2, \ldots, X_k] = 0 \]

In words: conditional on all included regressors, the error has mean zero.

This means $U$ is uncorrelated with every function of $(X_1, \ldots, X_k)$.
It requires that no relevant variable left in $U$ is correlated with any $X_j$ — otherwise we have omitted variable bias.

The PRF in MLR

Under $E[U \mid X_1, \ldots, X_k] = 0$:

\[ E[Y \mid X_1, \ldots, X_k] = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k \]

The PRF is now a hyperplane in $(X_1, \ldots, X_k, Y)$ space.
Each $\beta_j$ traces out how the conditional mean of $Y$ changes as $X_j$ varies, with all other $X$’s fixed.

Recovering Causal Effects with Controls

With multiple regressors, we can control for confounders and revisit the causal question.

\[ Y = \alpha_0 + \tau D + \gamma W + U \]

Under $E[U \mid D, W] = 0$, $\tau$ is the conditional average predictive effect (CAPE):

\[ \tau = E[Y \mid D=1, W] - E[Y \mid D=0, W] \]

This is a statistical quantity — the mean difference between treated and untreated, among units with the same $W$.

If additionally $D \perp\!\!\!\perp (Y(0), Y(1)) \mid W$ (conditional independence), then $\tau$ identifies the conditional average treatment effect (CATE):

\[ \tau = E[Y(1) \mid W] - E[Y(0) \mid W] \]

Multivalued Treatment

Many treatments vary in intensity — not just “on” or “off” (e.g., years of schooling, hours of training).

\[ Y = \alpha + \tau S + U, \qquad E[U \mid S] = 0 \]

$S$: treatment level (discrete or continuous)
Under ZCM, $\tau$ is the average difference in $Y$ per unit increase in $S$.

Let $Y(s)$ be the potential outcome at treatment level $s$. Under $S \perp\!\!\!\perp Y(s)$ for all $s$:

\[ \tau = E[Y(s) - Y(s-1)] \]

the average causal effect of a one-unit increase in treatment.

FYI: Matrix Notation

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u} \]

where

\[ \mathbf{y} = \begin{pmatrix} Y_1 \\ \vdots \\ Y_n \end{pmatrix}, \quad \mathbf{X} = \begin{pmatrix} 1 & X_{11} & \cdots & X_{1k} \\ \vdots & & & \vdots \\ 1 & X_{n1} & \cdots & X_{nk} \end{pmatrix}, \quad \boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{pmatrix} \]

We will use this notation when it is convenient to do so.

Regression: Not Just for Causality

So far, we’ve focused on using regression to study causal relationships.
But regression is also widely used for prediction.

What is the best way to predict $Y$ using $X$?
This question does not require any causal assumption — only that $(Y, X)$ have a joint distribution.

Optimal Prediction: The Conditional Expectation

The best predictor of $Y$ given $X$ — in the mean squared error sense — is:

\[ f^*(X) = E[Y \mid X] \]

Why? For any predictor $f(X)$:

\[ E[(Y - f(X))^2] = E[(Y - E[Y \mid X])^2] + E[(E[Y \mid X] - f(X))^2] \]

The first term does not depend on $f$. The second term is minimized at $f(X) = E[Y \mid X]$.

Regression as Linear Prediction

The CEF $E[Y \mid X]$ can be any function of $X$ — we may not know its form.
If we restrict to linear predictors, regression gives the best linear approximation to the CEF — no assumption on $E[U \mid X]$ needed.

Under ZCM, the best linear predictor equals the CEF.
Without ZCM, it is still the best linear approximation.

When Does Prediction Work?

The CEF result is a population statement: $E[Y \mid X]$ is the best predictor within the same distribution.
In practice, we estimate the regression from one sample and predict in another (new data, future periods, different context).

Prediction can fail for two reasons:

Distribution shift: the relationship between $Y$ and $X$ changes.
Extrapolation: new $X$ values fall outside the range of the training data.

When Does Prediction Fail?

Distribution shift:

A bank builds a credit scoring model on 2019 customer data.
In 2020, COVID-19 causes mass unemployment — the distribution of default risk shifts.
The model fails even though it performed well on held-out 2019 data.

Extrapolation:

Estimate house prices on homes with 1,000–3,000 sqft: $\widehat{\text{price}} = 50{,}000 + 100 \times \text{sqft}$.
Predict for a 15,000 sqft mansion: $\$1{,}550{,}000$. But luxury mansions are priced on land, prestige, architecture — factors absent from the model.

Both cases: regression predictions rely on the training data being representative of the prediction setting.

Summary

The linear regression model under ZCM:

\[ Y = \beta_0 + \beta_1 X + U, \qquad E[U \mid X] = 0 \]

$\beta_1$ measures how $E[Y \mid X]$ changes with $X$ — a statistical relationship
Causal interpretation additionally requires structural assumptions (e.g., random assignment)
MLR extends this to $k$ regressors; with controls, we can identify causal effects under conditional independence

Without ZCM, regression still provides the best linear predictor — minimizing mean squared error among linear functions of $X$.

Next: how do we estimate $\beta_0, \beta_1, \ldots, \beta_k$ from data?

What’s Next?

Lecture 2b — Estimation:

Ordinary Least Squares (OLS): derivation and algebraic properties
Fitted values, residuals, goodness of fit ($R^2$)
Frisch-Waugh-Lovell theorem: partialling out
OLS in matrix form

Lecture 2c — Estimator Properties:

Unbiasedness, variance, Gauss-Markov theorem
When does OLS give a causal estimate?