Xiamen University, Chow Institute
April, 2026
| Prediction | Causality | |
|---|---|---|
| Question | “What will happen?” | “What would happen if…?” |
| Focus | Accuracy | Isolation of effects |
| Needs | Correlates | Exogenous variation |
| Metric | MSE | Unbiasedness |
Recall (lect2a): the CEF \(E[y \mid \mathbf{x}]\) is the best predictor of \(y\) given \(\mathbf{x}\) under MSE.
But we don’t know the CEF — we estimate it:
\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \cdots + \hat{\beta}_k x_k \]
The prediction error \(y_{\text{new}} - \hat{y}_{\text{new}}\) comes from two sources:
Adding regressors can reduce (1) but worsens (2). How do we navigate this tradeoff?
Example: predicting GDP growth.
\[ \text{GDP}_t = \alpha + \beta_1\, \text{Investment}_t + \beta_2\, \text{Innovation}_t + u_t \]
This is the bias-variance tradeoff — the central idea of predictive model selection.
True relationship: \(y = f(\mathbf{x}) + u\), where \(E[u \mid \mathbf{x}] = 0\) and \(\text{Var}(u \mid \mathbf{x}) = \sigma^2\).
We fit a model \(\hat{y}(\mathbf{x})\) to training data and use it to predict a new observation \((y_{\text{new}}, \mathbf{x}_{\text{new}})\). Our prediction error is:
\[ y_{\text{new}} - \hat{y}(\mathbf{x}_{\text{new}}) = \underbrace{f(\mathbf{x}_{\text{new}}) - \hat{y}(\mathbf{x}_{\text{new}})}_{\text{model error}} + \underbrace{u_{\text{new}}}_{\text{noise}} \]
Since \(u_{\text{new}}\) is independent of \(\hat{y}\), squaring and taking expectations:
\[ \text{MSE}_{\text{out}} = E[(\hat{y} - y)^2] = \underbrace{E[(\hat{y} - f(\mathbf{x}))^2]}_{\text{model error}} + \underbrace{\sigma^2}_{\text{irreducible error}} \]
We cannot reduce \(\sigma^2\) — it is the inherent noise in \(y\). The question is how to minimize model error.
Fix a point \(\mathbf{x}\) and consider the model error \(\hat{y}(\mathbf{x}) - f(\mathbf{x})\). The only randomness is in \(\hat{y}\) (which depends on the training sample). Add and subtract \(E[\hat{y}]\):
\[ \hat{y} - f(\mathbf{x}) = \bigl(\hat{y} - E[\hat{y}]\bigr) + \bigl(E[\hat{y}] - f(\mathbf{x})\bigr) \]
Square and take expectations over the training sample. Since \(E[\hat{y}]\) and \(f(\mathbf{x})\) are both non-random (for fixed \(\mathbf{x}\)), the second term is a constant \(b \equiv E[\hat{y}] - f(\mathbf{x})\):
\[ E[(\hat{y} - f(\mathbf{x}))^2] = E[(\hat{y} - E[\hat{y}])^2] + 2b\,\underbrace{E[\hat{y} - E[\hat{y}]]}_{= 0} + b^2 \]
The cross term vanishes, leaving:
\[ E[(\hat{y} - f(\mathbf{x}))^2] = \underbrace{E[(\hat{y} - E[\hat{y}])^2]}_{\text{Variance}} + \underbrace{(E[\hat{y}] - f(\mathbf{x}))^2}_{\text{Bias}^2} \]
Putting it together:
\[ \text{MSE}_{\text{out}} = \underbrace{(E[\hat{y}] - f(\mathbf{x}))^2}_{\text{Bias}^2} + \underbrace{E[(\hat{y} - E[\hat{y}])^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible}} \]
An optimal complexity exists — but we cannot observe bias and variance separately. What does overfitting look like in practice?
The degree-12 model has near-zero in-sample error — but terrible out-of-sample performance. We need a way to measure the difference.
\[ \text{MSE}_{\text{in}} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]
\[ \text{MSE}_{\text{out}} = E[(y_{\text{new}} - \hat{y}_{\text{new}})^2] \]
We cannot observe \(\text{MSE}_{\text{out}}\) directly. Two practical approaches:
Holdout method: split data into training set and test set.
\(k\)-fold cross-validation: split data into \(k\) subsets (folds).
An alternative: instead of splitting the data, penalize in-sample fit for model complexity. This is the idea behind adjusted \(R^2\), AIC, and BIC.
\(R^2\) always increases when adding regressors — even useless ones. The adjusted \(R^2\) penalizes for model complexity:
\[ \bar{R}^2 = 1 - \frac{SSR/(n - k - 1)}{SST/(n - 1)} \]
Consider explaining R&D spending as a share of sales (\(\textit{rdintens}\)):
| Model 1 | Model 2 | Model 3 | |
|---|---|---|---|
| \(\log(\textit{sales})\) | \(-0.673\) | \(-1.117\) | \(-0.571\) |
| \(\textit{profmarg}\) | \(0.174\) | \(0.175\) | |
| \(\log(\textit{sales})^2\) | \(-0.037\) | ||
| \(R^2\) | \(0.061\) | \(0.148\) | \(0.155\) |
| \(\bar{R}^2\) | \(0.030\) | \(0.091\) | \(0.066\) |
Can we use \(\bar{R}^2\) to choose between \(y\) and \(\log(y)\) as the dependent variable?
No. \(\bar{R}^2\) measures the proportion of variation explained relative to the total variation in the dependent variable. Since \(\text{Var}(y) \neq \text{Var}(\log(y))\), the SST differs across models and the \(\bar{R}^2\) values are not comparable.
An alternative approach: penalize the log-likelihood for model complexity. Here \(k\) is the total number of estimated parameters (including the intercept).
AIC (Akaike):
\[ \text{AIC} = -2 \log \hat{L} + 2k \]
BIC (Bayesian / Schwarz):
\[ \text{BIC} = -2 \log \hat{L} + k \log n \]
The Regression Equation Specification Error Test (Ramsey, 1969) checks whether the model misses nonlinearities.
Idea: if the model \(y = \mathbf{x}\boldsymbol{\beta} + u\) is correctly specified, powers of the fitted values \(\hat{y}^2, \hat{y}^3\) should have no additional explanatory power.
Procedure:
Housing prices: \(\log(\textit{price}) = \beta_0 + \beta_1\, \textit{sqrft} + \beta_2\, \textit{bdrms} + u\).
Step 1: Estimate the model. Obtain \(\hat{y}_i\).
Step 2: Auxiliary regression:
\[ \log(\textit{price}_i) = \gamma_0 + \gamma_1\, \textit{sqrft}_i + \gamma_2\, \textit{bdrms}_i + \alpha_2 \hat{y}_i^2 + \alpha_3 \hat{y}_i^3 + u_i \]
Step 3: \(F\)-test of \(H_0: \alpha_2 = \alpha_3 = 0\).
Recall (lect3a): to compare a restricted model against an unrestricted model, use the \(F\)-test.
Example: Is a quadratic wage model better than a linear one?
\[ \text{Model 1}: \log(\textit{wage}) = \beta_0 + \beta_1\, \textit{exper} + u \]
\[ \text{Model 2}: \log(\textit{wage}) = \beta_0 + \beta_1\, \textit{exper} + \beta_2\, \textit{exper}^2 + u \]
\[ F = \frac{(SSR_r - SSR_{ur}) / q}{SSR_{ur} / (n - k - 1)} \sim F_{q, n-k-1} \]
where \(q\) = number of restrictions.
| Tool | What it does | Limitation |
|---|---|---|
| Adjusted \(R^2\) | Penalizes for \(k\) | Cannot compare different \(y\)’s |
| AIC / BIC | Penalizes log-likelihood | Requires same \(y\); BIC may underfit |
| \(F\)-test (lect3a) | Compares nested models | Requires nesting |
| RESET | Detects misspecification | Does not identify the fix |
All of these are tools for prediction. For causal questions, the criteria are fundamentally different.
For prediction, we want the model that minimizes MSE — we don’t care why a variable is correlated, only that it helps forecast.
For causality, we want to isolate the effect of a treatment \(T\) on an outcome \(Y\). Including the wrong controls can introduce bias rather than remove it.
The question is no longer “does this variable improve fit?” but rather: does conditioning on this variable help identify the causal effect?
A firm wants to estimate the causal effect of a job training program on employee productivity.
Training is randomly assigned. The naive comparison is unbiased:
\[ E[Y_i \mid T_i = 1] - E[Y_i \mid T_i = 0] = E[Y_{1i} - Y_{0i}] = \text{ATE} \]
Should we also control for post-training test scores? They are correlated with productivity — so they would improve prediction.
No.
Let \(B_i\) be an indicator for a high post-training test score (e.g., above the median). Conditioning on \(B_i\):
\[ \begin{aligned} &E[Y_i \mid B_i = 1, T_i = 1] - E[Y_i \mid B_i = 1, T_i = 0] \\ = \;& E[Y_{1i} \mid B_{1i} = 1] - E[Y_{0i} \mid B_{0i} = 1] \\ = \;& \underbrace{E[Y_{1i} - Y_{0i} \mid B_{1i} = 1]}_{\text{CATE}} + \underbrace{E[Y_{0i} \mid B_{1i} = 1] - E[Y_{0i} \mid B_{0i} = 1]}_{\text{selection bias}} \end{aligned} \]
Post-training test scores depend on both training and innate ability.
Among workers with the same test score:
A Directed Acyclic Graph (DAG) represents the causal structure:
Three fundamental building blocks:
| Pattern | Role of \(M\) | ||
|---|---|---|---|
| Chain | \(X \to M \to Y\) | Mediator | |
| Fork | \(X \leftarrow M \to Y\) | Common cause (confounder) | |
| Collider | \(X \to M \leftarrow Y\) | Common effect |
In the job training example, Test Score (Post) is a collider:
\[ \text{Training} \to \text{Test Score} \leftarrow \text{Innate Ability} \]
How do we determine when to control for a variable — and when not to?
Let \(Z\) be the set of variables we control for in the regression. A path between \(X\) and \(Y\) is blocked (d-separated) by \(Z\) if:
If all paths between \(X\) and \(Y\) are blocked by \(Z\), then \(X \perp\!\!\!\perp Y \mid Z\): \(X\) and \(Y\) are conditionally independent given \(Z\).
For causal inference: choose \(Z\) so that all non-causal (back-door) paths between treatment and outcome are blocked, without opening collider paths.
Chain: Rain \(\to\) Wet Road \(\to\) Accident
Fork: Parental Education \(\to\) Parental Involvement; Parental Education \(\to\) Child Performance
Collider: Stress \(\to\) Blood Pressure \(\leftarrow\) Family History
Suppose controlling for \(M\) makes the coefficient on \(T\) insignificant. Can we conclude that \(M\) mediates the effect of \(T\) on \(Y\)?
No. Two different causal structures produce the same regression pattern:
In both cases, the coefficient on \(T\) shrinks or vanishes when \(M\) is included. The regression cannot distinguish the two — you need to know the causal structure (the DAG).
This is why theory matters: the decision to control for a variable cannot be made on statistical grounds alone.
Rules of thumb:
Q1: Returns to college education.
Q2: Obesity and mortality.
Q3: Beer tax and traffic fatalities.
Q4: Attendance and final exam score.
Q5: Childhood nutrition and adult height.
The exercises above have no single “right” DAG — reasonable researchers can disagree.
Example. Does parental income affect child test scores…
When multiple stories are plausible, the DAG framework does not give a clear answer — and neither does any other approach.
The value of the DAG is not that it resolves the ambiguity, but that it forces you to state your assumptions explicitly rather than burying them in a regression specification.
Part IV — When Conditioning Isn’t Enough: