Model Selection & Controls

Natasha Kang

Xiamen University, Chow Institute

April, 2026

Two Goals of Econometrics

	Prediction	Causality
Question	“What will happen?”	“What would happen if…?”
Focus	Accuracy	Isolation of effects
Needs	Correlates	Exogenous variation
Metric	MSE	Unbiasedness

A good prediction model may be a bad causal model, and vice versa.
The tools for choosing variables differ depending on the goal.

The Prediction Problem

Recall (lect2a): the CEF \(E[y \mid \mathbf{x}]\) is the best predictor of \(y\) given \(\mathbf{x}\) under MSE.

But we don’t know the CEF — we estimate it:

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \cdots + \hat{\beta}_k x_k \]

The prediction error \(y_{\text{new}} - \hat{y}_{\text{new}}\) comes from two sources:

Model approximation: our linear model may not capture the true CEF.
Estimation uncertainty: \(\hat{\boldsymbol{\beta}}\) is estimated from a finite sample.

Adding regressors can reduce (1) but worsens (2). How do we navigate this tradeoff?

Why the “True” Model Isn’t Always Best

Example: predicting GDP growth.

\[ \text{GDP}_t = \alpha + \beta_1\, \text{Investment}_t + \beta_2\, \text{Innovation}_t + u_t \]

\(\text{Innovation}_t\) is hard to measure — noisy proxy.
Including it adds estimation uncertainty (source 2) to predictions.

A simpler model omitting \(\text{Innovation}_t\) has higher bias (source 1) but lower variance (source 2).
If the variance reduction outweighs the bias increase, the simpler model predicts better.

This is the bias-variance tradeoff — the central idea of predictive model selection.

The Bias-Variance Tradeoff: Setup

True relationship: \(y = f(\mathbf{x}) + u\), where \(E[u \mid \mathbf{x}] = 0\) and \(\text{Var}(u \mid \mathbf{x}) = \sigma^2\).

We fit a model \(\hat{y}(\mathbf{x})\) to training data and use it to predict a new observation \((y_{\text{new}}, \mathbf{x}_{\text{new}})\). Our prediction error is:

\[ y_{\text{new}} - \hat{y}(\mathbf{x}_{\text{new}}) = \underbrace{f(\mathbf{x}_{\text{new}}) - \hat{y}(\mathbf{x}_{\text{new}})}_{\text{model error}} + \underbrace{u_{\text{new}}}_{\text{noise}} \]

Since \(u_{\text{new}}\) is independent of \(\hat{y}\), squaring and taking expectations:

\[ \text{MSE}_{\text{out}} = E[(\hat{y} - y)^2] = \underbrace{E[(\hat{y} - f(\mathbf{x}))^2]}_{\text{model error}} + \underbrace{\sigma^2}_{\text{irreducible error}} \]

We cannot reduce \(\sigma^2\) — it is the inherent noise in \(y\). The question is how to minimize model error.

Deriving the Bias-Variance Decomposition

Fix a point \(\mathbf{x}\) and consider the model error \(\hat{y}(\mathbf{x}) - f(\mathbf{x})\). The only randomness is in \(\hat{y}\) (which depends on the training sample). Add and subtract \(E[\hat{y}]\):

\[ \hat{y} - f(\mathbf{x}) = \bigl(\hat{y} - E[\hat{y}]\bigr) + \bigl(E[\hat{y}] - f(\mathbf{x})\bigr) \]

Square and take expectations over the training sample. Since \(E[\hat{y}]\) and \(f(\mathbf{x})\) are both non-random (for fixed \(\mathbf{x}\)), the second term is a constant \(b \equiv E[\hat{y}] - f(\mathbf{x})\):

\[ E[(\hat{y} - f(\mathbf{x}))^2] = E[(\hat{y} - E[\hat{y}])^2] + 2b\,\underbrace{E[\hat{y} - E[\hat{y}]]}_{= 0} + b^2 \]

The cross term vanishes, leaving:

\[ E[(\hat{y} - f(\mathbf{x}))^2] = \underbrace{E[(\hat{y} - E[\hat{y}])^2]}_{\text{Variance}} + \underbrace{(E[\hat{y}] - f(\mathbf{x}))^2}_{\text{Bias}^2} \]

The Full Decomposition

Putting it together:

\[ \text{MSE}_{\text{out}} = \underbrace{(E[\hat{y}] - f(\mathbf{x}))^2}_{\text{Bias}^2} + \underbrace{E[(\hat{y} - E[\hat{y}])^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible}} \]

Bias: how far the model’s average prediction is from the truth. Caused by wrong functional form or omitted variables — the model is systematically off.
Variance: how much the prediction fluctuates across different training samples. Caused by estimation uncertainty — the model chases noise.
Irreducible error: noise in \(y\) itself. No model can eliminate it.

Bias vs. Variance: Intuition

Simple models (few regressors): high bias, low variance.
- Example: predicting wages with only education — misses a lot, but stable across samples.

Complex models (many regressors): low bias, high variance.
- Example: predicting wages with 50 variables including ZIP code and birth month — flexible, but erratic.

The optimal model balances the two: complex enough to capture the signal, simple enough to not chase noise.

All model selection tools (adjusted \(R^2\), AIC, BIC, cross-validation) are attempts to find this balance.

Visualizing the Tradeoff

An optimal complexity exists — but we cannot observe bias and variance separately. What does overfitting look like in practice?

Overfitting in Action

Left: too simple — misses the curvature (high bias).
Middle: captures the pattern without chasing noise.
Right: passes through every point but oscillates wildly — will predict poorly on new data (high variance).

The degree-12 model has near-zero in-sample error — but terrible out-of-sample performance. We need a way to measure the difference.

In-Sample vs. Out-of-Sample

In-sample MSE: how well the model fits the training data.

\[ \text{MSE}_{\text{in}} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

A model can always reduce in-sample MSE by adding more regressors — even irrelevant ones.
This is overfitting: the model fits noise, not signal.

Out-of-sample MSE: how well the model predicts new data. This is the true test.

\[ \text{MSE}_{\text{out}} = E[(y_{\text{new}} - \hat{y}_{\text{new}})^2] \]

Estimating Out-of-Sample MSE

We cannot observe \(\text{MSE}_{\text{out}}\) directly. Two practical approaches:

Holdout method: split data into training set and test set.

Fit the model on the training set.
Evaluate predictions on the test set.
Simple, but wastes data and results depend on the particular split.

\(k\)-fold cross-validation: split data into \(k\) subsets (folds).

For each fold: train on the other \(k-1\) folds, predict the held-out fold.
Average the \(k\) test MSEs.
Uses all data for both training and evaluation. More stable, but computationally heavier.

An alternative: instead of splitting the data, penalize in-sample fit for model complexity. This is the idea behind adjusted \(R^2\), AIC, and BIC.

Adjusted \(R^2\)

\(R^2\) always increases when adding regressors — even useless ones. The adjusted \(R^2\) penalizes for model complexity:

\[ \bar{R}^2 = 1 - \frac{SSR/(n - k - 1)}{SST/(n - 1)} \]

Unlike \(R^2\), \(\bar{R}^2\) can decrease when an added variable does not sufficiently improve fit.
Prefer the model with higher \(\bar{R}^2\).

Example: R&D Intensity and Firm Sales

Consider explaining R&D spending as a share of sales (\(\textit{rdintens}\)):

	Model 1	Model 2	Model 3
\(\log(\textit{sales})\)	\(-0.673\)	\(-1.117\)	\(-0.571\)
\(\textit{profmarg}\)		\(0.174\)	\(0.175\)
\(\log(\textit{sales})^2\)			\(-0.037\)
\(R^2\)	\(0.061\)	\(0.148\)	\(0.155\)
\(\bar{R}^2\)	\(0.030\)	\(0.091\)	\(0.066\)

\(R^2\) increases from Model 2 to Model 3, but \(\bar{R}^2\) decreases — the squared term does not improve fit enough to justify the added parameter.
\(\bar{R}^2\) selects Model 2.

Question

Can we use \(\bar{R}^2\) to choose between \(y\) and \(\log(y)\) as the dependent variable?

No. \(\bar{R}^2\) measures the proportion of variation explained relative to the total variation in the dependent variable. Since \(\text{Var}(y) \neq \text{Var}(\log(y))\), the SST differs across models and the \(\bar{R}^2\) values are not comparable.

Information Criteria

An alternative approach: penalize the log-likelihood for model complexity. Here \(k\) is the total number of estimated parameters (including the intercept).

AIC (Akaike):

\[ \text{AIC} = -2 \log \hat{L} + 2k \]

BIC (Bayesian / Schwarz):

\[ \text{BIC} = -2 \log \hat{L} + k \log n \]

Lower is better. Both trade off fit (\(-2\log\hat{L}\)) against complexity (\(k\)).
BIC penalizes more heavily for \(n > 7\) (since \(\log n > 2\)), favoring simpler models.
For linear regression with normal errors: \(-2\log\hat{L} \propto n\log(SSR/n)\).

RESET: Testing Functional Form

The Regression Equation Specification Error Test (Ramsey, 1969) checks whether the model misses nonlinearities.

Idea: if the model \(y = \mathbf{x}\boldsymbol{\beta} + u\) is correctly specified, powers of the fitted values \(\hat{y}^2, \hat{y}^3\) should have no additional explanatory power.

Procedure:

Estimate the original model. Obtain \(\hat{y}_i\).
Run the auxiliary regression: \(y_i = \mathbf{x}_i\boldsymbol{\gamma} + \alpha_2 \hat{y}_i^2 + \alpha_3 \hat{y}_i^3 + u_i\).
Test \(H_0: \alpha_2 = \alpha_3 = 0\) using an \(F\)-test.

Rejection suggests functional form misspecification (e.g., missing quadratics, logs, interactions).
RESET does not tell you what is wrong — only that something is.

RESET: Example

Housing prices: \(\log(\textit{price}) = \beta_0 + \beta_1\, \textit{sqrft} + \beta_2\, \textit{bdrms} + u\).

Step 1: Estimate the model. Obtain \(\hat{y}_i\).

Step 2: Auxiliary regression:

\[ \log(\textit{price}_i) = \gamma_0 + \gamma_1\, \textit{sqrft}_i + \gamma_2\, \textit{bdrms}_i + \alpha_2 \hat{y}_i^2 + \alpha_3 \hat{y}_i^3 + u_i \]

Step 3: \(F\)-test of \(H_0: \alpha_2 = \alpha_3 = 0\).

If \(F = 4.67\) with \(p = 0.012\): reject \(H_0\). The linear specification is inadequate.
Next step: try \(\textit{sqrft}^2\), \(\log(\textit{sqrft})\), or interactions (lect4a tools).

Comparing Nested Models: \(F\)-test

Recall (lect3a): to compare a restricted model against an unrestricted model, use the \(F\)-test.

Example: Is a quadratic wage model better than a linear one?

\[ \text{Model 1}: \log(\textit{wage}) = \beta_0 + \beta_1\, \textit{exper} + u \]

\[ \text{Model 2}: \log(\textit{wage}) = \beta_0 + \beta_1\, \textit{exper} + \beta_2\, \textit{exper}^2 + u \]

Model 1 is nested in Model 2. Test \(H_0: \beta_2 = 0\).
Here a \(t\)-test suffices (one restriction). For multiple restrictions, use the \(F\)-test:

\[ F = \frac{(SSR_r - SSR_{ur}) / q}{SSR_{ur} / (n - k - 1)} \sim F_{q, n-k-1} \]

where \(q\) = number of restrictions.

Predictive Model Selection: Summary

Tool	What it does	Limitation
Adjusted \(R^2\)	Penalizes for \(k\)	Cannot compare different \(y\)’s
AIC / BIC	Penalizes log-likelihood	Requires same \(y\); BIC may underfit
\(F\)-test (lect3a)	Compares nested models	Requires nesting
RESET	Detects misspecification	Does not identify the fix

All of these are tools for prediction. For causal questions, the criteria are fundamentally different.

From Prediction to Causality

For prediction, we want the model that minimizes MSE — we don’t care why a variable is correlated, only that it helps forecast.

For causality, we want to isolate the effect of a treatment \(T\) on an outcome \(Y\). Including the wrong controls can introduce bias rather than remove it.

The question is no longer “does this variable improve fit?” but rather: does conditioning on this variable help identify the causal effect?

Example: Job Training and Productivity

A firm wants to estimate the causal effect of a job training program on employee productivity.

Training is randomly assigned. The naive comparison is unbiased:

\[ E[Y_i \mid T_i = 1] - E[Y_i \mid T_i = 0] = E[Y_{1i} - Y_{0i}] = \text{ATE} \]

Should we also control for post-training test scores? They are correlated with productivity — so they would improve prediction.

No.

Bad Control: The Potential Outcomes View

Let \(B_i\) be an indicator for a high post-training test score (e.g., above the median). Conditioning on \(B_i\):

\[ \begin{aligned} &E[Y_i \mid B_i = 1, T_i = 1] - E[Y_i \mid B_i = 1, T_i = 0] \\ = \;& E[Y_{1i} \mid B_{1i} = 1] - E[Y_{0i} \mid B_{0i} = 1] \\ = \;& \underbrace{E[Y_{1i} - Y_{0i} \mid B_{1i} = 1]}_{\text{CATE}} + \underbrace{E[Y_{0i} \mid B_{1i} = 1] - E[Y_{0i} \mid B_{0i} = 1]}_{\text{selection bias}} \end{aligned} \]

The selection bias term is generally non-zero: having a high score under treatment (\(B_{1i} = 1\)) selects different people than having a high score under control (\(B_{0i} = 1\)).

Why Does This Happen?

Post-training test scores depend on both training and innate ability.

Among workers with the same test score:

Those who received training got that score despite possibly lower innate ability.
Those who did not receive training got that score because of higher innate ability.

So within each test-score group, the control group has systematically higher ability than the treated group.
Higher ability \(\to\) higher productivity \(\to\) the treated–control comparison understates the training effect.

Conditioning on test scores created a spurious link between training and ability — even though training was randomly assigned.
DAGs give us a visual language to see when this happens in general.

Directed Acyclic Graphs (DAGs)

A Directed Acyclic Graph (DAG) represents the causal structure:

Nodes = variables
Arrows = causal relationships
Acyclic = no feedback loops

Three fundamental building blocks:

	Pattern	Role of \(M\)
Chain	\(X \to M \to Y\)	Mediator
Fork	\(X \leftarrow M \to Y\)	Common cause (confounder)
Collider	\(X \to M \leftarrow Y\)	Common effect

The Collider Problem

In the job training example, Test Score (Post) is a collider:

\[ \text{Training} \to \text{Test Score} \leftarrow \text{Innate Ability} \]

Without conditioning: Training and Ability are independent (random assignment). No bias.

Conditioning on Test Score: among workers with the same score, Training and Ability become negatively associated — knowing one tells you about the other.
This opens a spurious path from Training to Productivity through Ability.

General rule: conditioning on a collider creates an association between its causes that did not exist before.

How do we determine when to control for a variable — and when not to?

\(d\)-Separation

Let \(Z\) be the set of variables we control for in the regression. A path between \(X\) and \(Y\) is blocked (d-separated) by \(Z\) if:

Chain \(\to M \to\) or fork \(\leftarrow M \to\):
- Path is open by default. Including \(M\) in \(Z\) blocks it.

Collider \(\to M \leftarrow\):
- Path is blocked by default. Including \(M\) in \(Z\) opens it.

If all paths between \(X\) and \(Y\) are blocked by \(Z\), then \(X \perp\!\!\!\perp Y \mid Z\): \(X\) and \(Y\) are conditionally independent given \(Z\).

For causal inference: choose \(Z\) so that all non-causal (back-door) paths between treatment and outcome are blocked, without opening collider paths.

\(d\)-Separation: Examples

Chain: Rain \(\to\) Wet Road \(\to\) Accident

Conditioning on Wet Road blocks the path from Rain to Accident.
Once you know the road is wet, knowing whether it rained adds nothing about accident risk.

Fork: Parental Education \(\to\) Parental Involvement; Parental Education \(\to\) Child Performance

Conditioning on Parental Education blocks the spurious association between Involvement and Performance.

Collider: Stress \(\to\) Blood Pressure \(\leftarrow\) Family History

Stress and Family History are marginally independent.
Conditioning on Blood Pressure induces a spurious association: among hypertensive patients, those without family history are more likely to be stressed.

Can We Identify Mediators from Regression?

Suppose controlling for \(M\) makes the coefficient on \(T\) insignificant. Can we conclude that \(M\) mediates the effect of \(T\) on \(Y\)?

No. Two different causal structures produce the same regression pattern:

Mediation: \(T \to M \to Y\). Controlling for \(M\) blocks the causal path.
Confounding: \(T \leftarrow M \to Y\). Controlling for \(M\) removes a spurious association.

In both cases, the coefficient on \(T\) shrinks or vanishes when \(M\) is included. The regression cannot distinguish the two — you need to know the causal structure (the DAG).

This is why theory matters: the decision to control for a variable cannot be made on statistical grounds alone.

The Job Training DAG

Test Score (Post) is a collider: Training \(\to\) Test Score \(\leftarrow\) Ability.
Conditioning on it opens a spurious path: Training \(\to\) Test Score \(\leftarrow\) Ability \(\to\) Productivity.

What About Pre-Training Test Scores?

Test Score (Pre) is part of a chain: Ability \(\to\) Test Score (Pre) \(\to\) Training.
Conditioning on it blocks the back-door path: Ability \(\to\) Test Score \(\to\) Training.
This is a good control — it removes confounding.

Good Controls vs. Bad Controls

Good control: a pre-treatment variable that blocks confounding paths without opening new ones.

Bad control: a post-treatment variable that lies on the causal path or is a collider — conditioning on it introduces bias.

Rules of thumb:

Pre-treatment variables that are correlated with both \(T\) and \(Y\) are usually good controls.
Post-treatment variables — outcomes, mediators, or variables affected by treatment — are usually bad controls.
When in doubt, draw the DAG.

Exercises: Good or Bad Control?

Q1: Returns to college education.

Outcome: Earnings. Treatment: College graduation.
Control? Occupation (white collar vs. blue collar).

Q2: Obesity and mortality.

Outcome: Mortality. Treatment: Obesity.
Control? Heart failure, diabetes. (The “obesity paradox.”)

Exercises (cont.)

Q3: Beer tax and traffic fatalities.

Outcome: Fatalities. Treatment: Beer tax.
Control? Per capita beer consumption.

Q4: Attendance and final exam score.

Outcome: Final exam score. Treatment: Attendance.
Consider: ACT, priGPA, termGPA, homework.
Which are good controls? Which are potentially bad?

Q5: Childhood nutrition and adult height.

Outcome: Height. Treatment: Childhood nutrition.
Control? Enlisted in military.

Drawing the DAG Is Hard

The exercises above have no single “right” DAG — reasonable researchers can disagree.

The DAG encodes your theoretical assumptions about the causal structure, not empirical findings.
Different narratives → different graphs → different control strategies.

Example. Does parental income affect child test scores…

…directly? Then school quality is a confounder — control for it.
…only through school quality? Then it is a mediator — do not control.

…But Still Valuable

When multiple stories are plausible, the DAG framework does not give a clear answer — and neither does any other approach.

The value of the DAG is not that it resolves the ambiguity, but that it forces you to state your assumptions explicitly rather than burying them in a regression specification.

Summary

For prediction, more variables are not always better — the bias-variance tradeoff governs model complexity.
Adjusted \(R^2\), AIC/BIC, and RESET help select among models for predictive purposes.
For causality, the question is not “does this variable improve fit?” but “does conditioning help identify the causal effect?”
Bad controls (post-treatment variables, colliders) can introduce bias even in randomized experiments.
Good controls (pre-treatment confounders) block back-door paths and reduce OVB.

What’s Next

Part IV — When Conditioning Isn’t Enough:

Measurement error and attenuation bias
Instrumental variables: motivation and estimation
Difference-in-differences