Inference

Natasha Kang

Xiamen University, Chow Institute

March, 2026

From Estimation to Inference

  • In Lecture 2, we estimated \(\hat\beta_j\) and studied its properties (unbiasedness, variance).
  • But an estimate alone is just a number — it doesn’t tell us whether the true \(\beta_j\) is zero, positive, or equal to some hypothesized value.
  • Statistical inference uses the sampling distribution of estimators to make statements about population parameters.
  • Two main tools: hypothesis tests and confidence intervals.

Hypothesis Testing

  • A hypothesis is a statement about the unknown population parameter \(\beta_j\).
  • Using data, we assess whether the evidence supports or contradicts this statement.
  • Null hypothesis (\(H_0\)): the claim held to be true unless the data provide sufficient evidence against it.
  • Alternative hypothesis (\(H_1\)): the claim we are trying to establish.
  • The econometrician carries the burden of proof — must show the data provide enough evidence to reject \(H_0\) in favor of \(H_1\).

Hypothesis Testing as a Trial

Courtroom Hypothesis Test
Defendant is innocent \(H_0\) is true
Defendant is guilty \(H_1\) is true
Prosecutor presents evidence Econometrician computes test statistic
Jury decides Reject or fail to reject \(H_0\)
  • Type I error: wrongful conviction — rejecting \(H_0\) when it is true.
  • Type II error: letting a guilty person go free — failing to reject \(H_0\) when \(H_1\) is true.

The Tradeoff

The Tradeoff (cont.)

  • We control Type I error by choosing how much evidence we require to reject \(H_0\).
  • The significance level (\(\alpha\)) is the probability of Type I error we are willing to tolerate:

\[ P(\text{Reject } H_0 \mid H_0 \text{ is true}) = \alpha \]

  • But lowering \(\alpha\) comes at a cost: requiring stronger evidence also makes it harder to reject when \(H_1\) is true (Type II error increases).
  • By convention, \(\alpha\) is set at 5% or 1%.

Steps of Hypothesis Testing

  1. Specify \(H_0\) and \(H_1\).
  2. Choose a significance level \(\alpha\).
  3. Define a decision rule (critical region).
  4. Compute the test statistic and see if it falls in the critical region.

To carry out step 3, we need to know the distribution of the test statistic under \(H_0\).

What Do We Need?

  • We know \(E[\hat\beta_j] = \beta_j\) and \(\text{Var}(\hat\beta_j \mid X)\) under MLR.1–MLR.5.
  • But mean and variance alone don’t determine a distribution — we need more.
  • This requires one more assumption.

The Normality Assumption

MLR.6 — Normality: \(U_i \mid X \sim N(0, \sigma^2)\) for each \(i\)

  • Justification: \(U\) is the sum of many unobserved factors. By the CLT, such sums are approximately normal.
  • Limitations:
    • How many factors? Are they independent? Is the combination additive?
    • Normality may be poor in skewed or heavy-tailed settings (e.g., wages, which are bounded below).
    • For binary or count outcomes, the linear model itself is often problematic for broader reasons.
  • With large samples, we can relax this assumption using asymptotic approximations (Lecture 3b).

Normal Sampling Distribution

Theorem: Under MLR.1–MLR.6 (the Classical Linear Model assumptions), conditional on \(X\):

\[ \hat\beta_j \sim N\left(\beta_j, \, \text{Var}(\hat\beta_j \mid X)\right), \qquad j = 0, 1, \ldots, k \]

  • Denote \(\text{sd}(\hat\beta_j) = \sqrt{\text{Var}(\hat\beta_j \mid X)} = \sigma / \sqrt{\text{SST}_j(1 - R_j^2)}\) for the slope coefficients (\(j = 1, \ldots, k\)). Standardizing:

\[ \frac{\hat\beta_j - \beta_j}{\text{sd}(\hat\beta_j)} \sim N(0, 1) \]

Two-Sided Test: Setup

Consider testing:

\[ H_0: \beta_j = \beta_{j,0} \qquad \text{vs.} \qquad H_1: \beta_j \neq \beta_{j,0} \]

where \(\beta_{j,0}\) is a known value specified under \(H_0\).

  • If \(\sigma^2\) were known, we could use:

\[ Z = \frac{\hat\beta_j - \beta_{j,0}}{\color{red}{\text{sd}(\hat\beta_j)}} \sim N(0, 1) \quad \text{under } H_0 \]

  • Reject \(H_0\) if \(|Z| > z_{1-\alpha/2}\) (the critical value): the value chosen so that \(P(|Z| > z_{1-\alpha/2}) = \alpha\).

The \(t\)-Statistic

Since \(\sigma^2\) is unknown, we replace \(\color{red}{\text{sd}(\hat\beta_j)}\) with \(\color{blue}{\text{se}(\hat\beta_j)} = \hat\sigma / \sqrt{\text{SST}_j(1-R_j^2)}\):

\[ t = \frac{\hat\beta_j - \beta_{j,0}}{\color{blue}{\text{se}(\hat\beta_j)}} \]

  • But this substitution has a cost: \(\hat\sigma\) is itself estimated from data, introducing extra randomness. The resulting statistic is not \(N(0,1)\).
  • Gosset (1908), working with small samples at the Guinness brewery under the pseudonym “Student,” showed that it follows a distribution with heavier tails — the \(t\)-distribution.

The \(t\)-Distribution

Theorem: Under the CLM assumptions (MLR.1–MLR.6):

\[ t \sim t_{n-k-1} \quad \text{under } H_0 \]

where \(k\) is the number of slope regressors (excluding the intercept), so \(\text{df} = n - k - 1\).

  • Decision rule: reject \(H_0\) if \(|t| > t_{1-\alpha/2, \, n-k-1}\) (the critical value).
  • This is an exact finite-sample result under MLR.1–MLR.6. In Lecture 3b, we develop large-sample alternatives that do not require normality.

Under the Alternative

What happens to the \(t\)-statistic when \(H_0\) is false?

\[ t = \frac{\hat\beta_j - \beta_{j,0}}{\text{se}(\hat\beta_j)} = \underbrace{\frac{\hat\beta_j - \beta_j}{\text{se}(\hat\beta_j)}}_{\sim \, t_{n-k-1}} + \underbrace{\frac{\beta_j - \beta_{j,0}}{\text{se}(\hat\beta_j)}}_{\text{nonzero shift}} \]

  • The \(t\)-statistic is shifted away from zero — the larger the shift, the more likely we reject.
  • The probability of correctly rejecting \(H_0\) when \(H_1\) is true is called power.
  • Power is higher when:
    • The effect is larger: \(|\beta_j - \beta_{j,0}|\) is big
    • The estimation is more precise: \(\text{se}(\hat\beta_j)\) is small

The Tradeoff, Revisited

  • Moving \(c\) outward (lowering \(\alpha\)): blue area shrinks, but red area grows — fewer false rejections, more missed detections.

Power Under Different Alternatives

  • The further \(\beta_j\) is from \(\beta_{j,0}\), the more the \(H_1\) distribution shifts away from \(H_0\) — and the larger the power (shaded area beyond \(c\)).

\(p\)-Values

  • Rather than fixing \(\alpha\) and looking up a critical value, we can ask: how strong is the evidence against \(H_0\)?
  • The \(p\)-value is the smallest significance level at which \(H_0\) would be rejected.

\[ p\text{-value} = P(|T| > |t|) \quad \text{where } T \sim t_{n-k-1} \]

  • Decision rule: reject \(H_0\) if \(p\text{-value} < \alpha\).
  • The \(p\)-value summarizes the strength of evidence against \(H_0\) — smaller means stronger evidence.

Testing \(H_0: \beta_j = 0\)

  • The null \(H_0: \beta_j = 0\) is the most commonly tested hypothesis.
  • Software automatically reports the \(t\)-statistic and \(p\)-value for this test.
  • If \(H_0: \beta_j = 0\) is rejected: \(X_j\) is statistically significant.
  • If not rejected: \(X_j\) is statistically insignificant.
  • Caution: failing to reject does not mean \(H_0\) is true — it may mean we lack the precision to detect the effect (low power).

Example: Determinants of College GPA

\[ \text{colGPA} = \beta_0 + \beta_1 \, \text{hsGPA} + \beta_2 \, \text{ACT} + \beta_3 \, \text{skipped} + U \]

             Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.38955    0.33155   4.191 4.95e-05 ***
hsGPA        0.41182    0.09367   4.396 2.19e-05 ***
ACT          0.01472    0.01056   1.393  0.16578
skipped     -0.08311    0.02600  -3.197  0.00173 **

Test \(H_0: \beta_{\text{skipped}} = 0\) against \(H_1: \beta_{\text{skipped}} \neq 0\) at the 5% level. Do you reject?

  • \(t = -0.08311 / 0.02600 = -3.197\)
  • Critical value at 5% with 137 df: \(\approx 1.98\)
  • \(|t| = 3.197 > 1.98\): reject \(H_0\). Skipping classes is statistically significant.

One-Sided Tests

When theory suggests a direction, we use a one-sided test — the entire \(\alpha\) is placed in one tail.

  • Reject \(H_0\) if \(t\) falls in the shaded rejection region.
  • One-sided tests have more power to detect effects in the hypothesized direction.
  • Important: the direction must be chosen before seeing the data. Choosing after seeing \(\hat\beta_j\) inflates the Type I error.

Example: Wages and Experience

\[ \log(\text{wage}) = \beta_0 + \beta_1 \, \text{educ} + \beta_2 \, \text{exper} + \beta_3 \, \text{tenure} + U \]

             Estimate Std. Error t value
educ          .0920     .0073    12.56
exper         .0041     .0017     2.41
tenure        .0221     .0031     7.13

Test whether experience has a positive effect on wages, after controlling for education and tenure.

  • \(H_0: \beta_{\text{exper}} \leq 0\) vs. \(H_1: \beta_{\text{exper}} > 0\)
  • \(t = 0.0041 / 0.0017 = 2.41\)
  • Critical value at 5% (522 df): \(\approx 1.65\)
  • \(t = 2.41 > 1.65\): reject \(H_0\).

Testing Other Values of \(\beta_j\)

The \(t\)-test works for any hypothesized value, not just zero.

In the college GPA example, test \(H_0: \beta_{\text{skipped}} = -0.1\) against a two-sided alternative.

\[ t = \frac{-0.08311 - (-0.1)}{0.026} = \frac{0.01689}{0.026} \approx 0.65 \]

  • \(|t| = 0.65 < 1.98\): fail to reject. The data are consistent with each skipped class reducing GPA by 0.1 points.

Confidence Intervals

We know that under the CLM assumptions:

\[ P\!\left( |t| \leq t_{1-\alpha/2, \, n-k-1} \right) = 1 - \alpha \]

Substituting \(t = (\hat\beta_j - \beta_j) / \text{se}(\hat\beta_j)\) and rearranging:

\[ P\!\left( \hat\beta_j - t_{1-\alpha/2, \, n-k-1} \cdot \text{se}(\hat\beta_j) \leq \beta_j \leq \hat\beta_j + t_{1-\alpha/2, \, n-k-1} \cdot \text{se}(\hat\beta_j) \right) = 1 - \alpha \]

The \(100(1-\alpha)\%\) confidence interval for \(\beta_j\):

\[ \hat\beta_j \pm t_{1-\alpha/2, \, n-k-1} \cdot \text{se}(\hat\beta_j) \]

Interpreting Confidence Intervals

  • Interpretation: in repeated sampling, \(100(1-\alpha)\%\) of such intervals will contain the true \(\beta_j\).
  • For a given sample, \(\beta_j\) is either in the interval or not — the probability is 0 or 1.

What is the relationship between a CI and a two-sided test?

  • Fail to reject \(H_0: \beta_j = \beta_{j,0}\) at level \(\alpha\) \(\iff\) \(\beta_{j,0}\) lies inside the \(100(1-\alpha)\%\) CI.
  • The CI is the set of all values that would not be rejected by a two-sided test.

CI Example: College GPA

95% CI for \(\beta_{\text{skipped}}\)?

\[ -0.08311 \pm 1.98 \times 0.026 = [-0.135, -0.032] \]

  • Zero is outside this interval — consistent with rejecting \(H_0: \beta_{\text{skipped}} = 0\).
  • \(-0.1\) is inside this interval — consistent with failing to reject \(H_0: \beta_{\text{skipped}} = -0.1\).

Economic vs. Statistical Significance

Statistical significance tells us whether an effect is detectable — not whether it is important.

Example: 401(k) participation and firm size

\[ \text{prate} = \beta_0 + \beta_1 \, \text{mrate} + \beta_2 \, \text{age} + \beta_3 \, \text{totemp} + U \]

where prate = participation rate (%), mrate = employer match rate, age = plan age (years), totemp = total employees.

             Estimate Std. Error t value Pr(>|t|)
totemp      -1.291e-04  3.666e-05  -3.521 0.000443 ***
  • \(\hat\beta_3\) is highly statistically significant (\(p < 0.001\)).
  • But a 10,000-employee increase reduces participation by only \(10{,}000 \times 0.00013 = 1.3\) percentage points.
  • Is that economically meaningful? That depends on context — statistical significance alone doesn’t answer the question.

Testing Linear Combinations of \(\beta_j\)

Sometimes the hypothesis involves multiple parameters.

Example: Is one year at a junior college worth as much as one year at a university?

\[ \log(\text{wage}) = \beta_0 + \beta_1 \, \text{jc} + \beta_2 \, \text{univ} + \beta_3 \, \text{exper} + U \]

\[ H_0: \beta_1 = \beta_2 \qquad \text{vs.} \qquad H_1: \beta_1 < \beta_2 \]

  • The \(t\)-statistic is:

\[ t = \frac{\hat\beta_1 - \hat\beta_2}{\text{se}(\hat\beta_1 - \hat\beta_2)} \]

  • But \(\text{se}(\hat\beta_1 - \hat\beta_2)\) requires \(\text{Cov}(\hat\beta_1, \hat\beta_2)\), which is not always reported.
  • In practice, software handles this directly (linearHypothesis in R, test in Stata). But the idea behind it is instructive.

The Reparametrization Trick

Define \(\theta = \beta_1 - \beta_2\). Then \(H_0: \theta = 0\).

Substitute \(\beta_1 = \theta + \beta_2\) into the regression:

\[ \begin{aligned} \log(\text{wage}) &= \beta_0 + (\theta + \beta_2) \, \text{jc} + \beta_2 \, \text{univ} + \beta_3 \, \text{exper} + U \\ &= \beta_0 + \theta \, \text{jc} + \beta_2 (\text{jc} + \text{univ}) + \beta_3 \, \text{exper} + U \end{aligned} \]

  • Run this transformed regression. The coefficient on jc is \(\hat\theta\), and its standard error is \(\text{se}(\hat\theta)\).
  • Test \(H_0: \theta = 0\) with the usual \(t\)-test — no need for covariance.

Multiple Linear Restrictions

  • The \(t\)-test (and reparametrization trick) handles one restriction at a time.
  • When there are multiple restrictions, reparametrization becomes cumbersome — we need a different approach.

Example: Do performance statistics matter for baseball players’ salaries?

\[ \begin{aligned} \log(\text{salary}) = \beta_0 &+ \beta_1 \, \text{years} + \beta_2 \, \text{gamesyr} \\ &+ \beta_3 \, \text{bavg} + \beta_4 \, \text{hrunsyr} + \beta_5 \, \text{rbisyr} + U \end{aligned} \]

where bavg = batting average, hrunsyr = home runs/year, rbisyr = RBIs/year.

\[ H_0: \beta_3 = \beta_4 = \beta_5 = 0 \qquad \text{vs.} \qquad H_1: \text{at least one} \neq 0 \]

Why Not Use Separate \(t\)-Tests?

Can we just test \(\beta_3 = 0\), \(\beta_4 = 0\), and \(\beta_5 = 0\) individually and reject the joint null if any individual test rejects?

  • This is not a valid size-\(\alpha\) test. Even if the tests were independent:

\[ P(\text{reject at least one} \mid H_0) = 1 - (1 - \alpha)^q \]

  • With \(q = 3\) tests at \(\alpha = 0.05\): \(1 - 0.95^3 \approx 0.143\) — nearly three times the nominal level.
  • We need a test designed for joint hypotheses.

The \(F\)-Test: Idea

  • Unrestricted model: includes all regressors.
  • Restricted model: imposes \(H_0\) (e.g., drops the variables whose coefficients are set to zero).
  • If \(H_0\) is true, the excluded variables don’t help explain \(Y\) — dropping them should barely worsen the fit.
  • If dropping them worsens the fit substantially, that’s evidence against \(H_0\).
  • How do we measure “worsening the fit”? Compare the sum of squared residuals: \(\text{SSR}_r\) vs. \(\text{SSR}_{ur}\).

The \(F\)-Statistic

\[ F = \frac{(\text{SSR}_r - \text{SSR}_{ur}) / q}{\text{SSR}_{ur} / (n - k - 1)} \]

where \(q\) is the number of restrictions.

  • \(\text{SSR}_r \geq \text{SSR}_{ur}\) always, so \(F \geq 0\).
  • Under \(H_0\): excluded variables have no effect, so \(\text{SSR}_r \approx \text{SSR}_{ur}\) and \(F \approx 0\).
  • Under \(H_1\): dropping them worsens fit substantially, so \(F\) is large.

Distribution and Decision Rule

Theorem: Under the CLM assumptions and \(H_0\): \(F \sim F_{q, \, n-k-1}\).

Reject \(H_0\) if \(F > c\), where \(c = F_{\alpha, \, q, \, n-k-1}\) is the critical value.

Example: MLB Salaries

Model SSR
Unrestricted \(\log(\text{salary})\) on years, gamesyr, bavg, hrunsyr, rbisyr 183.19
Restricted \(\log(\text{salary})\) on years, gamesyr 198.31

\(H_0: \beta_{\text{bavg}} = \beta_{\text{hrunsyr}} = \beta_{\text{rbisyr}} = 0\)   (\(q = 3\), \(n - k - 1 = 347\))

\[ F = \frac{(198.31 - 183.19)/3}{183.19/347} = \frac{5.04}{0.528} = 9.55 \]

  • Critical value at 5% with \((3, 347)\) df: \(2.63\)
  • \(F = 9.55 > 2.63\): reject \(H_0\). Performance statistics are jointly significant.

The \(R^2\) Form

Since \(\text{SSR} = \text{SST}(1 - R^2)\) and SST cancels:

\[ F = \frac{(R^2_{ur} - R^2_r) / q}{(1 - R^2_{ur}) / (n - k - 1)} \]

  • Useful when software reports \(R^2\) but not SSR.
  • Requires the same dependent variable in both models (otherwise SST does not cancel).

Joint vs. Individual Significance

  • The \(F\)-test can reject the joint null even when no individual \(t\)-test rejects.
  • Why? Multicollinearity: correlated regressors make individual effects hard to isolate, but their joint contribution can still be large.
  • This is precisely when the \(F\)-test is most valuable.

Overall Significance of a Regression

Testing whether all regressors are jointly significant:

\[ H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0 \]

  • The restricted model is just the intercept (\(R^2_r = 0\)), so:

\[ F = \frac{R^2 / k}{(1 - R^2) / (n - k - 1)} \]

  • Most software reports this \(F\)-statistic automatically.

\(t\)-Test vs. \(F\)-Test

\(t\)-test \(F\)-test
Restrictions Single Multiple
Distribution \(t_{n-k-1}\) \(F_{q, \, n-k-1}\)
Direction One- or two-sided Always one-sided (\(F \geq 0\))
  • For a single restriction, the two are equivalent: \(t^2 = F\) and \(t^2_{n-k-1} \sim F_{1,\, n-k-1}\).

Summary

  • Under the CLM assumptions (MLR.1–MLR.6), OLS estimators are normally distributed, enabling exact finite-sample inference.
  • \(t\)-test: tests a single linear restriction using \(t = (\hat\beta_j - \beta_{j,0}) / \text{se}(\hat\beta_j)\).
  • \(F\)-test: tests multiple linear restrictions by comparing restricted and unrestricted model fit.
  • Confidence intervals: the set of values not rejected by a two-sided test.
  • Statistical significance \(\neq\) economic significance — always assess the magnitude of the effect.

What’s Next

Lecture 3b — Asymptotics and Heteroskedasticity:

  • Large-sample properties of OLS (consistency, asymptotic normality)
  • Inference without normality
  • Heteroskedasticity-robust standard errors