Dummy Variables

Natasha Kang

Xiamen University, Chow Institute

April, 2026

Roadmap

  1. Dummy variables as regressors: intercept shifts, slope shifts, multiple categories
  2. Testing for structural differences (Chow test)
  3. The linear probability model

Qualitative Information in Regression

  • So far, all regressors have been quantitative (wages, education years, rooms).
  • But many important factors are qualitative: gender, marital status, region, industry.
  • OLS requires numerical inputs. We encode qualitative information using dummy variables (also called indicator variables):

\[ \text{female}_i = \begin{cases} 1 & \text{if person } i \text{ is a woman} \\ 0 & \text{if person } i \text{ is a man} \end{cases} \]

A Single Dummy Variable: Intercept Shift

Consider:

\[ \text{wage}_i = \beta_0 + \delta_0\, \text{female}_i + \beta_1\, \text{educ}_i + u_i \]

For men (\(\text{female}_i = 0\)):

\[ E[\text{wage}_i \mid \text{female}_i = 0,\, \text{educ}_i] = \beta_0 + \beta_1\, \text{educ}_i \]

For women (\(\text{female}_i = 1\)):

\[ E[\text{wage}_i \mid \text{female}_i = 1,\, \text{educ}_i] = (\beta_0 + \delta_0) + \beta_1\, \text{educ}_i \]

\(\delta_0\) is the intercept shift: the wage difference between women and men, holding education constant.

Visualizing the Intercept Shift

The dummy shifts the intercept — the two groups have parallel regression lines.

Example: Gender Wage Gap

Data: wage1

\[ \begin{aligned} \widehat{\text{wage}} = \underset{(0.72)}{-1.57} &\underset{(0.26)}{- 1.81}\; \text{female} + \underset{(0.049)}{0.572}\; \text{educ} \\ &+ \underset{(0.012)}{0.025}\; \text{exper} + \underset{(0.021)}{0.141}\; \text{tenure} \end{aligned} \]

  • \(\hat\delta_0 = -1.81\): a woman earns $1.81/hour less than a man with the same education, experience, and tenure.
  • The \(t\)-statistic is \(-1.81/0.26 = -6.96\) — highly significant.

Dummies with a Log Dependent Variable

\[ \log(\text{wage}) = \beta_0 + \delta_0\, \text{female} + \beta_1\, \text{educ} + \cdots + u_i \]

From lect4a, the coefficient on a dummy in a log model has a percentage interpretation:

\[ \delta_0 = \log(\text{wage}^F) - \log(\text{wage}^M) = \log\!\left(\frac{\text{wage}^F}{\text{wage}^M}\right) \approx \frac{\text{wage}^F - \text{wage}^M}{\text{wage}^M} \]

For large differences, use the exact formula: \(\%\Delta = 100 \cdot [\exp(\hat\delta_0) - 1]\).

Example: Log Wage Equation

Data: wage1

\[ \begin{aligned} \widehat{\log(\text{wage})} = \underset{(0.099)}{0.417} &\underset{(0.036)}{- 0.297}\; \text{female} + \underset{(0.007)}{0.080}\; \text{educ} \\ &+ \underset{(0.005)}{0.029}\; \text{exper} \underset{(0.00010)}{- 0.00058}\; \text{exper}^2 \\ &+ \underset{(0.007)}{0.032}\; \text{tenure} \underset{(0.00023)}{- 0.00059}\; \text{tenure}^2 \end{aligned} \]

  • Approximate: women earn \(29.7\%\) less than comparable men.
  • Exact: \(100 \cdot [\exp(-0.297) - 1] = -25.7\%\).
  • The approximation overstates the gap because \(|\hat\delta_0| = 0.297\) is not small.

The Base Group and Changing It

Men are the base group — the omitted category. The dummy coefficient measures the difference relative to this group.

What if we use \(\text{male}_i\) instead?

\[ \log(\text{wage}) = \theta_0 + \gamma_0\, \text{male}_i + \beta_1\, \text{educ} + u_i \]

Since \(\text{male}_i = 1 - \text{female}_i\), matching coefficients gives:

\[ \gamma_0 = -\delta_0, \quad \theta_0 = \beta_0 + \delta_0 \]

Changing the base group flips the sign of the dummy coefficient — it does not change the substance.

The Dummy Variable Trap

What if we include both \(\text{female}_i\) and \(\text{male}_i\) along with an intercept?

\[ \log(\text{wage}) = \alpha_0 + \delta_0\, \text{female}_i + \gamma_0\, \text{male}_i + \beta_1\, \text{educ} + u_i \]

Since \(\text{female}_i + \text{male}_i = 1\) for all \(i\), the three “regressors” (intercept, female, male) are perfectly collinear. OLS cannot be computed.

Rule: for a binary qualitative variable, include one dummy and omit the other. The omitted category is the base group.

Multiple Categories

Qualitative variables often have more than two levels (e.g., race, region, education level).

Example: marital status with 3 categories — married, single, divorced.

\[ \text{wage} = \beta_0 + \delta_1\, \text{single}_i + \delta_2\, \text{divorced}_i + \beta_1\, \text{educ} + u_i \]

  • Married is the base group.
  • \(\delta_1\): wage difference between single and married workers.
  • \(\delta_2\): wage difference between divorced and married workers.

General rule: a qualitative variable with \(m\) categories requires \(m - 1\) dummy variables.

Roadmap

  1. Dummy variables as regressors: intercept shifts, slope shifts, multiple categories
  2. Testing for structural differences (Chow test)
  3. The linear probability model

Slope Shifts: Dummy–Continuous Interactions

An intercept shift assumes the return to education is the same for men and women. What if it differs?

\[ \log(\text{wage}) = \beta_0 + \delta_0\, \text{female} + \beta_1\, \text{educ} + \delta_1\, (\text{female} \cdot \text{educ}) + u_i \]

For men:

\[ \log(\text{wage}^M) = \beta_0 + \beta_1\, \text{educ} \]

For women:

\[ \log(\text{wage}^F) = (\beta_0 + \delta_0) + (\beta_1 + \delta_1)\, \text{educ} \]

\(\delta_1\) measures the difference in the return to education between women and men.

Visualizing the Slope Shift

With a slope shift, the two groups have different slopes — the lines are no longer parallel.

Example: Returns to Education by Gender

Data: wage1

\[ \begin{aligned} \widehat{\log(\text{wage})} = \underset{(0.119)}{0.389} &\underset{(0.168)}{- 0.227}\; \text{female} + \underset{(0.008)}{0.082}\; \text{educ} \\ &\underset{(0.0131)}{- 0.0056}\; \text{female} \cdot \text{educ} \\ &+ \underset{(0.005)}{0.029}\; \text{exper} \underset{(0.00011)}{- 0.00058}\; \text{exper}^2 \\ &+ \underset{(0.007)}{0.032}\; \text{tenure} \underset{(0.00024)}{- 0.00059}\; \text{tenure}^2 \end{aligned} \]

  • \(\hat\delta_1 = -0.0056\): the return to education for women is 0.56 percentage points lower than for men.
  • But \(t = -0.0056/0.0131 = -0.43\) — not significant. We cannot reject equal returns to education.

Interactions Among Dummy Variables

What about interactions between two dummies?

\[ \begin{aligned} \widehat{\log(\text{wage})} = \underset{(.100)}{.321} &\underset{(.056)}{- .110}\; \text{female} + \underset{(.055)}{.231}\; \text{married} \\ &\underset{(.072)}{- .301}\; \text{female} \cdot \text{married} + \cdots \end{aligned} \]

This creates four groups:

Single Married
Men (base) \(0.321\) \(0.321 + 0.231 = 0.552\)
Women \(0.321 - 0.110 = 0.211\) \(0.321 - 0.110 + 0.231 - 0.301 = 0.141\)

The marriage premium is \(0.231\) for men but \(0.231 - 0.301 = -0.070\) for women. Marriage is associated with higher wages for men but lower wages for women.

Roadmap

  1. Dummy variables as regressors: intercept shifts, slope shifts, multiple categories
  2. Testing for structural differences (Chow test)
  3. The linear probability model

Are Two Groups Truly Different?

So far we’ve tested individual dummy coefficients. But what if we want to test whether all coefficients differ across groups?

Example: do the wage equations for men and women differ in any way — intercept, returns to education, returns to experience, etc.?

This is a joint test of all dummy coefficients and their interactions:

\[ H_0: \delta_0 = \delta_1 = \cdots = \delta_k = 0 \]

The Chow Test via Dummies

Include a group dummy and its interaction with every regressor:

\[ y_i = \beta_0 + \beta_1 x_{1i} + \cdots + \beta_k x_{ki} + \delta_0 D_i + \delta_1 D_i x_{1i} + \cdots + \delta_k D_i x_{ki} + u_i \]

  • Unrestricted model: the equation above (allows all coefficients to differ).
  • Restricted model: drop all \(\delta\) terms (forces identical coefficients across groups).

Test \(H_0: \delta_0 = \delta_1 = \cdots = \delta_k = 0\) using the \(F\)-statistic:

\[ F = \frac{(SSR_r - SSR_{ur}) / (k+1)}{SSR_{ur} / (n - 2(k+1))} \sim F_{k+1,\; n - 2(k+1)} \]

This is just the exclusion restriction \(F\)-test from lect3a — nothing new. Under heteroskedasticity, use a robust version of the joint test.

Equivalence: Separate Regressions

An equivalent approach: estimate separate regressions for each group.

\[ \begin{aligned} y_i &= \beta_{1,0} + \beta_{1,1} x_{1i} + \cdots + \beta_{1,k} x_{ki} + u_i \quad \text{(group 1)} \\ y_i &= \beta_{2,0} + \beta_{2,1} x_{1i} + \cdots + \beta_{2,k} x_{ki} + u_i \quad \text{(group 2)} \end{aligned} \]

Then \(SSR_{ur} = SSR_1 + SSR_2\), and the Chow \(F\)-statistic is:

\[ F = \frac{(SSR_p - SSR_1 - SSR_2) / (k+1)}{(SSR_1 + SSR_2) / (n_1 + n_2 - 2(k+1))} \]

where \(SSR_p\) is the SSR from the pooled (single-equation) regression.

Both approaches — dummy interactions and separate regressions — yield the same \(F\)-statistic.

Roadmap

  1. Dummy variables as regressors: intercept shifts, slope shifts, multiple categories
  2. Testing for structural differences (Chow test)
  3. The linear probability model

When the Dependent Variable Is Binary

So far, dummies have been regressors. But the dependent variable can also be binary:

  • Whether an individual is employed or not
  • Whether a borrower defaults on a loan
  • Whether a student is admitted to a university

If \(y_i \in \{0, 1\}\), then \(E[y \mid \mathbf{x}] = P(y = 1 \mid \mathbf{x})\) — the conditional mean is the probability. If we model this probability as linear:

\[ P(y = 1 \mid \mathbf{x}) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k \]

this is the linear probability model (LPM). Estimation is OLS as before.

Example: Labor Force Participation

Data: mroz

\[ \begin{aligned} \widehat{\text{inlf}} = \underset{(.154)}{.586} &\underset{(.0014)}{- .0034}\; \text{nwifeinc} + \underset{(.007)}{.038}\; \text{educ} \\ &+ \underset{(.006)}{.039}\; \text{exper} \underset{(.00018)}{- .00060}\; \text{exper}^2 \\ &\underset{(.002)}{- .016}\; \text{age} \underset{(.034)}{- .262}\; \text{kidslt6} + \underset{(.013)}{.013}\; \text{kidsge6} \end{aligned} \]

  • \(\hat\beta_{\text{kidslt6}} = -0.262\): one additional child under 6 reduces the probability of being in the labor force by 26.2 percentage points.
  • \(\hat\beta_{\text{educ}} = 0.038\): one more year of education increases the probability by 3.8 percentage points.
  • Coefficients measure changes in probability, not in a continuous outcome.

Shortcomings of the LPM

The LPM can predict probabilities outside \([0, 1]\):

  • Predicted probabilities can be negative or exceed 1 — nonsensical.
  • The LPM also assumes a constant marginal effect, which may be unrealistic near the boundaries.

The LPM Is Heteroskedastic by Construction

Since \(y_i \in \{0, 1\}\):

\[ \text{Var}(y \mid \mathbf{x}) = P(y = 1 \mid \mathbf{x})\, [1 - P(y = 1 \mid \mathbf{x})] \]

Under the LPM, \(P(y = 1 \mid \mathbf{x}) = \mathbf{x}'\boldsymbol{\beta}\), so:

\[ \text{Var}(u \mid \mathbf{x}) = (\mathbf{x}'\boldsymbol{\beta})(1 - \mathbf{x}'\boldsymbol{\beta}) \]

The error variance depends on \(\mathbf{x}\)heteroskedasticity is built in, not a special case.

This is an instance of what we saw in lect3b: the model structure itself can generate heteroskedasticity. Always use heteroskedasticity-robust standard errors with the LPM.

When Is the LPM Still Useful?

Despite its shortcomings:

  • The LPM works well for values of \(\mathbf{x}\) near the sample mean, where predicted probabilities are away from the boundaries.
  • It is easy to estimate, interpret, and extend with the same tools (OLS, \(t\)-tests, \(F\)-tests).
  • For many policy questions, the marginal effect at the mean is the quantity of interest — and the LPM delivers this directly.

For applications where boundary behavior matters, nonlinear alternatives (logit, probit) constrain probabilities to \([0, 1]\). These require maximum likelihood estimation — beyond the scope of this course.

Summary

  • Dummy variables encode qualitative information as 0/1 regressors. With \(m\) categories, use \(m - 1\) dummies.
  • An intercept dummy shifts the regression line; a dummy–continuous interaction allows different slopes across groups.
  • The Chow test checks whether all coefficients differ across groups — it is an \(F\)-test on the joint significance of all dummy interactions.
  • The LPM applies OLS to a binary outcome. Coefficients measure changes in probability. Use robust standard errors because the model is heteroskedastic by construction.

What’s Next

Lecture 4c — Model Selection & Controls:

  • Bias-variance tradeoff and adjusted \(R^2\)
  • Specification testing (RESET)
  • Good controls vs. bad controls