Xiamen University, Chow Institute
April, 2026
\[ \text{female}_i = \begin{cases} 1 & \text{if person } i \text{ is a woman} \\ 0 & \text{if person } i \text{ is a man} \end{cases} \]
Consider:
\[ \text{wage}_i = \beta_0 + \delta_0\, \text{female}_i + \beta_1\, \text{educ}_i + u_i \]
For men (\(\text{female}_i = 0\)):
\[ E[\text{wage}_i \mid \text{female}_i = 0,\, \text{educ}_i] = \beta_0 + \beta_1\, \text{educ}_i \]
For women (\(\text{female}_i = 1\)):
\[ E[\text{wage}_i \mid \text{female}_i = 1,\, \text{educ}_i] = (\beta_0 + \delta_0) + \beta_1\, \text{educ}_i \]
\(\delta_0\) is the intercept shift: the wage difference between women and men, holding education constant.
The dummy shifts the intercept — the two groups have parallel regression lines.
Data: wage1
\[ \begin{aligned} \widehat{\text{wage}} = \underset{(0.72)}{-1.57} &\underset{(0.26)}{- 1.81}\; \text{female} + \underset{(0.049)}{0.572}\; \text{educ} \\ &+ \underset{(0.012)}{0.025}\; \text{exper} + \underset{(0.021)}{0.141}\; \text{tenure} \end{aligned} \]
\[ \log(\text{wage}) = \beta_0 + \delta_0\, \text{female} + \beta_1\, \text{educ} + \cdots + u_i \]
From lect4a, the coefficient on a dummy in a log model has a percentage interpretation:
\[ \delta_0 = \log(\text{wage}^F) - \log(\text{wage}^M) = \log\!\left(\frac{\text{wage}^F}{\text{wage}^M}\right) \approx \frac{\text{wage}^F - \text{wage}^M}{\text{wage}^M} \]
For large differences, use the exact formula: \(\%\Delta = 100 \cdot [\exp(\hat\delta_0) - 1]\).
Data: wage1
\[ \begin{aligned} \widehat{\log(\text{wage})} = \underset{(0.099)}{0.417} &\underset{(0.036)}{- 0.297}\; \text{female} + \underset{(0.007)}{0.080}\; \text{educ} \\ &+ \underset{(0.005)}{0.029}\; \text{exper} \underset{(0.00010)}{- 0.00058}\; \text{exper}^2 \\ &+ \underset{(0.007)}{0.032}\; \text{tenure} \underset{(0.00023)}{- 0.00059}\; \text{tenure}^2 \end{aligned} \]
Men are the base group — the omitted category. The dummy coefficient measures the difference relative to this group.
What if we use \(\text{male}_i\) instead?
\[ \log(\text{wage}) = \theta_0 + \gamma_0\, \text{male}_i + \beta_1\, \text{educ} + u_i \]
Since \(\text{male}_i = 1 - \text{female}_i\), matching coefficients gives:
\[ \gamma_0 = -\delta_0, \quad \theta_0 = \beta_0 + \delta_0 \]
Changing the base group flips the sign of the dummy coefficient — it does not change the substance.
What if we include both \(\text{female}_i\) and \(\text{male}_i\) along with an intercept?
\[ \log(\text{wage}) = \alpha_0 + \delta_0\, \text{female}_i + \gamma_0\, \text{male}_i + \beta_1\, \text{educ} + u_i \]
Since \(\text{female}_i + \text{male}_i = 1\) for all \(i\), the three “regressors” (intercept, female, male) are perfectly collinear. OLS cannot be computed.
Rule: for a binary qualitative variable, include one dummy and omit the other. The omitted category is the base group.
Qualitative variables often have more than two levels (e.g., race, region, education level).
Example: marital status with 3 categories — married, single, divorced.
\[ \text{wage} = \beta_0 + \delta_1\, \text{single}_i + \delta_2\, \text{divorced}_i + \beta_1\, \text{educ} + u_i \]
General rule: a qualitative variable with \(m\) categories requires \(m - 1\) dummy variables.
An intercept shift assumes the return to education is the same for men and women. What if it differs?
\[ \log(\text{wage}) = \beta_0 + \delta_0\, \text{female} + \beta_1\, \text{educ} + \delta_1\, (\text{female} \cdot \text{educ}) + u_i \]
For men:
\[ \log(\text{wage}^M) = \beta_0 + \beta_1\, \text{educ} \]
For women:
\[ \log(\text{wage}^F) = (\beta_0 + \delta_0) + (\beta_1 + \delta_1)\, \text{educ} \]
\(\delta_1\) measures the difference in the return to education between women and men.
With a slope shift, the two groups have different slopes — the lines are no longer parallel.
Data: wage1
\[ \begin{aligned} \widehat{\log(\text{wage})} = \underset{(0.119)}{0.389} &\underset{(0.168)}{- 0.227}\; \text{female} + \underset{(0.008)}{0.082}\; \text{educ} \\ &\underset{(0.0131)}{- 0.0056}\; \text{female} \cdot \text{educ} \\ &+ \underset{(0.005)}{0.029}\; \text{exper} \underset{(0.00011)}{- 0.00058}\; \text{exper}^2 \\ &+ \underset{(0.007)}{0.032}\; \text{tenure} \underset{(0.00024)}{- 0.00059}\; \text{tenure}^2 \end{aligned} \]
What about interactions between two dummies?
\[ \begin{aligned} \widehat{\log(\text{wage})} = \underset{(.100)}{.321} &\underset{(.056)}{- .110}\; \text{female} + \underset{(.055)}{.231}\; \text{married} \\ &\underset{(.072)}{- .301}\; \text{female} \cdot \text{married} + \cdots \end{aligned} \]
This creates four groups:
| Single | Married | |
|---|---|---|
| Men (base) | \(0.321\) | \(0.321 + 0.231 = 0.552\) |
| Women | \(0.321 - 0.110 = 0.211\) | \(0.321 - 0.110 + 0.231 - 0.301 = 0.141\) |
The marriage premium is \(0.231\) for men but \(0.231 - 0.301 = -0.070\) for women. Marriage is associated with higher wages for men but lower wages for women.
So far we’ve tested individual dummy coefficients. But what if we want to test whether all coefficients differ across groups?
Example: do the wage equations for men and women differ in any way — intercept, returns to education, returns to experience, etc.?
This is a joint test of all dummy coefficients and their interactions:
\[ H_0: \delta_0 = \delta_1 = \cdots = \delta_k = 0 \]
Include a group dummy and its interaction with every regressor:
\[ y_i = \beta_0 + \beta_1 x_{1i} + \cdots + \beta_k x_{ki} + \delta_0 D_i + \delta_1 D_i x_{1i} + \cdots + \delta_k D_i x_{ki} + u_i \]
Test \(H_0: \delta_0 = \delta_1 = \cdots = \delta_k = 0\) using the \(F\)-statistic:
\[ F = \frac{(SSR_r - SSR_{ur}) / (k+1)}{SSR_{ur} / (n - 2(k+1))} \sim F_{k+1,\; n - 2(k+1)} \]
This is just the exclusion restriction \(F\)-test from lect3a — nothing new. Under heteroskedasticity, use a robust version of the joint test.
An equivalent approach: estimate separate regressions for each group.
\[ \begin{aligned} y_i &= \beta_{1,0} + \beta_{1,1} x_{1i} + \cdots + \beta_{1,k} x_{ki} + u_i \quad \text{(group 1)} \\ y_i &= \beta_{2,0} + \beta_{2,1} x_{1i} + \cdots + \beta_{2,k} x_{ki} + u_i \quad \text{(group 2)} \end{aligned} \]
Then \(SSR_{ur} = SSR_1 + SSR_2\), and the Chow \(F\)-statistic is:
\[ F = \frac{(SSR_p - SSR_1 - SSR_2) / (k+1)}{(SSR_1 + SSR_2) / (n_1 + n_2 - 2(k+1))} \]
where \(SSR_p\) is the SSR from the pooled (single-equation) regression.
Both approaches — dummy interactions and separate regressions — yield the same \(F\)-statistic.
So far, dummies have been regressors. But the dependent variable can also be binary:
If \(y_i \in \{0, 1\}\), then \(E[y \mid \mathbf{x}] = P(y = 1 \mid \mathbf{x})\) — the conditional mean is the probability. If we model this probability as linear:
\[ P(y = 1 \mid \mathbf{x}) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k \]
this is the linear probability model (LPM). Estimation is OLS as before.
Data: mroz
\[ \begin{aligned} \widehat{\text{inlf}} = \underset{(.154)}{.586} &\underset{(.0014)}{- .0034}\; \text{nwifeinc} + \underset{(.007)}{.038}\; \text{educ} \\ &+ \underset{(.006)}{.039}\; \text{exper} \underset{(.00018)}{- .00060}\; \text{exper}^2 \\ &\underset{(.002)}{- .016}\; \text{age} \underset{(.034)}{- .262}\; \text{kidslt6} + \underset{(.013)}{.013}\; \text{kidsge6} \end{aligned} \]
The LPM can predict probabilities outside \([0, 1]\):
Since \(y_i \in \{0, 1\}\):
\[ \text{Var}(y \mid \mathbf{x}) = P(y = 1 \mid \mathbf{x})\, [1 - P(y = 1 \mid \mathbf{x})] \]
Under the LPM, \(P(y = 1 \mid \mathbf{x}) = \mathbf{x}'\boldsymbol{\beta}\), so:
\[ \text{Var}(u \mid \mathbf{x}) = (\mathbf{x}'\boldsymbol{\beta})(1 - \mathbf{x}'\boldsymbol{\beta}) \]
The error variance depends on \(\mathbf{x}\) — heteroskedasticity is built in, not a special case.
This is an instance of what we saw in lect3b: the model structure itself can generate heteroskedasticity. Always use heteroskedasticity-robust standard errors with the LPM.
Despite its shortcomings:
For applications where boundary behavior matters, nonlinear alternatives (logit, probit) constrain probabilities to \([0, 1]\). These require maximum likelihood estimation — beyond the scope of this course.
Lecture 4c — Model Selection & Controls: