Functional Form & Scaling

Natasha Kang

Xiamen University, Chow Institute

April, 2026

Roadmap

  1. Functional forms: logarithms, quadratics, interactions
  2. Scaling and beta coefficients

Linear in Parameters, Not in Variables

  • The linear regression model is linear in the parameters, not necessarily in the variables.
  • Valid linear regression: \(\;y = \beta_0 + \beta_1 \sqrt{x} + u\)
  • Not a linear regression: \(\;y = \frac{1}{\beta_0 + \beta_1 x} + u\)
  • This means we can transform the variables — logarithms, polynomials, interactions — and still estimate with OLS.

Why Logarithms?

Example: the Cobb-Douglas production function with a multiplicative shock

\[ Y_i = A K_i^{\alpha} L_i^{\gamma} e^{u_i} \]

is nonlinear in the parameters. But taking logs:

\[ \log(Y_i) = \log(A) + \alpha \log(K_i) + \gamma \log(L_i) + u_i \]

This is now linear and can be estimated by OLS.

Why Logarithms? (cont.)

Beyond linearizing multiplicative models:

  • Many economic variables (income, prices, firm size) are right-skewed. Extreme values can have high leverage in OLS. Taking logs compresses the scale and reduces the influence of outliers.
  • If \(\log(Y) \mid X\) is approximately normal, the log-linear model is correctly specified and the normality assumption on errors holds.

Four Log-Linear Models

We can apply the log transformation to \(y\), to \(x\), to both, or to neither. Each choice changes how \(\beta_1\) is interpreted:

Model Specification Interpretation of \(\beta_1\)
Level-level \(y = \beta_0 + \beta_1 x + u\) \(\Delta y = \beta_1 \Delta x\)
Level-log \(y = \beta_0 + \beta_1 \log(x) + u\) \(\Delta y \approx (\beta_1/100)\, \%\Delta x\)
Log-level \(\log(y) = \beta_0 + \beta_1 x + u\) \(\%\Delta y \approx (100\beta_1)\, \Delta x\)
Log-log \(\log(y) = \beta_0 + \beta_1 \log(x) + u\) \(\%\Delta y \approx \beta_1\, \%\Delta x\)

Deriving the Log-Level Interpretation

In the model \(\log(y) = \beta_0 + \beta_1 x + u\), consider a change \(\Delta x\):

\[ \log(y + \Delta y) - \log(y) = \beta_1 \Delta x \]

The left side is \(\log(1 + \Delta y / y) \approx \Delta y / y\) for small \(\Delta y / y\).

So:

\[ \frac{\Delta y}{y} \approx \beta_1 \Delta x \quad\Longrightarrow\quad \%\Delta y \approx (100\beta_1)\, \Delta x \]

This approximation relies on \(\log(1+r) \approx r\) for small \(r\).

Deriving the Level-Log Interpretation

In the model \(y = \beta_0 + \beta_1 \log(x) + u\), consider a change \(\Delta x\):

\[ \Delta y = \beta_1 [\log(x + \Delta x) - \log(x)] = \beta_1 \log\!\left(1 + \frac{\Delta x}{x}\right) \]

For small \(\Delta x / x\), using \(\log(1+r) \approx r\):

\[ \Delta y \approx \beta_1 \cdot \frac{\Delta x}{x} = \frac{\beta_1}{100} \cdot \%\Delta x \]

Deriving the Log-Log Interpretation

In the model \(\log(y) = \beta_0 + \beta_1 \log(x) + u\), consider a change \(\Delta x\):

\[ \log(y + \Delta y) - \log(y) = \beta_1 [\log(x + \Delta x) - \log(x)] \]

Applying \(\log(1+r) \approx r\) to both sides:

\[ \frac{\Delta y}{y} \approx \beta_1 \cdot \frac{\Delta x}{x} \quad\Longrightarrow\quad \%\Delta y \approx \beta_1\, \%\Delta x \]

\(\beta_1\) is the elasticity of \(y\) with respect to \(x\): the percent change in \(y\) for a one percent change in \(x\).

When the Approximation Breaks Down

  • All three derivations used \(\log(1+r) \approx r\), which is accurate only for small \(r\). What if the change is large?
  • For large changes, use the exact formula. From the log-level model:

\[ \log(y + \Delta y) - \log(y) = \beta_1 \Delta x \quad\Longrightarrow\quad \frac{y + \Delta y}{y} = e^{\beta_1 \Delta x} \]

  • Therefore:

\[ \frac{\Delta y}{y} = e^{\beta_1 \Delta x} - 1 \quad\Longrightarrow\quad \%\Delta y = 100 \cdot \big[\exp(\beta_1 \Delta x) - 1\big] \]

  • Example: If \(\hat\beta_1 = 0.30\) for \(\Delta x = 1\):
    • Approximate: \(30\%\) increase
    • Exact: \(100 \cdot (e^{0.30} - 1) = 34.99\%\) increase

Example: Education and Wages

\[ \widehat{\log(\text{wage})} = 0.584 + 0.083\; \text{educ} \]

  • Approximate: each additional year of education is associated with an \(8.3\%\) increase in hourly wage.
  • Exact: \(100 \cdot (e^{0.083} - 1) = 8.65\%\).
  • The approximation is good here because \(0.083\) is small.

Example: CEO Salary and Firm Sales

\[ \widehat{\log(\text{salary})} = 4.82 + 0.257\; \log(\text{sales}) \]

  • This is a log-log model: \(\hat\beta_1 = 0.257\) is the estimated elasticity.
  • A \(1\%\) increase in firm sales is associated with a \(0.257\%\) increase in CEO salary.

Units of Measurement and Logs

If we change the units of sales, does the estimated elasticity change?

In the CEO salary regression, sales are measured in millions of dollars:

\[ \widehat{\log(\text{salary})} = 4.82 + 0.257\; \log(\text{sales}) \]

Suppose we re-measure sales in thousands of dollars. Then \(\text{sales}_{\text{new}} = 1000 \cdot \text{sales}\), and:

\[ \log(\text{sales}_{\text{new}}) = \log(1000) + \log(\text{sales}) \]

The slope \(\hat\beta_1 = 0.257\) is unchanged — only the intercept shifts.

In log-log models, the elasticity is invariant to units of measurement.

Roadmap

  1. Functional forms: logarithms, quadratics, interactions
  2. Scaling and beta coefficients

Quadratic Models

Consider the specification:

\[ y = \beta_0 + \beta_1 x + \beta_2 x^2 + u \]

The marginal effect of \(x\) on \(y\):

\[ \frac{\partial E[y \mid x]}{\partial x} = \beta_1 + 2\beta_2 x \]

  • The marginal effect is not constant — it depends on \(x\).
  • The sign of \(\beta_2\) determines whether the effect is increasing (\(\beta_2 > 0\)) or decreasing (\(\beta_2 < 0\)) in \(x\).

Example: Experience and Wages

Data: wage1

\[ \widehat{\text{wage}} = \underset{(0.35)}{3.73} + \underset{(0.04)}{0.298}\; \text{exper} \underset{(0.0009)}{- 0.0061}\; \text{exper}^2 \]

Marginal effect:

\[ \frac{\partial \widehat{\text{wage}}}{\partial \text{exper}} = 0.298 - 2(0.0061) \cdot \text{exper} = 0.298 - 0.0122 \cdot \text{exper} \]

  • At \(\text{exper} = 0\): marginal effect \(= \$0.30\)/hour per year.
  • At \(\text{exper} = 10\): marginal effect \(= \$0.18\)/hour per year.
  • Turning point: \(\text{exper}^* = 0.298 / 0.0122 \approx 24.4\).

Interpreting the Turning Point

Does this mean returns to experience become negative after 24 years? Not necessarily — could be OVB (e.g., omitting education) or misspecification (e.g., \(\log(\text{wage})\) may be more appropriate).

Quadratics: Housing Prices

A quadratic is a local approximation — should we take it literally everywhere?

\[ \begin{aligned} \widehat{\log(\text{price})} = \underset{(0.57)}{13.39} &\underset{(0.11)}{- 0.902}\; \log(\text{nox}) \underset{(0.04)}{- 0.087}\; \log(\text{dist}) \\ &\underset{(0.17)}{- 0.545}\; \text{rooms} + \underset{(0.01)}{0.062}\; \text{rooms}^2 \underset{(0.006)}{- 0.048}\; \text{stratio} \end{aligned} \]

Marginal effect of rooms:

\[ \frac{\partial \widehat{\log(\text{price})}}{\partial \text{rooms}} = -0.545 + 0.124 \cdot \text{rooms} \]

Quadratics: Housing Prices (cont.)

  • The model predicts that a house with 2 rooms is worth more than one with 4 rooms — nonsensical.
  • The quadratic is a local approximation; it should not be trusted at extreme values.
  • In practice, ~95% of houses in the sample have 5–8 rooms — few observations inform the extremes.

Roadmap

  1. Functional forms: logarithms, quadratics, interactions
  2. Scaling and beta coefficients

Interaction Terms

Quadratic terms capture curvature in a single variable. But the second-order Taylor expansion of \(f(x_1, x_2)\) also introduces cross-terms:

\[ \begin{aligned} y &= f(x_1, x_2) \\ &\approx \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \beta_4 x_1^2 + \beta_5 x_2^2 + \cdots \end{aligned} \]

The interaction term \(x_1 x_2\) allows the effect of \(x_1\) to depend on \(x_2\):

\[ \frac{\partial E[y \mid x_1, x_2]}{\partial x_1} = \beta_1 + \beta_3 x_2 \]

Example: Attendance and Exam Performance

Data: attend

\[ \begin{aligned} \widehat{\text{stndfnl}} = \underset{(1.36)}{2.05} &\underset{(0.01)}{- 0.0067}\; \text{atndrte} \underset{(0.48)}{- 1.63}\; \text{priGPA} \\ &\underset{(0.10)}{- 0.128}\; \text{ACT} + \underset{(0.10)}{0.296}\; \text{priGPA}^2 \\ &+ \underset{(0.002)}{0.0045}\; \text{ACT}^2 + \underset{(0.004)}{0.0056}\; \text{priGPA} \cdot \text{atndrte} \end{aligned} \]

Partial effect of attendance (only \(\text{atndrte}\) and its interaction appear):

\[ \frac{\partial \widehat{\text{stndfnl}}}{\partial \text{atndrte}} = -0.0067 + 0.0056 \cdot \text{priGPA} \]

Average Partial Effect

Evaluating at \(\overline{\text{priGPA}} = 2.59\):

\[ \widehat{APE}_{\text{atndrte}} = -0.0067 + 0.0056 \times 2.59 = 0.0078 \]

  • At the average prior GPA, a one-percentage-point increase in attendance is associated with a \(0.0078\) standard deviation increase in final exam score.
  • For a student with \(\text{priGPA} = 2.0\): effect \(= -0.0067 + 0.0056(2.0) = 0.0045\).
  • For a student with \(\text{priGPA} = 3.5\): effect \(= -0.0067 + 0.0056(3.5) = 0.013\).

Attendance matters more for students with higher prior GPA.

Interpreting the Main Effect

In the model \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + u\):

\(\beta_1\) gives the effect of \(x_1\) when \(x_2 = 0\).

In the attendance example, \(\beta_1 = -0.0067\) is the effect of attendance for a student with \(\text{priGPA} = 0\) — not meaningful.

We want the effect at a meaningful value of \(x_2\), such as its sample mean.

Mean-Centering the Interaction

Solution: replace \(x_1 x_2\) with \((x_1 - \bar{x}_1)(x_2 - \bar{x}_2)\) in the interaction term only:

\[ y = \alpha_0 + \alpha_1 x_1 + \alpha_2 x_2 + \beta_3 (x_1 - \bar{x}_1)(x_2 - \bar{x}_2) + u \]

  • The interaction coefficient \(\beta_3\) is unchanged — centering is just a reparameterization.
  • But now \(\alpha_1\) gives the effect of \(x_1\) at \(x_2 = \bar{x}_2\): this is the average partial effect.

In the attendance example: replace \(\text{priGPA} \cdot \text{atndrte}\) with \((\text{priGPA} - \overline{\text{priGPA}}) \cdot \text{atndrte}\). Then \(\hat\alpha_1\) directly estimates the APE of attendance — and OLS reports its standard error, so we get inference for free.

Roadmap

  1. Functional forms: logarithms, quadratics, interactions
  2. Scaling and beta coefficients

Does Scaling Matter?

We’ve seen how transformations change the model. But even without transforming, the choice of units affects the numbers we report.

Consider \(y_i = \hat\beta_0 + \hat\beta_1 x_i + \hat{u}_i\).

Question: If we change the units of \(x\) or \(y\) (e.g., dollars \(\to\) thousands of dollars), what happens to:

  • The OLS estimates \(\hat\beta_0\), \(\hat\beta_1\)?
  • The \(R^2\)?
  • The \(t\)-statistics and \(F\)-statistics?

Scaling the Independent Variable

Adding a constant: \(x_i^* = x_i + c\)

\[ y_i = \hat\beta_0^* + \hat\beta_1^* x_i^* + \hat{u}_i^* = (\hat\beta_0^* + \hat\beta_1^* c) + \hat\beta_1^* x_i + \hat{u}_i^* \]

Matching coefficients: \(\hat\beta_1 = \hat\beta_1^*\) (unchanged), \(\hat\beta_0 = \hat\beta_0^* + \hat\beta_1^* c\).

Multiplying by a constant: \(x_i^* = a\, x_i\)

\[ \hat\beta_1 = a\, \hat\beta_1^* \quad\Longrightarrow\quad \hat\beta_1^* = \hat\beta_1 / a \]

The slope rescales inversely with the unit change.

Scaling the Dependent Variable

Adding a constant: \(y_i^* = y_i + c\)

\[ \hat\beta_0^* = \hat\beta_0 + c, \quad \hat\beta_1^* = \hat\beta_1 \]

Multiplying by a constant: \(y_i^* = a\, y_i\)

\[ \hat\beta_0^* = a\, \hat\beta_0, \quad \hat\beta_1^* = a\, \hat\beta_1 \]

Summary: Effects of Data Scaling

For the model \(y_i = \beta_0 + \beta_1 x_i + u_i\):

Transformation Intercept Slope \(R^2\)
Independent variable
\(x + c\) \(\beta_0 - c\beta_1\) \(\beta_1\) unchanged
\(a \cdot x\) \(\beta_0\) \(\beta_1 / a\) unchanged
Dependent variable
\(y + c\) \(\beta_0 + c\) \(\beta_1\) unchanged
\(a \cdot y\) \(a\beta_0\) \(a\beta_1\) unchanged
  • \(R^2\) is invariant to linear rescaling. For slope coefficients, \(t\)-statistics and \(F\)-statistics are also invariant (the intercept \(t\)-statistic can change under additive shifts).
  • Slope coefficients and their standard errors rescale in lockstep.

Why This Matters

  • Scaling does not change the substance of results — only their presentation.
  • Choose units that make coefficients easy to interpret:
    • Income in thousands (not raw dollars) so coefficients aren’t tiny.
    • Population in millions rather than raw counts.
  • Never compare coefficient magnitudes across variables with different units. A “larger” coefficient may simply reflect a smaller unit of measurement. So how do we compare effects across variables?

Beta Coefficients

Problem: How do we compare the relative importance of regressors measured in different units?

Solution: Standardize all variables to have mean zero and standard deviation one.

Original model:

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik} + u_i \]

Standardized model:

\[ \frac{y_i - \bar{y}}{\hat\sigma_y} = \beta_1^* \frac{x_{i1} - \bar{x}_1}{\hat\sigma_1} + \beta_2^* \frac{x_{i2} - \bar{x}_2}{\hat\sigma_2} + \cdots + \beta_k^* \frac{x_{ik} - \bar{x}_k}{\hat\sigma_k} + u_i^* \]

Beta Coefficients (cont.)

The standardized coefficient:

\[ \beta_j^* = \frac{\hat\sigma_j}{\hat\sigma_y} \hat\beta_j \]

Interpretation: a one standard deviation increase in \(x_j\) is associated with a \(\beta_j^*\) standard deviation change in \(y\), holding all else equal.

  • Beta coefficients are unit-free — they can be compared across variables.
  • No intercept in the standardized regression (all variables are demeaned).

Example: Pollution and Housing Prices

\[ \begin{aligned} \widehat{\text{price}} = \underset{(5055)}{20{,}871} &\underset{(354)}{- 2{,}706}\; \text{nox} \underset{(33)}{- 154}\; \text{crime} \\ &+ \underset{(394)}{6{,}726}\; \text{rooms} \underset{(188)}{- 1{,}027}\; \text{dist} \underset{(127)}{- 1{,}148}\; \text{stratio} \end{aligned} \]

Standardized regression:

\[ \begin{aligned} \widehat{z_{\text{price}}} = &\underset{(0.04)}{-0.340}\; z_{\text{nox}} \underset{(0.03)}{- 0.143}\; z_{\text{crime}} \\ &+ \underset{(0.03)}{0.514}\; z_{\text{rooms}} \underset{(0.04)}{- 0.235}\; z_{\text{dist}} \underset{(0.03)}{- 0.270}\; z_{\text{stratio}} \end{aligned} \]

  • Rooms has the largest standardized effect (\(0.514\)), followed by pollution (\(-0.340\)).
  • Pollution has a larger effect than crime (\(0.340\) vs. \(0.143\)) — hard to see from unstandardized coefficients alone.

Summary

  • The linear regression model is linear in the parameters, not necessarily in the variables. Transformations (logs, polynomials, interactions) expand what OLS can capture.
  • Logarithmic models give percentage-change interpretations; the log-log coefficient is an elasticity.
  • Quadratic and interaction terms allow marginal effects to vary with \(x\). Always compute the marginal effect — do not interpret individual coefficients in isolation.
  • Scaling changes the presentation of results, not the substance. Use beta coefficients to compare magnitudes across variables.

What’s Next

Lecture 4b — Dummy Variables:

  • Binary regressors: intercept and slope shifts
  • Structural breaks and the Chow test
  • The linear probability model