Xiamen University, Chow Institute
March, 2026
Consider \[ y_t = y_{t-1} + \alpha_0 + \epsilon_t, \] where \(\epsilon_t\) is stationary.
This can be written equivalently as \[ y_t = y_0 + \alpha_0 t + \sum_{i=1}^t \epsilon_i . \]
Subtracting the deterministic component \(y_0 + \alpha_0 t\) removes the linear drift, but the accumulated shocks \(\sum_{i=1}^t \epsilon_i\) remain. Therefore, detrending cannot remove a stochastic trend.
The appropriate transformation depends on the nature of the trend.
In practice, the nature of the trend is unknown.
This motivates the need for tools that help distinguish deterministic from stochastic trends — namely, unit root testing.
Routine differencing changes the object of analysis by removing low-frequency variation that is often central to macroeconomic questions.
While differencing restores stationarity, it eliminates low-frequency components of the data.
As a result:
After differencing, regression coefficients describe how changes in the dependent variable are related to changes (or levels) of the regressors, not how their levels move together over time.
This concern is emphasized in Christopher A. Sims’ critique of routine differencing and pre-testing in macroeconometric practice.
Consider a regression using variables in levels:
\[ y_t = \beta x_t + u_t, \] where \(x_t\) and \(y_t\) are nonstationary.
For example, consider the following two time series observed in levels.
Table: OLS regression in levels
|Term | Estimate| Std.Error| t.stat| p.value|
|:-----------|--------:|---------:|------:|-------:|
|(Intercept) | -190.03| 7.69| -24.70| <1e-04|
|co2 | 0.60| 0.02| 25.33| <1e-04|
In the late 19th and early 20th centuries, statistics was primarily used to describe and compare social conditions over time.
Governments, churches, and public institutions began collecting long time series on:
As these long time series became available, researchers naturally began using correlation to assess whether different indicators moved together over time.
Correlation was often interpreted as evidence consistent with a causal relationship, and used to inform policy or intervention.
These associations were often:
The difficulty was not that any single relationship was obviously absurd, but that almost everything appeared related to everything else.
Yule (1926) was among the first to articulate this problem clearly.
He noted that when time series exhibit strong persistence, large correlations can arise mechanically, even when the underlying series are unrelated.
Yule referred to this phenomenon as “nonsense correlation.”
As regression methods became standard in applied work (particularly from the 1960s onward), researchers increasingly studied relationships between time series in levels.
A typical specification was:
\[ y_t = \beta x_t + u_t. \]
Regression was often viewed as more informative than correlation, and results were frequently interpreted at face value.
Granger and Newbold (1974) showed that even when \(x_t\) and \(y_t\) are unrelated, regressions in levels can exhibit:
This phenomenon became known as spurious regression.
| | Estimate| Std. Error| t value| p-value|
|:-----------|--------:|----------:|-------:|-------:|
|(Intercept) | -2.790| 0.347| -8.03| 0|
|x | -0.762| 0.039| -19.44| 0|
The problem of spurious regression lies in the residual \(\{\hat{\varepsilon}_t\}\) (ignoring the intercept):
\[ \hat{\varepsilon}_t = y_t - \hat{\beta} x_t . \]
Suppose the variables follow random walks: \[ y_t = \sum_{i=1}^t v_i, \qquad x_t = \sum_{i=1}^t u_i, \] where \(\{u_i\}\) and \(\{v_i\}\) are independent, mean-zero innovations.
Then the residual behaves like \[ \hat{\varepsilon}_t \;\approx\; \sum_{i=1}^t v_i \;-\; \hat{\beta} \sum_{i=1}^t u_i . \]
Table: Regression in first differences
| | Estimate| Std. Error| t value| Pr(>|t|)|
|:-----------|--------:|----------:|-------:|------------------:|
|(Intercept) | 0.063| 0.069| 0.908| 0.365|
|diff(x) | 0.044| 0.071| 0.616| 0.539|
So far, we have seen that:
There is, however, an important exception.
If a linear combination of nonstationary variables is stationary, then a regression in levels can be meaningful.
Suppose \(x_t\) and \(y_t\) are both nonstationary.
They are said to be cointegrated if there exists a coefficient \(\beta\) such that
\[ y_t - \beta x_t \]
is stationary.
Before the statistical consequences of nonstationarity were widely understood, empirical macroeconomics routinely analyzed key variables in levels.
By the late 1970s, this practice faced a tension:
Examples of relationships historically analyzed in levels include:
The permanent income hypothesis (Friedman) states that households choose consumption based on long-run (permanent) income, not on short-run income fluctuations.
Income is decomposed as \[ y_t = y_t^{p} + y_t^{tr}, \] where:
Consumption is decomposed analogously: \[ c_t = c_t^{p} + c_t^{tr}. \]
The key behavioral assumption of PIH is: \[ c_t^{p} = \beta \, y_t^{p}, \] with transitory consumption \(c_t^{tr}\) assumed to be stationary.
This implies \[ c_t = \beta y_t^{p} + c_t^{tr}, \] so that \(c_t\) and \(y_t^{p}\) may be nonstationary, but deviations from their long-run relationship are stable.
In the long run, the money market clears: \[ \text{money supply} = \text{money demand}. \]
The behavioral side of this equilibrium is given by the liquidity theory of money demand: \[ \frac{M_t}{P_t} = L(Y_t, i_t), \] where real money balances demanded depend on real income and interest rates.
In logs, the long-run equilibrium condition can be written as \[ m_t - p_t = \beta_0 + \beta_1 y_t + \beta_2 i_t + u_t. \]
If money supply, prices, and income are \(I(1)\), monetary equilibrium requires the disequilibrium term \[ u_t = (m_t - p_t) - \beta_1 y_t - \beta_2 i_t - \beta_0 \] to be stationary.
This implies cointegration among money, prices, and income.
Let \(r_t^{(n)}\) denote the nominal interest rate on an \(n\)-period bond and \(r_t^{(1)}\) the one-period (short-term) rate.
Term structure theory implies that long and short rates are linked by a stable long-run relationship. In particular, the yield spread \[ s_t^{(n)} \equiv r_t^{(n)} - r_t^{(1)} \] reflects expectations of future short rates and term premia.
Empirically, individual interest rates are often highly persistent and are reasonably characterized as \(I(1)\) processes.
If term structure theory is correct in the long run, the spread \(s_t^{(n)}\) should be stable, implying \[ r_t^{(n)} - r_t^{(1)} \sim I(0). \]
Thus, \(r_t^{(n)}\) and \(r_t^{(1)}\) are cointegrated, with the yield spread as the equilibrium error.
Purchasing power parity posits a long-run relationship between nominal exchange rates and relative price levels.
In levels, absolute PPP implies \[ S_t = \frac{P_t}{P_t^*}, \] where \(S_t\) is the nominal exchange rate (domestic price of foreign currency).
Taking logs, \[
s_t = p_t - p_t^*.
\]
Empirically, exchange rates and price levels are often highly persistent and are reasonably characterized as \(I(1)\) processes.
If PPP holds as a long-run equilibrium condition, the real exchange rate \[ q_t \equiv s_t - (p_t - p_t^*) \] should be stable, implying \[ q_t \sim I(0). \]
Thus, \(s_t\), \(p_t\), and \(p_t^*\) are cointegrated, with the real exchange rate as the equilibrium error.
Let \(x_t = (x_{1t}, \ldots, x_{kt})'\) be a vector of time series.
The components of \(x_t\) are said to be cointegrated of order \((d,b)\), denoted \(x_t \sim CI(d,b)\), if:
Each component of \(x_t\) is integrated of order \(d\).
There exists a nonzero vector \(\beta \in \mathbb{R}^k\) such that \[ \beta' x_t \sim I(d-b), \qquad b>0. \]
The vector \(\beta\) is called a cointegrating vector. In most macroeconomic applications, we focus on the case \[ CI(1,1), \] where individual series are \(I(1)\) but the equilibrium error \(\beta' x_t\) is stationary.
Suppose \(x_t \sim CI(d,b)\) and there exists a cointegrating vector \(\beta \neq 0\) such that \[ \beta' x_t \sim I(d-b). \]
For any nonzero scalar \(c \neq 0\), \[ (c\beta)' x_t = c(\beta' x_t) \sim I(d-b). \]
Cointegrating vectors are not unique. Only the cointegrating space is uniquely defined.
To express the long-run restriction in a convenient form, a normalization is imposed.
A common normalization sets one coefficient equal to one. For example, if \[ \beta_1 y_t + \beta_2 x_t \sim I(0), \] we may normalize on \(y_t\): \[ y_t - \theta x_t \sim I(0), \qquad \theta = -\beta_2 / \beta_1. \]
Different normalizations represent the same equilibrium condition.
Let \(x_t\) be a \(k \times 1\) vector of time series, with each component integrated of order \(d\).
Suppose there exist \(r\) linearly independent vectors \[ \beta_1, \ldots, \beta_r \] such that \[ \beta_i' x_t \sim I(d-b), \qquad i = 1,\ldots,r. \]
Then \(x_t\) is said to have cointegrating rank \(r\).
When \(r = 1\), the cointegrating vector is unique up to scale (normalization).
When \(r > 1\), there are multiple linearly independent cointegrating relationships.
Cointegrating rank is the number of linearly independent stationary relations among a set of nonstationary variables.
Rank can exceed one when multiple long-run relations hold simultaneously in the same system.
For example, in a monetary system:
a long-run money demand relation implies \[ m_t - \beta_0 - \beta_1 p_t - \beta_2 y_t - \beta_3 r_t \sim I(0) \]
a monetary policy feedback rule, where the central bank adjusts nominal money supply in response to nominal GDP, implies \[ m_t - \gamma_0 + \gamma_1 (y_t + p_t) \sim I(0) \]
Writing \[ x_t = (m_t,\; 1,\; p_t,\; y_t,\; r_t)', \] we follow the standard convention of augmenting the stochastic variables with a constant so that intercepts are included in the cointegrating relations. Cointegration itself concerns the stochastic variables \((m_t,p_t,y_t,r_t)\).
These relations correspond to the cointegrating vectors
\[ \beta^{(1)} = (1 ,\; -\beta_0,\; -\beta_1,\; -\beta_2,\; -\beta_3)', \qquad \beta^{(2)} = (1,\; -\gamma_0,\; \gamma_1,\; \gamma_1,\; 0)'. \]
Since the vectors are linearly independent, the cointegrating rank is \(r=2\).
By the 1960s–70s, empirical researchers observed that many macroeconomic time series were highly persistent and moved together over long horizons.
This raised a descriptive question:
Are these variables sharing the same long-run sources of persistence, or does each variable drift independently?
One way to formalize this question is to consider the representation \[ x_t = \Lambda f_t + e_t, \] where:
If there exists a nonzero vector \(\beta \in \mathbb{R}^n\) such that \[ \beta' x_t \sim I(0), \] then \[ \beta' \Lambda = 0. \]
That is, \(\beta\) eliminates the \(k\) persistent components. This is possible only if \(k < n\).
The number of such linearly independent \(\beta\)’s is \(n-k\), which corresponds to the cointegrating rank.
Cointegration is a restriction on long-run behavior.
It implies that while variables may drift over time, they cannot drift arbitrarily far apart.
Equivalently, deviations from the long-run relation must be temporary.
For example, let \[ z_{t-1} \equiv y_{t-1} - \beta x_{t-1} \] denote the equilibrium error.
If \(\{z_t\}\) is stationary, then it is mean reverting: when the system is out of equilibrium, adjustment must occur to restore the long-run relation.
Since adjustment cannot occur in levels, it must occur through changes in the variables.
A dynamically coherent specification is therefore \[ \begin{aligned} \Delta y_t &= \alpha_y \, z_{t-1} + \varepsilon_{y,t}, \\ \Delta x_t &= \alpha_x \, z_{t-1} + \varepsilon_{x,t}, \end{aligned} \] which is called an error correction model (ECM).
In practice, short-run dynamics may involve lags, intercepts, and additional covariates.
A general single-equation ECM can be written as \[ \Delta y_t = \alpha\, z_{t-1} + c + \sum_{i=1}^p \phi_i \, \Delta y_{t-i} + \sum_{j=0}^q \psi_j \, \Delta x_{t-j} + \varepsilon_t, \] where \[ z_{t-1} = y_{t-1} - \beta x_{t-1}. \]
An error correction model is typically implemented in two steps.
Step 1. Estimate the long-run relationship
Estimate the cointegrating relation in levels: \[ y_t = \beta x_t + u_t. \]
Obtain the estimated equilibrium error: \[ \hat z_t = y_t - \hat\beta x_t. \]
Step 2. Estimate short-run dynamics
Estimate the ECM using differenced data: \[ \Delta y_t = \alpha \hat z_{t-1} + \sum_{i=1}^p \phi_i \Delta y_{t-i} + \sum_{j=0}^q \psi_j \Delta x_{t-j} + \varepsilon_t. \]
Spurious regression arises because the regression residual is nonstationary.
Cointegration reverses this logic.
Suppose we estimate the levels regression \[ y_t = \beta x_t + u_t, \] where \(x_t\) and \(y_t\) are nonstationary.
Define the residual error \[ \hat u_t \equiv y_t - \hat\beta x_t. \]
The Engle–Granger test implements this idea by testing whether \(\hat u_t\) contains a unit root.
Because \(\hat u_t\) is constructed using an estimated coefficient, the test uses critical values different from the usual Dickey–Fuller case.
Cointegration is a symmetric property:
either \(x_t\) and \(y_t\) share a stationary linear combination, or they do not.
The Engle–Granger procedure is asymmetric by construction because it relies on a single OLS projection.
Consider two independent \(I(1)\) processes: \[ x_t = x_{t-1} + \eta_t, \qquad y_t = y_{t-1} + \xi_t, \] with \(\{\eta_t\}\) and \(\{\xi_t\}\) i.i.d., mean zero, and uncorrelated.
Equivalently, \[ x_t = \sum_{i=1}^t \eta_i, \qquad y_t = \sum_{i=1}^t \xi_i. \]
Estimate the levels regression \[ y_t = \beta x_t + \varepsilon_t \] by OLS.
OLS chooses \(\hat\beta\) to minimize \[ \sum_{t=1}^T (y_t - \beta x_t)^2, \] that is, it projects the entire path of \(y_t\) onto the path of \(x_t\).
The resulting residual is \[ \hat\varepsilon_t = y_t - \hat\beta x_t = \sum_{i=1}^t \xi_i - \hat\beta \sum_{i=1}^t \eta_i = \sum_{i=1}^t (\xi_i - \hat\beta \eta_i). \]
If instead we reverse the roles and estimate \[ x_t = \gamma y_t + \nu_t, \] OLS now projects the path of \(x_t\) onto the path of \(y_t\), producing \[ \hat\nu_t = \sum_{i=1}^t (\eta_i - \hat\gamma \xi_i), \] which is a different stochastic process.
Under no cointegration, any residual constructed from nonstationary regressors is nonstationary in population.
The Engle–Granger test is a finite-sample procedure: it evaluates whether the estimated residual appears sufficiently mean-reverting.
Different OLS projections absorb different amounts of low-frequency variation, so residuals can exhibit different degrees of persistence in finite samples.
As a result, unit root tests applied to these residuals can yield different outcomes, even though the underlying variables are not cointegrated.
Key point