Xiamen University, Chow Institute
April, 2026
In earlier discussions of regression with time series data, we used large-sample arguments for dependent observations.
These arguments rely on stationarity of the underlying time series.
In many economic and financial time series, the data exhibit a clear time trend:
A natural first response is to include a time trend directly in the regression:
\[ y_t = \beta_0 + \beta_1 x_t + \gamma t + u_t . \]
Here:
Including a deterministic trend implies:
Inference proceeds as in standard time series regression. This is, however, a strong modeling assumption.
When the nonstationarity in the data is stochastic rather than deterministic, including a time trend is no longer appropriate.
A common response is to difference the data to remove the stochastic trend.
For example, if \(y_t\) is \(I(1)\):
Differencing, \[ \Delta y_t = y_t - y_{t-1}, \] reduces the order of integration by one.
Consider \[ y_t = \alpha + \gamma t + \epsilon_t, \] where \(\epsilon_t\) is stationary.
Taking first differences, \[ \Delta y_t = \gamma + \epsilon_t - \epsilon_{t-1} = \gamma + (1-L) \epsilon_t \]
Differencing applies an unnecessary \((1-L)\) filter to a stationary component, producing an over-differenced process.
The MA polynomial \((1-L)\) has a unit root, so \(\Delta y_t\) inherits a noninvertible MA component — the formal sense in which the series has been over-differenced.
Consider \[ y_t = y_{t-1} + \alpha_0 + \epsilon_t, \] where \(\epsilon_t\) is stationary.
This can be written equivalently as \[ y_t = y_0 + \alpha_0 t + \sum_{i=1}^t \epsilon_i . \]
Subtracting the deterministic component \(y_0 + \alpha_0 t\) removes the linear drift, but the accumulated shocks \(\sum_{i=1}^t \epsilon_i\) remain. Therefore, detrending cannot remove a stochastic trend.
The appropriate transformation depends on the nature of the trend.
In practice, the nature of the trend is unknown.
This motivates the need for tools that help distinguish deterministic from stochastic trends — namely, unit root testing.
Differencing restores stationarity, but it removes the low-frequency variation that is often central to macroeconomic questions.
As a result:
After differencing, regression coefficients describe how changes in the dependent variable are related to changes (or levels) of the regressors, not how their levels move together over time.
This concern is emphasized in Christopher A. Sims’ critique of routine differencing and pre-testing in macroeconometric practice.
Consider a regression using variables in levels:
\[ y_t = \beta x_t + u_t, \] where \(x_t\) and \(y_t\) are nonstationary.
Table: OLS regression in levels
|Term | Estimate| Std.Error| t.stat| p.value|
|:-----------|--------:|---------:|------:|-------:|
|(Intercept) | -190.03| 7.69| -24.70| <1e-04|
|co2 | 0.60| 0.02| 25.33| <1e-04|
The coefficient on \(x_t\) is highly significant — but should we believe it?
In the late 19th and early 20th centuries, statistics was primarily used to describe and compare social conditions over time.
Governments, churches, and public institutions began collecting long time series on:
As these long time series became available, researchers naturally began using correlation to assess whether different indicators moved together over time.
Correlation was often interpreted as evidence consistent with a causal relationship, and used to inform policy or intervention.
These associations were often:
The difficulty was not that any single relationship was obviously absurd, but that almost everything appeared related to everything else.
Yule (1926) was among the first to articulate this problem clearly.
He noted that when time series exhibit strong persistence, large correlations can arise mechanically, even when the underlying series are unrelated.
Yule referred to this phenomenon as “nonsense correlation.”
As regression methods became standard from the 1960s onward, researchers increasingly fit relationships between time series in levels: \[ y_t = \beta x_t + u_t. \]
Granger and Newbold (1974) showed that even when \(x_t\) and \(y_t\) are unrelated, such regressions can produce:
This phenomenon became known as spurious regression.
| | Estimate| Std. Error| t value| p-value|
|:-----------|--------:|----------:|-------:|-------:|
|(Intercept) | -2.790| 0.347| -8.03| 0|
|x | -0.762| 0.039| -19.44| 0|
Despite \(x_t\) and \(y_t\) being independent, OLS reports a highly significant coefficient and a large \(R^2\).
A preview: the right-hand column shows a case where the residual is stable — cointegration, defined formally in the next section.
The problem of spurious regression lies in the residual \(\{\hat{u}_t\}\) (ignoring the intercept):
\[ \hat{u}_t = y_t - \hat{\beta} x_t . \]
Suppose the variables follow random walks: \[ y_t = \sum_{i=1}^t \xi_i, \qquad x_t = \sum_{i=1}^t \eta_i, \] where \(\{\eta_i\}\) and \(\{\xi_i\}\) are independent, mean-zero innovations.
The residual then behaves like \[ \hat{u}_t \;\approx\; \sum_{i=1}^t \xi_i \;-\; \hat{\beta} \sum_{i=1}^t \eta_i . \]
Table: Regression in first differences
| | Estimate| Std. Error| t value| Pr(>|t|)|
|:-----------|--------:|----------:|-------:|------------------:|
|(Intercept) | 0.063| 0.069| 0.908| 0.365|
|diff(x) | 0.044| 0.071| 0.616| 0.539|
Differencing eliminates the spurious significance — but also any information about the long-run levels.
So far, we have seen that:
There is, however, an important exception.
If a linear combination of nonstationary variables is stationary, then a regression in levels can be meaningful.
Suppose \(x_t\) and \(y_t\) are both nonstationary.
They are said to be cointegrated if there exists a coefficient \(\beta\) such that
\[ y_t - \beta x_t \]
is stationary.
Before the statistical consequences of nonstationarity were widely understood, empirical macroeconomics routinely analyzed key variables in levels.
By the late 1970s, this practice faced a tension:
Examples of relationships historically analyzed in levels include:
The permanent income hypothesis (Friedman) states that households choose consumption based on long-run (permanent) income, not on short-run income fluctuations.
Income is decomposed as \[ y_t = y_t^{p} + y_t^{tr}, \] where:
Consumption is decomposed analogously: \[ c_t = c_t^{p} + c_t^{tr}. \]
The key behavioral assumption of PIH is: \[ c_t^{p} = \beta \, y_t^{p}, \] with transitory consumption \(c_t^{tr}\) assumed to be stationary.
This implies \[ c_t = \beta y_t^{p} + c_t^{tr}, \] so that \(c_t\) and \(y_t^{p}\) may be nonstationary, but deviations from their long-run relationship are stable.
In the long run, the money market clears: \[ \text{money supply} = \text{money demand}. \]
The behavioral side of this equilibrium is given by the liquidity theory of money demand: \[ \frac{M_t}{P_t} = L(Y_t, i_t), \] where real money balances demanded depend on real income and interest rates.
In logs, the long-run equilibrium condition can be written as \[ m_t - p_t = \beta_0 + \beta_1 y_t + \beta_2 i_t + u_t. \]
If money supply, prices, and income are \(I(1)\), monetary equilibrium requires the disequilibrium term \[ u_t = (m_t - p_t) - \beta_1 y_t - \beta_2 i_t - \beta_0 \] to be stationary.
This implies cointegration among money, prices, and income.
Let \(r_t^{(n)}\) denote the nominal interest rate on an \(n\)-period bond and \(r_t^{(1)}\) the one-period (short-term) rate.
Term structure theory implies that long and short rates are linked by a stable long-run relationship. In particular, the yield spread \[ s_t^{(n)} \equiv r_t^{(n)} - r_t^{(1)} \] reflects expectations of future short rates and term premia.
Empirically, individual interest rates are often highly persistent and are reasonably characterized as \(I(1)\) processes.
If term structure theory is correct in the long run, the spread \(s_t^{(n)}\) should be stable, implying \[ r_t^{(n)} - r_t^{(1)} \sim I(0). \]
Thus, \(r_t^{(n)}\) and \(r_t^{(1)}\) are cointegrated, with the yield spread as the equilibrium error.
Purchasing power parity posits a long-run relationship between nominal exchange rates and relative price levels.
In levels, absolute PPP implies \[ S_t = \frac{P_t}{P_t^*}, \] where \(S_t\) is the nominal exchange rate (domestic price of foreign currency).
Taking logs, \[ s_t = p_t - p_t^*. \]
Empirically, exchange rates and price levels are often highly persistent and are reasonably characterized as \(I(1)\) processes.
If PPP holds as a long-run equilibrium condition, the real exchange rate \[ q_t \equiv s_t - (p_t - p_t^*) \] should be stable, implying \[ q_t \sim I(0). \]
Thus, \(s_t\), \(p_t\), and \(p_t^*\) are cointegrated, with the real exchange rate as the equilibrium error.
Let \(x_t = (x_{1t}, \ldots, x_{kt})'\) be a vector of time series.
The components of \(x_t\) are said to be cointegrated of order \((d,b)\), denoted \(x_t \sim CI(d,b)\), if:
Each component of \(x_t\) is integrated of order \(d\).
There exists a nonzero vector \(\beta \in \mathbb{R}^k\) such that \[ \beta' x_t \sim I(d-b), \qquad 0 < b \le d. \]
The vector \(\beta\) is called a cointegrating vector. In most macroeconomic applications, we focus on the case \[ CI(1,1), \] where individual series are \(I(1)\) but the equilibrium error \(\beta' x_t\) is stationary.
Suppose \(x_t \sim CI(d,b)\) and there exists a cointegrating vector \(\beta \neq 0\) such that \[ \beta' x_t \sim I(d-b). \]
For any nonzero scalar \(c \neq 0\), \[ (c\beta)' x_t = c(\beta' x_t) \sim I(d-b). \]
Cointegrating vectors are not unique. Only the cointegrating space is uniquely defined.
To express the long-run restriction in a convenient form, a normalization is imposed.
A common normalization sets one coefficient equal to one. For example, if \[ \beta_1 y_t + \beta_2 x_t \sim I(0), \] we may normalize on \(y_t\): \[ y_t - \theta x_t \sim I(0), \qquad \theta = -\beta_2 / \beta_1. \]
Different normalizations represent the same equilibrium condition.
So far we have considered a single long-run relationship between two variables. With more than two variables, several relationships may hold simultaneously.
Let \(x_t\) be a \(k \times 1\) vector of time series, with each component integrated of order \(d\).
Suppose there exist \(r\) linearly independent vectors \[ \beta_1, \ldots, \beta_r \] such that \[ \beta_i' x_t \sim I(d-b), \qquad i = 1,\ldots,r. \]
Then \(x_t\) is said to have cointegrating rank \(r\).
When \(r = 1\), the cointegrating vector is unique up to scale (normalization).
When \(r > 1\), there are multiple linearly independent cointegrating relationships.
Rank can exceed one when multiple long-run relations hold simultaneously in the same system.
For example, in a monetary system:
a long-run money demand relation implies \[ m_t - \beta_0 - \beta_1 p_t - \beta_2 y_t - \beta_3 r_t \sim I(0) \]
a monetary policy feedback rule, where the central bank adjusts nominal money supply in response to nominal GDP, implies \[ m_t - \gamma_0 + \gamma_1 (y_t + p_t) \sim I(0) \]
Writing \[ x_t = (m_t,\; 1,\; p_t,\; y_t,\; r_t)', \] we follow the standard convention of augmenting the stochastic variables with a constant so that intercepts are included in the cointegrating relations. Cointegration itself concerns the stochastic variables \((m_t,p_t,y_t,r_t)\).
These relations correspond to the cointegrating vectors
\[ \beta^{(1)} = (1 ,\; -\beta_0,\; -\beta_1,\; -\beta_2,\; -\beta_3)', \qquad \beta^{(2)} = (1,\; -\gamma_0,\; \gamma_1,\; \gamma_1,\; 0)'. \]
Since the vectors are linearly independent, the cointegrating rank is \(r=2\).
By the 1960s–70s, empirical researchers observed that many macroeconomic time series were highly persistent and moved together over long horizons.
This raised a descriptive question:
Are these variables sharing the same long-run sources of persistence, or does each variable drift independently?
One way to formalize this question is to consider the representation \[ x_t = \Lambda f_t + e_t, \] where:
If there exists a nonzero vector \(\beta \in \mathbb{R}^n\) such that \(\beta’ x_t \sim I(0)\), then substituting the factor representation, \[ \beta’ x_t = \beta’ \Lambda f_t + \beta’ e_t. \]
Since \(\beta’ e_t\) is \(I(0)\) and \(f_t\) is \(I(1)\), the sum can be \(I(0)\) only if \[ \beta’ \Lambda = 0. \]
That is, \(\beta\) eliminates the \(k\) persistent components. This is possible only if \(k < n\).
The number of such linearly independent \(\beta\)’s is \(n-k\), which corresponds to the cointegrating rank.
We now turn from characterizing cointegration to its dynamic implications.
Cointegration is a restriction on long-run behavior.
It implies that while variables may drift over time, they cannot drift arbitrarily far apart.
Equivalently, deviations from the long-run relation must be temporary.
For example, let \[ z_{t-1} \equiv y_{t-1} - \beta x_{t-1} \] denote the equilibrium error.
If \(\{z_t\}\) is stationary, then it is mean reverting: when the system is out of equilibrium, adjustment must occur to restore the long-run relation.
The equilibrium error evolves from one period to the next as \[ \Delta z_t = \Delta y_t - \beta \, \Delta x_t. \]
So when the system is out of equilibrium (\(z_{t-1} \ne 0\)), the correction can only come through \(\Delta y_t\) and \(\Delta x_t\) — these are the only channels by which the error can move.
A dynamically coherent specification is therefore \[ \begin{aligned} \Delta y_t &= \alpha_y \, z_{t-1} + \varepsilon_{y,t}, \\ \Delta x_t &= \alpha_x \, z_{t-1} + \varepsilon_{x,t}, \end{aligned} \] which is called an error correction model (ECM).
If \(\alpha_y = 0\), then \(y_t\) does not respond to disequilibrium.
All error correction occurs through \(x_t\).
In this case, \(y_t\) is said to be weakly exogenous: it does not participate in error correction.
In practice, short-run dynamics may involve lags, intercepts, and additional covariates.
A general single-equation ECM can be written as \[ \Delta y_t = \alpha\, z_{t-1} + c + \sum_{i=1}^p \phi_i \, \Delta y_{t-i} + \sum_{j=0}^q \psi_j \, \Delta x_{t-j} + \varepsilon_t, \] where \[ z_{t-1} = y_{t-1} - \beta x_{t-1}. \]
Before we estimate \(\beta\) by OLS in levels, we need to know whether OLS produces a usable estimate when the regressors are nonstationary.
In a standard stationary regression, OLS converges at rate \(\sqrt{T}\): \[ \hat\beta - \beta = O_p(T^{-1/2}). \]
In a cointegrating regression between \(I(1)\) variables, OLS converges at rate \(T\): \[ \hat\beta - \beta = O_p(T^{-1}). \]
This is super-consistency: the estimator converges much faster because \(\sum_{t=1}^T x_t^2 = O_p(T^2)\) when \(x_t\) is \(I(1)\) (versus \(O_p(T)\) when stationary), and the OLS denominator scales accordingly.
Estimate the long-run relationship.
Estimate the cointegrating relation in levels: \[ y_t = \beta x_t + u_t. \]
Obtain the estimated equilibrium error: \[ \hat z_t = y_t - \hat\beta x_t. \]
Estimate the short-run dynamics.
Fit the ECM using differenced data and the lagged equilibrium error: \[ \Delta y_t = \alpha \hat z_{t-1} + \sum_{i=1}^p \phi_i \Delta y_{t-i} + \sum_{j=0}^q \psi_j \Delta x_{t-j} + \varepsilon_t. \]
Assume both \(x_t\) and \(y_t\) are \(I(1)\).
Because \(\hat\beta\) is super-consistent, the estimated residual \(\hat u_t = y_t - \hat\beta x_t\) inherits the stationarity properties of the true equilibrium error.
The Engle–Granger test inverts the spurious-regression logic: if \(x_t\) and \(y_t\) are cointegrated, \(\hat u_t\) should be stationary.
Suppose we estimate the levels regression \[ y_t = \beta x_t + u_t, \] and define the residual \(\hat u_t \equiv y_t - \hat\beta x_t\).
The Engle–Granger test checks whether \(\hat u_t\) contains a unit root by running an ADF regression on the residuals: \[ \Delta \hat u_t = \rho\, \hat u_{t-1} + \sum_{i=1}^{p} \phi_i \Delta \hat u_{t-i} + e_t, \] and testing \(H_0: \rho = 0\) (unit root in residuals, i.e., no cointegration).
Because \(\hat u_t\) is constructed using an estimated coefficient, the test uses critical values different from the usual Dickey–Fuller tables.
Cointegration is a symmetric property: either \(x_t\) and \(y_t\) share a stationary linear combination, or they do not.
The Engle–Granger procedure is asymmetric by construction because it relies on a single OLS projection.
Consider two independent \(I(1)\) processes: \[ x_t = x_{t-1} + \eta_t, \qquad y_t = y_{t-1} + \xi_t, \] with \(\{\eta_t\}\) and \(\{\xi_t\}\) i.i.d., mean zero, and uncorrelated.
Equivalently, \[ x_t = \sum_{i=1}^t \eta_i, \qquad y_t = \sum_{i=1}^t \xi_i. \]
Estimate the levels regression \[ y_t = \beta x_t + \varepsilon_t \] by OLS.
OLS chooses \(\hat\beta\) to minimize \[ \sum_{t=1}^T (y_t - \beta x_t)^2, \] that is, it projects the entire path of \(y_t\) onto the path of \(x_t\).
The resulting residual is \[ \hat u_t = y_t - \hat\beta x_t = \sum_{i=1}^t \xi_i - \hat\beta \sum_{i=1}^t \eta_i = \sum_{i=1}^t (\xi_i - \hat\beta \eta_i). \]
If instead we reverse the roles and estimate \[ x_t = \gamma y_t + \nu_t, \] OLS now projects the path of \(x_t\) onto the path of \(y_t\), producing \[ \hat\nu_t = \sum_{i=1}^t (\eta_i - \hat\gamma \xi_i), \] which is a different stochastic process.
In finite samples, \(\hat\gamma \ne 1/\hat\beta\), since each OLS minimizes a different sum of squares.
The Engle–Granger approach handles one cointegrating relation between two variables.
For systems with multiple variables and potentially multiple cointegrating relations: