Cointegration and Error Correction Models

Natasha Kang

Xiamen University, Chow Institute

April, 2026

Regression with Time Series Data: What Was Assumed

In earlier discussions of regression with time series data, we used large-sample arguments for dependent observations.

These arguments rely on stationarity of the underlying time series.

In many economic and financial time series, the data exhibit a clear time trend:

output and income
price levels
population
asset prices

Regression with a Deterministic Time Trend

A natural first response is to include a time trend directly in the regression:

\[ y_t = \beta_0 + \beta_1 x_t + \gamma t + u_t . \]

Here:

\(t\) captures systematic time evolution
\(x_t\) explains variation around the trend

Including a deterministic trend implies:

the mean of \(y_t\) changes smoothly over time
the slope of the trend is fixed

Inference proceeds as in standard time series regression. This is, however, a strong modeling assumption.

If the Trend Is Not Deterministic

When the nonstationarity in the data is stochastic rather than deterministic, including a time trend is no longer appropriate.

A common response is to difference the data to remove the stochastic trend.

For example, if \(y_t\) is \(I(1)\):

levels are nonstationary
first differences are stationary

Differencing, \[ \Delta y_t = y_t - y_{t-1}, \] reduces the order of integration by one.

Can We Remove a Deterministic Trend by Differencing?

Consider \[ y_t = \alpha + \gamma t + \epsilon_t, \] where \(\epsilon_t\) is stationary.

Taking first differences, \[ \Delta y_t = \gamma + \epsilon_t - \epsilon_{t-1} = \gamma + (1-L) \epsilon_t \]

Differencing applies an unnecessary \((1-L)\) filter to a stationary component, producing an over-differenced process.

The MA polynomial \((1-L)\) has a unit root, so \(\Delta y_t\) inherits a noninvertible MA component — the formal sense in which the series has been over-differenced.

Can We Remove a Stochastic Trend by Detrending?

Consider \[ y_t = y_{t-1} + \alpha_0 + \epsilon_t, \] where \(\epsilon_t\) is stationary.

This can be written equivalently as \[ y_t = y_0 + \alpha_0 t + \sum_{i=1}^t \epsilon_i . \]

Subtracting the deterministic component \(y_0 + \alpha_0 t\) removes the linear drift, but the accumulated shocks \(\sum_{i=1}^t \epsilon_i\) remain. Therefore, detrending cannot remove a stochastic trend.

Choosing the Right Transformation

The appropriate transformation depends on the nature of the trend.

If the trend is deterministic:
- detrending is sufficient
- differencing is inappropriate and leads to over-differencing
If the trend is stochastic:
- detrending fails
- differencing may be required to restore stationarity

In practice, the nature of the trend is unknown.

This motivates the need for tools that help distinguish deterministic from stochastic trends — namely, unit root testing.

A Critique of Differencing

Differencing restores stationarity, but it removes the low-frequency variation that is often central to macroeconomic questions.

As a result:

permanent movements in levels are removed
long-run comovement between variables is no longer visible
relationships in levels cannot be identified

After differencing, regression coefficients describe how changes in the dependent variable are related to changes (or levels) of the regressors, not how their levels move together over time.

This concern is emphasized in Christopher A. Sims’ critique of routine differencing and pre-testing in macroeconometric practice.

What If We Do Not Difference?

Consider a regression using variables in levels:

\[ y_t = \beta x_t + u_t, \] where \(x_t\) and \(y_t\) are nonstationary.

Example: Two Nonstationary Series in Levels

OLS Regression in Levels



Table: OLS regression in levels

|Term        | Estimate| Std.Error| t.stat| p.value|
|:-----------|--------:|---------:|------:|-------:|
|(Intercept) |  -190.03|      7.69| -24.70|  <1e-04|
|co2         |     0.60|      0.02|  25.33|  <1e-04|

The coefficient on \(x_t\) is highly significant — but should we believe it?

Can We Trust Statistical Association in Levels?

In the late 19th and early 20th centuries, statistics was primarily used to describe and compare social conditions over time.

Governments, churches, and public institutions began collecting long time series on:

church attendance, as a measure of religiosity and moral behavior
drunkenness or alcohol consumption, as indicators of social disorder
poverty, population, and mortality, as measures of social welfare

As these long time series became available, researchers naturally began using correlation to assess whether different indicators moved together over time.

Correlation was often interpreted as evidence consistent with a causal relationship, and used to inform policy or intervention.

Why the Patterns Looked Real

These associations were often:

large in magnitude
stable over time
regarded as strong and non-negligible by the statistical standards of the time

The difficulty was not that any single relationship was obviously absurd, but that almost everything appeared related to everything else.

Yule’s Diagnosis: “Nonsense Correlation”

Yule (1926) was among the first to articulate this problem clearly.

He noted that when time series exhibit strong persistence, large correlations can arise mechanically, even when the underlying series are unrelated.

Yule referred to this phenomenon as “nonsense correlation.”

Spurious Regression

As regression methods became standard from the 1960s onward, researchers increasingly fit relationships between time series in levels: \[ y_t = \beta x_t + u_t. \]

Granger and Newbold (1974) showed that even when \(x_t\) and \(y_t\) are unrelated, such regressions can produce:

large \(t\)-statistics
high \(R^2\)
persistent residuals

This phenomenon became known as spurious regression.

Illustration: Independent Persistent Series

Spurious Regression: OLS Output



|            | Estimate| Std. Error| t value| p-value|
|:-----------|--------:|----------:|-------:|-------:|
|(Intercept) |   -2.790|      0.347|   -8.03|       0|
|x           |   -0.762|      0.039|  -19.44|       0|

Despite \(x_t\) and \(y_t\) being independent, OLS reports a highly significant coefficient and a large \(R^2\).

Illustration: Cointegrated vs Independent Series

A preview: the right-hand column shows a case where the residual is stable — cointegration, defined formally in the next section.

Spurious \(R^2\)

Residual Persistence in Spurious Regression

Nonstationary Residuals

The problem of spurious regression lies in the residual \(\{\hat{u}_t\}\) (ignoring the intercept):

\[ \hat{u}_t = y_t - \hat{\beta} x_t . \]

Suppose the variables follow random walks: \[ y_t = \sum_{i=1}^t \xi_i, \qquad x_t = \sum_{i=1}^t \eta_i, \] where \(\{\eta_i\}\) and \(\{\xi_i\}\) are independent, mean-zero innovations.

The Residual Inherits a Unit Root

The residual then behaves like \[ \hat{u}_t \;\approx\; \sum_{i=1}^t \xi_i \;-\; \hat{\beta} \sum_{i=1}^t \eta_i . \]

The residual process is nonstationary (contains a unit root component).
This violates the assumptions underlying standard OLS inference.

Illustration: Same Data, After Differencing



Table: Regression in first differences

|            | Estimate| Std. Error| t value| Pr(>&#124;t&#124;)|
|:-----------|--------:|----------:|-------:|------------------:|
|(Intercept) |    0.063|      0.069|   0.908|              0.365|
|diff(x)     |    0.044|      0.071|   0.616|              0.539|

Differencing eliminates the spurious significance — but also any information about the long-run levels.

When Is a Levels Regression Not Spurious?

So far, we have seen that:

regressions in levels with nonstationary data are unreliable
differencing removes the problem, but also removes long-run information

There is, however, an important exception.

If a linear combination of nonstationary variables is stationary, then a regression in levels can be meaningful.

Cointegration

Suppose \(x_t\) and \(y_t\) are both nonstationary.

They are said to be cointegrated if there exists a coefficient \(\beta\) such that

\[ y_t - \beta x_t \]

is stationary.

the variables may wander individually
but they do not drift arbitrarily far apart
deviations from the long-run relationship are stable

Cointegration: Economic Motivation

Before the statistical consequences of nonstationarity were widely understood, empirical macroeconomics routinely analyzed key variables in levels.

By the late 1970s, this practice faced a tension:

many macroeconomic time series appeared to be nonstationary
differencing restored statistical validity
but economic analysis often focused on long-run relationships in levels

Examples of relationships historically analyzed in levels include:

consumption and income, studied as long-run co-moving aggregates
money supply and the price level, central to monetary equilibrium analysis
nominal interest rates at different maturities, where yield spreads are often stable
exchange rates and relative prices, motivated by long-run parity conditions

Consumption and Income (Permanent Income Hypothesis)

The permanent income hypothesis (Friedman) states that households choose consumption based on long-run (permanent) income, not on short-run income fluctuations.

Income is decomposed as \[ y_t = y_t^{p} + y_t^{tr}, \] where:

\(y_t^{p}\) is permanent income
\(y_t^{tr}\) is transitory income

Consumption is decomposed analogously: \[ c_t = c_t^{p} + c_t^{tr}. \]

PIH: Behavioral Assumption and Cointegration

The key behavioral assumption of PIH is: \[ c_t^{p} = \beta \, y_t^{p}, \] with transitory consumption \(c_t^{tr}\) assumed to be stationary.

This implies \[ c_t = \beta y_t^{p} + c_t^{tr}, \] so that \(c_t\) and \(y_t^{p}\) may be nonstationary, but deviations from their long-run relationship are stable.

Monetary Equilibrium (Money, Prices, and Income)

In the long run, the money market clears: \[ \text{money supply} = \text{money demand}. \]

The behavioral side of this equilibrium is given by the liquidity theory of money demand: \[ \frac{M_t}{P_t} = L(Y_t, i_t), \] where real money balances demanded depend on real income and interest rates.

Monetary Equilibrium as Cointegration

In logs, the long-run equilibrium condition can be written as \[ m_t - p_t = \beta_0 + \beta_1 y_t + \beta_2 i_t + u_t. \]

If money supply, prices, and income are \(I(1)\), monetary equilibrium requires the disequilibrium term \[ u_t = (m_t - p_t) - \beta_1 y_t - \beta_2 i_t - \beta_0 \] to be stationary.

This implies cointegration among money, prices, and income.

Term Structure of Interest Rates

Let \(r_t^{(n)}\) denote the nominal interest rate on an \(n\)-period bond and \(r_t^{(1)}\) the one-period (short-term) rate.

Term structure theory implies that long and short rates are linked by a stable long-run relationship. In particular, the yield spread \[ s_t^{(n)} \equiv r_t^{(n)} - r_t^{(1)} \] reflects expectations of future short rates and term premia.

Empirically, individual interest rates are often highly persistent and are reasonably characterized as \(I(1)\) processes.

If term structure theory is correct in the long run, the spread \(s_t^{(n)}\) should be stable, implying \[ r_t^{(n)} - r_t^{(1)} \sim I(0). \]

Thus, \(r_t^{(n)}\) and \(r_t^{(1)}\) are cointegrated, with the yield spread as the equilibrium error.

Purchasing Power Parity (PPP)

Purchasing power parity posits a long-run relationship between nominal exchange rates and relative price levels.

In levels, absolute PPP implies \[ S_t = \frac{P_t}{P_t^*}, \] where \(S_t\) is the nominal exchange rate (domestic price of foreign currency).

Taking logs, \[ s_t = p_t - p_t^*. \]

Empirically, exchange rates and price levels are often highly persistent and are reasonably characterized as \(I(1)\) processes.

PPP as Cointegration

If PPP holds as a long-run equilibrium condition, the real exchange rate \[ q_t \equiv s_t - (p_t - p_t^*) \] should be stable, implying \[ q_t \sim I(0). \]

Thus, \(s_t\), \(p_t\), and \(p_t^*\) are cointegrated, with the real exchange rate as the equilibrium error.

Cointegration (Engle and Granger, 1987)

Let \(x_t = (x_{1t}, \ldots, x_{kt})'\) be a vector of time series.

The components of \(x_t\) are said to be cointegrated of order \((d,b)\), denoted \(x_t \sim CI(d,b)\), if:

Each component of \(x_t\) is integrated of order \(d\).
There exists a nonzero vector \(\beta \in \mathbb{R}^k\) such that \[ \beta' x_t \sim I(d-b), \qquad 0 < b \le d. \]

The vector \(\beta\) is called a cointegrating vector. In most macroeconomic applications, we focus on the case \[ CI(1,1), \] where individual series are \(I(1)\) but the equilibrium error \(\beta' x_t\) is stationary.

Non-Uniqueness of Cointegrating Vectors

Suppose \(x_t \sim CI(d,b)\) and there exists a cointegrating vector \(\beta \neq 0\) such that \[ \beta' x_t \sim I(d-b). \]

For any nonzero scalar \(c \neq 0\), \[ (c\beta)' x_t = c(\beta' x_t) \sim I(d-b). \]

Cointegrating vectors are not unique. Only the cointegrating space is uniquely defined.

Normalization of Cointegrating Vectors

To express the long-run restriction in a convenient form, a normalization is imposed.

A common normalization sets one coefficient equal to one. For example, if \[ \beta_1 y_t + \beta_2 x_t \sim I(0), \] we may normalize on \(y_t\): \[ y_t - \theta x_t \sim I(0), \qquad \theta = -\beta_2 / \beta_1. \]

Different normalizations represent the same equilibrium condition.

Cointegrating Rank

So far we have considered a single long-run relationship between two variables. With more than two variables, several relationships may hold simultaneously.

Let \(x_t\) be a \(k \times 1\) vector of time series, with each component integrated of order \(d\).

Suppose there exist \(r\) linearly independent vectors \[ \beta_1, \ldots, \beta_r \] such that \[ \beta_i' x_t \sim I(d-b), \qquad i = 1,\ldots,r. \]

Then \(x_t\) is said to have cointegrating rank \(r\).

When \(r = 1\), the cointegrating vector is unique up to scale (normalization).

When \(r > 1\), there are multiple linearly independent cointegrating relationships.

Cointegrating Rank: Special Cases

\(r = 0\): no cointegration
\(r = 1\): a single long-run equilibrium relationship
\(1 < r < k\): multiple long-run equilibrium restrictions
\(r = k\): every direction in \(\mathbb{R}^k\) is a cointegrating combination, so each \(x_{it}\) is itself \(I(0)\) (applies when \(d=b=1\))

Cointegrating Rank: Interpretation

Rank can exceed one when multiple long-run relations hold simultaneously in the same system.

For example, in a monetary system:

a long-run money demand relation implies \[ m_t - \beta_0 - \beta_1 p_t - \beta_2 y_t - \beta_3 r_t \sim I(0) \]
a monetary policy feedback rule, where the central bank adjusts nominal money supply in response to nominal GDP, implies \[ m_t - \gamma_0 + \gamma_1 (y_t + p_t) \sim I(0) \]

Cointegrating Rank: Monetary System Example

Writing \[ x_t = (m_t,\; 1,\; p_t,\; y_t,\; r_t)', \] we follow the standard convention of augmenting the stochastic variables with a constant so that intercepts are included in the cointegrating relations. Cointegration itself concerns the stochastic variables \((m_t,p_t,y_t,r_t)\).

These relations correspond to the cointegrating vectors

\[ \beta^{(1)} = (1 ,\; -\beta_0,\; -\beta_1,\; -\beta_2,\; -\beta_3)', \qquad \beta^{(2)} = (1,\; -\gamma_0,\; \gamma_1,\; \gamma_1,\; 0)'. \]

Since the vectors are linearly independent, the cointegrating rank is \(r=2\).

Cointegration and Common Stochastic Trends

By the 1960s–70s, empirical researchers observed that many macroeconomic time series were highly persistent and moved together over long horizons.

This raised a descriptive question:

Are these variables sharing the same long-run sources of persistence, or does each variable drift independently?

A Factor Representation

One way to formalize this question is to consider the representation \[ x_t = \Lambda f_t + e_t, \] where:

\(x_t\) is an \(n \times 1\) vector of observed variables
\(f_t\) is a \(k \times 1\) vector of persistent (\(I(1)\)) components
\(\Lambda\) is an \(n \times k\) loading matrix of full column rank
\(e_t\) is an \(n \times 1\) stationary component

Common Trends and Cointegrating Rank

If there exists a nonzero vector \(\beta \in \mathbb{R}^n\) such that \(\beta’ x_t \sim I(0)\), then substituting the factor representation, \[ \beta’ x_t = \beta’ \Lambda f_t + \beta’ e_t. \]

Since \(\beta’ e_t\) is \(I(0)\) and \(f_t\) is \(I(1)\), the sum can be \(I(0)\) only if \[ \beta’ \Lambda = 0. \]

That is, \(\beta\) eliminates the \(k\) persistent components. This is possible only if \(k < n\).

The number of such linearly independent \(\beta\)’s is \(n-k\), which corresponds to the cointegrating rank.

Cointegration and Error Correction Models

We now turn from characterizing cointegration to its dynamic implications.

Cointegration is a restriction on long-run behavior.

It implies that while variables may drift over time, they cannot drift arbitrarily far apart.

Equivalently, deviations from the long-run relation must be temporary.

For example, let \[ z_{t-1} \equiv y_{t-1} - \beta x_{t-1} \] denote the equilibrium error.

If \(\{z_t\}\) is stationary, then it is mean reverting: when the system is out of equilibrium, adjustment must occur to restore the long-run relation.

The Error Correction Model

The equilibrium error evolves from one period to the next as \[ \Delta z_t = \Delta y_t - \beta \, \Delta x_t. \]

So when the system is out of equilibrium (\(z_{t-1} \ne 0\)), the correction can only come through \(\Delta y_t\) and \(\Delta x_t\) — these are the only channels by which the error can move.

A dynamically coherent specification is therefore \[ \begin{aligned} \Delta y_t &= \alpha_y \, z_{t-1} + \varepsilon_{y,t}, \\ \Delta x_t &= \alpha_x \, z_{t-1} + \varepsilon_{x,t}, \end{aligned} \] which is called an error correction model (ECM).

\(z_{t-1}\) measures the extent of disequilibrium
\(\alpha_y\) and \(\alpha_x\) are speed-of-adjustment parameters, describing the direction and strength of adjustment

Weak Exogeneity

If \(\alpha_y = 0\), then \(y_t\) does not respond to disequilibrium.

All error correction occurs through \(x_t\).

In this case, \(y_t\) is said to be weakly exogenous: it does not participate in error correction.

Error Correction Model: General Form

In practice, short-run dynamics may involve lags, intercepts, and additional covariates.

A general single-equation ECM can be written as \[ \Delta y_t = \alpha\, z_{t-1} + c + \sum_{i=1}^p \phi_i \, \Delta y_{t-i} + \sum_{j=0}^q \psi_j \, \Delta x_{t-j} + \varepsilon_t, \] where \[ z_{t-1} = y_{t-1} - \beta x_{t-1}. \]

\(z_{t-1}\) captures long-run disequilibrium
differenced terms capture short-run dynamics
the constant allows for deterministic drift
the ECM combines long-run restrictions with short-run flexibility

Super-Consistency of the Cointegrating Regression

Before we estimate \(\beta\) by OLS in levels, we need to know whether OLS produces a usable estimate when the regressors are nonstationary.

In a standard stationary regression, OLS converges at rate \(\sqrt{T}\): \[ \hat\beta - \beta = O_p(T^{-1/2}). \]

In a cointegrating regression between \(I(1)\) variables, OLS converges at rate \(T\): \[ \hat\beta - \beta = O_p(T^{-1}). \]

This is super-consistency: the estimator converges much faster because \(\sum_{t=1}^T x_t^2 = O_p(T^2)\) when \(x_t\) is \(I(1)\) (versus \(O_p(T)\) when stationary), and the OLS denominator scales accordingly.

Engle–Granger Two-Step: Step 1

Estimate the long-run relationship.

Estimate the cointegrating relation in levels: \[ y_t = \beta x_t + u_t. \]

Obtain the estimated equilibrium error: \[ \hat z_t = y_t - \hat\beta x_t. \]

Engle–Granger Two-Step: Step 2

Estimate the short-run dynamics.

Fit the ECM using differenced data and the lagged equilibrium error: \[ \Delta y_t = \alpha \hat z_{t-1} + \sum_{i=1}^p \phi_i \Delta y_{t-i} + \sum_{j=0}^q \psi_j \Delta x_{t-j} + \varepsilon_t. \]

Cointegration Test: Engle–Granger

Assume both \(x_t\) and \(y_t\) are \(I(1)\).

Because \(\hat\beta\) is super-consistent, the estimated residual \(\hat u_t = y_t - \hat\beta x_t\) inherits the stationarity properties of the true equilibrium error.

The Engle–Granger test inverts the spurious-regression logic: if \(x_t\) and \(y_t\) are cointegrated, \(\hat u_t\) should be stationary.

Suppose we estimate the levels regression \[ y_t = \beta x_t + u_t, \] and define the residual \(\hat u_t \equiv y_t - \hat\beta x_t\).

If \(\hat u_t\) is nonstationary, the regression is spurious
If \(\hat u_t\) is stationary, \(x_t\) and \(y_t\) are cointegrated

Engle–Granger Test Procedure

The Engle–Granger test checks whether \(\hat u_t\) contains a unit root by running an ADF regression on the residuals: \[ \Delta \hat u_t = \rho\, \hat u_{t-1} + \sum_{i=1}^{p} \phi_i \Delta \hat u_{t-i} + e_t, \] and testing \(H_0: \rho = 0\) (unit root in residuals, i.e., no cointegration).

Because \(\hat u_t\) is constructed using an estimated coefficient, the test uses critical values different from the usual Dickey–Fuller tables.

A Limitation of the Engle–Granger Approach

Cointegration is a symmetric property: either \(x_t\) and \(y_t\) share a stationary linear combination, or they do not.

The Engle–Granger procedure is asymmetric by construction because it relies on a single OLS projection.

Consider two independent \(I(1)\) processes: \[ x_t = x_{t-1} + \eta_t, \qquad y_t = y_{t-1} + \xi_t, \] with \(\{\eta_t\}\) and \(\{\xi_t\}\) i.i.d., mean zero, and uncorrelated.

Asymmetry of the Engle–Granger Procedure

Equivalently, \[ x_t = \sum_{i=1}^t \eta_i, \qquad y_t = \sum_{i=1}^t \xi_i. \]

Estimate the levels regression \[ y_t = \beta x_t + \varepsilon_t \] by OLS.

OLS chooses \(\hat\beta\) to minimize \[ \sum_{t=1}^T (y_t - \beta x_t)^2, \] that is, it projects the entire path of \(y_t\) onto the path of \(x_t\).

The resulting residual is \[ \hat u_t = y_t - \hat\beta x_t = \sum_{i=1}^t \xi_i - \hat\beta \sum_{i=1}^t \eta_i = \sum_{i=1}^t (\xi_i - \hat\beta \eta_i). \]

Reversing the Regression

If instead we reverse the roles and estimate \[ x_t = \gamma y_t + \nu_t, \] OLS now projects the path of \(x_t\) onto the path of \(y_t\), producing \[ \hat\nu_t = \sum_{i=1}^t (\eta_i - \hat\gamma \xi_i), \] which is a different stochastic process.

In finite samples, \(\hat\gamma \ne 1/\hat\beta\), since each OLS minimizes a different sum of squares.

Why test outcomes may differ

Under no cointegration, any residual constructed from nonstationary regressors is nonstationary in population.
The Engle–Granger test is a finite-sample procedure: it evaluates whether the estimated residual appears sufficiently mean-reverting.
Different OLS projections absorb different amounts of low-frequency variation, so residuals can exhibit different degrees of persistence in finite samples.
As a result, unit root tests applied to these residuals can yield different outcomes, even though the underlying variables are not cointegrated.

Where We Landed

Problem: regressions in levels with trending data produce spurious results — large \(t\)-statistics, high \(R^2\), persistent residuals.
Partial fix: differencing restores stationarity but eliminates the long-run information that is often the object of interest.
Resolution: cointegration — when a stationary linear combination exists, a levels regression recovers the long-run relationship.
Operational tool: the ECM combines long-run restrictions with short-run dynamics; the Engle–Granger test checks whether cointegration holds.

Looking Ahead

The Engle–Granger approach handles one cointegrating relation between two variables.

For systems with multiple variables and potentially multiple cointegrating relations:

Vector Error Correction Models (VECM) generalize the ECM to systems
The Johansen procedure tests for cointegrating rank by estimating the number of common stochastic trends
Next: Vector Autoregressions and VECM