Vector Autoregressions

Natasha Kang

Xiamen University, Chow Institute

May, 2026

Recall: ADL Models

We have studied dynamic relationships using autoregressive distributed lag (ADL) models.

\[ y_t = \alpha + \sum_{j=1}^p \phi_j y_{t-j} + \sum_{k=0}^q \beta_k x_{t-k} + u_t. \]

The model captures:

persistence (lags of \(y_t\))
dynamic effects of \(x_t\) through its lags
gradual adjustment over time

The Exogeneity Assumption

For consistent estimation and interpretation:

the regressors must be weakly exogenous
feedback from \(y_t\) to \(x_t\) is ruled out

But ruling out feedback is often untenable in macro, where:

policy reacts to economic conditions
variables evolve jointly
feedback is pervasive

A Simple Control-System Analogy

Consider a thermostat regulating room temperature:

\(y_t\): room temperature
\(x_t\): thermostat setting
\(u_t\): unobserved shocks (weather, insulation, people entering)

The thermostat adjusts its setting in response to past temperature:

\[ x_t = \gamma (y_{t-1} - y^\ast), \]

where \(y^\ast\) is the target temperature.

Why Exogeneity Fails

Suppose the room temperature is generated by \[ y_t = \alpha y_{t-1} + \beta x_t + u_t. \]

Since \(x_t = \gamma(y_{t-1} - y^\ast)\) depends on \(y_{t-1}\), which itself depends on \(u_{t-1}\):

\[ \mathrm{Cov}(x_t, u_{t-1}) \neq 0. \]

Exogeneity fails because the thermostat responds to the state of the system.

Feedback in Macro: Policy and Inflation

A simple bivariate policy–inflation system:

\[ \begin{aligned} i_t &= \rho\, i_{t-1} + \phi\, \pi_{t-1} + u_{it} \\ \pi_t &= \theta\, \pi_{t-1} - \lambda\, i_{t-1} + u_{\pi t}. \end{aligned} \]

the policy rate responds to past inflation (Taylor-type rule)
inflation responds to past policy rates (monetary transmission)
each equation has its own shock

Neither variable is naturally “independent” of the other. There is no principled way to choose one as the regressor and the other as the regressand — feedback runs both ways.

Large-Scale Macroeconometric Models

Macroeconomists historically tried to capture this kind of feedback through large-scale macroeconometric models.

These models:

consist of many simultaneous equations
specify behavioral relationships for each variable (e.g. consumption, investment, etc.)
explicitly allow feedback across the system

For example, a model might include equations such as: \[ \begin{aligned} C_t &= f(Y_t, T_t, W_{t-1}, C_{t-1}) \\ I_t &= g(Y_t, r_t, I_{t-1}) \\ M_t &= h(Y_t, P_t, M_{t-1}) \end{aligned} \]

Difficulties with Simultaneous-Equation Systems

To identify such a system, you must impose exclusion restrictions — assumptions that some variables do not enter certain equations.

Exclusions are hard to justify. Why does \(r_t\) belong in investment but not in consumption? Intertemporal optimization puts it in both. In practice, exclusions reflect modeling convenience more than theory.
Exclusions choose the transmission. Omitting \(r_t\) from \(C_t\) forces every effect of a rate shock on \(C\) to travel through \(Y_t\) — ruling out, by assumption, any direct consumer response to rates.

If we can’t defend the exclusions, we can’t defend the system.

Symmetric Dynamic System

The alternative: treat all variables symmetrically — each gets its own equation, and each equation can include every other variable.

The simplest case is a bivariate system:

\[ \begin{aligned} y_t &= b_{10} - b_{12} z_t + \gamma_{11} y_{t-1} + \gamma_{12} z_{t-1} + \varepsilon_{yt}, \\ z_t &= b_{20} - b_{21} y_t + \gamma_{21} y_{t-1} + \gamma_{22} z_{t-1} + \varepsilon_{zt}. \end{aligned} \]

Compact Vector Form

Let \[ x_t = \begin{bmatrix} y_t \\ z_t \end{bmatrix}, \qquad \varepsilon_t = \begin{bmatrix} \varepsilon_{yt} \\ \varepsilon_{zt} \end{bmatrix}. \]

The system can be written compactly as \[ B x_t = \Gamma_0 + \Gamma_1 x_{t-1} + \varepsilon_t, \] where

\[ B = \begin{bmatrix} 1 & b_{12} \\ b_{21} & 1 \end{bmatrix}, \qquad \Gamma_0 = \begin{bmatrix} b_{10} \\ b_{20} \end{bmatrix}, \qquad \Gamma_1 = \begin{bmatrix} \gamma_{11} & \gamma_{12} \\ \gamma_{21} & \gamma_{22} \end{bmatrix}. \]

Structural Restrictions and Reduced Form

The matrix \(B\) summarizes contemporaneous structural restrictions.

If \(B\) is invertible, the system implies a reduced-form representation: \[ x_t = A_0 + A_1 x_{t-1} + u_t, \qquad u_t = B^{-1}\varepsilon_t. \]

The reduced form is invariant to the choice of contemporaneous structural restrictions, and can be estimated by OLS.

Sims (1980) proposed estimating this reduced form directly, arguing that many structural models relied on “incredible restrictions.”

Stability of the VAR(1)

Consider the VAR(1): \[ x_t = A_0 + A_1 x_{t-1} + u_t. \]

Iterating backward, \[ x_t = \Big(\sum_{i=0}^{k} A_1^i\Big) A_0 + \sum_{i=0}^{k} A_1^i u_{t-i} + A_1^{k+1} x_{t-k-1}. \]

Stability condition: all eigenvalues of \(A_1\) lie strictly inside the unit circle.

Stability Implies Stationarity

If all eigenvalues of \(A_1\) lie strictly inside the unit circle, then:

\(A_1^{k+1} x_{t-k-1} \to 0\) as \(k \to \infty\) (the effect of initial conditions vanishes)
The series \(\sum_{i=0}^{\infty} A_1^i u_{t-i}\) converges in mean square
\(\sum_{i=0}^{\infty} A_1^i = (I-A_1)^{-1}\)

Hence, the stationary solution is \[ x_t = \mu + \sum_{i=0}^{\infty} A_1^i u_{t-i}, \qquad \mu = (I-A_1)^{-1}A_0. \]

Equivalently, using lag operators, \[ x_t - \mu = (I-A_1L)^{-1} u_t. \]

Under the stability condition, \(\{x_t\}\) is covariance-stationary.

Generalization to VAR(\(p\))

Consider the VAR(\(p\)): \[ x_t = A_1 x_{t-1} + \cdots + A_p x_{t-p} + u_t. \]

Define the stacked state vector \[ X_t = \begin{bmatrix} x_t \\ x_{t-1} \\ \vdots \\ x_{t-p+1} \end{bmatrix}. \]

VAR(\(p\)) Companion Form

The system can be written in companion form

\[ X_t = \mathscr{A} X_{t-1} + \mathscr{U}_t, \]

where

\[ \mathscr{A} = \begin{bmatrix} A_1 & A_2 & \cdots & A_p \\ I & 0 & \cdots & 0 \\ & \ddots & \ddots & \vdots \\ 0 & & I & 0 \end{bmatrix}. \]

Stability of VAR(\(p\))

The VAR(\(p\)) is stable if all eigenvalues of the companion matrix \(\mathscr{A}\) lie strictly inside the unit circle:

\[ |\lambda_i(\mathscr{A})| < 1 \quad \text{for all } i. \]

Eigenvalues and Polynomial Roots

In ARMA models, stability was characterized via the roots of a polynomial. For VAR(\(p\)), the companion-matrix eigenvalues give the same thing.

Eigenvalues \(\lambda\) of \(\mathscr{A}\) solve \(\det(\lambda I - \mathscr{A}) = 0\). Expanding using the block structure, \[ \det(\lambda I - \mathscr{A}) = \det(\lambda^p I - A_1 \lambda^{p-1} - \cdots - A_p). \]

Factoring out \(\lambda^p\) and setting \(z = 1/\lambda\): \[ \det(I - A_1 z - \cdots - A_p z^p) = 0. \]

So the eigenvalues of \(\mathscr{A}\) are reciprocals of the roots of the characteristic polynomial: \(|\lambda|<1 \Leftrightarrow |z|>1\).

Estimation of VAR Models

A VAR(\(p\)) is estimated as a system of linear projections: \[ x_t = A_1 x_{t-1} + \cdots + A_p x_{t-p} + u_t. \]

Estimation is typically carried out by equation-by-equation ordinary least squares.

VAR Inference Conditions

If:

the VAR is stable, so \(\{x_t\}\) is stationary
the reduced-form specification is correct (innovations are orthogonal to lagged regressors)
the dimension \((m,p)\) is fixed relative to sample size \(T\)

then:

OLS estimators are consistent
standard asymptotic theory applies

Is Equation-by-Equation OLS Efficient?

Equation-by-equation OLS estimates each VAR equation in isolation. VAR innovations are typically correlated across equations (\(\Sigma_u = \mathrm{Var}(u_t)\) not diagonal) — which usually makes GLS more efficient than OLS.

For the reduced-form VAR, though, GLS coincides with OLS — because every equation shares the same regressors, GLS’s usual advantage disappears.

The next slides make this precise.

SUR Representation

Each VAR equation, stacked over \(t = 1, \ldots, T\), gives \(y_i = Z\beta_i + u_i\), with \(Z \in \mathbb{R}^{T \times mp}\) the lagged-regressor matrix and \(\beta_i \in \mathbb{R}^{mp}\) equation \(i\)’s coefficient vector.

Stacking across the \(m\) equations:

\[ \underbrace{\begin{bmatrix} y_1 \\ \vdots \\ y_m \end{bmatrix}}_{y \,\in\, \mathbb{R}^{mT}} = \underbrace{\begin{bmatrix} Z & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & Z \end{bmatrix}}_{X \,\in\, \mathbb{R}^{mT \times m^2 p}} \underbrace{\begin{bmatrix} \beta_1 \\ \vdots \\ \beta_m \end{bmatrix}}_{\beta \,\in\, \mathbb{R}^{m^2 p}} + \underbrace{\begin{bmatrix} u_1 \\ \vdots \\ u_m \end{bmatrix}}_{u \,\in\, \mathbb{R}^{mT}}. \]

Each equation shares the same regressor matrix \(Z\).

Kronecker Product

The block-diagonal \(X\) above has a name: the Kronecker product.

For matrices \(A\) (\(p \times q\)) and \(B\), \(A \otimes B\) replaces each scalar entry \(a_{ij}\) of \(A\) with the block \(a_{ij}B\):

\[ A \otimes B = \begin{bmatrix} a_{11} B & \cdots & a_{1q} B \\ \vdots & & \vdots \\ a_{p1} B & \cdots & a_{pq} B \end{bmatrix}. \]

SUR in Kronecker Form

Since \(I_m\) has 1s on the diagonal, the block-diagonal \(X\) is just \(I_m \otimes Z\):

\[ X = I_m \otimes Z. \]

Because \(u_t\) is i.i.d. across time with \(\mathrm{Var}(u_t) = \Sigma_u\) (an \(m \times m\) contemporaneous covariance), the stacked error has

\[ \mathrm{Var}(u) = \Sigma_u \otimes I_T \;\in\; \mathbb{R}^{mT \times mT}. \]

So the SUR system is compactly \(y = (I_m \otimes Z)\beta + u\) with \(\mathrm{Var}(u) = \Sigma_u \otimes I_T\).

Useful Kronecker properties (we’ll need these):

\((A \otimes B)' = A' \otimes B'\)
\((A \otimes B)(C \otimes D) = (AC) \otimes (BD)\)
\((A \otimes B)^{-1} = A^{-1} \otimes B^{-1}\)

GLS Setup

We want the efficient estimator for this stacked system. Work generically first: \(y = X\beta + u\) with \(\mathrm{Var}(u) = \Omega\) positive-definite (later we take \(\Omega = \Sigma_u \otimes I_T\)).

OLS minimizes the unweighted sum of squared residuals:

\[ S_{\mathrm{OLS}}(\beta) = (y - X\beta)'(y - X\beta). \]

Equal weighting is optimal when \(\Omega = \sigma^2 I\), but suboptimal when \(\Omega\) is general:

observations with larger variance carry less information
correlated observations carry overlapping information

Weighted Least Squares

For general \(\Omega\), minimize a weighted sum of squared residuals instead:

\[ S_{\mathrm{GLS}}(\beta) = (y - X\beta)' \Omega^{-1} (y - X\beta). \]

First-order condition (derivative in \(\beta\) set to zero):

\[ -2\, X' \Omega^{-1} (y - X\beta) = 0. \]

Solving the normal equations yields the GLS estimator:

\[ \widehat\beta_{\mathrm{GLS}} = (X'\Omega^{-1}X)^{-1} X'\Omega^{-1} y. \]

Why \(\Omega^{-1}\)? The Pre-Whitening View

The \(\Omega^{-1}\) weighting isn’t arbitrary — it comes from transforming the model so errors become i.i.d.

\(\Omega\) is symmetric positive-definite, so it has a symmetric square root \(\Omega^{1/2}\) with \(\Omega^{1/2}\Omega^{1/2} = \Omega\). Premultiply by \(\Omega^{-1/2}\):

\[ \Omega^{-1/2} y = \Omega^{-1/2} X \beta + \Omega^{-1/2} u, \qquad \mathrm{Var}(\Omega^{-1/2} u) = I. \]

Applying OLS to the transformed system gives the same formula, \((X'\Omega^{-1}X)^{-1}X'\Omega^{-1}y\) — pre-whitening and weighted LS are the same estimator.

Applying GLS to the SUR System

Start with \(X = I_m \otimes Z\) and \(\Omega = \Sigma_u \otimes I_T\).

Transpose and inverse: \[ X' = I_m \otimes Z', \qquad \Omega^{-1} = \Sigma_u^{-1} \otimes I_T. \]

Mixed-product gives \[ X'\Omega^{-1} = (I_m \otimes Z')(\Sigma_u^{-1} \otimes I_T) = \Sigma_u^{-1} \otimes Z', \]

\[ X'\Omega^{-1}X = (\Sigma_u^{-1} \otimes Z')(I_m \otimes Z) = \Sigma_u^{-1} \otimes (Z'Z). \]

Inverse property: \[ (X'\Omega^{-1}X)^{-1} = \Sigma_u \otimes (Z'Z)^{-1}. \]

Why OLS Is Efficient

Combining the pieces in the GLS formula:

\[ \widehat\beta_{\mathrm{GLS}} = (X'\Omega^{-1}X)^{-1} X'\Omega^{-1} y = \big(\Sigma_u \otimes (Z'Z)^{-1}\big)\big(\Sigma_u^{-1} \otimes Z'\big) y = \big(I_m \otimes (Z'Z)^{-1}Z'\big) y. \]

The block-diagonal form means \((Z'Z)^{-1}Z'\) is applied equation by equation — which is equation-by-equation OLS. So

\[ \widehat\beta_{\mathrm{GLS}} = \widehat\beta_{\mathrm{OLS}}. \]

Implication: with identical regressors across equations, SUR gains nothing — equation-by-equation OLS is efficient for reduced-form VARs.

Dimensionality in VAR Models

The reduced-form VAR(\(p\)) \(x_t = \nu + A_1 x_{t-1} + \cdots + A_p x_{t-p} + u_t\) contains

\[ m^2 p + m \]

parameters (\(m\) = number of variables).

parameter count grows quadratically in \(m\) — the curse of dimensionality
increasing lag length rapidly exhausts degrees of freedom
finite-sample estimation error can be substantial

Rule of Thumb for Sample Size

Each equation in a VAR(\(p\)) involves approximately \(mp\) slope coefficients.

Empirical practice typically requires \[ T \gtrsim 5\text{–}10 \times mp \] for reliable estimation.

As a result, standard VARs are usually restricted to small systems in macroeconomic applications.

Lag-Length Selection

Because dimensionality grows with \(p\), lag length must be chosen carefully. Standard criteria trade off fit against parameterization:

Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)
Hannan–Quinn Criterion (HQ)

In typical macroeconomic systems:

BIC tends to select the most parsimonious lag length
AIC tends to select longer lags
HQ lies between the two

In practice, report sensitivity: estimate the VAR under more than one criterion and check whether conclusions depend on \(p\).

Forecasting in a VAR

Forecasts are computed recursively from the VAR coefficients:

\[ \widehat{x}_{t+h|t} = A_1 \widehat{x}_{t+h-1|t} + \cdots + A_p \widehat{x}_{t+h-p|t}, \qquad h \ge 1, \]

with \(\widehat{x}_{s|t} = x_s\) for \(s \le t\).

Forecast Error Variance

For a VAR(1) with \(\mathrm{Var}(u_t) = \Sigma_u\), the FEV satisfies

\[ \Omega_1 = \Sigma_u, \qquad \Omega_h = A\,\Omega_{h-1}A' + \Sigma_u. \]

Under stability, \(\Omega_h\) grows with \(h\) and converges to the unconditional variance of \(x_t\).

Under (asymptotic) Gaussianity, a \(100(1-\alpha)\%\) CI for \(x_{i,t+h}\) is

\[ \widehat{x}_{i,t+h|t} \pm z_{1-\alpha/2} \sqrt{[\Omega_h]_{ii}}. \]

Example: Forecasting in a Stable VAR (Simulated Data)

We simulate a stable bivariate VAR(1) process and illustrate multi-step forecasts and forecast confidence intervals.

Data-generating process is stationary by construction
Forecasts are produced recursively from the VAR
Confidence intervals reflect uncertainty from future shocks

Nonstationary Variables: Why Not Just Difference?

When \(x_t\) is \(I(1)\), differencing restores stationarity — but it removes long-run relationships between variables.

A VAR in differences cannot capture equilibrium comovement (Sims, Stock, and Watson, 1990).

We need a framework that preserves both short-run dynamics and long-run equilibrium: cointegration and the VECM.

From VAR Levels to a Differenced Form

Start from a VAR(\(p\)) in levels: \[ x_t = A_1 x_{t-1} + \cdots + A_p x_{t-p} + u_t. \]

Subtract \(x_{t-1}\) from both sides: \[ \Delta x_t = \Big(A_1 + \cdots + A_p - I\Big)x_{t-1} + \sum_{i=1}^{p-1} \Big(-A_{i+1}-\cdots-A_p\Big)\Delta x_{t-i} + u_t. \]

Rewriting as a VECM

Define \[ \Pi := \sum_{i=1}^p A_i - I, \qquad \Gamma_i := -\sum_{j=i+1}^p A_j. \]

Then the VAR can always be written as \[ \Delta x_t = \Pi x_{t-1} + \sum_{i=1}^{p-1} \Gamma_i \Delta x_{t-i} + u_t. \]

Cointegration Hypothesis

Suppose \(x_t\) is \(I(1)\) and cointegrated with rank \(r<m\).

That is, there exists \(\beta \in \mathbb{R}^{m\times r}\) such that \[ \beta' x_t \ \text{is stationary}. \]

Why Π Must Have Reduced Rank

In the VECM form, \[ \Delta x_t = \Pi x_{t-1} + \sum_{i=1}^{p-1} \Gamma_i \Delta x_{t-i} + u_t, \] the left-hand side is stationary, as are \(\Delta x_{t-i}\) and \(u_t\). Hence \(\Pi x_{t-1}\) must be stationary.

Since \(x_{t-1}\) is nonstationary, \(\Pi x_{t-1}\) can be stationary only if it is a linear combination of \(\beta' x_{t-1}\): \[ \mathrm{rank}(\Pi)=r<m \quad\Rightarrow\quad \Pi=\alpha\beta'. \]

The Vector Error-Correction Model

Substituting \(\Pi=\alpha\beta'\) gives the VECM:

\[ \Delta x_t = \alpha \beta' x_{t-1} + \sum_{i=1}^{p-1} \Gamma_i \Delta x_{t-i} + u_t. \]

short-run dynamics: \(\Gamma_i \Delta x_{t-i}\)
long-run equilibrium errors: \(\beta' x_{t-1}\)
adjustment to equilibrium: \(\alpha \in \mathbb{R}^{m \times r}\)

A Bivariate VECM Example

Let \[ x_t = \begin{bmatrix} y_t \\ z_t \end{bmatrix}, \qquad y_t,\; z_t \text{ are } I(1). \]

Suppose there exists a scalar \(\theta\) such that \[ \beta' x_t = y_t - \theta z_t \quad \text{is stationary}, \qquad \beta'=(1,\,-\theta). \]

Then \(y_t\) and \(z_t\) are cointegrated (rank \(r=1\)).

Bivariate VECM: Error-Correction Equations

Define the equilibrium error \(e_{t-1} \equiv y_{t-1} - \theta z_{t-1}\). The VECM \[ \Delta x_t = \alpha e_{t-1} + \sum_{i=1}^{p-1}\Gamma_i \Delta x_{t-i} + u_t, \qquad \alpha \in \mathbb{R}^{2\times 1}, \]

implies two error-correction equations: \[ \begin{aligned} \Delta y_t &= \alpha_1 e_{t-1} + \text{short-run dynamics} + u_{1t}, \\ \Delta z_t &= \alpha_2 e_{t-1} + \text{short-run dynamics} + u_{2t}. \end{aligned} \]

Estimation of the VECM Parameters

Consider the VECM \[ \Delta x_t = \Pi x_{t-1} + \sum_{i=1}^{p-1}\Gamma_i \Delta x_{t-i} + u_t, \qquad \Pi=\alpha\beta', \qquad \mathrm{rank}(\Pi)=r. \]

What needs to be estimated:

short-run parameters \(\Gamma_i\)
long-run matrix \(\Pi\), subject to rank restriction \(\mathrm{rank}(\Pi)=r<m\)

Why OLS Fails for the VECM

OLS treats \(\Pi\) as unrestricted
unrestricted OLS produces a full-rank estimate of \(\Pi\)

Therefore, the VECM cannot be estimated by OLS. The rank restriction must be imposed directly in the estimation procedure.

Johansen: Partial Out Short-Run Dynamics

Start from the VECM: \[ \Delta x_t=\Pi x_{t-1}+\sum_{i=1}^{p-1}\Gamma_i \Delta x_{t-i}+u_t. \]

Idea: partial out short-run dynamics so the remaining relation is purely long-run.

Johansen: Partial Out (Residualized Variables)

Regress \(\Delta x_t\) and \(x_{t-1}\) on \(\{\Delta x_{t-1},\dots,\Delta x_{t-p+1}\}\) and keep residuals: \[ \begin{aligned} R_{0t} &= \Delta x_t - \mathbb{E}[\Delta x_t \mid \Delta x_{t-1},\dots,\Delta x_{t-p+1}],\\ R_{1t} &= x_{t-1} - \mathbb{E}[x_{t-1} \mid \Delta x_{t-1},\dots,\Delta x_{t-p+1}]. \end{aligned} \]

The model becomes the long-run regression \[ R_{0t}=\Pi R_{1t}+u_t. \]

This follows from the Frisch–Waugh–Lovell Theorem.

Johansen: Gaussian MLE Setup

Assume the VECM errors are Gaussian: \[ u_t \sim N(0,\Sigma_u), \quad t=1,\dots,T. \]

By FWL, the partialled-out long-run regression has the same error: \[ R_{0t} = \Pi R_{1t} + u_t, \qquad u_t \sim N(0,\Sigma_u). \]

Johansen: Log-Likelihood

The (conditional) log-likelihood is

\[ \ell(\Pi,\Sigma_u) = -\frac{T}{2}\log|\Sigma_u| -\frac{1}{2}\sum_{t=1}^T (R_{0t}-\Pi R_{1t})'\Sigma_u^{-1}(R_{0t}-\Pi R_{1t}) + \text{const}. \]

Maximizing the likelihood is equivalent to minimizing the weighted sum of squared residuals.

Profile Out \(\Sigma_u\)

For fixed \(\Pi\), maximizing the Gaussian log-likelihood over \(\Sigma_u\) gives \[ \widehat\Sigma_u(\Pi) = \tfrac{1}{T}\sum_{t=1}^T (R_{0t} - \Pi R_{1t})(R_{0t} - \Pi R_{1t})'. \]

Substituting back, the trace term collapses to a constant, so \[ \ell(\Pi) = -\tfrac{T}{2}\log\bigl|\widehat\Sigma_u(\Pi)\bigr| + \text{const}, \] and the MLE problem becomes \[ \min_{\Pi:\,\mathrm{rank}(\Pi)=r}\;\bigl|\widehat\Sigma_u(\Pi)\bigr|. \]

Imposing Π = αβ’; Profile Out α

Expand the residual covariance: \[ \widehat\Sigma_u(\Pi) = S_{00} - \Pi S_{10} - S_{01}\Pi' + \Pi S_{11}\Pi', \qquad S_{ij} = \tfrac{1}{T}\textstyle\sum_t R_{it}R_{jt}'. \]

Substituting \(\Pi = \alpha\beta'\) and completing the square in \(\alpha\), \[ \widehat\alpha(\beta) = S_{01}\beta(\beta'S_{11}\beta)^{-1}. \]

Plugging \(\widehat\alpha\) back in leaves a concentrated objective in \(\beta\) alone: \[ \widehat\Sigma_u(\beta) = S_{00} - S_{01}\beta(\beta'S_{11}\beta)^{-1}\beta'S_{10}. \]

From β to a Generalized Eigenvalue Problem

Two determinant identities — \(|A-B|=|A|\,|I-A^{-1}B|\) and Sylvester’s \(|I+UV|=|I+VU|\) — plus the normalization \(\beta'S_{11}\beta = I_r\), reduce the problem to \[ \max_{\beta}\;\bigl|\beta' A \beta\bigr| \quad\text{s.t.}\quad \beta'S_{11}\beta = I_r, \qquad A := S_{10}S_{00}^{-1}S_{01}. \]

The FOC (Lagrangian + Jacobi’s formula, then diagonalize) is the generalized eigenvalue problem \[ S_{10}S_{00}^{-1}S_{01}\,v = \lambda\,S_{11}\,v. \] \(\widehat\beta_r\) = the top-\(r\) eigenvectors.

Remark. The eigenvalues \(\widehat\lambda_i\) are the squared canonical correlations between \(R_{0t}\) and \(R_{1t}\) — Johansen is canonical correlation analysis (CCA) between residualized differences and levels.

Why: CCA’s eigenproblem \(\Sigma_{YY}^{-1}\Sigma_{YX}\Sigma_{XX}^{-1}\Sigma_{XY}\,b = \rho^2\,b\) becomes \(S_{11}^{-1}S_{10}S_{00}^{-1}S_{01}\,v = \lambda\,v\) once \(X = R_{0t},\ Y = R_{1t}\) — identical to Johansen, so \(\widehat\lambda_i = \widehat\rho_i^{\,2}\).

CCA in Variational Form

The remark just said \(\widehat\lambda_i = \widehat\rho_i^{\,2}\) — eigenvalues are squared canonical correlations. Make this operational.

Canonical direction vectors. Pick \(a,\,b \in \mathbb{R}^m\) to form scalar series \[ a' R_{0t}\ \text{(combined residual differences)}, \qquad b' R_{1t}\ \text{(combined residual levels)}. \]

Variational identity. \[ \widehat\lambda_i \;=\; \max_{a,\,b}\;\widehat{\mathrm{corr}}^{\,2}\!\bigl(a' R_{0t},\, b' R_{1t}\bigr) \;=\; \max_{a,\,b}\;\frac{(a'S_{01}\,b)^2}{(a'S_{00}\,a)\,(b'S_{11}\,b)}. \] For \(i>1\), the max is taken orthogonal to previous pairs (CCA deflation).

What Does a Large \(\widehat\lambda_i\) Mean?

Fix a direction \(b\) with \(b'R_{1t}\sim I(1)\). (\(R_{0t}\sim I(0)\) always.) What does the squared sample correlation along \((a, b)\) converge to? \[ \frac{(a'S_{01}\,b)^2}{(a'S_{00}\,a)\,(b'S_{11}\,b)}, \qquad S_{ij} = \tfrac{1}{T}\!\sum_{t} R_{it} R_{jt}'. \]

Three asymptotic rates.

\(a' S_{00}\, a \;\xrightarrow{p}\; a'\Sigma_{00}\,a \;=\; O_p(1)\) (\(I(0)\times I(0)\))
\(b' S_{11}\, b \;=\; O_p(T)\) (\(I(1)\times I(1)\): \(\mathrm{Var}(b'R_{1t}) = O(t)\), sample average inflates as \(T\))
\(a' S_{01}\, b \;=\; O_p(1)\) (\(I(0)\times I(1)\), unit-root FCLT: \(\xrightarrow{d} \int_0^1 W\,dB\))

Ratio collapses: \[ \frac{O_p(1)}{O_p(1)\cdot O_p(T)} \;=\; O_p(T^{-1}) \;\xrightarrow{p}\; 0. \] Any \(I(1)\)-direction washes out. So \(\widehat\lambda_i = \max_{a,b}(\cdot)\) can stay bounded above 0 only if some \(b\) gives \(b'R_{1t}\sim I(0)\).

Cointegration as the Escape Hatch

A non-vanishing \(\widehat\lambda_i\) requires a \(\beta_i\) such that \(\beta_i'x_{t-1}\) is stationary. Then \(\beta_i' S_{11}\,\beta_i = O_p(1)\) rather than \(O_p(T)\), and \(\widehat\lambda_i\) stays bounded away from 0.

\[ \widehat\lambda_i \;\not\xrightarrow{p}\; 0 \;\;\Longleftrightarrow\;\; \beta_i'x_{t-1}\sim I(0) \;\;\Longleftrightarrow\;\; \beta_i \text{ cointegrates } x_t. \]

How Eigenvalues Encode rank(Π)

Each non-zero \(\lambda_i\) of \(M := \Sigma_{11}^{-1}\Sigma_{10}\Sigma_{00}^{-1}\Sigma_{01}\) flags a cointegrating direction. To get a count, chain three facts:

1. # non-zero \(\lambda_i = \mathrm{rank}(M)\) — by definition.

2. \(\mathrm{rank}(M) = \mathrm{rank}(\Pi)\) — both reduce to \(\mathrm{rank}(\Sigma_{01})\), since \(\Pi = \Sigma_{01}\Sigma_{11}^{-1}\) and the middle factor \(\Sigma_{10}\Sigma_{00}^{-1}\Sigma_{01}\) in \(M\) is a Gram matrix.

3. \(\mathrm{rank}(\Pi)\) = # cointegrating relations — from the VECM setup.

\(\Longrightarrow\) # non-zero \(\widehat\lambda_i\) = # cointegrating relations.

Eigenvalues and Cointegrating Rank

A single \(\widehat\lambda_i\) describes one direction. Their collective pattern encodes the cointegrating rank \(r\):

\(r = 0\) (no cointegration; \(\Pi = 0\), \(x_t\) pure \(I(1)\)). All \(\widehat\lambda_i = O_p(T^{-1}) \to 0\). Every direction has the variance mismatch.

\(0 < r < m\) (\(r\) cointegrating relations). Exactly \(r\) eigenvalues stay \(O_p(1)\); the remaining \(m-r\) vanish.

\(r = m\) (\(x_t\) already stationary). All \(\widehat\lambda_i\) are \(O_p(1)\).

\(\Longrightarrow\) # of non-vanishing eigenvalues = cointegrating rank.

At finite \(T\), sampling noise keeps every \(\widehat\lambda_i > 0\) — the rank tests formalize “how large is large enough.”

From Eigenvalues to a Test

In population, the rank dictionary cleanly separates non-vanishing from vanishing eigenvalues. At finite \(T\), sampling noise keeps every \(\widehat\lambda_i > 0\) — we need a hypothesis test.

Hypothesis. For each candidate \(r\), \[ H_0:\mathrm{rank}(\Pi) \le r \quad\text{vs.}\quad H_1:\mathrm{rank}(\Pi) > r. \] Under \(H_0\): \(\widehat\lambda_{r+1}, \ldots, \widehat\lambda_m\) should all vanish. Under \(H_1\): at least \(\widehat\lambda_{r+1}\) remains \(O_p(1)\).

Direct route. Reject if \(\widehat\lambda_{r+1}\) is “too big.” Two gaps:

cutoff — what null distribution for raw \(\widehat\lambda\)?
combining info — \(\widehat\lambda_{r+1}\) alone, or all of \(\widehat\lambda_{r+1}, \ldots, \widehat\lambda_m\)?

LR framework. Standard hypothesis-testing machinery — provides a null distribution and a principled way to aggregate across eigenvalues.

The Log-Likelihood Ratio

For nested models with maximized likelihoods \(L_0 \le L_1\) (restricted vs. unrestricted), \[ \mathrm{LR} \;=\; -2\log\frac{L_0}{L_1} \;=\; -2(\ell_0 - \ell_1) \;\ge\; 0. \]

Large \(\mathrm{LR}\) → restricted model fits much worse → reject \(H_0\).

Wilks’ theorem (regular problems): under \(H_0\), \[ \mathrm{LR} \;\xrightarrow{d}\; \chi^2_q, \qquad q = \#\,\text{restrictions}. \]

Johansen’s case is nonstandard. Because \(x_t\) is \(I(1)\), the limit is a functional of Brownian motion, not \(\chi^2\). Critical values are tabulated (Osterwald-Lenum 1992; MacKinnon–Haug–Michelis 1999).

Johansen’s LR Rank Test

For each \(r\), test \(H_0:\mathrm{rank}(\Pi)\le r\) vs. \(H_1:\mathrm{rank}(\Pi) > r\). Substituting the profile log-likelihood \(\ell(\Pi) = -\tfrac{T}{2}\log|\widehat\Sigma_u(\Pi)| + \text{const}\), \[ \mathrm{LR}(r) \;=\; -2(\ell_r - \ell_m) \;=\; -T\log\frac{|\widehat\Sigma_m|}{|\widehat\Sigma_r|}, \] where \(\widehat\Sigma_m, \widehat\Sigma_r\) are the residual covariances under the unrestricted and rank-\(r\) models.

LR in Eigenvalue Form

The determinant ratio reduces to a product over the Johansen eigenvalues: \[ \frac{|\widehat\Sigma_m|}{|\widehat\Sigma_r|} \;=\; \prod_{i=r+1}^m (1 - \widehat\lambda_i), \qquad \mathrm{LR}(r) \;=\; -T\sum_{i=r+1}^m \log(1 - \widehat\lambda_i). \]

Two tests are built from this: the trace test and the maximum eigenvalue test.

Trace Test (Joint LR Test)

Compares the rank-\(r\) model to the unrestricted model: \[ \mathrm{LR}_{\text{trace}}(r) = -T\sum_{i=r+1}^m \log(1-\hat\lambda_i). \]

This tests \[ H_0:\ \mathrm{rank}(\Pi)\le r \quad \text{vs.} \quad H_1:\ \mathrm{rank}(\Pi)>r. \]

The null requires every \(\hat\lambda_{r+1},\dots,\hat\lambda_m\) to be zero.

Maximum Eigenvalue Test

Compares the rank-\(r\) model to the rank-\((r+1)\) model: \[ \mathrm{LR}_{\max}(r,r+1) = -T\log(1-\hat\lambda_{r+1}). \]

This tests \[ H_0:\ \mathrm{rank}(\Pi)=r \quad \text{vs.} \quad H_1:\ \mathrm{rank}(\Pi)=r+1. \]

Sequential use. Starting from \(r=0\): test, and if rejected, increment \(r\) and test again. The selected rank is the first \(r\) for which \(H_0\) is not rejected.

Connecting Back to the Eigenvalue Magnitude

For small \(\widehat\lambda_i\) (the relevant range under the null of no further cointegration), Taylor expansion gives \[ -\log(1-\widehat\lambda_i) \;=\; \widehat\lambda_i + \tfrac{1}{2}\widehat\lambda_i^2 + O(\widehat\lambda_i^3). \]

Both Johansen statistics reduce to magnitude-based forms:

Trace: \(\mathrm{LR}_{\text{trace}}(r) \;\approx\; T\sum_{i=r+1}^m \widehat\lambda_i\).
Max-eigenvalue: \(\mathrm{LR}_{\max}(r) \;\approx\; T\,\widehat\lambda_{r+1}\).

The “log” was never a separate strategy — it is the likelihood-correct version of “look at the eigenvalues.” The intuition built up earlier is what the test ultimately uses.

Trace vs. Max-Eigen: Practical Guidance

Why they differ.

Trace sums contributions across \(\widehat\lambda_{r+1}, \dots, \widehat\lambda_m\) — diffuse evidence across many small eigenvalues accumulates → tends to pick larger \(r\).
Max-eigen uses only \(\widehat\lambda_{r+1}\) — needs one clearly large next eigenvalue → tends to pick smaller \(r\).

So what.

Trace has more power when several weak cointegrating relations are present.
Max-eigen has more power against one strong relation and is easier to interpret step-by-step.
In small samples, trace over-rejects more.

What to do.

Report both.
If they agree → done.
If they disagree → prefer max-eigen (more conservative), and probe robustness via lag length and deterministic specification.

Engle–Granger vs. Johansen

Both estimate \(\beta\) identified only up to rotation. The difference is when the normalization enters.

Engle–Granger (single-equation)

OLS on \(y_t = \theta' z_t + e_t\), then ECM with \(\widehat e_{t-1}\)
normalization is baked into estimation — picking \(y\) as LHS fixes its coefficient at \(1\)
different LHS choices give different finite-sample \(\widehat\theta\) (asymmetric)
cannot handle \(r > 1\)

Johansen (system VECM)

system estimator for the cointegration space \(\mathrm{span}(\beta)\)
normalization is post-estimation — rotate eigenvectors after fitting; rank tests are invariant
handles \(r = 0, 1, \dots, m-1\); LR rank tests built in

Both require lag length \(p\) and deterministic terms (constant/trend) to be specified up front.

From Rank to VECM Structure

With \(r\) fixed by the rank tests, both VECM components come from the eigenproblem: \[ \widehat\beta_r = \text{top-}r\ \text{eigenvectors (normalized: } \widehat\beta_r'\,S_{11}\,\widehat\beta_r = I_r\text{)}, \qquad \widehat\alpha_r = S_{01}\,\widehat\beta_r. \]

The estimated VECM \[ \Delta x_t = \widehat\alpha\,\widehat\beta'\,x_{t-1} + \cdots + u_t \] has two structural objects to interrogate:

\(\beta\) — long-run equilibrium relations,
\(\alpha\) — adjustment coefficients (which variables respond to disequilibrium).

Restrictions on either are testable hypotheses with economic content. We turn to \(\alpha\) next — its restrictions read as statements about the error-correction role of each variable (e.g., “is variable \(i\) weakly exogenous to the long-run equilibrium?”).

Testing Adjustment in a VECM: Restrictions on \(\alpha\)

In the VECM \[ \Delta x_t = \alpha\beta' x_{t-1} + \cdots + u_t, \] \(\alpha \in \mathbb{R}^{m\times r}\) collects adjustment coefficients: row \(i\) describes how \(\Delta x_{i,t}\) responds to the \(r\) equilibrium errors.

Linear hypotheses on \(\alpha\) take the form \[ H_0:\ R\,\mathrm{vec}(\alpha) = q, \qquad R \in \mathbb{R}^{k \times mr},\ q \in \mathbb{R}^{k}, \] where \(\mathrm{vec}(\alpha) \in \mathbb{R}^{mr}\) stacks the columns of \(\alpha\), and \(k\) = # independent restrictions = degrees of freedom of the LR/Wald \(\chi^2\) test.

Conditional on the cointegration rank \(r\), \(\alpha\) is the coefficient on the stationary regressor \(\widehat\beta_r' x_{t-1},\) so standard regression / MLE asymptotics apply.

Examples of Restrictions on \(\alpha\)

No adjustment for variable \(i\) (weak exogeneity for \(\beta\)): \[ H_0:\ \alpha_{i\cdot}=0 \quad (r\ \text{restrictions}) \]
Only relation 1 adjusts variable \(i\): \[ H_0:\ \alpha_{i2}=\cdots=\alpha_{ir}=0 \]
Same adjustment for variables \(i\) and \(j\): \[ H_0:\ \alpha_{i\cdot}=\alpha_{j\cdot} \]

Example: A Small Open Economy

\(x_t = (y_t,\ p_t,\ p_t^{\ast})'\) — domestic output, domestic price, foreign price.

Question. Do prices satisfy PPP in the long run, and which side adjusts to disequilibrium — domestic or foreign?

Procedure.

Johansen rank tests → estimate \(\widehat r\).
Recover \(\widehat\beta\) from the top eigenvector(s).
LR test whether \(\widehat\beta\) matches the PPP form \((0,1,-1)'\).
Test restrictions on \(\alpha\) for adjustment dynamics.

Suppose stages 1–3 deliver \(\widehat r = 1\) and \(\widehat\beta \approx (0,1,-1)'\). The VECM components: \[ \beta = \begin{bmatrix} 0 \\ 1 \\ -1 \end{bmatrix}, \quad \alpha = \begin{bmatrix} \alpha_y \\ \alpha_p \\ \alpha_{p^\ast} \end{bmatrix}, \quad \Pi = \alpha\beta' = \begin{bmatrix} 0 & \alpha_y & -\alpha_y \\ 0 & \alpha_p & -\alpha_p \\ 0 & \alpha_{p^\ast} & -\alpha_{p^\ast} \end{bmatrix}. \]

Small Open Economy: A Meaningful Restriction

Small open economy assumption: foreign prices do not adjust to domestic disequilibrium: \[ H_0:\ \alpha_{p^\ast}=0. \]

Interpretation:

deviations from PPP are corrected through domestic adjustment
\(p_t^{\ast}\) is weakly exogenous for the cointegrating relation \(\beta\)

Diagnostics

Once a VAR or VECM is estimated, check that the specification is adequate:

residual autocorrelation: Ljung–Box or portmanteau test on multivariate residuals
stability: all eigenvalues of the companion matrix inside the unit circle
distributional checks: normality or at least symmetry of residuals for inference

Failure of these checks typically points back to an under-specified lag length or an omitted structural break.

Summary: What VARs and VECMs Do

We studied VARs and VECMs as reduced-form representations of multivariate time series.

They allow us to:

model joint dynamics without imposing exogeneity
estimate long-run relations via cointegration
produce forecasts and forecast uncertainty

Summary: What They Do Not Do

The shocks \(u_t\) in a VAR or VECM are:

statistical innovations
generally correlated across equations
not economically interpretable by construction

Next unit: how additional identifying assumptions transform reduced-form innovations into structural shocks with economic meaning.