Xiamen University, Chow Institute
May, 2026
We have studied dynamic relationships using autoregressive distributed lag (ADL) models.
\[ y_t = \alpha + \sum_{j=1}^p \phi_j y_{t-j} + \sum_{k=0}^q \beta_k x_{t-k} + u_t. \]
The model captures:
For consistent estimation and interpretation:
But ruling out feedback is often untenable in macro, where:
Consider a thermostat regulating room temperature:
The thermostat adjusts its setting in response to past temperature:
\[ x_t = \gamma (y_{t-1} - y^\ast), \]
where \(y^\ast\) is the target temperature.
Suppose the room temperature is generated by \[ y_t = \alpha y_{t-1} + \beta x_t + u_t. \]
Since \(x_t = \gamma(y_{t-1} - y^\ast)\) depends on \(y_{t-1}\), which itself depends on \(u_{t-1}\):
\[ \mathrm{Cov}(x_t, u_{t-1}) \neq 0. \]
Exogeneity fails because the thermostat responds to the state of the system.
A simple bivariate policy–inflation system:
\[ \begin{aligned} i_t &= \rho\, i_{t-1} + \phi\, \pi_{t-1} + u_{it} \\ \pi_t &= \theta\, \pi_{t-1} - \lambda\, i_{t-1} + u_{\pi t}. \end{aligned} \]
Neither variable is naturally “independent” of the other. There is no principled way to choose one as the regressor and the other as the regressand — feedback runs both ways.
Macroeconomists historically tried to capture this kind of feedback through large-scale macroeconometric models.
These models:
For example, a model might include equations such as: \[ \begin{aligned} C_t &= f(Y_t, T_t, W_{t-1}, C_{t-1}) \\ I_t &= g(Y_t, r_t, I_{t-1}) \\ M_t &= h(Y_t, P_t, M_{t-1}) \end{aligned} \]
To identify such a system, you must impose exclusion restrictions — assumptions that some variables do not enter certain equations.
Exclusions are hard to justify. Why does \(r_t\) belong in investment but not in consumption? Intertemporal optimization puts it in both. In practice, exclusions reflect modeling convenience more than theory.
Exclusions choose the transmission. Omitting \(r_t\) from \(C_t\) forces every effect of a rate shock on \(C\) to travel through \(Y_t\) — ruling out, by assumption, any direct consumer response to rates.
If we can’t defend the exclusions, we can’t defend the system.
The alternative: treat all variables symmetrically — each gets its own equation, and each equation can include every other variable.
The simplest case is a bivariate system:
\[ \begin{aligned} y_t &= b_{10} - b_{12} z_t + \gamma_{11} y_{t-1} + \gamma_{12} z_{t-1} + \varepsilon_{yt}, \\ z_t &= b_{20} - b_{21} y_t + \gamma_{21} y_{t-1} + \gamma_{22} z_{t-1} + \varepsilon_{zt}. \end{aligned} \]
Let \[ x_t = \begin{bmatrix} y_t \\ z_t \end{bmatrix}, \qquad \varepsilon_t = \begin{bmatrix} \varepsilon_{yt} \\ \varepsilon_{zt} \end{bmatrix}. \]
The system can be written compactly as \[ B x_t = \Gamma_0 + \Gamma_1 x_{t-1} + \varepsilon_t, \] where
\[ B = \begin{bmatrix} 1 & b_{12} \\ b_{21} & 1 \end{bmatrix}, \qquad \Gamma_0 = \begin{bmatrix} b_{10} \\ b_{20} \end{bmatrix}, \qquad \Gamma_1 = \begin{bmatrix} \gamma_{11} & \gamma_{12} \\ \gamma_{21} & \gamma_{22} \end{bmatrix}. \]
The matrix \(B\) summarizes contemporaneous structural restrictions.
If \(B\) is invertible, the system implies a reduced-form representation: \[ x_t = A_0 + A_1 x_{t-1} + u_t, \qquad u_t = B^{-1}\varepsilon_t. \]
The reduced form is invariant to the choice of contemporaneous structural restrictions, and can be estimated by OLS.
Sims (1980) proposed estimating this reduced form directly, arguing that many structural models relied on “incredible restrictions.”
Consider the VAR(1): \[ x_t = A_0 + A_1 x_{t-1} + u_t. \]
Iterating backward, \[ x_t = \Big(\sum_{i=0}^{k} A_1^i\Big) A_0 + \sum_{i=0}^{k} A_1^i u_{t-i} + A_1^{k+1} x_{t-k-1}. \]
Stability condition: all eigenvalues of \(A_1\) lie strictly inside the unit circle.
If all eigenvalues of \(A_1\) lie strictly inside the unit circle, then:
Hence, the stationary solution is \[ x_t = \mu + \sum_{i=0}^{\infty} A_1^i u_{t-i}, \qquad \mu = (I-A_1)^{-1}A_0. \]
Equivalently, using lag operators, \[ x_t - \mu = (I-A_1L)^{-1} u_t. \]
Under the stability condition, \(\{x_t\}\) is covariance-stationary.
Consider the VAR(\(p\)): \[ x_t = A_1 x_{t-1} + \cdots + A_p x_{t-p} + u_t. \]
Define the stacked state vector \[ X_t = \begin{bmatrix} x_t \\ x_{t-1} \\ \vdots \\ x_{t-p+1} \end{bmatrix}. \]
The system can be written in companion form
\[ X_t = \mathscr{A} X_{t-1} + \mathscr{U}_t, \]
where
\[ \mathscr{A} = \begin{bmatrix} A_1 & A_2 & \cdots & A_p \\ I & 0 & \cdots & 0 \\ & \ddots & \ddots & \vdots \\ 0 & & I & 0 \end{bmatrix}. \]
The VAR(\(p\)) is stable if all eigenvalues of the companion matrix \(\mathscr{A}\) lie strictly inside the unit circle:
\[ |\lambda_i(\mathscr{A})| < 1 \quad \text{for all } i. \]
In ARMA models, stability was characterized via the roots of a polynomial. For VAR(\(p\)), the companion-matrix eigenvalues give the same thing.
Eigenvalues \(\lambda\) of \(\mathscr{A}\) solve \(\det(\lambda I - \mathscr{A}) = 0\). Expanding using the block structure, \[ \det(\lambda I - \mathscr{A}) = \det(\lambda^p I - A_1 \lambda^{p-1} - \cdots - A_p). \]
Factoring out \(\lambda^p\) and setting \(z = 1/\lambda\): \[ \det(I - A_1 z - \cdots - A_p z^p) = 0. \]
So the eigenvalues of \(\mathscr{A}\) are reciprocals of the roots of the characteristic polynomial: \(|\lambda|<1 \Leftrightarrow |z|>1\).
A VAR(\(p\)) is estimated as a system of linear projections: \[ x_t = A_1 x_{t-1} + \cdots + A_p x_{t-p} + u_t. \]
Estimation is typically carried out by equation-by-equation ordinary least squares.
If:
then:
Equation-by-equation OLS estimates each VAR equation in isolation. VAR innovations are typically correlated across equations (\(\Sigma_u = \mathrm{Var}(u_t)\) not diagonal) — which usually makes GLS more efficient than OLS.
For the reduced-form VAR, though, GLS coincides with OLS — because every equation shares the same regressors, GLS’s usual advantage disappears.
The next slides make this precise.
Each VAR equation, stacked over \(t = 1, \ldots, T\), gives \(y_i = Z\beta_i + u_i\), with \(Z \in \mathbb{R}^{T \times mp}\) the lagged-regressor matrix and \(\beta_i \in \mathbb{R}^{mp}\) equation \(i\)’s coefficient vector.
Stacking across the \(m\) equations:
\[ \underbrace{\begin{bmatrix} y_1 \\ \vdots \\ y_m \end{bmatrix}}_{y \,\in\, \mathbb{R}^{mT}} = \underbrace{\begin{bmatrix} Z & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & Z \end{bmatrix}}_{X \,\in\, \mathbb{R}^{mT \times m^2 p}} \underbrace{\begin{bmatrix} \beta_1 \\ \vdots \\ \beta_m \end{bmatrix}}_{\beta \,\in\, \mathbb{R}^{m^2 p}} + \underbrace{\begin{bmatrix} u_1 \\ \vdots \\ u_m \end{bmatrix}}_{u \,\in\, \mathbb{R}^{mT}}. \]
Each equation shares the same regressor matrix \(Z\).
The block-diagonal \(X\) above has a name: the Kronecker product.
For matrices \(A\) (\(p \times q\)) and \(B\), \(A \otimes B\) replaces each scalar entry \(a_{ij}\) of \(A\) with the block \(a_{ij}B\):
\[ A \otimes B = \begin{bmatrix} a_{11} B & \cdots & a_{1q} B \\ \vdots & & \vdots \\ a_{p1} B & \cdots & a_{pq} B \end{bmatrix}. \]
Since \(I_m\) has 1s on the diagonal, the block-diagonal \(X\) is just \(I_m \otimes Z\):
\[ X = I_m \otimes Z. \]
Because \(u_t\) is i.i.d. across time with \(\mathrm{Var}(u_t) = \Sigma_u\) (an \(m \times m\) contemporaneous covariance), the stacked error has
\[ \mathrm{Var}(u) = \Sigma_u \otimes I_T \;\in\; \mathbb{R}^{mT \times mT}. \]
So the SUR system is compactly \(y = (I_m \otimes Z)\beta + u\) with \(\mathrm{Var}(u) = \Sigma_u \otimes I_T\).
Useful Kronecker properties (we’ll need these):
We want the efficient estimator for this stacked system. Work generically first: \(y = X\beta + u\) with \(\mathrm{Var}(u) = \Omega\) positive-definite (later we take \(\Omega = \Sigma_u \otimes I_T\)).
OLS minimizes the unweighted sum of squared residuals:
\[ S_{\mathrm{OLS}}(\beta) = (y - X\beta)'(y - X\beta). \]
Equal weighting is optimal when \(\Omega = \sigma^2 I\), but suboptimal when \(\Omega\) is general:
For general \(\Omega\), minimize a weighted sum of squared residuals instead:
\[ S_{\mathrm{GLS}}(\beta) = (y - X\beta)' \Omega^{-1} (y - X\beta). \]
First-order condition (derivative in \(\beta\) set to zero):
\[ -2\, X' \Omega^{-1} (y - X\beta) = 0. \]
Solving the normal equations yields the GLS estimator:
\[ \widehat\beta_{\mathrm{GLS}} = (X'\Omega^{-1}X)^{-1} X'\Omega^{-1} y. \]
The \(\Omega^{-1}\) weighting isn’t arbitrary — it comes from transforming the model so errors become i.i.d.
\(\Omega\) is symmetric positive-definite, so it has a symmetric square root \(\Omega^{1/2}\) with \(\Omega^{1/2}\Omega^{1/2} = \Omega\). Premultiply by \(\Omega^{-1/2}\):
\[ \Omega^{-1/2} y = \Omega^{-1/2} X \beta + \Omega^{-1/2} u, \qquad \mathrm{Var}(\Omega^{-1/2} u) = I. \]
Applying OLS to the transformed system gives the same formula, \((X'\Omega^{-1}X)^{-1}X'\Omega^{-1}y\) — pre-whitening and weighted LS are the same estimator.
Start with \(X = I_m \otimes Z\) and \(\Omega = \Sigma_u \otimes I_T\).
Transpose and inverse: \[ X' = I_m \otimes Z', \qquad \Omega^{-1} = \Sigma_u^{-1} \otimes I_T. \]
Mixed-product gives \[ X'\Omega^{-1} = (I_m \otimes Z')(\Sigma_u^{-1} \otimes I_T) = \Sigma_u^{-1} \otimes Z', \]
\[ X'\Omega^{-1}X = (\Sigma_u^{-1} \otimes Z')(I_m \otimes Z) = \Sigma_u^{-1} \otimes (Z'Z). \]
Inverse property: \[ (X'\Omega^{-1}X)^{-1} = \Sigma_u \otimes (Z'Z)^{-1}. \]
Combining the pieces in the GLS formula:
\[ \widehat\beta_{\mathrm{GLS}} = (X'\Omega^{-1}X)^{-1} X'\Omega^{-1} y = \big(\Sigma_u \otimes (Z'Z)^{-1}\big)\big(\Sigma_u^{-1} \otimes Z'\big) y = \big(I_m \otimes (Z'Z)^{-1}Z'\big) y. \]
The block-diagonal form means \((Z'Z)^{-1}Z'\) is applied equation by equation — which is equation-by-equation OLS. So
\[ \widehat\beta_{\mathrm{GLS}} = \widehat\beta_{\mathrm{OLS}}. \]
Implication: with identical regressors across equations, SUR gains nothing — equation-by-equation OLS is efficient for reduced-form VARs.
The reduced-form VAR(\(p\)) \(x_t = \nu + A_1 x_{t-1} + \cdots + A_p x_{t-p} + u_t\) contains
\[ m^2 p + m \]
parameters (\(m\) = number of variables).
Each equation in a VAR(\(p\)) involves approximately \(mp\) slope coefficients.
Empirical practice typically requires \[ T \gtrsim 5\text{–}10 \times mp \] for reliable estimation.
As a result, standard VARs are usually restricted to small systems in macroeconomic applications.
Because dimensionality grows with \(p\), lag length must be chosen carefully. Standard criteria trade off fit against parameterization:
In typical macroeconomic systems:
In practice, report sensitivity: estimate the VAR under more than one criterion and check whether conclusions depend on \(p\).
Forecasts are computed recursively from the VAR coefficients:
\[ \widehat{x}_{t+h|t} = A_1 \widehat{x}_{t+h-1|t} + \cdots + A_p \widehat{x}_{t+h-p|t}, \qquad h \ge 1, \]
with \(\widehat{x}_{s|t} = x_s\) for \(s \le t\).
For a VAR(1) with \(\mathrm{Var}(u_t) = \Sigma_u\), the FEV satisfies
\[ \Omega_1 = \Sigma_u, \qquad \Omega_h = A\,\Omega_{h-1}A' + \Sigma_u. \]
Under stability, \(\Omega_h\) grows with \(h\) and converges to the unconditional variance of \(x_t\).
Under (asymptotic) Gaussianity, a \(100(1-\alpha)\%\) CI for \(x_{i,t+h}\) is
\[ \widehat{x}_{i,t+h|t} \pm z_{1-\alpha/2} \sqrt{[\Omega_h]_{ii}}. \]
We simulate a stable bivariate VAR(1) process and illustrate multi-step forecasts and forecast confidence intervals.
When \(x_t\) is \(I(1)\), differencing restores stationarity — but it removes long-run relationships between variables.
A VAR in differences cannot capture equilibrium comovement (Sims, Stock, and Watson, 1990).
We need a framework that preserves both short-run dynamics and long-run equilibrium: cointegration and the VECM.
Start from a VAR(\(p\)) in levels: \[ x_t = A_1 x_{t-1} + \cdots + A_p x_{t-p} + u_t. \]
Subtract \(x_{t-1}\) from both sides: \[ \Delta x_t = \Big(A_1 + \cdots + A_p - I\Big)x_{t-1} + \sum_{i=1}^{p-1} \Big(-A_{i+1}-\cdots-A_p\Big)\Delta x_{t-i} + u_t. \]
Define \[ \Pi := \sum_{i=1}^p A_i - I, \qquad \Gamma_i := -\sum_{j=i+1}^p A_j. \]
Then the VAR can always be written as \[ \Delta x_t = \Pi x_{t-1} + \sum_{i=1}^{p-1} \Gamma_i \Delta x_{t-i} + u_t. \]
Suppose \(x_t\) is \(I(1)\) and cointegrated with rank \(r<m\).
That is, there exists \(\beta \in \mathbb{R}^{m\times r}\) such that \[ \beta' x_t \ \text{is stationary}. \]
In the VECM form, \[ \Delta x_t = \Pi x_{t-1} + \sum_{i=1}^{p-1} \Gamma_i \Delta x_{t-i} + u_t, \] the left-hand side is stationary, as are \(\Delta x_{t-i}\) and \(u_t\). Hence \(\Pi x_{t-1}\) must be stationary.
Since \(x_{t-1}\) is nonstationary, \(\Pi x_{t-1}\) can be stationary only if it is a linear combination of \(\beta' x_{t-1}\): \[ \mathrm{rank}(\Pi)=r<m \quad\Rightarrow\quad \Pi=\alpha\beta'. \]
Substituting \(\Pi=\alpha\beta'\) gives the VECM:
\[ \Delta x_t = \alpha \beta' x_{t-1} + \sum_{i=1}^{p-1} \Gamma_i \Delta x_{t-i} + u_t. \]
Let \[ x_t = \begin{bmatrix} y_t \\ z_t \end{bmatrix}, \qquad y_t,\; z_t \text{ are } I(1). \]
Suppose there exists a scalar \(\theta\) such that \[ \beta' x_t = y_t - \theta z_t \quad \text{is stationary}, \qquad \beta'=(1,\,-\theta). \]
Then \(y_t\) and \(z_t\) are cointegrated (rank \(r=1\)).
Define the equilibrium error \(e_{t-1} \equiv y_{t-1} - \theta z_{t-1}\). The VECM \[ \Delta x_t = \alpha e_{t-1} + \sum_{i=1}^{p-1}\Gamma_i \Delta x_{t-i} + u_t, \qquad \alpha \in \mathbb{R}^{2\times 1}, \]
implies two error-correction equations: \[ \begin{aligned} \Delta y_t &= \alpha_1 e_{t-1} + \text{short-run dynamics} + u_{1t}, \\ \Delta z_t &= \alpha_2 e_{t-1} + \text{short-run dynamics} + u_{2t}. \end{aligned} \]
Consider the VECM \[ \Delta x_t = \Pi x_{t-1} + \sum_{i=1}^{p-1}\Gamma_i \Delta x_{t-i} + u_t, \qquad \Pi=\alpha\beta', \qquad \mathrm{rank}(\Pi)=r. \]
What needs to be estimated:
Therefore, the VECM cannot be estimated by OLS. The rank restriction must be imposed directly in the estimation procedure.
Start from the VECM: \[ \Delta x_t=\Pi x_{t-1}+\sum_{i=1}^{p-1}\Gamma_i \Delta x_{t-i}+u_t. \]
Idea: partial out short-run dynamics so the remaining relation is purely long-run.
Regress \(\Delta x_t\) and \(x_{t-1}\) on \(\{\Delta x_{t-1},\dots,\Delta x_{t-p+1}\}\) and keep residuals: \[ \begin{aligned} R_{0t} &= \Delta x_t - \mathbb{E}[\Delta x_t \mid \Delta x_{t-1},\dots,\Delta x_{t-p+1}],\\ R_{1t} &= x_{t-1} - \mathbb{E}[x_{t-1} \mid \Delta x_{t-1},\dots,\Delta x_{t-p+1}]. \end{aligned} \]
The model becomes the long-run regression \[ R_{0t}=\Pi R_{1t}+u_t. \]
This follows from the Frisch–Waugh–Lovell Theorem.
Assume the VECM errors are Gaussian: \[ u_t \sim N(0,\Sigma_u), \quad t=1,\dots,T. \]
By FWL, the partialled-out long-run regression has the same error: \[ R_{0t} = \Pi R_{1t} + u_t, \qquad u_t \sim N(0,\Sigma_u). \]
The (conditional) log-likelihood is
\[ \ell(\Pi,\Sigma_u) = -\frac{T}{2}\log|\Sigma_u| -\frac{1}{2}\sum_{t=1}^T (R_{0t}-\Pi R_{1t})'\Sigma_u^{-1}(R_{0t}-\Pi R_{1t}) + \text{const}. \]
Maximizing the likelihood is equivalent to minimizing the weighted sum of squared residuals.
For fixed \(\Pi\), maximizing the Gaussian log-likelihood over \(\Sigma_u\) gives \[ \widehat\Sigma_u(\Pi) = \tfrac{1}{T}\sum_{t=1}^T (R_{0t} - \Pi R_{1t})(R_{0t} - \Pi R_{1t})'. \]
Substituting back, the trace term collapses to a constant, so \[ \ell(\Pi) = -\tfrac{T}{2}\log\bigl|\widehat\Sigma_u(\Pi)\bigr| + \text{const}, \] and the MLE problem becomes \[ \min_{\Pi:\,\mathrm{rank}(\Pi)=r}\;\bigl|\widehat\Sigma_u(\Pi)\bigr|. \]
Expand the residual covariance: \[ \widehat\Sigma_u(\Pi) = S_{00} - \Pi S_{10} - S_{01}\Pi' + \Pi S_{11}\Pi', \qquad S_{ij} = \tfrac{1}{T}\textstyle\sum_t R_{it}R_{jt}'. \]
Substituting \(\Pi = \alpha\beta'\) and completing the square in \(\alpha\), \[ \widehat\alpha(\beta) = S_{01}\beta(\beta'S_{11}\beta)^{-1}. \]
Plugging \(\widehat\alpha\) back in leaves a concentrated objective in \(\beta\) alone: \[ \widehat\Sigma_u(\beta) = S_{00} - S_{01}\beta(\beta'S_{11}\beta)^{-1}\beta'S_{10}. \]
Two determinant identities — \(|A-B|=|A|\,|I-A^{-1}B|\) and Sylvester’s \(|I+UV|=|I+VU|\) — plus the normalization \(\beta'S_{11}\beta = I_r\), reduce the problem to \[ \max_{\beta}\;\bigl|\beta' A \beta\bigr| \quad\text{s.t.}\quad \beta'S_{11}\beta = I_r, \qquad A := S_{10}S_{00}^{-1}S_{01}. \]
The FOC (Lagrangian + Jacobi’s formula, then diagonalize) is the generalized eigenvalue problem \[ S_{10}S_{00}^{-1}S_{01}\,v = \lambda\,S_{11}\,v. \] \(\widehat\beta_r\) = the top-\(r\) eigenvectors.
Remark. The eigenvalues \(\widehat\lambda_i\) are the squared canonical correlations between \(R_{0t}\) and \(R_{1t}\) — Johansen is canonical correlation analysis (CCA) between residualized differences and levels.
Why: CCA’s eigenproblem \(\Sigma_{YY}^{-1}\Sigma_{YX}\Sigma_{XX}^{-1}\Sigma_{XY}\,b = \rho^2\,b\) becomes \(S_{11}^{-1}S_{10}S_{00}^{-1}S_{01}\,v = \lambda\,v\) once \(X = R_{0t},\ Y = R_{1t}\) — identical to Johansen, so \(\widehat\lambda_i = \widehat\rho_i^{\,2}\).
The remark just said \(\widehat\lambda_i = \widehat\rho_i^{\,2}\) — eigenvalues are squared canonical correlations. Make this operational.
Canonical direction vectors. Pick \(a,\,b \in \mathbb{R}^m\) to form scalar series \[ a' R_{0t}\ \text{(combined residual differences)}, \qquad b' R_{1t}\ \text{(combined residual levels)}. \]
Variational identity. \[ \widehat\lambda_i \;=\; \max_{a,\,b}\;\widehat{\mathrm{corr}}^{\,2}\!\bigl(a' R_{0t},\, b' R_{1t}\bigr) \;=\; \max_{a,\,b}\;\frac{(a'S_{01}\,b)^2}{(a'S_{00}\,a)\,(b'S_{11}\,b)}. \] For \(i>1\), the max is taken orthogonal to previous pairs (CCA deflation).
Fix a direction \(b\) with \(b'R_{1t}\sim I(1)\). (\(R_{0t}\sim I(0)\) always.) What does the squared sample correlation along \((a, b)\) converge to? \[ \frac{(a'S_{01}\,b)^2}{(a'S_{00}\,a)\,(b'S_{11}\,b)}, \qquad S_{ij} = \tfrac{1}{T}\!\sum_{t} R_{it} R_{jt}'. \]
Three asymptotic rates.
Ratio collapses: \[ \frac{O_p(1)}{O_p(1)\cdot O_p(T)} \;=\; O_p(T^{-1}) \;\xrightarrow{p}\; 0. \] Any \(I(1)\)-direction washes out. So \(\widehat\lambda_i = \max_{a,b}(\cdot)\) can stay bounded above 0 only if some \(b\) gives \(b'R_{1t}\sim I(0)\).
A non-vanishing \(\widehat\lambda_i\) requires a \(\beta_i\) such that \(\beta_i'x_{t-1}\) is stationary. Then \(\beta_i' S_{11}\,\beta_i = O_p(1)\) rather than \(O_p(T)\), and \(\widehat\lambda_i\) stays bounded away from 0.
\[ \widehat\lambda_i \;\not\xrightarrow{p}\; 0 \;\;\Longleftrightarrow\;\; \beta_i'x_{t-1}\sim I(0) \;\;\Longleftrightarrow\;\; \beta_i \text{ cointegrates } x_t. \]
Each non-zero \(\lambda_i\) of \(M := \Sigma_{11}^{-1}\Sigma_{10}\Sigma_{00}^{-1}\Sigma_{01}\) flags a cointegrating direction. To get a count, chain three facts:
1. # non-zero \(\lambda_i = \mathrm{rank}(M)\) — by definition.
2. \(\mathrm{rank}(M) = \mathrm{rank}(\Pi)\) — both reduce to \(\mathrm{rank}(\Sigma_{01})\), since \(\Pi = \Sigma_{01}\Sigma_{11}^{-1}\) and the middle factor \(\Sigma_{10}\Sigma_{00}^{-1}\Sigma_{01}\) in \(M\) is a Gram matrix.
3. \(\mathrm{rank}(\Pi)\) = # cointegrating relations — from the VECM setup.
\(\Longrightarrow\) # non-zero \(\widehat\lambda_i\) = # cointegrating relations.
A single \(\widehat\lambda_i\) describes one direction. Their collective pattern encodes the cointegrating rank \(r\):
\(r = 0\) (no cointegration; \(\Pi = 0\), \(x_t\) pure \(I(1)\)). All \(\widehat\lambda_i = O_p(T^{-1}) \to 0\). Every direction has the variance mismatch.
\(0 < r < m\) (\(r\) cointegrating relations). Exactly \(r\) eigenvalues stay \(O_p(1)\); the remaining \(m-r\) vanish.
\(r = m\) (\(x_t\) already stationary). All \(\widehat\lambda_i\) are \(O_p(1)\).
\(\Longrightarrow\) # of non-vanishing eigenvalues = cointegrating rank.
At finite \(T\), sampling noise keeps every \(\widehat\lambda_i > 0\) — the rank tests formalize “how large is large enough.”
In population, the rank dictionary cleanly separates non-vanishing from vanishing eigenvalues. At finite \(T\), sampling noise keeps every \(\widehat\lambda_i > 0\) — we need a hypothesis test.
Hypothesis. For each candidate \(r\), \[ H_0:\mathrm{rank}(\Pi) \le r \quad\text{vs.}\quad H_1:\mathrm{rank}(\Pi) > r. \] Under \(H_0\): \(\widehat\lambda_{r+1}, \ldots, \widehat\lambda_m\) should all vanish. Under \(H_1\): at least \(\widehat\lambda_{r+1}\) remains \(O_p(1)\).
Direct route. Reject if \(\widehat\lambda_{r+1}\) is “too big.” Two gaps:
LR framework. Standard hypothesis-testing machinery — provides a null distribution and a principled way to aggregate across eigenvalues.
For nested models with maximized likelihoods \(L_0 \le L_1\) (restricted vs. unrestricted), \[ \mathrm{LR} \;=\; -2\log\frac{L_0}{L_1} \;=\; -2(\ell_0 - \ell_1) \;\ge\; 0. \]
Large \(\mathrm{LR}\) → restricted model fits much worse → reject \(H_0\).
Wilks’ theorem (regular problems): under \(H_0\), \[ \mathrm{LR} \;\xrightarrow{d}\; \chi^2_q, \qquad q = \#\,\text{restrictions}. \]
Johansen’s case is nonstandard. Because \(x_t\) is \(I(1)\), the limit is a functional of Brownian motion, not \(\chi^2\). Critical values are tabulated (Osterwald-Lenum 1992; MacKinnon–Haug–Michelis 1999).
For each \(r\), test \(H_0:\mathrm{rank}(\Pi)\le r\) vs. \(H_1:\mathrm{rank}(\Pi) > r\). Substituting the profile log-likelihood \(\ell(\Pi) = -\tfrac{T}{2}\log|\widehat\Sigma_u(\Pi)| + \text{const}\), \[ \mathrm{LR}(r) \;=\; -2(\ell_r - \ell_m) \;=\; -T\log\frac{|\widehat\Sigma_m|}{|\widehat\Sigma_r|}, \] where \(\widehat\Sigma_m, \widehat\Sigma_r\) are the residual covariances under the unrestricted and rank-\(r\) models.
The determinant ratio reduces to a product over the Johansen eigenvalues: \[ \frac{|\widehat\Sigma_m|}{|\widehat\Sigma_r|} \;=\; \prod_{i=r+1}^m (1 - \widehat\lambda_i), \qquad \mathrm{LR}(r) \;=\; -T\sum_{i=r+1}^m \log(1 - \widehat\lambda_i). \]
Two tests are built from this: the trace test and the maximum eigenvalue test.
Compares the rank-\(r\) model to the unrestricted model: \[ \mathrm{LR}_{\text{trace}}(r) = -T\sum_{i=r+1}^m \log(1-\hat\lambda_i). \]
This tests \[ H_0:\ \mathrm{rank}(\Pi)\le r \quad \text{vs.} \quad H_1:\ \mathrm{rank}(\Pi)>r. \]
The null requires every \(\hat\lambda_{r+1},\dots,\hat\lambda_m\) to be zero.
Compares the rank-\(r\) model to the rank-\((r+1)\) model: \[ \mathrm{LR}_{\max}(r,r+1) = -T\log(1-\hat\lambda_{r+1}). \]
This tests \[ H_0:\ \mathrm{rank}(\Pi)=r \quad \text{vs.} \quad H_1:\ \mathrm{rank}(\Pi)=r+1. \]
Sequential use. Starting from \(r=0\): test, and if rejected, increment \(r\) and test again. The selected rank is the first \(r\) for which \(H_0\) is not rejected.
For small \(\widehat\lambda_i\) (the relevant range under the null of no further cointegration), Taylor expansion gives \[ -\log(1-\widehat\lambda_i) \;=\; \widehat\lambda_i + \tfrac{1}{2}\widehat\lambda_i^2 + O(\widehat\lambda_i^3). \]
Both Johansen statistics reduce to magnitude-based forms:
The “log” was never a separate strategy — it is the likelihood-correct version of “look at the eigenvalues.” The intuition built up earlier is what the test ultimately uses.
Why they differ.
So what.
What to do.
Both estimate \(\beta\) identified only up to rotation. The difference is when the normalization enters.
Engle–Granger (single-equation)
Johansen (system VECM)
Both require lag length \(p\) and deterministic terms (constant/trend) to be specified up front.
With \(r\) fixed by the rank tests, both VECM components come from the eigenproblem: \[ \widehat\beta_r = \text{top-}r\ \text{eigenvectors (normalized: } \widehat\beta_r'\,S_{11}\,\widehat\beta_r = I_r\text{)}, \qquad \widehat\alpha_r = S_{01}\,\widehat\beta_r. \]
The estimated VECM \[ \Delta x_t = \widehat\alpha\,\widehat\beta'\,x_{t-1} + \cdots + u_t \] has two structural objects to interrogate:
Restrictions on either are testable hypotheses with economic content. We turn to \(\alpha\) next — its restrictions read as statements about the error-correction role of each variable (e.g., “is variable \(i\) weakly exogenous to the long-run equilibrium?”).
In the VECM \[ \Delta x_t = \alpha\beta' x_{t-1} + \cdots + u_t, \] \(\alpha \in \mathbb{R}^{m\times r}\) collects adjustment coefficients: row \(i\) describes how \(\Delta x_{i,t}\) responds to the \(r\) equilibrium errors.
Linear hypotheses on \(\alpha\) take the form \[ H_0:\ R\,\mathrm{vec}(\alpha) = q, \qquad R \in \mathbb{R}^{k \times mr},\ q \in \mathbb{R}^{k}, \] where \(\mathrm{vec}(\alpha) \in \mathbb{R}^{mr}\) stacks the columns of \(\alpha\), and \(k\) = # independent restrictions = degrees of freedom of the LR/Wald \(\chi^2\) test.
Conditional on the cointegration rank \(r\), \(\alpha\) is the coefficient on the stationary regressor \(\widehat\beta_r' x_{t-1},\) so standard regression / MLE asymptotics apply.
No adjustment for variable \(i\) (weak exogeneity for \(\beta\)): \[ H_0:\ \alpha_{i\cdot}=0 \quad (r\ \text{restrictions}) \]
Only relation 1 adjusts variable \(i\): \[ H_0:\ \alpha_{i2}=\cdots=\alpha_{ir}=0 \]
Same adjustment for variables \(i\) and \(j\): \[ H_0:\ \alpha_{i\cdot}=\alpha_{j\cdot} \]
\(x_t = (y_t,\ p_t,\ p_t^{\ast})'\) — domestic output, domestic price, foreign price.
Question. Do prices satisfy PPP in the long run, and which side adjusts to disequilibrium — domestic or foreign?
Procedure.
Suppose stages 1–3 deliver \(\widehat r = 1\) and \(\widehat\beta \approx (0,1,-1)'\). The VECM components: \[ \beta = \begin{bmatrix} 0 \\ 1 \\ -1 \end{bmatrix}, \quad \alpha = \begin{bmatrix} \alpha_y \\ \alpha_p \\ \alpha_{p^\ast} \end{bmatrix}, \quad \Pi = \alpha\beta' = \begin{bmatrix} 0 & \alpha_y & -\alpha_y \\ 0 & \alpha_p & -\alpha_p \\ 0 & \alpha_{p^\ast} & -\alpha_{p^\ast} \end{bmatrix}. \]
Small open economy assumption: foreign prices do not adjust to domestic disequilibrium: \[ H_0:\ \alpha_{p^\ast}=0. \]
Interpretation:
Once a VAR or VECM is estimated, check that the specification is adequate:
Failure of these checks typically points back to an under-specified lag length or an omitted structural break.
We studied VARs and VECMs as reduced-form representations of multivariate time series.
They allow us to:
The shocks \(u_t\) in a VAR or VECM are:
Next unit: how additional identifying assumptions transform reduced-form innovations into structural shocks with economic meaning.