Unit Roots, Integration, ARIMA Models
Xiamen University, Chow Institute
March, 2026
A time series \(\{x_t\}\) is nonstationary if its probabilistic behavior changes over time, so that:
Equivalently:
The statistical properties of the process are not stable over time.
Nonstationarity can arise from different mechanisms:
Stochastic nonstationarity Instability driven by the accumulation of random shocks (e.g. unit roots)
Deterministic nonstationarity Time-varying deterministic components (e.g. trends)
Other forms Structural breaks, regime changes, changing volatility
We will focus on stochastic nonstationarity here.
Consider the AR(1) model \[ x_t = \rho x_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d.}(0,\sigma^2). \]
The associated autoregressive polynomial is \[ \Phi(z) = 1 - \rho z. \]
When \(\rho = 1\), we have \(\Phi(1) = 0\).
In this case, the autoregressive polynomial has a root equal to one. This is called a unit root.
\[ x_t = x_{t-1} + \varepsilon_t, \]
Iterating forward, \[ x_t = x_0 + \sum_{j=1}^t \varepsilon_j. \]
Shocks enter the level cumulatively.
A one-time shock \(\varepsilon_t\) affects the process at all future horizons: \[ x_{t+h} = x_t + \sum_{j=1}^h \varepsilon_{t+j}, \qquad h \ge 0. \]
There is no decay in the effect of past innovations.
If \(\mathbb{E}[\varepsilon_t]=0\), then \[ \mathbb{E}[x_t]=\mathbb{E}[x_0], \qquad \mathrm{Var}(x_t)=\mathrm{Var}(x_0)+t\sigma^2 \to \infty. \]
The mean is constant, but dispersion grows without bound.
When a unit root process is AR(1), it is also referred to as a random walk.
Unit roots raise a natural question: how far is a series from stationarity?
A time series \(\{x_t\}\) is integrated of order \(d\), written \(x_t \sim I(d)\), if:
where \(\Delta = 1 - L\) is the difference operator.
ARIMA stands for Autoregressive Integrated Moving Average.
If a process is integrated of order \(d\), \[ x_t \sim I(d), \] then \[ \Delta^d x_t \;\text{is stationary}. \]
An ARIMA\((p,d,q)\) model assumes that the differenced series admits an ARMA representation: \[ \Phi(L)\,\Delta^d x_t = \Theta(L)\,\varepsilon_t. \]
When \(d=0\), ARIMA reduces to ARMA.
In ARIMA models, the order of integration \(d\) captures the presence of unit roots in the data.
This requires assessing whether a time series contains zero, one, or multiple unit roots.
In its simplest form, consider the AR(1) model \[ x_t = \rho x_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d.}(0,\sigma^2). \]
Testing for a unit root amounts to testing: \[ H_0:\ \rho = 1 \qquad \text{vs.} \qquad H_1:\ \rho < 1. \]
Consider the unit root model \[ x_t = x_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d.}(0,\sigma^2). \]
Iterating forward, \[ x_t = x_0 + \sum_{j=1}^t \varepsilon_j. \]
A unit root process is a partial sum of shocks.
By CLT,
\[ n^{-1/2} \sum_{i=1}^n \varepsilon_i \;\xrightarrow{d}\; N(0,\sigma^2). \]
This describes the behavior of the partial sum at a fixed point in time.
Let \[ t = \lfloor rn \rfloor, \qquad r \in [0,1]. \]
Then, \[ \lfloor rn \rfloor^{-1/2} \sum_{i=1}^{\lfloor rn \rfloor} \varepsilon_i \;\xrightarrow{d}\; N(0,\sigma^2). \]
Consider the scaled process \[ n^{-1/2} x_{\lfloor rn \rfloor} = n^{-1/2} x_0 + \left( \frac{\lfloor rn \rfloor}{n} \right)^{1/2} \lfloor rn \rfloor^{-1/2} \sum_{i=1}^{\lfloor rn \rfloor} \varepsilon_i. \]
As \(n \to \infty\), the initial condition vanishes: \[ n^{-1/2} x_0 \to 0. \]
Therefore,
\[ n^{-1/2} x_{\lfloor rn \rfloor} \;\xrightarrow{d}\; r^{1/2} N(0,\sigma^2) = \sigma W(r), \qquad r \in [0,1], \]
where \(W(r)\) is a Brownian motion.
A unit root process converges to Brownian motion after appropriate scaling.
Recall that the OLS estimator in the AR(1) regression \[ y_t = \rho y_{t-1} + \varepsilon_t \] satisfies \[ \hat\rho - \rho = \frac{\sum_{t=1}^n y_{t-1}\varepsilon_t}{\sum_{t=1}^n y_{t-1}^2}. \]
Under the unit root null hypothesis (\(\rho = 1\)), the following sample quantities satisfy:
\[ \begin{aligned} n^{-2}\sum_{t=1}^n y_{t-1}^2 &\;\xrightarrow{d}\; \sigma^2 \int_0^1 W(r)^2\,dr, \\[6pt] n^{-1}\sum_{t=1}^n y_{t-1}\varepsilon_t &\;\xrightarrow{d}\; \sigma^2 \int_0^1 W(r)\,dW(r), \end{aligned} \] where \(W(r)\) is standard Brownian motion.
Under the null hypothesis \(\rho = 1\), \[ n(\hat\rho - 1) \;\xrightarrow{d}\; \frac{\int_0^1 W(r)\,dW(r)}{\int_0^1 W(r)^2\,dr}. \]
The corresponding \(t\) statistic is \[ t_n = \frac{\hat\rho - 1}{\hat\sigma_{\hat\rho}} = \frac{\hat\rho - 1} {\hat\sigma\,(\sum_{t=1}^n y_{t-1}^2)^{-1/2}} \;\xrightarrow{d}\; \frac{\int_0^1 W(r)\,dW(r)} {\left\{\int_0^1 W(r)^2\,dr\right\}^{1/2}}. \]
The unit root hypothesis is \[ H_0:\ \gamma = 0 \quad (\text{equivalently, } \rho = 1). \]
The DF regression can include deterministic terms:
| Specification | Regression |
|---|---|
| No deterministics | \(\Delta y_t = \gamma y_{t-1} + \varepsilon_t\) |
| Intercept | \(\Delta y_t = \alpha_0 + \gamma y_{t-1} + \varepsilon_t\) |
| Intercept + trend | \(\Delta y_t = \alpha_0 + \alpha_1 t + \gamma y_{t-1} + \varepsilon_t\) |
The test statistic follows the Dickey–Fuller distribution (not standard normal) in all three cases, with different critical values.
In practice, the AR(1) specification is too restrictive. For a general AR(\(p\)) model, reparameterize in first differences: \[ \Delta y_t = \gamma y_{t-1} + \sum_{i=1}^{p-1} \beta_i \Delta y_{t-i} + \varepsilon_t, \] where \(\gamma = \sum_{i=1}^p a_i - 1\).
The lagged differences absorb higher-order dynamics; the unit root test remains \(H_0: \gamma = 0\) with the same DF critical values.
ADF tests for a single unit root. But a series may be \(I(0)\), \(I(1)\), or \(I(2)\).
Dickey–Pantula (1987) provides a size-controlled sequential procedure:
Test \(I(2)\) vs. \(I(1)\): regress \(\Delta^2 y_t\) on \(\Delta y_{t-1}\) (plus lags of \(\Delta^2 y_t\)). If the null is not rejected, conclude \(I(2)\).
Test \(I(1)\) vs. \(I(0)\): if step 1 rejects, add \(y_{t-2}\) to the regression and test its coefficient. Failure to reject \(\Rightarrow I(1)\); rejection \(\Rightarrow I(0)\).
The procedure tests from the most differenced hypothesis down, which controls size at each step. In practice, \(d \in \{0,1,2\}\) covers nearly all cases.
We test for a unit root using the Dickey–Fuller test, which is a left-tailed test.
Null hypothesis (unit root): \[ H_0:\ \rho = 1 \]
Alternative hypothesis (stationarity): \[ H_1:\ \rho < 1 \]
Rejection occurs when the test statistic is more negative than the Dickey–Fuller critical value.
ADF test (levels, with intercept)
Test statistic (tau): -2.025
5% critical value: -2.87
ADF test (first differences)
Test statistic (tau): -7.63
5% critical value: -2.87
The PACF cuts off after lag 1, suggesting an AR(1) for the differenced series, i.e., an ARIMA(1,1,0) model.
The ACF shows significant spikes at lags 1 and 2, which is also consistent with an MA(2) for the differenced series, i.e., an ARIMA(0,1,2) model.
Given a set of candidate ARIMA\((p,d,q)\) models, we select using information criteria:
\[ \text{AIC} = -2\log L(\hat\theta) + 2k, \qquad \text{BIC} = -2\log L(\hat\theta) + k\log T, \] where \(k\) is the number of estimated parameters (depends on \(p\), \(q\), and whether deterministic terms are included).
n <- length(dgdp)
aic <- c(fit_ar1$aic, fit_ma2$aic)
bic <- aic + (log(n) - 2) * c(length(fit_ar1$coef), length(fit_ma2$coef))
cat(sprintf("%-15s %10s %10s\n", "Model", "AIC", "BIC"))
cat(sprintf("%-15s %10.2f %10.2f\n", "ARIMA(1,1,0)", aic[1], bic[1]))
cat(sprintf("%-15s %10.2f %10.2f\n", "ARIMA(0,1,2)", aic[2], bic[2]))Model AIC BIC
ARIMA(1,1,0) -1743.73 -1740.13
ARIMA(0,1,2) -1726.60 -1719.39
Both criteria favor ARIMA(1,1,0).
Model diagnostics assess whether the assumptions of the ARIMA model are plausibly satisfied by the data.
Residual time series plot → checks stability, outliers, variance changes
ACF of residuals → checks remaining serial correlation
Ljung–Box test → formal test of joint residual autocorrelation \[ H_0:\ \rho_1 = \rho_2 = \cdots = \rho_h = 0,\] where \(\rho_k\) denotes the autocorrelation of the model residuals at lag \(k\).
Normal Q–Q plot → assesses whether residuals are approximately normal → systematic deviations, especially in the tails, matter
Diagnostics do not test whether the model is “true” — they assess whether the model is adequate.
Once an ARIMA model has been estimated and passes basic diagnostics, it can be used to generate forecasts of future observations.
Forecasting answers a different question than estimation:
Forecasts are model-based extrapolations, not statements of truth.
All time series forecasting relies on a crucial assumption:
Past statistical regularities remain informative about the future.
In ARIMA models, this means:
Given data observed up to time \(t\) and estimated parameters \(\hat\theta\), an ARIMA model implies a model-based forecast distribution for future observations: \[ y_{t+h} \;\sim\; \mathbb{P}_{\hat\theta}\!\left(\,\cdot \mid \mathscr{F}_t \right). \]
At time \(t\), the information set \(\mathscr{F}_t\) is observed. The fitted parameters \(\hat\theta\) define the forecasting model. Randomness arises from future shocks implied by the model.
From this model-based forecast distribution, we obtain:
a point forecast, defined as the model-based conditional mean \[ \hat{y}_{t+h\mid t} \;\equiv\; \mathbb{E}_{\hat\theta}\!\left(y_{t+h}\mid \mathscr{F}_t\right), \]
prediction intervals, constructed from quantiles of the model-based conditional distribution \[ y_{t+h}\mid \mathscr{F}_t \;\sim\; \mathbb{P}_{\hat\theta}. \]
Formally, a \((1-\alpha)\) prediction interval satisfies \[ \Pr_{\hat\theta}\!\left( y_{t+h}\in[\ell_h,u_h]\mid \mathscr{F}_t \right)=1-\alpha, \] where \(\ell_h\) and \(u_h\) are the \(\alpha/2\) and \(1-\alpha/2\) quantiles of the model-based forecast distribution.
We continue with the ARIMA model fitted to log real GDP. Let \[ y_t = \log(\text{GDP}_t). \]
The fitted model is \[ \Delta y_t = \phi\,\Delta y_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d. }(0,\sigma^2). \]
Equivalently, \[ y_t = y_{t-1} + \phi\,\Delta y_{t-1} + \varepsilon_t. \]
The point forecast at horizon \(h\) is defined as the conditional mean under the fitted model:
\[ \hat y_{t+h\mid t} \;\equiv\; \mathbb E_{\hat\theta}\!\left( y_{t+h} \mid \mathscr{F}_t \right), \] where \(\hat\theta=(\hat\phi,\hat\sigma^2)\) denotes the estimated model parameters.
One-Step-Ahead Forecast
From the model equation, \[ y_{t+1} = y_{t} + \phi\,\Delta y_t + \varepsilon_{t+1}, \] taking conditional expectations given \(\mathscr{F}_t\) and \(\hat\theta\), and using \(\mathbb E_{\hat\theta}(\varepsilon_{t+1}\mid\mathscr{F}_t)=0\), we obtain the one-step-ahead point forecast \[ \hat y_{t+1\mid t} = y_t + \hat\phi\,\Delta y_t. \]
Point forecasts summarize the center of the forecast distribution, but forecasting also requires quantifying uncertainty.
Consider the fitted ARIMA\((1,1,0)\) model \[ \Delta y_t = \hat\phi\,\Delta y_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d. }(0,\hat\sigma^2). \]
At time \(t\), the information set \(\mathscr F_t\) and the estimated parameters \(\hat\theta=(\hat\phi,\hat\sigma^2)\) are treated as fixed.
Under this conditioning, randomness arises only from future innovations \[ \varepsilon_{t+1}, \varepsilon_{t+2}, \ldots \]
From the fitted model, \[ y_{t+1} = y_t + \hat\phi\,\Delta y_t + \varepsilon_{t+1}, \] taking conditional expectations yields the point forecast \[ \hat y_{t+1\mid t} = y_t + \hat\phi\,\Delta y_t. \]
Therefore, the one-step-ahead forecast error is \[ y_{t+1} - \hat y_{t+1\mid t} = \varepsilon_{t+1}. \]
Two-step-ahead forecast error
Iterating the model forward one more step, the two-step-ahead forecast error satisfies \[ y_{t+2}-\hat y_{t+2\mid t} = (1+\hat\phi)\,\varepsilon_{t+1} + \varepsilon_{t+2}. \]
This illustrates how forecast uncertainty accumulates and propagates with the forecast horizon.
Under the fitted ARIMA\((1,1,0)\) model, forecast errors are linear combinations of future innovations: \[ y_{t+h}-\hat y_{t+h\mid t} = \sum_{j=1}^h w_{h,j}\,\varepsilon_{t+j}, \] where the weights \(w_{h,j}\) depend on \(\hat\phi\).
Since the fitted model assumes \[ \varepsilon_{t+j} \sim \text{i.i.d. }(0,\hat\sigma^2), \] the model-based forecast distribution of \(y_{t+h}\) is centered at \(\hat y_{t+h\mid t}\), with variance increasing in the forecast horizon \(h\).
Under the maintained Gaussian assumption, the model-based forecast error is normally distributed: \[ y_{t+h} - \hat y_{t+h\mid t} \;\sim\; N\!\left( 0,\; \mathrm{Var}_{\hat\theta}\!\left(y_{t+h}-\hat y_{t+h\mid t}\right) \right). \]
Therefore, a \((1-\alpha)\) model-based prediction interval for \(y_{t+h}\) is \[ \hat y_{t+h\mid t} \;\pm\; z_{1-\alpha/2} \sqrt{ \mathrm{Var}_{\hat\theta}\!\left(y_{t+h}-\hat y_{t+h\mid t}\right) }, \] where \(z_{1-\alpha/2}\) is the \((1-\alpha/2)\) quantile of the standard normal distribution.
Out-of-sample evaluation (h = 10 quarters)
RMSE: 0.0189
MAE: 0.0142
Forecasting ultimately concerns realized outcomes, not models.
A forecast made at time \(t\), \[ \hat y_{t+h\mid t}, \] is evaluated against what actually occurs, \[ y_{t+h}. \]
The realized forecast error is \[ e_{t+h} = y_{t+h} - \hat y_{t+h\mid t}. \]
This realized error combines all sources of forecast uncertainty:
A single realized forecast error \(e_{t+h}\) is informative, but forecast evaluation requires averaging over repeated realizations.
The standard criterion is the mean squared forecast error (MSFE): \[ \mathrm{MSFE}(h) = \mathbb{E}\!\left[ \left( y_{t+h} - \hat y_{t+h\mid t} \right)^2 \right], \] where the expectation is taken unconditionally under the true data-generating process.
To understand the sources of forecast error, write \[ y_{t+h} - \hat y_{t+h\mid t} = \underbrace{ \bigl(y_{t+h}-y_{t+h}^\ast\bigr) }_{\text{unpredictable shocks}} + \underbrace{ \bigl(y_{t+h}^\ast-\hat y_{t+h\mid t}\bigr) }_{\text{forecast construction error}}, \] where \[ y_{t+h}^\ast \equiv \mathbb{E}\!\left(y_{t+h}\mid \mathscr F_t\right) \] denotes the ideal conditional forecast under the true data-generating process.
Squaring and taking expectations yields the familiar decomposition: \[ \mathrm{MSFE}(h) = \underbrace{ \mathbb{E}\!\left[ \left(y_{t+h}-y_{t+h}^\ast\right)^2 \right] }_{\text{irreducible uncertainty}} + \underbrace{ \mathbb{E}\!\left[ \left(y_{t+h}^\ast-\hat y_{t+h\mid t}\right)^2 \right] }_{\text{bias and estimation error}}. \]
The second term in the MSFE decomposition, \[ \mathbb{E}\!\left[ \left(y_{t+h}^\ast-\hat y_{t+h\mid t}\right)^2 \right], \] captures systematic forecast error.
To obtain a bias–variance decomposition, we work conditionally on the information set \(\mathscr F_t\).
\[ \mathbb E\!\left[ (y^*_{t+h}-\hat y_{t+h\mid t})^2 \mid \mathscr F_t \right] = \underbrace{ \left( \mathbb E[\hat y_{t+h\mid t}\mid \mathscr F_t]-y^*_{t+h} \right)^2 }_{\text{forecast bias}^2} + \underbrace{ \mathrm{Var}(\hat y_{t+h\mid t}\mid \mathscr F_t) }_{\text{forecast variance}}. \]
Interpretation
This decomposition highlights a fundamental bias–variance tradeoff in forecasting: simpler models may be biased but stable, while flexible models may reduce bias at the cost of higher variance.
Averaging over forecasting situations (i.e. over \(\mathscr F_t\)) yields the unconditional contribution to the MSFE: \[ \mathbb{E}\!\left[ (y^*_{t+h}-\hat y_{t+h\mid t})^2 \right] = \mathbb{E}\!\left[ \left( \mathbb E[\hat y_{t+h\mid t}\mid \mathscr F_t]-y^*_{t+h} \right)^2 \right] + \mathbb{E}\!\left[ \mathrm{Var}(\hat y_{t+h\mid t}\mid \mathscr F_t) \right]. \]
The MSFE is a population object: \[ \mathrm{MSFE}(h) = \mathbb{E}\!\left[ \left( y_{t+h} - \hat y_{t+h\mid t} \right)^2 \right]. \]
In practice, we observe one realized path. Expectations are therefore replaced by out-of-sample averages.
This leads to the root mean squared forecast error (RMSE): \[ \widehat{\mathrm{RMSE}}(h) = \sqrt{ \frac{1}{T} \sum_{t=1}^T \left( y_{t+h} - \hat y_{t+h\mid t} \right)^2 }. \]
Another commonly used criterion is the mean absolute error (MAE): \[ \widehat{\mathrm{MAE}}(h) = \frac{1}{T} \sum_{t=1}^T \left| y_{t+h} - \hat y_{t+h\mid t} \right|. \]
MAE penalizes errors linearly, rather than quadratically.
As a result:
The MSFE and RMSE are defined in terms of forecast errors, not the level of the time series.
For forecast evaluation to be meaningful, we implicitly require that:
This is exactly what ARIMA models are designed to deliver.
ARIMA models handle univariate nonstationary series by differencing.
What they leave out: