Unit Roots, Integration, ARIMA Models
Xiamen University, Chow Institute
March, 2026
A time series \(\{x_t\}\) is nonstationary if its probabilistic behavior changes over time, so that:
Equivalently:
The statistical properties of the process are not stable over time.
Nonstationarity can arise from different mechanisms:
Stochastic nonstationarity
Instability driven by the accumulation of random shocks (e.g. unit roots)
Deterministic nonstationarity
Time-varying deterministic components (e.g. trends)
Other forms
Structural breaks, regime changes, changing volatility
We will focus on stochastic nonstationarity here.
Consider the AR(1) model \[ x_t = \rho x_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d.}(0,\sigma^2). \]
The associated autoregressive polynomial is \[ \Phi(z) = 1 - \rho z. \]
When \[ \rho = 1, \] we have \[ \Phi(1) = 0. \]
In this case, the autoregressive polynomial has a root equal to one. This is called a unit root.
\[ x_t = x_{t-1} + \epsilon_t, \]
Iterating forward, \[ x_t = x_0 + \sum_{j=1}^t \varepsilon_j. \]
Shocks enter the level cumulatively.
A one-time shock \(\varepsilon_t\) affects the process at all future horizons: \[ x_{t+h} = x_t + \sum_{j=1}^h \varepsilon_{t+j}, \qquad h \ge 0. \]
There is no decay in the effect of past innovations.
If \(\mathbb{E}[\varepsilon_t]=0\), then \[ \mathbb{E}[x_t]=\mathbb{E}[x_0], \qquad \mathrm{Var}(x_t)=\mathrm{Var}(x_0)+t\sigma^2 \to \infty. \]
The mean is constant, but dispersion grows without bound.
When a unit root process is AR(1), it is also referred to as a random walk.
Unit roots raise a natural question: how far is a series from stationarity?
A time series \(\{x_t\}\) is integrated of order \(d\), written \(x_t \sim I(d)\), if:
where \(\Delta = 1 - L\) is the difference operator.
ARIMA stands for Autoregressive Integrated Moving Average.
If a process is integrated of order \(d\), \[ x_t \sim I(d), \] then \[ \Delta^d x_t \;\text{is stationary}. \]
An ARIMA\((p,d,q)\) model assumes that the differenced series admits an ARMA representation: \[ \Phi(L)\,\Delta^d x_t = \Theta(L)\,\varepsilon_t. \]
When \(d=0\), ARIMA reduces to ARMA.
In ARIMA models, the order of integration \(d\) captures the presence of unit roots in the data.
This requires assessing whether a time series contains zero, one, or multiple unit roots.
In its simplest form, consider the AR(1) model \[ x_t = \rho x_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d.}(0,\sigma^2). \]
Testing for a unit root amounts to testing: \[ H_0:\ \rho = 1 \qquad \text{vs.} \qquad H_1:\ \rho < 1. \]
Consider the unit root model \[ x_t = x_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d.}(0,\sigma^2). \]
Iterating forward, \[ x_t = x_0 + \sum_{j=1}^t \varepsilon_j. \]
A unit root process is a partial sum of shocks.
By CLT,
\[ n^{-1/2} \sum_{i=1}^n \varepsilon_i \;\xrightarrow{d}\; N(0,\sigma^2). \]
This describes the behavior of the partial sum at a fixed point in time.
Let \[ t = \lfloor rn \rfloor, \qquad r \in [0,1]. \]
Then, \[
\lfloor rn \rfloor^{-1/2}
\sum_{i=1}^{\lfloor rn \rfloor} \varepsilon_i
\;\xrightarrow{d}\;
N(0,\sigma^2).
\]
Consider the scaled process \[ n^{-1/2} y_{\lfloor rn \rfloor} = n^{-1/2} y_0 + \left( \frac{\lfloor rn \rfloor}{n} \right)^{1/2} \lfloor rn \rfloor^{-1/2} \sum_{i=1}^{\lfloor rn \rfloor} \varepsilon_i. \]
As \(n \to \infty\), the initial condition vanishes: \[ n^{-1/2} y_0 \to 0. \]
Therefore,
\[ n^{-1/2} y_{\lfloor rn \rfloor} \;\xrightarrow{d}\; r^{1/2} N(0,\sigma^2) = \sigma W(r), \qquad r \in [0,1], \]
where \(W(r)\) is a Brownian motion.
A unit root process converges to Brownian motion after appropriate scaling.
Recall that the OLS estimator in the AR(1) regression \[ y_t = \rho y_{t-1} + \varepsilon_t \] satisfies \[ \hat\rho - \rho = \frac{\sum_{t=1}^n y_{t-1}\varepsilon_t}{\sum_{t=1}^n y_{t-1}^2}. \]
Under the unit root null hypothesis (\(\rho = 1\)), the following sample quantities satisfy:
\[ n^{-2}\sum_{t=1}^n y_{t-1}^2 \;\xrightarrow{d}\; \sigma^2 \int_0^1 W(r)^2\,dr, \]
\[ n^{-1}\sum_{t=1}^n y_{t-1}\varepsilon_t \;\xrightarrow{d}\; \sigma^2 \int_0^1 W(r)\,dW(r), \]
where \(W(r)\) is standard Brownian motion.
Under the null hypothesis \(\rho = 1\), \[ n(\hat\rho - 1) \;\xrightarrow{d}\; \frac{\int_0^1 W(r)\,dW(r)}{\int_0^1 W(r)^2\,dr}. \]
The corresponding \(t\) statistic is \[ t_n = \frac{\hat\rho - 1}{\hat\sigma_{\hat\rho}} = \frac{\hat\rho - 1} {\hat\sigma\,(\sum_{t=1}^n y_{t-1}^2)^{-1/2}} \;\xrightarrow{d}\; \frac{\int_0^1 W(r)\,dW(r)} {\left\{\int_0^1 W(r)^2\,dr\right\}^{1/2}}. \]
So far, we considered the simplest Dickey–Fuller regression: \[ \Delta y_t = \gamma y_{t-1} + \varepsilon_t. \]
Dickey–Fuller allows additional deterministic terms:
(1) With intercept \[ \Delta y_t = \alpha_0 + \gamma y_{t-1} + \varepsilon_t. \]
(2) With intercept and linear trend \[ \Delta y_t = \alpha_0 + \alpha_1 t + \gamma y_{t-1} + \varepsilon_t. \]
In these regressions, the unit root is tested by \[
H_0:\ \gamma = 0.
\]
Joint tests involving deterministic terms are possible, but empirical applications typically focus on testing for a unit root.
Recall that for an AR(1) process, the Dickey–Fuller regression is \[ \Delta y_t = \gamma y_{t-1} + \varepsilon_t. \]
In practice, time series often exhibit higher-order dynamics. Consider a general AR(\(p\)) model: \[ y_t = a_1 y_{t-1} + a_2 y_{t-2} + \cdots + a_p y_{t-p} + \varepsilon_t. \]
By rewriting the AR(\(p\)) model in first differences, we obtain \[ \Delta y_t = \gamma y_{t-1} + \sum_{i=1}^{p-1} \beta_i \Delta y_{t-i} + \varepsilon_t, \]
where \[ \gamma = \sum_{i=1}^p a_i - 1, \qquad \beta_i = - \sum_{j=i+1}^p a_j. \]
This reparameterization leads to the Augmented Dickey–Fuller regression, which preserves the Dickey–Fuller test for a unit root.
The null hypothesis remains \[ H_0:\ \gamma = 0. \]
The Augmented Dickey–Fuller (ADF) regression tests for the presence of a single unit root, that is, whether the autoregressive polynomial contains a factor \((1-L)\).
More generally, a time series may contain zero, one, or more unit roots.
A common informal approach proceeds by applying ADF tests to \(y_t\), then to \(\Delta y_t\), and so on, until the null is rejected.
While intuitive, this sequential use of ADF does not provide a size-controlled procedure for testing multiple unit roots.
Dickey and Pantula (1987) propose a formal procedure for testing multiple unit roots.
The procedure is based on:
This approach provides a theoretically justified way to distinguish between different numbers of unit roots.
To distinguish between \(I(2)\), \(I(1)\), and \(I(0)\), Dickey–Pantula (1987) propose a sequential testing procedure based on second differences.
Estimate \[ \Delta^2 y_t = \alpha_0 + \beta_1 \Delta y_{t-1} + \sum_{i=1}^p c_i \Delta^2 y_{t-i} + u_t, \]
and test \[ H_0:\ \beta_1 = 0. \]
If \(H_0\) is rejected, estimate \[ \Delta^2 y_t = \alpha_0 + \beta_1 \Delta y_{t-1} + \beta_2 y_{t-2} + \sum_{i=1}^p c_i \Delta^2 y_{t-i} + u_t, \]
and test \[ H_0:\ \beta_2 = 0. \]
The Dickey–Pantula procedure is designed to:
The data admit a finite-order AR representation (or a sufficiently accurate AR approximation)
Innovations are weakly dependent with finite variance
Deterministic terms (intercept, trend) are included consistently across steps
In practice, attention is typically limited to \(d \in \{0,1,2\}\).
We test for a unit root using the Dickey–Fuller test, which is a left-tailed test.
Null hypothesis (unit root): \[ H_0:\ \rho = 1 \]
Alternative hypothesis (stationarity): \[ H_1:\ \rho < 1 \]
Rejection occurs when the test statistic is more negative than the Dickey–Fuller critical value.
ADF test (levels, with intercept)
Test statistic (tau): -2.025
5% critical value: -2.87
ADF test (first differences)
Test statistic (tau): -7.63
5% critical value: -2.87
ARIMA(1,1,0)
Call:
arima(x = dgdp, order = c(1, 0, 0), include.mean = TRUE)
Coefficients:
ar1 intercept
0.3575 0.0079
s.e. 0.0566 0.0008
sigma^2 estimated as 7.951e-05: log likelihood = 897.78, aic = -1789.55
Call:
arima(x = gdp, order = c(1, 1, 0), xreg = t)
Coefficients:
ar1 t
0.3575 0.0079
s.e. 0.0566 0.0008
sigma^2 estimated as 7.951e-05: log likelihood = 897.77, aic = -1789.55
ARIMA(0,1,2)
Call:
arima(x = dgdp, order = c(1, 0, 0))
Coefficients:
ar1 intercept
0.3575 0.0079
s.e. 0.0566 0.0008
sigma^2 estimated as 7.951e-05: log likelihood = 897.78, aic = -1789.55
Call:
arima(x = dgdp, order = c(0, 0, 2))
Coefficients:
ma1 ma2 intercept
0.3054 0.2271 0.0079
s.e. 0.0593 0.0566 0.0008
sigma^2 estimated as 7.805e-05: log likelihood = 900.27, aic = -1792.54
Model diagnostics assess whether the assumptions of the ARIMA model are plausibly satisfied by the data.
Common diagnostic tools:
Residual time series plot
→ checks stability, outliers, variance changes
ACF of residuals
→ checks remaining serial correlation
Ljung–Box test
→ formal test of joint residual autocorrelation \[
H_0:\ \rho_1 = \rho_2 = \cdots = \rho_h = 0,\] where \(\rho_k\) denotes the autocorrelation of the model residuals at lag \(k\).
Normal Q–Q plot
→ assesses whether residuals are approximately normal
→ systematic deviations, especially in the tails, matter
Diagnostics do not test whether the model is “true” — they assess whether the model is adequate.
Once an ARIMA model has been estimated and passes basic diagnostics, it can be used to generate forecasts of future observations.
Forecasting answers a different question than estimation:
Forecasts are model-based extrapolations, not statements of truth.
All time series forecasting relies on a crucial assumption:
Past statistical regularities remain informative about the future.
In ARIMA models, this means:
Given data observed up to time \(t\) and estimated parameters \(\hat\theta\), an ARIMA model implies a model-based forecast distribution for future observations: \[ x_{t+h} \;\sim\; \mathbb{P}_{\hat\theta}\!\left(\,\cdot \mid \mathscr{F}_t \right). \]
At time \(t\), the information set \(\mathscr{F}_t\) is observed. The fitted parameters \(\hat\theta\) define the forecasting model. Randomness arises from future shocks implied by the model.
From this model-based forecast distribution, we obtain:
a point forecast, defined as the model-based conditional mean \[ \hat{x}_{t+h\mid t} \;\equiv\; \mathbb{E}_{\hat\theta}\!\left(x_{t+h}\mid \mathscr{F}_t\right), \]
prediction intervals, constructed from quantiles of the model-based conditional distribution \[ x_{t+h}\mid \mathscr{F}_t \;\sim\; \mathbb{P}_{\hat\theta}. \]
Formally, a \((1-\alpha)\) prediction interval satisfies \[ \Pr_{\hat\theta}\!\left( x_{t+h}\in[\ell_h,u_h]\mid \mathscr{F}_t \right)=1-\alpha, \] where \(\ell_h\) and \(u_h\) are the \(\alpha/2\) and \(1-\alpha/2\) quantiles of the model-based forecast distribution.
We continue with the ARIMA model fitted to log real GDP. Let \[ y_t = \log(\text{GDP}_t). \]
The fitted model is \[ \Delta y_t = \mu + \phi\,\Delta y_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d. }(0,\sigma^2). \]
Equivalently, \[ y_t = y_{t-1} + \mu + \phi\,\Delta y_{t-1} + \varepsilon_t. \]
The point forecast at horizon \(h\) is defined as the conditional mean under the fitted model:
\[ \hat y_{t+h\mid t} \;\equiv\; \mathbb E_{\hat\theta}\!\left( y_{t+h} \mid \mathscr{F}_t \right), \] where \(\hat\theta=(\hat\mu,\hat\phi,\hat\sigma^2)\) denotes the estimated model parameters.
One-Step-Ahead Forecast
From the model equation, \[ y_{t+1} = y_t + \mu + \phi\,\Delta y_t + \varepsilon_{t+1}, \] taking conditional expectations given \(\mathscr{F}_t\) and \(\hat\theta\), and using \[ \mathbb E_{\hat\theta}(\varepsilon_{t+1}\mid\mathscr{F}_t)=0, \] we obtain the one-step-ahead point forecast \[ \hat y_{t+1\mid t} = y_t + \hat\mu + \hat\phi\,\Delta y_t. \]
Point forecasts summarize the center of the forecast distribution, but forecasting also requires quantifying uncertainty.
Consider the fitted ARIMA\((1,1,0)\) model \[ \Delta y_t = \hat\mu + \hat\phi\,\Delta y_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d. }(0,\hat\sigma^2). \]
At time \(t\), the information set \(\mathscr F_t\) and the estimated parameters \(\hat\theta=(\hat\mu,\hat\phi,\hat\sigma^2)\) are treated as fixed.
Under this conditioning, randomness arises only from future innovations \[ \varepsilon_{t+1}, \varepsilon_{t+2}, \ldots \]
One-step-ahead forecast error
From the fitted model, \[ y_{t+1} = y_t + \hat\mu + \hat\phi\,\Delta y_t + \varepsilon_{t+1}, \] taking conditional expectations yields the point forecast \[ \hat y_{t+1\mid t} = y_t + \hat\mu + \hat\phi\,\Delta y_t. \]
Therefore, the one-step-ahead forecast error is \[ y_{t+1} - \hat y_{t+1\mid t} = \varepsilon_{t+1}. \]
Two-step-ahead forecast error
Iterating the model forward one more step, the two-step-ahead forecast error satisfies \[ y_{t+2}-\hat y_{t+2\mid t} = (1+\hat\phi)\,\varepsilon_{t+1} + \varepsilon_{t+2}. \]
This illustrates how forecast uncertainty accumulates and propagates with the forecast horizon.
Under the fitted ARIMA\((1,1,0)\) model, forecast errors are linear combinations of future innovations: \[ y_{t+h}-\hat y_{t+h\mid t} = \sum_{j=1}^h w_{h,j}\,\varepsilon_{t+j}, \] where the weights \(w_{h,j}\) depend on \(\hat\phi\).
Since the fitted model assumes \[ \varepsilon_{t+j} \sim \text{i.i.d. }(0,\hat\sigma^2), \] the model-based forecast distribution of \(y_{t+h}\) is centered at \(\hat y_{t+h\mid t}\), with variance increasing in the forecast horizon \(h\).
Under the maintained Gaussian assumption, the model-based forecast error is normally distributed: \[ y_{t+h} - \hat y_{t+h\mid t} \;\sim\; N\!\left( 0,\; \mathrm{Var}_{\hat\theta}\!\left(y_{t+h}-\hat y_{t+h\mid t}\right) \right). \]
Therefore, a \((1-\alpha)\) model-based prediction interval for \(y_{t+h}\) is \[ \hat y_{t+h\mid t} \;\pm\; z_{1-\alpha/2} \sqrt{ \mathrm{Var}_{\hat\theta}\!\left(y_{t+h}-\hat y_{t+h\mid t}\right) }, \] where \(z_{1-\alpha/2}\) is the \((1-\alpha/2)\) quantile of the standard normal distribution.
Out-of-sample evaluation (h = 10 quarters)
RMSE: 0.0189
MAE: 0.0142
Forecasting ultimately concerns realized outcomes, not models.
A forecast made at time \(t\), \[ \hat y_{t+h\mid t}, \] is evaluated against what actually occurs, \[ y_{t+h}. \]
The realized forecast error is \[ e_{t+h} = y_{t+h} - \hat y_{t+h\mid t}. \]
This realized error combines all sources of forecast uncertainty:
A single realized forecast error \(e_{t+h}\) is informative, but forecast evaluation requires averaging over repeated realizations.
The standard criterion is the mean squared forecast error (MSFE): \[ \mathrm{MSFE}(h) = \mathbb{E}\!\left[ \left( y_{t+h} - \hat y_{t+h\mid t} \right)^2 \right], \] where the expectation is taken unconditionally under the true data-generating process.
To understand the sources of forecast error, write \[ y_{t+h} - \hat y_{t+h\mid t} = \underbrace{ \bigl(y_{t+h}-y_{t+h}^\ast\bigr) }_{\text{unpredictable shocks}} + \underbrace{ \bigl(y_{t+h}^\ast-\hat y_{t+h\mid t}\bigr) }_{\text{forecast construction error}}, \] where \[ y_{t+h}^\ast \equiv \mathbb{E}\!\left(y_{t+h}\mid \mathscr F_t\right) \] denotes the ideal conditional forecast under the true data-generating process.
Squaring and taking expectations yields the familiar decomposition: \[ \mathrm{MSFE}(h) = \underbrace{ \mathbb{E}\!\left[ \left(y_{t+h}-y_{t+h}^\ast\right)^2 \right] }_{\text{irreducible uncertainty}} + \underbrace{ \mathbb{E}\!\left[ \left(y_{t+h}^\ast-\hat y_{t+h\mid t}\right)^2 \right] }_{\text{bias and estimation error}}. \]
The second term in the MSFE decomposition, \[ \mathbb{E}\!\left[ \left(y_{t+h}^\ast-\hat y_{t+h\mid t}\right)^2 \right], \] captures systematic forecast error.
To obtain a bias–variance decomposition, we work conditionally on the information set \(\mathscr F_t\).
\[ \mathbb E\!\left[ (y^*_{t+h}-\hat y_{t+h\mid t})^2 \mid \mathscr F_t \right] = \underbrace{ \left( \mathbb E[\hat y_{t+h\mid t}\mid \mathscr F_t]-y^*_{t+h} \right)^2 }_{\text{forecast bias}^2} + \underbrace{ \mathrm{Var}(\hat y_{t+h\mid t}\mid \mathscr F_t) }_{\text{forecast variance}}. \]
Interpretation
This decomposition highlights a fundamental bias–variance tradeoff in forecasting: simpler models may be biased but stable, while flexible models may reduce bias at the cost of higher variance.
Averaging over forecasting situations (i.e. over \(\mathscr F_t\)) yields the unconditional contribution to the MSFE:: \[ \mathbb{E}\!\left[ (y^*_{t+h}-\hat y_{t+h\mid t})^2 \right] = \mathbb{E}\!\left[ \left( \mathbb E[\hat y_{t+h\mid t}\mid \mathscr F_t]-y^*_{t+h} \right)^2 \right] + \mathbb{E}\!\left[ \mathrm{Var}(\hat y_{t+h\mid t}\mid \mathscr F_t) \right]. \]
The MSFE is a population object: \[ \mathrm{MSFE}(h) = \mathbb{E}\!\left[ \left( y_{t+h} - \hat y_{t+h\mid t} \right)^2 \right]. \]
In practice, we observe one realized path. Expectations are therefore replaced by out-of-sample averages.
This leads to the root mean squared forecast error (RMSE): \[ \widehat{\mathrm{RMSE}}(h) = \sqrt{ \frac{1}{T} \sum_{t=1}^T \left( y_{t+h} - \hat y_{t+h\mid t} \right)^2 }. \]
Another commonly used criterion is the mean absolute error (MAE): \[ \widehat{\mathrm{MAE}}(h) = \frac{1}{T} \sum_{t=1}^T \left| y_{t+h} - \hat y_{t+h\mid t} \right|. \]
MAE penalizes errors linearly, rather than quadratically.
As a result:
The MSFE and RMSE are defined in terms of forecast errors, not the level of the time series.
For forecast evaluation to be meaningful, we implicitly require that:
This is exactly what ARIMA models are designed to deliver.
Implication
RMSE is a valid evaluation criterion for ARIMA forecasts even when the underlying series is nonstationary, because forecast errors inherit stability from the differenced model.
ARIMA models were developed as practical forecasting tools, not as structural or causal models.
Historically (Box–Jenkins):
ARIMA models are therefore best understood as:
This explains why: