Nonstationary Time Series Models

Unit Roots, Integration, ARIMA Models

Natasha Kang

Xiamen University, Chow Institute

March, 2026

What Do We Mean by Nonstationarity?

A time series \(\{x_t\}\) is nonstationary if its probabilistic behavior changes over time, so that:

its distribution is not invariant to time shifts, or
key features such as moments or dependence structure are time-dependent

Equivalently:

The statistical properties of the process are not stable over time.

Sources of Nonstationarity

Nonstationarity can arise from different mechanisms:

Stochastic nonstationarity Instability driven by the accumulation of random shocks (e.g. unit roots)
Deterministic nonstationarity Time-varying deterministic components (e.g. trends)
Other forms Structural breaks, regime changes, changing volatility

We will focus on stochastic nonstationarity here.

A Canonical Example of a Unit Root Process

Consider the AR(1) model \[ x_t = \rho x_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d.}(0,\sigma^2). \]

The associated autoregressive polynomial is \[ \Phi(z) = 1 - \rho z. \]

When \(\rho = 1\), we have \(\Phi(1) = 0\).

In this case, the autoregressive polynomial has a root equal to one. This is called a unit root.

Dynamics of a Unit Root Process

\[ x_t = x_{t-1} + \varepsilon_t, \]

Accumulation of shocks

Iterating forward, \[ x_t = x_0 + \sum_{j=1}^t \varepsilon_j. \]

Shocks enter the level cumulatively.

Permanent effects

A one-time shock \(\varepsilon_t\) affects the process at all future horizons: \[ x_{t+h} = x_t + \sum_{j=1}^h \varepsilon_{t+j}, \qquad h \ge 0. \]

There is no decay in the effect of past innovations.

Unit Root Process: Diverging Variance

No stable long-run mean

If \(\mathbb{E}[\varepsilon_t]=0\), then \[ \mathbb{E}[x_t]=\mathbb{E}[x_0], \qquad \mathrm{Var}(x_t)=\mathrm{Var}(x_0)+t\sigma^2 \to \infty. \]

The mean is constant, but dispersion grows without bound.

Random Walk

When a unit root process is AR(1), it is also referred to as a random walk.

Order of Integration

Unit roots raise a natural question: how far is a series from stationarity?

A time series \(\{x_t\}\) is integrated of order \(d\), written \(x_t \sim I(d)\), if:

\(\Delta^d x_t\) is stationary, and
\(\Delta^{d-1} x_t\) is not,

where \(\Delta = 1 - L\) is the difference operator.

Interpretation:

\(I(0)\): stationary process
\(I(1)\): first differences are stationary
\(I(d)\): stationarity achieved after \(d\) differences

ARIMA Models

ARIMA stands for Autoregressive Integrated Moving Average.

If a process is integrated of order \(d\), \[ x_t \sim I(d), \] then \[ \Delta^d x_t \;\text{is stationary}. \]

An ARIMA\((p,d,q)\) model assumes that the differenced series admits an ARMA representation: \[ \Phi(L)\,\Delta^d x_t = \Theta(L)\,\varepsilon_t. \]

\(d\): order of integration
\(p,q\): short-run dynamics of the stationary component

When \(d=0\), ARIMA reduces to ARMA.

Testing for Unit Roots

In ARIMA models, the order of integration \(d\) captures the presence of unit roots in the data.

This requires assessing whether a time series contains zero, one, or multiple unit roots.

In its simplest form, consider the AR(1) model \[ x_t = \rho x_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d.}(0,\sigma^2). \]

Testing for a unit root amounts to testing: \[ H_0:\ \rho = 1 \qquad \text{vs.} \qquad H_1:\ \rho < 1. \]

Unit Root Asymptotics

Consider the unit root model \[ x_t = x_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d.}(0,\sigma^2). \]

Iterating forward, \[ x_t = x_0 + \sum_{j=1}^t \varepsilon_j. \]

A unit root process is a partial sum of shocks.

By CLT,

\[ n^{-1/2} \sum_{i=1}^n \varepsilon_i \;\xrightarrow{d}\; N(0,\sigma^2). \]

This describes the behavior of the partial sum at a fixed point in time.

Unit Root Asymptotics: Rescaling

Let \[ t = \lfloor rn \rfloor, \qquad r \in [0,1]. \]

Then, \[ \lfloor rn \rfloor^{-1/2} \sum_{i=1}^{\lfloor rn \rfloor} \varepsilon_i \;\xrightarrow{d}\; N(0,\sigma^2). \]

Consider the scaled process \[ n^{-1/2} x_{\lfloor rn \rfloor} = n^{-1/2} x_0 + \left( \frac{\lfloor rn \rfloor}{n} \right)^{1/2} \lfloor rn \rfloor^{-1/2} \sum_{i=1}^{\lfloor rn \rfloor} \varepsilon_i. \]

Convergence to Brownian Motion

As \(n \to \infty\), the initial condition vanishes: \[ n^{-1/2} x_0 \to 0. \]

Therefore,

\[ n^{-1/2} x_{\lfloor rn \rfloor} \;\xrightarrow{d}\; r^{1/2} N(0,\sigma^2) = \sigma W(r), \qquad r \in [0,1], \]

where \(W(r)\) is a Brownian motion.

A unit root process converges to Brownian motion after appropriate scaling.

Asymptotics of the OLS Estimator (Unit Root Case)

Recall that the OLS estimator in the AR(1) regression \[ y_t = \rho y_{t-1} + \varepsilon_t \] satisfies \[ \hat\rho - \rho = \frac{\sum_{t=1}^n y_{t-1}\varepsilon_t}{\sum_{t=1}^n y_{t-1}^2}. \]

Under the unit root null hypothesis (\(\rho = 1\)), the following sample quantities satisfy:

\[ \begin{aligned} n^{-2}\sum_{t=1}^n y_{t-1}^2 &\;\xrightarrow{d}\; \sigma^2 \int_0^1 W(r)^2\,dr, \\[6pt] n^{-1}\sum_{t=1}^n y_{t-1}\varepsilon_t &\;\xrightarrow{d}\; \sigma^2 \int_0^1 W(r)\,dW(r), \end{aligned} \] where \(W(r)\) is standard Brownian motion.

Limit Distribution of \(\hat\rho\) (Dickey–Fuller)

Under the null hypothesis \(\rho = 1\), \[ n(\hat\rho - 1) \;\xrightarrow{d}\; \frac{\int_0^1 W(r)\,dW(r)}{\int_0^1 W(r)^2\,dr}. \]

The corresponding \(t\) statistic is \[ t_n = \frac{\hat\rho - 1}{\hat\sigma_{\hat\rho}} = \frac{\hat\rho - 1} {\hat\sigma\,(\sum_{t=1}^n y_{t-1}^2)^{-1/2}} \;\xrightarrow{d}\; \frac{\int_0^1 W(r)\,dW(r)} {\left\{\int_0^1 W(r)^2\,dr\right\}^{1/2}}. \]

\(\hat\rho\) is super-consistent
the \(t\) statistic is not asymptotically normal
its limiting distribution is known as the Dickey–Fuller distribution

Dickey–Fuller Test

The unit root hypothesis is \[ H_0:\ \gamma = 0 \quad (\text{equivalently, } \rho = 1). \]

The DF regression can include deterministic terms:

Specification	Regression
No deterministics	\(\Delta y_t = \gamma y_{t-1} + \varepsilon_t\)
Intercept	\(\Delta y_t = \alpha_0 + \gamma y_{t-1} + \varepsilon_t\)
Intercept + trend	\(\Delta y_t = \alpha_0 + \alpha_1 t + \gamma y_{t-1} + \varepsilon_t\)

The test statistic follows the Dickey–Fuller distribution (not standard normal) in all three cases, with different critical values.

Augmented Dickey–Fuller

In practice, the AR(1) specification is too restrictive. For a general AR(\(p\)) model, reparameterize in first differences: \[ \Delta y_t = \gamma y_{t-1} + \sum_{i=1}^{p-1} \beta_i \Delta y_{t-i} + \varepsilon_t, \] where \(\gamma = \sum_{i=1}^p a_i - 1\).

The lagged differences absorb higher-order dynamics; the unit root test remains \(H_0: \gamma = 0\) with the same DF critical values.

Determining the Order of Integration

ADF tests for a single unit root. But a series may be \(I(0)\), \(I(1)\), or \(I(2)\).

Dickey–Pantula (1987) provides a size-controlled sequential procedure:

Test \(I(2)\) vs. \(I(1)\): regress \(\Delta^2 y_t\) on \(\Delta y_{t-1}\) (plus lags of \(\Delta^2 y_t\)). If the null is not rejected, conclude \(I(2)\).
Test \(I(1)\) vs. \(I(0)\): if step 1 rejects, add \(y_{t-2}\) to the regression and test its coefficient. Failure to reject \(\Rightarrow I(1)\); rejection \(\Rightarrow I(0)\).

The procedure tests from the most differenced hypothesis down, which controls size at each step. In practice, \(d \in \{0,1,2\}\) covers nearly all cases.

Example: ARIMA Modeling of U.S. Real GDP (FRED)

First Differences of Log Real GDP

Augmented Dickey–Fuller Test (U.S. Real GDP)

We test for a unit root using the Dickey–Fuller test, which is a left-tailed test.

Null hypothesis (unit root): \[ H_0:\ \rho = 1 \]

Alternative hypothesis (stationarity): \[ H_1:\ \rho < 1 \]

Rejection occurs when the test statistic is more negative than the Dickey–Fuller critical value.

ADF test (levels, with intercept)
 Test statistic (tau):  -2.025 
 5% critical value:   -2.87 

 ADF test (first differences)
 Test statistic (tau):  -7.63 
 5% critical value:   -2.87

ACF and PACF of \(\Delta \log(\text{GDP})\)

Estimation: ARIMA(1,1,0)

The PACF cuts off after lag 1, suggesting an AR(1) for the differenced series, i.e., an ARIMA(1,1,0) model.

fit_ar1 <- arima(gdp, order = c(1, 1, 0))
fit_ar1


Call:
arima(x = gdp, order = c(1, 1, 0))

Coefficients:
         ar1
      0.6173
s.e.  0.0475

sigma^2 estimated as 9.467e-05:  log likelihood = 873.87,  aic = -1743.73

Estimation: ARIMA(0,1,2)

The ACF shows significant spikes at lags 1 and 2, which is also consistent with an MA(2) for the differenced series, i.e., an ARIMA(0,1,2) model.

fit_ma2 <- arima(gdp, order = c(0, 1, 2))
fit_ma2


Call:
arima(x = gdp, order = c(0, 1, 2))

Coefficients:
         ma1     ma2
      0.4505  0.3367
s.e.  0.0624  0.0490

sigma^2 estimated as 0.0001001:  log likelihood = 866.3,  aic = -1726.6

Model Selection: AIC and BIC

Given a set of candidate ARIMA\((p,d,q)\) models, we select using information criteria:

\[ \text{AIC} = -2\log L(\hat\theta) + 2k, \qquad \text{BIC} = -2\log L(\hat\theta) + k\log T, \] where \(k\) is the number of estimated parameters (depends on \(p\), \(q\), and whether deterministic terms are included).

AIC tends to select larger models (better for forecasting)
BIC penalizes complexity more heavily (consistent for model selection)

Model Comparison

n <- length(dgdp)
aic <- c(fit_ar1$aic, fit_ma2$aic)
bic <- aic + (log(n) - 2) * c(length(fit_ar1$coef), length(fit_ma2$coef))
cat(sprintf("%-15s %10s %10s\n", "Model", "AIC", "BIC"))
cat(sprintf("%-15s %10.2f %10.2f\n", "ARIMA(1,1,0)", aic[1], bic[1]))
cat(sprintf("%-15s %10.2f %10.2f\n", "ARIMA(0,1,2)", aic[2], bic[2]))

Model                  AIC        BIC
ARIMA(1,1,0)      -1743.73   -1740.13
ARIMA(0,1,2)      -1726.60   -1719.39

Both criteria favor ARIMA(1,1,0).

Diagnostics

Model diagnostics assess whether the assumptions of the ARIMA model are plausibly satisfied by the data.

the residuals are serially uncorrelated
the residuals have constant variance
the residuals have mean zero
the residuals are normally distributed (not required, but good to have, esp. for MLE)

Diagnostic Tools

Residual time series plot → checks stability, outliers, variance changes
ACF of residuals → checks remaining serial correlation
Ljung–Box test → formal test of joint residual autocorrelation \[ H_0:\ \rho_1 = \rho_2 = \cdots = \rho_h = 0,\] where \(\rho_k\) denotes the autocorrelation of the model residuals at lag \(k\).
Normal Q–Q plot → assesses whether residuals are approximately normal → systematic deviations, especially in the tails, matter

Diagnostics do not test whether the model is “true” — they assess whether the model is adequate.

Diagnostics: ARIMA(1,1,0) for U.S. Real GDP

Forecasting with ARIMA Models

Once an ARIMA model has been estimated and passes basic diagnostics, it can be used to generate forecasts of future observations.

Forecasting answers a different question than estimation:

Estimation: What statistical dependence structure fits the data?
Forecasting: What does the fitted model imply about the future?

Forecasts are model-based extrapolations, not statements of truth.

A Key Assumption Behind Forecasting

All time series forecasting relies on a crucial assumption:

Past statistical regularities remain informative about the future.

In ARIMA models, this means:

the order of integration remains unchanged,
the short-run dynamics are stable over the forecast horizon.

What Does an ARIMA Forecast Deliver?

Given data observed up to time \(t\) and estimated parameters \(\hat\theta\), an ARIMA model implies a model-based forecast distribution for future observations: \[ y_{t+h} \;\sim\; \mathbb{P}_{\hat\theta}\!\left(\,\cdot \mid \mathscr{F}_t \right). \]

At time \(t\), the information set \(\mathscr{F}_t\) is observed. The fitted parameters \(\hat\theta\) define the forecasting model. Randomness arises from future shocks implied by the model.

From this model-based forecast distribution, we obtain:

a point forecast, defined as the model-based conditional mean \[ \hat{y}_{t+h\mid t} \;\equiv\; \mathbb{E}_{\hat\theta}\!\left(y_{t+h}\mid \mathscr{F}_t\right), \]
prediction intervals, constructed from quantiles of the model-based conditional distribution \[ y_{t+h}\mid \mathscr{F}_t \;\sim\; \mathbb{P}_{\hat\theta}. \]

Formal Prediction Intervals

Formally, a \((1-\alpha)\) prediction interval satisfies \[ \Pr_{\hat\theta}\!\left( y_{t+h}\in[\ell_h,u_h]\mid \mathscr{F}_t \right)=1-\alpha, \] where \(\ell_h\) and \(u_h\) are the \(\alpha/2\) and \(1-\alpha/2\) quantiles of the model-based forecast distribution.

Forecasting U.S. Real GDP

We continue with the ARIMA model fitted to log real GDP. Let \[ y_t = \log(\text{GDP}_t). \]

The fitted model is \[ \Delta y_t = \phi\,\Delta y_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d. }(0,\sigma^2). \]

Equivalently, \[ y_t = y_{t-1} + \phi\,\Delta y_{t-1} + \varepsilon_t. \]

Point Forecast and One-Step-Ahead

Point Forecast

The point forecast at horizon \(h\) is defined as the conditional mean under the fitted model:

\[ \hat y_{t+h\mid t} \;\equiv\; \mathbb E_{\hat\theta}\!\left( y_{t+h} \mid \mathscr{F}_t \right), \] where \(\hat\theta=(\hat\phi,\hat\sigma^2)\) denotes the estimated model parameters.

One-Step-Ahead Forecast

From the model equation, \[ y_{t+1} = y_{t} + \phi\,\Delta y_t + \varepsilon_{t+1}, \] taking conditional expectations given \(\mathscr{F}_t\) and \(\hat\theta\), and using \(\mathbb E_{\hat\theta}(\varepsilon_{t+1}\mid\mathscr{F}_t)=0\), we obtain the one-step-ahead point forecast \[ \hat y_{t+1\mid t} = y_t + \hat\phi\,\Delta y_t. \]

Forecast Uncertainty

Point forecasts summarize the center of the forecast distribution, but forecasting also requires quantifying uncertainty.

Consider the fitted ARIMA\((1,1,0)\) model \[ \Delta y_t = \hat\phi\,\Delta y_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d. }(0,\hat\sigma^2). \]

At time \(t\), the information set \(\mathscr F_t\) and the estimated parameters \(\hat\theta=(\hat\phi,\hat\sigma^2)\) are treated as fixed.

Under this conditioning, randomness arises only from future innovations \[ \varepsilon_{t+1}, \varepsilon_{t+2}, \ldots \]

One-Step-Ahead Forecast Error

From the fitted model, \[ y_{t+1} = y_t + \hat\phi\,\Delta y_t + \varepsilon_{t+1}, \] taking conditional expectations yields the point forecast \[ \hat y_{t+1\mid t} = y_t + \hat\phi\,\Delta y_t. \]

Therefore, the one-step-ahead forecast error is \[ y_{t+1} - \hat y_{t+1\mid t} = \varepsilon_{t+1}. \]

Multi-Step Forecast Errors

Two-step-ahead forecast error

Iterating the model forward one more step, the two-step-ahead forecast error satisfies \[ y_{t+2}-\hat y_{t+2\mid t} = (1+\hat\phi)\,\varepsilon_{t+1} + \varepsilon_{t+2}. \]

This illustrates how forecast uncertainty accumulates and propagates with the forecast horizon.

Prediction Intervals

Under the fitted ARIMA\((1,1,0)\) model, forecast errors are linear combinations of future innovations: \[ y_{t+h}-\hat y_{t+h\mid t} = \sum_{j=1}^h w_{h,j}\,\varepsilon_{t+j}, \] where the weights \(w_{h,j}\) depend on \(\hat\phi\).

Since the fitted model assumes \[ \varepsilon_{t+j} \sim \text{i.i.d. }(0,\hat\sigma^2), \] the model-based forecast distribution of \(y_{t+h}\) is centered at \(\hat y_{t+h\mid t}\), with variance increasing in the forecast horizon \(h\).

Gaussian Prediction Intervals

Under the maintained Gaussian assumption, the model-based forecast error is normally distributed: \[ y_{t+h} - \hat y_{t+h\mid t} \;\sim\; N\!\left( 0,\; \mathrm{Var}_{\hat\theta}\!\left(y_{t+h}-\hat y_{t+h\mid t}\right) \right). \]

Therefore, a \((1-\alpha)\) model-based prediction interval for \(y_{t+h}\) is \[ \hat y_{t+h\mid t} \;\pm\; z_{1-\alpha/2} \sqrt{ \mathrm{Var}_{\hat\theta}\!\left(y_{t+h}-\hat y_{t+h\mid t}\right) }, \] where \(z_{1-\alpha/2}\) is the \((1-\alpha/2)\) quantile of the standard normal distribution.

Out-of-Sample Forecast

Out-of-sample evaluation (h = 10 quarters)
RMSE: 0.0189 
MAE:  0.0142

Forecast Evaluation

Forecasting ultimately concerns realized outcomes, not models.

A forecast made at time \(t\), \[ \hat y_{t+h\mid t}, \] is evaluated against what actually occurs, \[ y_{t+h}. \]

The realized forecast error is \[ e_{t+h} = y_{t+h} - \hat y_{t+h\mid t}. \]

This realized error combines all sources of forecast uncertainty:

future shocks,
parameter estimation error,
model misspecification.

Expected Squared Forecast Error

A single realized forecast error \(e_{t+h}\) is informative, but forecast evaluation requires averaging over repeated realizations.

The standard criterion is the mean squared forecast error (MSFE): \[ \mathrm{MSFE}(h) = \mathbb{E}\!\left[ \left( y_{t+h} - \hat y_{t+h\mid t} \right)^2 \right], \] where the expectation is taken unconditionally under the true data-generating process.

MSFE Decomposition

To understand the sources of forecast error, write \[ y_{t+h} - \hat y_{t+h\mid t} = \underbrace{ \bigl(y_{t+h}-y_{t+h}^\ast\bigr) }_{\text{unpredictable shocks}} + \underbrace{ \bigl(y_{t+h}^\ast-\hat y_{t+h\mid t}\bigr) }_{\text{forecast construction error}}, \] where \[ y_{t+h}^\ast \equiv \mathbb{E}\!\left(y_{t+h}\mid \mathscr F_t\right) \] denotes the ideal conditional forecast under the true data-generating process.

Squaring and taking expectations yields the familiar decomposition: \[ \mathrm{MSFE}(h) = \underbrace{ \mathbb{E}\!\left[ \left(y_{t+h}-y_{t+h}^\ast\right)^2 \right] }_{\text{irreducible uncertainty}} + \underbrace{ \mathbb{E}\!\left[ \left(y_{t+h}^\ast-\hat y_{t+h\mid t}\right)^2 \right] }_{\text{bias and estimation error}}. \]

Bias–Variance Decomposition (Forecasting)

The second term in the MSFE decomposition, \[ \mathbb{E}\!\left[ \left(y_{t+h}^\ast-\hat y_{t+h\mid t}\right)^2 \right], \] captures systematic forecast error.

To obtain a bias–variance decomposition, we work conditionally on the information set \(\mathscr F_t\).

\[ \mathbb E\!\left[ (y^*_{t+h}-\hat y_{t+h\mid t})^2 \mid \mathscr F_t \right] = \underbrace{ \left( \mathbb E[\hat y_{t+h\mid t}\mid \mathscr F_t]-y^*_{t+h} \right)^2 }_{\text{forecast bias}^2} + \underbrace{ \mathrm{Var}(\hat y_{t+h\mid t}\mid \mathscr F_t) }_{\text{forecast variance}}. \]

Bias–Variance Tradeoff

Interpretation

Bias reflects systematic misspecification relative to the true conditional mean
Variance reflects sensitivity of the forecast to sampling and parameter estimation

This decomposition highlights a fundamental bias–variance tradeoff in forecasting: simpler models may be biased but stable, while flexible models may reduce bias at the cost of higher variance.

Averaging over forecasting situations (i.e. over \(\mathscr F_t\)) yields the unconditional contribution to the MSFE: \[ \mathbb{E}\!\left[ (y^*_{t+h}-\hat y_{t+h\mid t})^2 \right] = \mathbb{E}\!\left[ \left( \mathbb E[\hat y_{t+h\mid t}\mid \mathscr F_t]-y^*_{t+h} \right)^2 \right] + \mathbb{E}\!\left[ \mathrm{Var}(\hat y_{t+h\mid t}\mid \mathscr F_t) \right]. \]

Empirical Forecast Evaluation

The MSFE is a population object: \[ \mathrm{MSFE}(h) = \mathbb{E}\!\left[ \left( y_{t+h} - \hat y_{t+h\mid t} \right)^2 \right]. \]

In practice, we observe one realized path. Expectations are therefore replaced by out-of-sample averages.

This leads to the root mean squared forecast error (RMSE): \[ \widehat{\mathrm{RMSE}}(h) = \sqrt{ \frac{1}{T} \sum_{t=1}^T \left( y_{t+h} - \hat y_{t+h\mid t} \right)^2 }. \]

Mean Absolute Error

Another commonly used criterion is the mean absolute error (MAE): \[ \widehat{\mathrm{MAE}}(h) = \frac{1}{T} \sum_{t=1}^T \left| y_{t+h} - \hat y_{t+h\mid t} \right|. \]

MAE penalizes errors linearly, rather than quadratically.

As a result:

RMSE is more sensitive to large mistakes,
MAE is more robust to outliers.

Remark on Forecast Evaluation

The MSFE and RMSE are defined in terms of forecast errors, not the level of the time series.

For forecast evaluation to be meaningful, we implicitly require that:

the distribution of forecast errors is stable over time, and
sample averages of forecast errors converge to population averages.

What This Means in Practice

The level series \(y_t\) itself need not be stationary (e.g. \(y_t\) may be \(I(1)\) or \(I(d)\)).
What matters is that the forecast error process \[ e_{t+h} = y_{t+h} - \hat y_{t+h\mid t} \] is stationary and ergodic.

This is exactly what ARIMA models are designed to deliver.

Looking Ahead

ARIMA models handle univariate nonstationary series by differencing.

What they leave out:

Explanatory variables — ARIMA uses only the history of the series itself. What if other variables help predict it? (Unit 5: Regression with Time Series Data)
Multiple nonstationary series — regressing one \(I(1)\) series on another can produce spurious results. But if the series share a common stochastic trend, they may be cointegrated. (Unit 6: Cointegration)