Nonstationary Time Series Models

Unit Roots, Integration, ARIMA Models

Natasha Kang

Xiamen University, Chow Institute

March, 2026

What Do We Mean by Nonstationarity?

A time series \(\{x_t\}\) is nonstationary if its probabilistic behavior changes over time, so that:

  • its distribution is not invariant to time shifts, or
  • key features such as moments or dependence structure are time-dependent

Equivalently:

The statistical properties of the process are not stable over time.


Sources of Nonstationarity

Nonstationarity can arise from different mechanisms:

  • Stochastic nonstationarity
    Instability driven by the accumulation of random shocks (e.g. unit roots)

  • Deterministic nonstationarity
    Time-varying deterministic components (e.g. trends)

  • Other forms
    Structural breaks, regime changes, changing volatility


We will focus on stochastic nonstationarity here.

A Canonical Example of a Unit Root Process

Consider the AR(1) model \[ x_t = \rho x_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d.}(0,\sigma^2). \]


The associated autoregressive polynomial is \[ \Phi(z) = 1 - \rho z. \]


When \[ \rho = 1, \] we have \[ \Phi(1) = 0. \]


In this case, the autoregressive polynomial has a root equal to one. This is called a unit root.

Dynamics of a Unit Root Process

\[ x_t = x_{t-1} + \epsilon_t, \]


Accumulation of shocks

Iterating forward, \[ x_t = x_0 + \sum_{j=1}^t \varepsilon_j. \]

Shocks enter the level cumulatively.


Permanent effects

A one-time shock \(\varepsilon_t\) affects the process at all future horizons: \[ x_{t+h} = x_t + \sum_{j=1}^h \varepsilon_{t+j}, \qquad h \ge 0. \]

There is no decay in the effect of past innovations.


No stable long-run mean

If \(\mathbb{E}[\varepsilon_t]=0\), then \[ \mathbb{E}[x_t]=\mathbb{E}[x_0], \qquad \mathrm{Var}(x_t)=\mathrm{Var}(x_0)+t\sigma^2 \to \infty. \]

The mean is constant, but dispersion grows without bound.

Random Walk

When a unit root process is AR(1), it is also referred to as a random walk.

Order of Integration

Unit roots raise a natural question: how far is a series from stationarity?


A time series \(\{x_t\}\) is integrated of order \(d\), written \(x_t \sim I(d)\), if:

  • \(\Delta^d x_t\) is stationary, and
  • \(\Delta^{d-1} x_t\) is not,

where \(\Delta = 1 - L\) is the difference operator.


Interpretation:

  • \(I(0)\): stationary process
  • \(I(1)\): first differences are stationary
  • \(I(d)\): stationarity achieved after \(d\) differences

ARIMA Models

ARIMA stands for Autoregressive Integrated Moving Average.


If a process is integrated of order \(d\), \[ x_t \sim I(d), \] then \[ \Delta^d x_t \;\text{is stationary}. \]


An ARIMA\((p,d,q)\) model assumes that the differenced series admits an ARMA representation: \[ \Phi(L)\,\Delta^d x_t = \Theta(L)\,\varepsilon_t. \]


  • \(d\): order of integration
  • \(p,q\): short-run dynamics of the stationary component


When \(d=0\), ARIMA reduces to ARMA.

Testing for Unit Roots

In ARIMA models, the order of integration \(d\) captures the presence of unit roots in the data.


This requires assessing whether a time series contains zero, one, or multiple unit roots.


In its simplest form, consider the AR(1) model \[ x_t = \rho x_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d.}(0,\sigma^2). \]


Testing for a unit root amounts to testing: \[ H_0:\ \rho = 1 \qquad \text{vs.} \qquad H_1:\ \rho < 1. \]

Unit Root Asymptotics

Consider the unit root model \[ x_t = x_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d.}(0,\sigma^2). \]


Iterating forward, \[ x_t = x_0 + \sum_{j=1}^t \varepsilon_j. \]

A unit root process is a partial sum of shocks.


By CLT,

\[ n^{-1/2} \sum_{i=1}^n \varepsilon_i \;\xrightarrow{d}\; N(0,\sigma^2). \]

This describes the behavior of the partial sum at a fixed point in time.

Unit Root Asymptotics

Let \[ t = \lfloor rn \rfloor, \qquad r \in [0,1]. \]


Then, \[ \lfloor rn \rfloor^{-1/2} \sum_{i=1}^{\lfloor rn \rfloor} \varepsilon_i \;\xrightarrow{d}\; N(0,\sigma^2). \]

Consider the scaled process \[ n^{-1/2} y_{\lfloor rn \rfloor} = n^{-1/2} y_0 + \left( \frac{\lfloor rn \rfloor}{n} \right)^{1/2} \lfloor rn \rfloor^{-1/2} \sum_{i=1}^{\lfloor rn \rfloor} \varepsilon_i. \]


As \(n \to \infty\), the initial condition vanishes: \[ n^{-1/2} y_0 \to 0. \]

Therefore,

\[ n^{-1/2} y_{\lfloor rn \rfloor} \;\xrightarrow{d}\; r^{1/2} N(0,\sigma^2) = \sigma W(r), \qquad r \in [0,1], \]

where \(W(r)\) is a Brownian motion.


A unit root process converges to Brownian motion after appropriate scaling.

Asymptotics of the OLS Estimator (Unit Root Case)

Recall that the OLS estimator in the AR(1) regression \[ y_t = \rho y_{t-1} + \varepsilon_t \] satisfies \[ \hat\rho - \rho = \frac{\sum_{t=1}^n y_{t-1}\varepsilon_t}{\sum_{t=1}^n y_{t-1}^2}. \]


Under the unit root null hypothesis (\(\rho = 1\)), the following sample quantities satisfy:

\[ n^{-2}\sum_{t=1}^n y_{t-1}^2 \;\xrightarrow{d}\; \sigma^2 \int_0^1 W(r)^2\,dr, \]

\[ n^{-1}\sum_{t=1}^n y_{t-1}\varepsilon_t \;\xrightarrow{d}\; \sigma^2 \int_0^1 W(r)\,dW(r), \]

where \(W(r)\) is standard Brownian motion.

Limit Distribution of \(\hat\rho\) (Dickey–Fuller)

Under the null hypothesis \(\rho = 1\), \[ n(\hat\rho - 1) \;\xrightarrow{d}\; \frac{\int_0^1 W(r)\,dW(r)}{\int_0^1 W(r)^2\,dr}. \]


The corresponding \(t\) statistic is \[ t_n = \frac{\hat\rho - 1}{\hat\sigma_{\hat\rho}} = \frac{\hat\rho - 1} {\hat\sigma\,(\sum_{t=1}^n y_{t-1}^2)^{-1/2}} \;\xrightarrow{d}\; \frac{\int_0^1 W(r)\,dW(r)} {\left\{\int_0^1 W(r)^2\,dr\right\}^{1/2}}. \]


  • \(\hat\rho\) is super-consistent
  • the \(t\) statistic is not asymptotically normal
  • its limiting distribution is known as the Dickey–Fuller distribution

Dickey–Fuller Test:

So far, we considered the simplest Dickey–Fuller regression: \[ \Delta y_t = \gamma y_{t-1} + \varepsilon_t. \]


Dickey–Fuller allows additional deterministic terms:


(1) With intercept \[ \Delta y_t = \alpha_0 + \gamma y_{t-1} + \varepsilon_t. \]

(2) With intercept and linear trend \[ \Delta y_t = \alpha_0 + \alpha_1 t + \gamma y_{t-1} + \varepsilon_t. \]


In these regressions, the unit root is tested by \[ H_0:\ \gamma = 0. \]

Joint tests involving deterministic terms are possible, but empirical applications typically focus on testing for a unit root.

Augmented Dickey–Fuller

Recall that for an AR(1) process, the Dickey–Fuller regression is \[ \Delta y_t = \gamma y_{t-1} + \varepsilon_t. \]

In practice, time series often exhibit higher-order dynamics. Consider a general AR(\(p\)) model: \[ y_t = a_1 y_{t-1} + a_2 y_{t-2} + \cdots + a_p y_{t-p} + \varepsilon_t. \]


By rewriting the AR(\(p\)) model in first differences, we obtain \[ \Delta y_t = \gamma y_{t-1} + \sum_{i=1}^{p-1} \beta_i \Delta y_{t-i} + \varepsilon_t, \]

where \[ \gamma = \sum_{i=1}^p a_i - 1, \qquad \beta_i = - \sum_{j=i+1}^p a_j. \]


This reparameterization leads to the Augmented Dickey–Fuller regression, which preserves the Dickey–Fuller test for a unit root.


The null hypothesis remains \[ H_0:\ \gamma = 0. \]

Multiple Unit Roots

The Augmented Dickey–Fuller (ADF) regression tests for the presence of a single unit root, that is, whether the autoregressive polynomial contains a factor \((1-L)\).


More generally, a time series may contain zero, one, or more unit roots.


A common informal approach proceeds by applying ADF tests to \(y_t\), then to \(\Delta y_t\), and so on, until the null is rejected.


While intuitive, this sequential use of ADF does not provide a size-controlled procedure for testing multiple unit roots.

A Formal Approach: Dickey–Pantula Testing

Dickey and Pantula (1987) propose a formal procedure for testing multiple unit roots.


The procedure is based on:

  • Nested hypotheses on the number of unit roots
  • Regressions constructed so that each null hypothesis has a known asymptotic distribution
  • A sequential testing strategy with correct size control


This approach provides a theoretically justified way to distinguish between different numbers of unit roots.

Testing for Multiple Unit Roots (Dickey–Pantula)

To distinguish between \(I(2)\), \(I(1)\), and \(I(0)\), Dickey–Pantula (1987) propose a sequential testing procedure based on second differences.


Step 1: Test for two unit roots (\(I(2)\))

Estimate \[ \Delta^2 y_t = \alpha_0 + \beta_1 \Delta y_{t-1} + \sum_{i=1}^p c_i \Delta^2 y_{t-i} + u_t, \]

and test \[ H_0:\ \beta_1 = 0. \]

  • Failure to reject \(\Rightarrow\ y_t \sim I(2)\).


Step 2: Test for one unit root (\(I(1)\))

If \(H_0\) is rejected, estimate \[ \Delta^2 y_t = \alpha_0 + \beta_1 \Delta y_{t-1} + \beta_2 y_{t-2} + \sum_{i=1}^p c_i \Delta^2 y_{t-i} + u_t, \]

and test \[ H_0:\ \beta_2 = 0. \]

  • Failure to reject \(\Rightarrow\ y_t \sim I(1)\).
  • Rejection \(\Rightarrow\ y_t \sim I(0)\).

Remarks on Dickey–Pantula Testing

The Dickey–Pantula procedure is designed to:

  • test nested hypotheses
    \(d \ge 2 \;\Rightarrow\; d \ge 1 \;\Rightarrow\; d = 0\),
  • while controlling size at each step of the sequence.


Key assumptions

  • The data admit a finite-order AR representation (or a sufficiently accurate AR approximation)

    • any stationary ARMA process admits an infinite-order AR representation, and
    • truncating this representation at a sufficiently large lag order provides a good approximation to the true short-run dynamics.
  • Innovations are weakly dependent with finite variance

  • Deterministic terms (intercept, trend) are included consistently across steps


Practical interpretation

  • Dickey–Pantula determines the order of integration \(d\) needed for ARIMA modeling
  • Once \(d\) is selected, ARMA dynamics are modeled on the differenced series


In practice, attention is typically limited to \(d \in \{0,1,2\}\).

Example: ARIMA Modeling of U.S. Real GDP (FRED)

Augmented Dickey–Fuller Test (U.S. Real GDP)

We test for a unit root using the Dickey–Fuller test, which is a left-tailed test.

Null hypothesis (unit root): \[ H_0:\ \rho = 1 \]

Alternative hypothesis (stationarity): \[ H_1:\ \rho < 1 \]

Rejection occurs when the test statistic is more negative than the Dickey–Fuller critical value.

ADF test (levels, with intercept)
 Test statistic (tau):  -2.025 
 5% critical value:   -2.87 

 ADF test (first differences)
 Test statistic (tau):  -7.63 
 5% critical value:   -2.87 

ACF and PACF of \(\Delta \log(\text{GDP})\)

Estimation

ARIMA(1,1,0)

fit <- arima(dgdp, order = c(1, 0, 0), include.mean = TRUE)
fit

Call:
arima(x = dgdp, order = c(1, 0, 0), include.mean = TRUE)

Coefficients:
         ar1  intercept
      0.3575     0.0079
s.e.  0.0566     0.0008

sigma^2 estimated as 7.951e-05:  log likelihood = 897.78,  aic = -1789.55


t <- 1:length(gdp)
fit110 <- arima(gdp, order = c(1, 1, 0), xreg = t)
fit110 

Call:
arima(x = gdp, order = c(1, 1, 0), xreg = t)

Coefficients:
         ar1       t
      0.3575  0.0079
s.e.  0.0566  0.0008

sigma^2 estimated as 7.951e-05:  log likelihood = 897.77,  aic = -1789.55


  • What’s the average growth rate?

Estimation

ARIMA(0,1,2)

arima(dgdp, order = c(1, 0, 0)) # AR(1)
arima(dgdp, order = c(0, 0, 2)) # MA(2)
ARMAtoMA(ar=.3575, ma=0, 5) # prints psi-weights

Call:
arima(x = dgdp, order = c(1, 0, 0))

Coefficients:
         ar1  intercept
      0.3575     0.0079
s.e.  0.0566     0.0008

sigma^2 estimated as 7.951e-05:  log likelihood = 897.78,  aic = -1789.55

Call:
arima(x = dgdp, order = c(0, 0, 2))

Coefficients:
         ma1     ma2  intercept
      0.3054  0.2271     0.0079
s.e.  0.0593  0.0566     0.0008

sigma^2 estimated as 7.805e-05:  log likelihood = 900.27,  aic = -1792.54
  1. 0.3575
  2. 0.12780625
  3. 0.045690734375
  4. 0.0163344375390625
  5. 0.00583956142021484

Diagnostics

Model diagnostics assess whether the assumptions of the ARIMA model are plausibly satisfied by the data.

  • the residuals are serially uncorrelated
  • the residuals have constant variance
  • the residuals have mean zero
  • the residuals are normally distributed (not required, but good to have, esp. for MLE)


Common diagnostic tools:

  • Residual time series plot
    → checks stability, outliers, variance changes

  • ACF of residuals
    → checks remaining serial correlation

  • Ljung–Box test
    → formal test of joint residual autocorrelation \[ H_0:\ \rho_1 = \rho_2 = \cdots = \rho_h = 0,\] where \(\rho_k\) denotes the autocorrelation of the model residuals at lag \(k\).

  • Normal Q–Q plot
    → assesses whether residuals are approximately normal
    → systematic deviations, especially in the tails, matter


Diagnostics do not test whether the model is “true” — they assess whether the model is adequate.

Diagnostics

Forecasting with ARIMA Models

Once an ARIMA model has been estimated and passes basic diagnostics, it can be used to generate forecasts of future observations.


Forecasting answers a different question than estimation:

  • Estimation: What statistical dependence structure fits the data?
  • Forecasting: What does the fitted model imply about the future?


Forecasts are model-based extrapolations, not statements of truth.

A Key Assumption Behind Forecasting

All time series forecasting relies on a crucial assumption:


Past statistical regularities remain informative about the future.


In ARIMA models, this means:

  • the order of integration remains unchanged,
  • the short-run dynamics are stable over the forecast horizon.

What Does an ARIMA Forecast Deliver?

Given data observed up to time \(t\) and estimated parameters \(\hat\theta\), an ARIMA model implies a model-based forecast distribution for future observations: \[ x_{t+h} \;\sim\; \mathbb{P}_{\hat\theta}\!\left(\,\cdot \mid \mathscr{F}_t \right). \]

At time \(t\), the information set \(\mathscr{F}_t\) is observed. The fitted parameters \(\hat\theta\) define the forecasting model. Randomness arises from future shocks implied by the model.


From this model-based forecast distribution, we obtain:

  • a point forecast, defined as the model-based conditional mean \[ \hat{x}_{t+h\mid t} \;\equiv\; \mathbb{E}_{\hat\theta}\!\left(x_{t+h}\mid \mathscr{F}_t\right), \]

  • prediction intervals, constructed from quantiles of the model-based conditional distribution \[ x_{t+h}\mid \mathscr{F}_t \;\sim\; \mathbb{P}_{\hat\theta}. \]


Formally, a \((1-\alpha)\) prediction interval satisfies \[ \Pr_{\hat\theta}\!\left( x_{t+h}\in[\ell_h,u_h]\mid \mathscr{F}_t \right)=1-\alpha, \] where \(\ell_h\) and \(u_h\) are the \(\alpha/2\) and \(1-\alpha/2\) quantiles of the model-based forecast distribution.

Forecasting U.S. Real GDP

We continue with the ARIMA model fitted to log real GDP. Let \[ y_t = \log(\text{GDP}_t). \]

The fitted model is \[ \Delta y_t = \mu + \phi\,\Delta y_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d. }(0,\sigma^2). \]

Equivalently, \[ y_t = y_{t-1} + \mu + \phi\,\Delta y_{t-1} + \varepsilon_t. \]


Point Forecast

The point forecast at horizon \(h\) is defined as the conditional mean under the fitted model:

\[ \hat y_{t+h\mid t} \;\equiv\; \mathbb E_{\hat\theta}\!\left( y_{t+h} \mid \mathscr{F}_t \right), \] where \(\hat\theta=(\hat\mu,\hat\phi,\hat\sigma^2)\) denotes the estimated model parameters.


One-Step-Ahead Forecast

From the model equation, \[ y_{t+1} = y_t + \mu + \phi\,\Delta y_t + \varepsilon_{t+1}, \] taking conditional expectations given \(\mathscr{F}_t\) and \(\hat\theta\), and using \[ \mathbb E_{\hat\theta}(\varepsilon_{t+1}\mid\mathscr{F}_t)=0, \] we obtain the one-step-ahead point forecast \[ \hat y_{t+1\mid t} = y_t + \hat\mu + \hat\phi\,\Delta y_t. \]

Forecast Uncertainty

Point forecasts summarize the center of the forecast distribution, but forecasting also requires quantifying uncertainty.

Consider the fitted ARIMA\((1,1,0)\) model \[ \Delta y_t = \hat\mu + \hat\phi\,\Delta y_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d. }(0,\hat\sigma^2). \]

At time \(t\), the information set \(\mathscr F_t\) and the estimated parameters \(\hat\theta=(\hat\mu,\hat\phi,\hat\sigma^2)\) are treated as fixed.

Under this conditioning, randomness arises only from future innovations \[ \varepsilon_{t+1}, \varepsilon_{t+2}, \ldots \]


One-step-ahead forecast error

From the fitted model, \[ y_{t+1} = y_t + \hat\mu + \hat\phi\,\Delta y_t + \varepsilon_{t+1}, \] taking conditional expectations yields the point forecast \[ \hat y_{t+1\mid t} = y_t + \hat\mu + \hat\phi\,\Delta y_t. \]

Therefore, the one-step-ahead forecast error is \[ y_{t+1} - \hat y_{t+1\mid t} = \varepsilon_{t+1}. \]

Two-step-ahead forecast error

Iterating the model forward one more step, the two-step-ahead forecast error satisfies \[ y_{t+2}-\hat y_{t+2\mid t} = (1+\hat\phi)\,\varepsilon_{t+1} + \varepsilon_{t+2}. \]

This illustrates how forecast uncertainty accumulates and propagates with the forecast horizon.

Prediction Intervals

Under the fitted ARIMA\((1,1,0)\) model, forecast errors are linear combinations of future innovations: \[ y_{t+h}-\hat y_{t+h\mid t} = \sum_{j=1}^h w_{h,j}\,\varepsilon_{t+j}, \] where the weights \(w_{h,j}\) depend on \(\hat\phi\).

Since the fitted model assumes \[ \varepsilon_{t+j} \sim \text{i.i.d. }(0,\hat\sigma^2), \] the model-based forecast distribution of \(y_{t+h}\) is centered at \(\hat y_{t+h\mid t}\), with variance increasing in the forecast horizon \(h\).


Under the maintained Gaussian assumption, the model-based forecast error is normally distributed: \[ y_{t+h} - \hat y_{t+h\mid t} \;\sim\; N\!\left( 0,\; \mathrm{Var}_{\hat\theta}\!\left(y_{t+h}-\hat y_{t+h\mid t}\right) \right). \]

Therefore, a \((1-\alpha)\) model-based prediction interval for \(y_{t+h}\) is \[ \hat y_{t+h\mid t} \;\pm\; z_{1-\alpha/2} \sqrt{ \mathrm{Var}_{\hat\theta}\!\left(y_{t+h}-\hat y_{t+h\mid t}\right) }, \] where \(z_{1-\alpha/2}\) is the \((1-\alpha/2)\) quantile of the standard normal distribution.

Out-of-Sample Forecast

Out-of-sample evaluation (h = 10 quarters)
RMSE: 0.0189 
MAE:  0.0142 

Forecast Evaluation

Forecasting ultimately concerns realized outcomes, not models.


A forecast made at time \(t\), \[ \hat y_{t+h\mid t}, \] is evaluated against what actually occurs, \[ y_{t+h}. \]


The realized forecast error is \[ e_{t+h} = y_{t+h} - \hat y_{t+h\mid t}. \]


This realized error combines all sources of forecast uncertainty:

  • future shocks,
  • parameter estimation error,
  • model misspecification.

Expected Squared Forecast Error

A single realized forecast error \(e_{t+h}\) is informative, but forecast evaluation requires averaging over repeated realizations.


The standard criterion is the mean squared forecast error (MSFE): \[ \mathrm{MSFE}(h) = \mathbb{E}\!\left[ \left( y_{t+h} - \hat y_{t+h\mid t} \right)^2 \right], \] where the expectation is taken unconditionally under the true data-generating process.


To understand the sources of forecast error, write \[ y_{t+h} - \hat y_{t+h\mid t} = \underbrace{ \bigl(y_{t+h}-y_{t+h}^\ast\bigr) }_{\text{unpredictable shocks}} + \underbrace{ \bigl(y_{t+h}^\ast-\hat y_{t+h\mid t}\bigr) }_{\text{forecast construction error}}, \] where \[ y_{t+h}^\ast \equiv \mathbb{E}\!\left(y_{t+h}\mid \mathscr F_t\right) \] denotes the ideal conditional forecast under the true data-generating process.


Squaring and taking expectations yields the familiar decomposition: \[ \mathrm{MSFE}(h) = \underbrace{ \mathbb{E}\!\left[ \left(y_{t+h}-y_{t+h}^\ast\right)^2 \right] }_{\text{irreducible uncertainty}} + \underbrace{ \mathbb{E}\!\left[ \left(y_{t+h}^\ast-\hat y_{t+h\mid t}\right)^2 \right] }_{\text{bias and estimation error}}. \]

Bias–Variance Decomposition (Forecasting)

The second term in the MSFE decomposition, \[ \mathbb{E}\!\left[ \left(y_{t+h}^\ast-\hat y_{t+h\mid t}\right)^2 \right], \] captures systematic forecast error.


To obtain a bias–variance decomposition, we work conditionally on the information set \(\mathscr F_t\).

\[ \mathbb E\!\left[ (y^*_{t+h}-\hat y_{t+h\mid t})^2 \mid \mathscr F_t \right] = \underbrace{ \left( \mathbb E[\hat y_{t+h\mid t}\mid \mathscr F_t]-y^*_{t+h} \right)^2 }_{\text{forecast bias}^2} + \underbrace{ \mathrm{Var}(\hat y_{t+h\mid t}\mid \mathscr F_t) }_{\text{forecast variance}}. \]

Interpretation

  • Bias reflects systematic misspecification relative to the true conditional mean
  • Variance reflects sensitivity of the forecast to sampling and parameter estimation

This decomposition highlights a fundamental bias–variance tradeoff in forecasting: simpler models may be biased but stable, while flexible models may reduce bias at the cost of higher variance.


Averaging over forecasting situations (i.e. over \(\mathscr F_t\)) yields the unconditional contribution to the MSFE:: \[ \mathbb{E}\!\left[ (y^*_{t+h}-\hat y_{t+h\mid t})^2 \right] = \mathbb{E}\!\left[ \left( \mathbb E[\hat y_{t+h\mid t}\mid \mathscr F_t]-y^*_{t+h} \right)^2 \right] + \mathbb{E}\!\left[ \mathrm{Var}(\hat y_{t+h\mid t}\mid \mathscr F_t) \right]. \]

Empirical Forecast Evaluation

The MSFE is a population object: \[ \mathrm{MSFE}(h) = \mathbb{E}\!\left[ \left( y_{t+h} - \hat y_{t+h\mid t} \right)^2 \right]. \]


In practice, we observe one realized path. Expectations are therefore replaced by out-of-sample averages.


This leads to the root mean squared forecast error (RMSE): \[ \widehat{\mathrm{RMSE}}(h) = \sqrt{ \frac{1}{T} \sum_{t=1}^T \left( y_{t+h} - \hat y_{t+h\mid t} \right)^2 }. \]


Another commonly used criterion is the mean absolute error (MAE): \[ \widehat{\mathrm{MAE}}(h) = \frac{1}{T} \sum_{t=1}^T \left| y_{t+h} - \hat y_{t+h\mid t} \right|. \]

MAE penalizes errors linearly, rather than quadratically.


As a result:

  • RMSE is more sensitive to large mistakes,
  • MAE is more robust to outliers.

Remark on Forecast Evaluation

The MSFE and RMSE are defined in terms of forecast errors, not the level of the time series.


For forecast evaluation to be meaningful, we implicitly require that:

  • the distribution of forecast errors is stable over time, and
  • sample averages of forecast errors converge to population averages.


What This Means in Practice

  • The level series \(y_t\) itself need not be stationary (e.g. \(y_t\) may be \(I(1)\) or \(I(d)\)).
  • What matters is that the forecast error process \[ e_{t+h} = y_{t+h} - \hat y_{t+h\mid t} \] is stationary and ergodic.

This is exactly what ARIMA models are designed to deliver.


Implication

RMSE is a valid evaluation criterion for ARIMA forecasts even when the underlying series is nonstationary, because forecast errors inherit stability from the differenced model.

ARIMA Models: A Historical Perspective on Forecasting

ARIMA models were developed as practical forecasting tools, not as structural or causal models.


Historically (Box–Jenkins):

  • the goal was to summarize dependence patterns in the data,
  • assess adequacy through diagnostics,
  • and generate short- to medium-horizon forecasts.


ARIMA models are therefore best understood as:

  • descriptive models of time-series dependence,
  • designed for prediction, not interpretation,
  • evaluated by forecast performance, not parameter meaning.


This explains why:

  • diagnostics focus on residual behavior,
  • prediction intervals are model-based,
  • and forecast evaluation relies on realized errors.