Estimation and Inference

Natasha Kang

Xiamen University, Chow Institute

March, 2026

The Logic of Estimation and Inference

Estimation Problem for ARMA Models

In the population, an ARMA(\(p,q\)) process satisfies
\[ \phi(L)x_t = \theta(L) w_t, \qquad w_t \sim wn(0,\sigma_w^2), \] with unknown parameters
\[ \phi_1,\ldots,\phi_p,\quad \theta_1,\ldots,\theta_q,\quad \sigma_w^2. \]


We observe only a finite sample \[ (x_1,\ldots,x_n), \] and the shocks \(\{w_t\}\) are unobserved.


Estimation consists of constructing \[ (\hat\phi_1,\ldots,\hat\phi_p,\; \hat\theta_1,\ldots,\hat\theta_q,\; \hat\sigma_w^2) \] using the observed data.

Roadmap for Estimation and Inference

  1. Define estimators
    • OLS for AR models
    • (Quasi-)MLE for MA and ARMA models
  2. Ask large-sample questions
    • Do estimators converge to the true parameters?
    • What is their sampling distribution?
  3. Introduce required conditions
    • stationarity (well-defined population moments)
    • ergodicity and dependence conditions for LLN and CLT

These concepts are introduced only as needed to justify inference.

Estimation of AR Models via OLS

Consider an AR(\(p\)) model \[ x_t = \phi_1 x_{t-1} + \cdots + \phi_p x_{t-p} + w_t. \]

This model has a regression representation:

  • regressand: \(x_t\)
  • regressors: \((x_{t-1},\ldots,x_{t-p})\)
  • error term: \(w_t\)

We can estimate \((\phi_1,\ldots,\phi_p)\) by ordinary least squares

\[ (\hat\phi_1,\ldots,\hat\phi_p) = \arg\min_{\phi_1,\ldots,\phi_p} \sum_{t=p+1}^n \left( x_t - \phi_1 x_{t-1} - \cdots - \phi_p x_{t-p} \right)^2. \]

Example: AR(1)

\[ x_t = \phi x_{t-1} + w_t. \]

The OLS estimator of \(\phi\) is \[ \hat\phi = \arg\min_{\phi} \sum_{t=2}^n (x_t - \phi x_{t-1})^2. \]

Solving the minimization problem yields \[ \hat\phi = \frac{\sum_{t=2}^n x_{t-1} x_t} {\sum_{t=2}^n x_{t-1}^2}. \]

Properties of a good estimator?

  • unbiasedness (finite-sample)
  • consistency
  • asymptotic normality

Finite-Sample Unbiasedness?

Recall \[ \hat\phi = \phi + \frac{\sum_{t=2}^n x_{t-1} w_t} {\sum_{t=2}^n x_{t-1}^2}. \]

Question: \[ \mathbb{E}(\hat\phi) \stackrel{?}{=} \phi. \]


For \(\hat\phi\) to be unbiased, we would need strict exogeneity:

\[ E[w_t \mid x_1, x_2, \ldots, x_n] = 0, \] which cannot hold in dynamic models.

Consistency

Since finite-sample unbiasedness fails, we focus on consistency: \[ \hat\phi \xrightarrow{p} \phi \quad \text{as } n \to \infty. \]

Using the decomposition \[ \hat\phi - \phi = \frac{\sum_{t=2}^n x_{t-1} w_t} {\sum_{t=2}^n x_{t-1}^2}, \] consistency depends on the behavior of the numerator and denominator as sample size grows.

Denominator: Stabilization

We need \[ \frac{1}{n}\sum_{t=2}^n x_{t-1}^2 \;\xrightarrow{p}\; \mathbb{E}(x_{t-1}^2) > 0. \]

Numerator: Vanishing Term

We need \[ \frac{1}{n}\sum_{t=2}^n x_{t-1} w_t \;\xrightarrow{p}\; \mathbb{E}(x_{t-1} w_t) = 0. \]

Do These Population Moments Exist?

The consistency argument involves population quantities such as \[ \mathbb{E}(x_{t-1}^2), \qquad \mathbb{E}(x_{t-1} w_t). \]

These expectations are only meaningful if the process \(\{x_t\}\) admits time-invariant moments.

This requires stationarity of \(\{x_t\}\). In an AR model, where

\[ w_t := x_t - \phi x_{t-1}, \] stationarity of \(\{x_t\}\) implies joint stationarity of \((x_t, w_t)\), so these population moments are well defined.

When Do Sample Averages Represent Population Moments?

In time series analysis, we observe a single realization \((x_1,\ldots,x_n)\) of a stochastic process.

Sample averages such as \[ \frac{1}{n}\sum_{t=2}^n x_{t-1}^2, \qquad \frac{1}{n}\sum_{t=2}^n x_{t-1} w_t \] are time averages, computed along one observed path.


The corresponding population quantities, \[ \mathbb{E}(x_{t-1}^2), \qquad \mathbb{E}(x_{t-1} w_t), \] are ensemble averages: expectations taken across hypothetical repetitions of the process at a fixed time.


For estimation to work, time averages must coincide with ensemble averages. This is exactly what ergodicity guarantees.

When Is a Time Series Ergodic?

Ergodicity is not automatic. Whether it holds depends on the way dependence propagates over time.


For the linear time series models studied in this course:

  • Causal AR and ARMA processes are ergodic
  • A sufficient condition is that the process admits a representation \[ x_t = \sum_{j=0}^{\infty} \psi_j w_{t-j}, \qquad \sum_{j=0}^{\infty} |\psi_j| < \infty. \]

In linear ARMA models, causality implies absolute summability, which in turn ensures ergodicity.

Law of Large Numbers for Ergodic Time Series

Let \(\{x_t\}\) be a stationary and ergodic process, and let \(g(\cdot)\) be a function such that \(\mathbb{E}|g(x_t)| < \infty\).

Then, \[ \frac{1}{n}\sum_{t=1}^n g(x_t) \;\xrightarrow{p}\; \mathbb{E}[g(x_t)]. \]


This result justifies replacing population moments by sample averages in estimation.

Applying LLN to the AR(1) Estimator

Recall \[ \hat\phi = \frac{\frac{1}{n}\sum x_{t-1}x_t} {\frac{1}{n}\sum x_{t-1}^2}. \]

Under stationarity and ergodicity, \[ \frac{1}{n}\sum x_{t-1}^2 \to \mathbb{E}(x_{t-1}^2), \qquad \frac{1}{n}\sum x_{t-1}w_t \to \mathbb{E}(x_{t-1}w_t). \]

Asymptotic Normality

\[ \hat\phi = \frac{\sum_{t=2}^n x_{t-1}x_t}{\sum_{t=2}^n x_{t-1}^2} = \phi + \frac{\sum_{t=2}^n x_{t-1}w_t}{\sum_{t=2}^n x_{t-1}^2}. \]


To get a sampling distribution, scale the error: \[ \sqrt{n}(\hat\phi-\phi) = \frac{\frac{1}{\sqrt{n}}\sum_{t=2}^n x_{t-1}w_t} {\frac{1}{n}\sum_{t=2}^n x_{t-1}^2}. \]

CLT for the Numerator

We already used ergodicity to show \[ \frac{1}{n}\sum_{t=2}^n x_{t-1}^2 \;\to\; \mathbb{E}(x_{t-1}^2). \]


Now consider the numerator: \[ \frac{1}{\sqrt{n}}\sum_{t=2}^n x_{t-1} w_t. \]


In an AR model, \[ \mathbb{E}(w_t \mid x_{t-1}, x_{t-2}, \ldots) = 0, \] so the sequence \(\{x_{t-1} w_t\}\) is a martingale difference sequence.



From LLN to CLT

A CLT for the partial sum \[ \frac{1}{\sqrt{n}}\sum_{t=2}^n x_{t-1} w_t \] requires two ingredients:

  1. Variance accumulation at rate \(n\), so that \(\sqrt{n}\) is the correct normalization.
  2. Tail control, so that no small number of terms dominates the normalized sum.

In the AR model, the martingale difference structure of \(\{x_{t-1} w_t\}\) ensures weak dependence, which delivers (1). Together with finite second moments, this is sufficient to rule out domination by extreme terms and deliver (2).

Under these conditions, \[ \frac{1}{\sqrt{n}}\sum_{t=2}^n x_{t-1} w_t \;\xrightarrow{d}\; N\!\left(0,\; \mathbb{E}(x_{t-1}^2 w_t^2)\right). \]

Key Takeaway: Estimation and Inference for AR Models

For an AR(\(p\)) model with white-noise innovations, \[ x_t = \phi_1 x_{t-1} + \cdots + \phi_p x_{t-p} + w_t, \qquad w_t \sim wn(0,\sigma_w^2), \] the model admits a regression representation.


When standard regression tools apply

  • Stationarity ensures population moments exist
  • Ergodicity allows time averages to estimate those moments
  • Innovations are unpredictable given the past: \[ \mathbb{E}(w_t \mid x_{t-1}, x_{t-2}, \ldots)=0 \]

These conditions replace independence in time-series settings.


Implications

  • OLS is consistent
  • OLS is asymptotically normal
  • Standard variance formulas apply

Estimation for MA and ARMA Models

So far, we have seen that OLS works for AR models because they admit a regression representation with observed regressors.


What changes for MA and ARMA models?

  • The model involves unobserved shocks
  • There is no regression with observed regressors

For example, an MA(1) model: \[ x_t = w_t + \theta w_{t-1}, \] depends on latent innovations \(\{w_t\}\).


Implication for estimation

  • Ordinary least squares is not feasible
  • Estimation must be based on the joint distribution implied by the model
  • This leads naturally to maximum likelihood estimation (MLE)

MLE for MA and ARMA Models: Basic Idea

In MA and ARMA models, the shocks \(\{w_t\}\) are unobserved. Estimation therefore relies on the likelihood of the observed data \((x_1,\ldots,x_n)\).


Likelihood as a joint density

The likelihood is the joint density \[ f(x_1,\ldots,x_n), \] which can always be written as a product of conditional densities: \[ f(x_1)\, f(x_2\mid x_1)\, \cdots\, f(x_n\mid x_1,\ldots,x_{n-1}). \]


A working distributional assumption

Assume the innovations are Gaussian and Independent: \[ w_t \stackrel{i.i.d.}{\sim} N(0,\sigma^2). \]

This assumption yields a tractable likelihood.


Likelihood-based estimation

The parameters \((\phi_1,\ldots,\phi_p,\;\theta_1,\ldots,\theta_q,\;\sigma_w^2)\) index the joint density above.

MLE selects parameter values that maximize the (log-)likelihood evaluated at the observed data.

Example: MA(1) Model

\[ x_t = w_t + \theta w_{t-1}, \qquad w_t \stackrel{i.i.d.}{\sim} N(0,\sigma^2), \quad t=1,\ldots,n. \]


Recursion equation

For any candidate \((\theta,\sigma^2)\) and any initial value \(w_0\), the model implies the recursion \[ w_t = x_t - \theta w_{t-1}, \qquad t=1,\ldots,n, \] so each \(w_t\) is a function of \((x_1,\ldots,x_t)\) and \(w_0\).


Likelihood

Under \(w_t \stackrel{i.i.d.}{\sim} N(0,\sigma^2)\), \[ \begin{aligned} f(x_1,\ldots,x_n \mid w_0;\theta,\sigma^2) =& \prod_{t=1}^n f(x_t \mid x_{t-1},\ldots,x_1,w_0;\theta,\sigma^2)\\ =& \prod_{t=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{w_t^2}{2\sigma^2}\right), \end{aligned} \] where each \(w_t\) is obtained from the recursion.


Conditional vs. Exact Likelihood

The expression above is the conditional likelihood: it treats the initial innovation \(w_0\) as fixed.

The exact likelihood integrates over the model-implied distribution of \(w_0\): \[ f(x_1,\ldots,x_n;\theta,\sigma^2) = \int f(x_1,\ldots,x_n \mid w_0;\theta,\sigma^2)\, f(w_0)\,dw_0. \]

Both approaches are valid; under stationarity, they are asymptotically equivalent.

In practice, the conditional likelihood is often used for computational simplicity.

Example: ARMA(1,1)

\[ x_t = \phi x_{t-1} + w_t + \theta w_{t-1}, \qquad w_t \stackrel{i.i.d.}{\sim} N(0,\sigma^2). \]


Likelihood (conditional)

Given \((\phi,\theta,\sigma^2)\) and an initial value \(w_0\), we have \[ w_t = x_t - \phi x_{t-1} - \theta w_{t-1}, \qquad t=1,\ldots,n. \]


Under \(w_t \stackrel{i.i.d.}{\sim} N(0,\sigma^2)\), the conditional likelihood is \[ f(x_1,\ldots,x_n \mid w_0;\phi,\theta,\sigma^2) = \prod_{t=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left( -\frac{w_t^2}{2\sigma^2} \right). \]

Statistical Inference with MLE

Once the likelihood is specified, inference is based on the large-sample behavior of the MLE.


Large-sample properties of MLE

In addition to stationarity and ergodicity of the process, and under standard likelihood regularity conditions:

  • the MLE is consistent
  • the MLE is asymptotically normal


Quasi–Maximum Likelihood (QMLE)

The Gaussian likelihood can be used even if the true innovations are not Gaussian.

  • The resulting estimator is called the QMLE
  • QMLE remains consistent under correct specification of the conditional mean and variance
  • QMLE is asymptotically normal under weak dependence