Last week I attended a short course on Markov chain Monte Carlo (MCMC) methods in statistics. It was given by Professors Gareth Roberts and Omiros Papaspiliopoulos at the Department of Mathematical Sciences at University of Copenhagen. Both did a splendid job teaching the material and engaging the participants between and after lectures. Part of the course was to solve a small problem to get a practical feel of the methods that we discussed from a theoretical perspective. I was fortunate to work with a superb group consisting of Massimiliano Tamborrino (KU Math), Nina Munkholt (KU Math), Nina Lange (CBS), and Martin Jonsson (Lund). We produced the following short report on how to apply the Metropolis-Hastings algoritm to generate samples from the analytically intractable smoothing distribution of a non-Gaussian state-space model, which is a central problem for inference in this class of models. The model under consideration was a (very simple) univariate Gaussian random walk observed with i.i.d. $t_{3}$-distributed measurement errors. We considered a few different proposals, and in the end we solved the problem by using a random walk proposal. Other (and much more efficient) solutions can be thought of, but I think that this one is neat because it work quite well, and also illustrates some of the aspects that must be considered when choosing a proposal for the Metropolis-Hastings algorithm.

We consider for $t=1,\,\ldots,\,T$ the state-space model,

\begin{align}

y_{t}

%

&= x_{t} + \tau z_{t} \, , \quad z_{t} \sim t_{\nu} \label{eq_y} \\

%

%

x_{t}

%

&= x_{t-1} + \sigma h_{t} \, , \quad h_{t} \sim N(0, 1) \label{eq_x} \, ,

\end{align}

with $x_{0} = 0$ fixed, and $\tau = \sigma \sqrt{1/\nu}$.

The equations in (\ref{eq_y}) admit the observation and transition densities with respect to the Lebesgue measure, $p\left( y_{t} \mid x_{t} \right)$ and $p\left( x_{t} \mid x_{t-1} \right)$, respectively. For notational convenience, we define $z_{1:k} \defeq [ \begin{array}{ccc} z_{1} & \ldots & z_{k} \end{array} ]^{\prime}$ for a sequence of $k$ reals.

We wish to sample from the so-called smoothing distribution,

\begin{align}

p\left( x_{1:T} \mid y_{1:T} \right) \, , \label{eq_smoothing_distribution}

\end{align}

which does not have a closed-form expression due to the non-Normality of the observation density. However, note that the smoothing distribution (\ref{eq_smoothing_distribution}) is known up to proportionality; that is,

\begin{align}

p\left( x_{1:T} \mid y_{1:T} \right)

%

&= \frac{p\left( y_{1:T} \mid x_{1:T} \right) p\left( x_{1:T} \right)}{p\left( y_{1:T} \right)} \notag \\

%

&\propto p\left( y_{1:T} \mid x_{1:T} \right) p\left( x_{1:T} \right) \, ,

\end{align}

where

\begin{align}

p\left( y_{1:T} \mid x_{1:T} \right)

%

= \prod_{t=1}^{T} p\left( y_{t} \mid x_{t} \right)

%

\quad \text{and} \quad

%

p\left( x_{1:T} \right)

%

= \prod_{t=1}^{T} p\left( x_{t} \mid x_{t-1} \right) \, . \label{eq_factorizations}

\end{align}

To sample from the smoothing distribution, we employ the Metropolis-Hastings algorithm of Metropolis et al. (1953) and Hastings (1970), which runs as specified below.

Initialization: Choose $x^{(0)}$, $q\left( x , x^{\prime} \right)$, and $N$.

For $i=1,\,\ldots,\,N$:

- Sample $u^{(i)} \sim U(0,1)$.
- Sample $\tilde{x}^{(i)} \sim q\left( x^{(i-1)} ,\, \cdot \right)$.
- Evaluate the acceptance probability

\begin{align}

\alpha^{(i)}

%

\defeq \min\left\{ \frac{\pi_{u}(\tilde{x}^{(i)})}{\pi_{u}(x^{(i-1)})} \frac{q\left(\tilde{x}^{(i)} ,\, x^{(i-1)} \right)}{q\left( x^{(i-1)} ,\, \tilde{x}^{(i)} \right)} ,\, 1 \right\} \, , \label{eq_acceptance_prob}

\end{align}

and set $x^{(i)} \defeq \tilde{x}^{(i)}$ if $u^{(i)} < \alpha^{(i)}$, else set $x^{(i)} \defeq x^{(i-1)}$.

The Metropolis-Hastings algorithm allows sampling from any density known up to proportionality, denoted $\pi(x)$ and referred to as the `target density’. It is prerequisite to pick 1) a transition kernel, denoted $q(x ,\, x^{\prime})$ and referred to as the `proposal density’, 2) some initial value of $x$, and 3) how many iterations to execute, denoted $N$. Observe that the number of iterations $N$ is not equivalent to the effective number of samples from $\pi(x)$!

For the model under consideration we set the (unnormalized) target density to be the unnormalized smoothing density,

\begin{align}

\pi_{u}(x_{1:T})

%

&\defeq p\left( y_{1:T} \mid x_{1:T} \right) p\left( x_{1:T} \right) \, ,

\end{align}

which can easily be evaluated using the factorizations in (\ref{eq_factorizations}).

The choice of proposal density is not trivial, and several alternatives could be considered. It is common to consider variations of the random-walk MH or the independent MH. The RWMH is characterized by the proposal density,

\begin{align}

q(x_{1:T}^{(i-1)} ,\, x_{1:T}^{(i)})

&

\defeq N( x_{1:T}^{(i-1)} ,\, \Omega) \, ,

\end{align}

where $\Omega$ is an appropriately chosen variance. A simple approach could be to choose $\Omega = \omega I_{T}$, where $\omega$ is a scaling parameter that can be chosen to obtain a given level of acceptance probability. The IMH is characterized by

\begin{align}

q(x_{1:T}^{(i)})

&

\defeq N( \mu ,\, \Omega ) \, ,

\end{align}

where $\mu$ and $\Omega$ are appropriately chosen parameters. It should be noted that other distributions than the normal can be used instead, and that the expression for the acceptance probability in (\ref{eq_acceptance_prob}) may be simplified, depending on the choice of proposal.

We are given $T=100$ observations generated from the model in (\ref{eq_y}), and told that $\nu = 3$ and $\sigma = 0.01$. The observation sequence $y_{1:T}$ is plotted in Figure 2(a).

First, we tried to implement the IMH, setting the mean and variance equal to the moments obtained from the Kalman smoother (matching the moments of the Gaussian SSM to the non-Gaussian SSM under consideration). This MH sampler yields an acceptance rate of at best 1–2 pct., so we abandoned further work on this solution.

Second, we have implemented the RWMH, setting the covariance matrix to $\Omega \defeq \omega I_{T}$, with scaling $\omega = 0.001$. The scaling was chosen by hand to obtain an acceptance rate of approximately 30 pct., which is close to the optimum for sampling problems of this dimension. We ran $51000$ iterations of the RWMH algorithm, discarding the first $1000$ iterations as a `burn-in’ sample, to allow the RW proposal to converge. The result is $T=100$ chains of length $N=50000$. We have plotted the trace plots for four of these in Figure 1 to illustrate what these look like.

The chains in Figure 1 are clearly not IID draws, and therefore the effective sample size is less than the $50000$ iterations. We note that the effective sample size can be calculated, but we have not had time to implement it.

Having sampled the chains for the hidden state sequence, we can produce an estimate of the expected state, conditional on the sequence of observations,

\begin{align}

\ex\left[ x_{1:T} \mid y_{1:T} \right]

%

&= \int x_{1:T} \, p\left( x_{1:T} \mid y_{1:T} \right) \, \kd x_{1:T} \notag \\

%

&\approx \frac{1}{N} \sum_{i=1}^{N} x_{1:T}^{(i)} \, ,

\end{align}

since the chain iterations $x_{1:T}^{(i)}$ have been sampled from the smoothing density $p\left( x_{1:T} \mid y_{1:T} \right)$. We observe that as $N\to\infty$, a law of large numbers applies such that the estimate converges in probability to the expectation. To assess the estimation uncertainty of this estimator we could run the MH sampler an other $M$ times, compute the state estimates and their standard deviation. Note moreover that computing the centered second order moment of the chains yields a distinctly different statistic, because this says something about the smoothing distribution rather than the estimator of the expectation. The estimated expected state is displayed in Figure 2(a), and the resulting estimated measurement error in Figure 2(b). The estimated errors include some large absolute realizations, which is consistent with a $t_{3}$-distribution.

The mixing properties of the chains produced by the RWMH sampler can be assessed further by inspecting the ACF’s. Figure 3(a) displays the ACF’s corresponding to each chain. We note that at lag $250$, the ACF’s are still in the interval from $0.2$ to $0.8$, which indicates that the chains are generally mixing slowly, but in particular that some chains are mixing much slower than others.

As a final point, we have considered whether the mixing of a given chain is related to the size of the corresponding measurement error. The hypothesis is that a large (absolute) error would tend to over-reject proposals at the `optimal’ scaling factor, therefore producing more persistent chains for the elements with large measurement errors. We investigate this graphically by plotting the ACF’s at lag 250 against the log of the absolute value of the measurement error, cf. Figure 3(b). There is a clear positive dependence between the two, which supports the hypothesis that applying the same scaling factor to all elements of the hidden state sequence might result in poorer mixing for the elements with larger measurement errors. A way to solve this problem could be to allow the scaling to vary element-by-element, but this would of course require that one thinks carefully about the scaling for each element.