Last week I attended a short course on Markov chain Monte Carlo (MCMC) methods in statistics. It was given by Professors Gareth Roberts and Omiros Papaspiliopoulos at the Department of Mathematical Sciences at University of Copenhagen. Both did a splendid job teaching the material and engaging the participants between and after lectures. Part of the course was to solve a small problem to get a practical feel of the methods that we discussed from a theoretical perspective. I was fortunate to work with a superb group consisting of Massimiliano Tamborrino (KU Math), Nina Munkholt (KU Math), Nina Lange (CBS), and Martin Jonsson (Lund). We produced the following short report on how to apply the Metropolis-Hastings algoritm to generate samples from the analytically intractable smoothing distribution of a non-Gaussian state-space model, which is a central problem for inference in this class of models. The model under consideration was a (very simple) univariate Gaussian random walk observed with i.i.d. -distributed measurement errors. We considered a few different proposals, and in the end we solved the problem by using a random walk proposal. Other (and much more efficient) solutions can be thought of, but I think that this one is neat because it work quite well, and also illustrates some of the aspects that must be considered when choosing a proposal for the Metropolis-Hastings algorithm.
with fixed, and .
The equations in (1) admit the observation and transition densities with respect to the Lebesgue measure, and , respectively. For notational convenience, we define for a sequence of reals.
which does not have a closed-form expression due to the non-Normality of the observation density. However, note that the smoothing distribution (2) is known up to proportionality; that is,
To sample from the smoothing distribution, we employ the Metropolis-Hastings algorithm of Metropolis et al. (1953) and Hastings (1970), which runs as specified below.
Initialization: Choose , , and .
The Metropolis-Hastings algorithm allows sampling from any density known up to proportionality, denoted and referred to as the `target density’. It is prerequisite to pick 1) a transition kernel, denoted and referred to as the `proposal density’, 2) some initial value of , and 3) how many iterations to execute, denoted . Observe that the number of iterations is not equivalent to the effective number of samples from !
For the model under consideration we set the (unnormalized) target density to be the unnormalized smoothing density,
which can easily be evaluated using the factorizations in (4).
The choice of proposal density is not trivial, and several alternatives could be considered. It is common to consider variations of the random-walk MH or the independent MH. The RWMH is characterized by the proposal density,
where is an appropriately chosen variance. A simple approach could be to choose , where is a scaling parameter that can be chosen to obtain a given level of acceptance probability. The IMH is characterized by
where and are appropriately chosen parameters. It should be noted that other distributions than the normal can be used instead, and that the expression for the acceptance probability in (5) may be simplified, depending on the choice of proposal.
We are given observations generated from the model in (1), and told that and . The observation sequence is plotted in Figure 2(a).
First, we tried to implement the IMH, setting the mean and variance equal to the moments obtained from the Kalman smoother (matching the moments of the Gaussian SSM to the non-Gaussian SSM under consideration). This MH sampler yields an acceptance rate of at best 1–2 pct., so we abandoned further work on this solution.
Second, we have implemented the RWMH, setting the covariance matrix to , with scaling . The scaling was chosen by hand to obtain an acceptance rate of approximately 30 pct., which is close to the optimum for sampling problems of this dimension. We ran iterations of the RWMH algorithm, discarding the first iterations as a `burn-in’ sample, to allow the RW proposal to converge. The result is chains of length . We have plotted the trace plots for four of these in Figure 1 to illustrate what these look like.
The chains in Figure 1 are clearly not IID draws, and therefore the effective sample size is less than the iterations. We note that the effective sample size can be calculated, but we have not had time to implement it.
Having sampled the chains for the hidden state sequence, we can produce an estimate of the expected state, conditional on the sequence of observations,
since the chain iterations have been sampled from the smoothing density . We observe that as , a law of large numbers applies such that the estimate converges in probability to the expectation. To assess the estimation uncertainty of this estimator we could run the MH sampler an other times, compute the state estimates and their standard deviation. Note moreover that computing the centered second order moment of the chains yields a distinctly different statistic, because this says something about the smoothing distribution rather than the estimator of the expectation. The estimated expected state is displayed in Figure 2(a), and the resulting estimated measurement error in Figure 2(b). The estimated errors include some large absolute realizations, which is consistent with a -distribution.
The mixing properties of the chains produced by the RWMH sampler can be assessed further by inspecting the ACF’s. Figure 3(a) displays the ACF’s corresponding to each chain. We note that at lag , the ACF’s are still in the interval from to , which indicates that the chains are generally mixing slowly, but in particular that some chains are mixing much slower than others.
As a final point, we have considered whether the mixing of a given chain is related to the size of the corresponding measurement error. The hypothesis is that a large (absolute) error would tend to over-reject proposals at the `optimal’ scaling factor, therefore producing more persistent chains for the elements with large measurement errors. We investigate this graphically by plotting the ACF’s at lag 250 against the log of the absolute value of the measurement error, cf. Figure 3(b). There is a clear positive dependence between the two, which supports the hypothesis that applying the same scaling factor to all elements of the hidden state sequence might result in poorer mixing for the elements with larger measurement errors. A way to solve this problem could be to allow the scaling to vary element-by-element, but this would of course require that one thinks carefully about the scaling for each element.