Sampling from the Smoothing Distribution of a State-Space Model with a Random Walk Metropolis-Hastings Algorithm

Last week I attended a short course on Markov chain Monte Carlo (MCMC) methods in statistics. It was given by Professors Gareth Roberts and Omiros Papaspiliopoulos at the Department of Mathematical Sciences at University of Copenhagen. Both did a splendid job teaching the material and engaging the participants between and after lectures. Part of the course was to solve a small problem to get a practical feel of the methods that we discussed from a theoretical perspective. I was fortunate to work with a superb group consisting of Massimiliano Tamborrino (KU Math), Nina Munkholt (KU Math), Nina Lange (CBS), and Martin Jonsson (Lund). We produced the following short report on how to apply the Metropolis-Hastings algoritm to generate samples from the analytically intractable smoothing distribution of a non-Gaussian state-space model, which is a central problem for inference in this class of models. The model under consideration was a (very simple) univariate Gaussian random walk observed with i.i.d. t_{3}-distributed measurement errors. We considered a few different proposals, and in the end we solved the problem by using a random walk proposal. Other (and much more efficient) solutions can be thought of, but I think that this one is neat because it work quite well, and also illustrates some of the aspects that must be considered when choosing a proposal for the Metropolis-Hastings algorithm.

We consider for t=1,\,\ldots,\,T the state-space model,

(1)   \begin{align*} y_{t} % &= x_{t} + \tau z_{t} \, , \quad z_{t} \sim t_{\nu}  \\ % % x_{t} % &= x_{t-1} + \sigma h_{t} \, , \quad h_{t} \sim N(0, 1) \label{eq_x} \, , \end{align*}

with x_{0} = 0 fixed, and \tau = \sigma \sqrt{1/\nu}.

The equations in (1) admit the observation and transition densities with respect to the Lebesgue measure, p\left( y_{t} \mid x_{t} \right) and p\left( x_{t} \mid x_{t-1} \right), respectively. For notational convenience, we define z_{1:k} \defeq [ \begin{array}{ccc} z_{1} & \ldots & z_{k} \end{array} ]^{\prime} for a sequence of k reals.

We wish to sample from the so-called smoothing distribution,

(2)   \begin{align*} p\left( x_{1:T} \mid y_{1:T} \right) \, ,  \end{align*}

which does not have a closed-form expression due to the non-Normality of the observation density. However, note that the smoothing distribution (2) is known up to proportionality; that is,

(3)   \begin{align*} p\left( x_{1:T} \mid y_{1:T} \right) % &= \frac{p\left( y_{1:T} \mid x_{1:T} \right) p\left( x_{1:T} \right)}{p\left( y_{1:T} \right)} \notag \\ % &\propto p\left( y_{1:T} \mid x_{1:T} \right) p\left( x_{1:T} \right) \, , \end{align*}


(4)   \begin{align*} p\left( y_{1:T} \mid x_{1:T} \right) % = \prod_{t=1}^{T} p\left( y_{t} \mid x_{t} \right) % \quad \text{and} \quad % p\left( x_{1:T} \right) % = \prod_{t=1}^{T} p\left( x_{t} \mid x_{t-1} \right) \, .  \end{align*}

To sample from the smoothing distribution, we employ the Metropolis-Hastings algorithm of Metropolis et al. (1953) and Hastings (1970), which runs as specified below.

Initialization: Choose x^{(0)}, q\left( x , x^{\prime} \right), and N.

For i=1,\,\ldots,\,N:

  1. Sample u^{(i)} \sim U(0,1).
  2. Sample \tilde{x}^{(i)} \sim q\left( x^{(i-1)} ,\, \cdot \right).
  3. Evaluate the acceptance probability

    (5)   \begin{align*} \alpha^{(i)} % \defeq \min\left\{ \frac{\pi_{u}(\tilde{x}^{(i)})}{\pi_{u}(x^{(i-1)})} \frac{q\left(\tilde{x}^{(i)} ,\, x^{(i-1)} \right)}{q\left( x^{(i-1)} ,\, \tilde{x}^{(i)} \right)} ,\, 1 \right\} \, ,  \end{align*}

    and set x^{(i)} \defeq \tilde{x}^{(i)} if u^{(i)} < \alpha^{(i)}, else set x^{(i)} \defeq x^{(i-1)}.

The Metropolis-Hastings algorithm allows sampling from any density known up to proportionality, denoted \pi(x) and referred to as the `target density’. It is prerequisite to pick 1) a transition kernel, denoted q(x ,\, x^{\prime}) and referred to as the `proposal density’, 2) some initial value of x, and 3) how many iterations to execute, denoted N. Observe that the number of iterations N is not equivalent to the effective number of samples from \pi(x)!

For the model under consideration we set the (unnormalized) target density to be the unnormalized smoothing density,

(6)   \begin{align*} \pi_{u}(x_{1:T}) % &\defeq p\left( y_{1:T} \mid x_{1:T} \right) p\left( x_{1:T} \right) \, , \end{align*}

which can easily be evaluated using the factorizations in (4).

The choice of proposal density is not trivial, and several alternatives could be considered. It is common to consider variations of the random-walk MH or the independent MH. The RWMH is characterized by the proposal density,

(7)   \begin{align*} q(x_{1:T}^{(i-1)} ,\, x_{1:T}^{(i)}) & \defeq N( x_{1:T}^{(i-1)} ,\, \Omega) \, , \end{align*}

where \Omega is an appropriately chosen variance. A simple approach could be to choose \Omega = \omega I_{T}, where \omega is a scaling parameter that can be chosen to obtain a given level of acceptance probability. The IMH is characterized by

(8)   \begin{align*} q(x_{1:T}^{(i)}) & \defeq N( \mu ,\, \Omega ) \, , \end{align*}

where \mu and \Omega are appropriately chosen parameters. It should be noted that other distributions than the normal can be used instead, and that the expression for the acceptance probability in (5) may be simplified, depending on the choice of proposal.

We are given T=100 observations generated from the model in (1), and told that \nu = 3 and \sigma = 0.01. The observation sequence y_{1:T} is plotted in Figure 2(a).

First, we tried to implement the IMH, setting the mean and variance equal to the moments obtained from the Kalman smoother (matching the moments of the Gaussian SSM to the non-Gaussian SSM under consideration). This MH sampler yields an acceptance rate of at best 1–2 pct., so we abandoned further work on this solution.

Second, we have implemented the RWMH, setting the covariance matrix to \Omega \defeq \omega I_{T}, with scaling \omega = 0.001. The scaling was chosen by hand to obtain an acceptance rate of approximately 30 pct., which is close to the optimum for sampling problems of this dimension. We ran 51000 iterations of the RWMH algorithm, discarding the first 1000 iterations as a `burn-in’ sample, to allow the RW proposal to converge. The result is T=100 chains of length N=50000. We have plotted the trace plots for four of these in Figure 1 to illustrate what these look like.


The chains in Figure 1 are clearly not IID draws, and therefore the effective sample size is less than the 50000 iterations. We note that the effective sample size can be calculated, but we have not had time to implement it.

Having sampled the chains for the hidden state sequence, we can produce an estimate of the expected state, conditional on the sequence of observations,

(9)   \begin{align*} \ex\left[ x_{1:T} \mid y_{1:T} \right] % &= \int x_{1:T} \, p\left( x_{1:T} \mid y_{1:T} \right) \, \kd x_{1:T} \notag \\ % &\approx \frac{1}{N} \sum_{i=1}^{N} x_{1:T}^{(i)} \, , \end{align*}

since the chain iterations x_{1:T}^{(i)} have been sampled from the smoothing density p\left( x_{1:T} \mid y_{1:T} \right). We observe that as N\to\infty, a law of large numbers applies such that the estimate converges in probability to the expectation. To assess the estimation uncertainty of this estimator we could run the MH sampler an other M times, compute the state estimates and their standard deviation. Note moreover that computing the centered second order moment of the chains yields a distinctly different statistic, because this says something about the smoothing distribution rather than the estimator of the expectation. The estimated expected state is displayed in Figure 2(a), and the resulting estimated measurement error in Figure 2(b). The estimated errors include some large absolute realizations, which is consistent with a t_{3}-distribution.


The mixing properties of the chains produced by the RWMH sampler can be assessed further by inspecting the ACF’s. Figure 3(a) displays the ACF’s corresponding to each chain. We note that at lag 250, the ACF’s are still in the interval from 0.2 to 0.8, which indicates that the chains are generally mixing slowly, but in particular that some chains are mixing much slower than others.


As a final point, we have considered whether the mixing of a given chain is related to the size of the corresponding measurement error. The hypothesis is that a large (absolute) error would tend to over-reject proposals at the `optimal’ scaling factor, therefore producing more persistent chains for the elements with large measurement errors. We investigate this graphically by plotting the ACF’s at lag 250 against the log of the absolute value of the measurement error, cf. Figure 3(b). There is a clear positive dependence between the two, which supports the hypothesis that applying the same scaling factor to all elements of the hidden state sequence might result in poorer mixing for the elements with larger measurement errors. A way to solve this problem could be to allow the scaling to vary element-by-element, but this would of course require that one thinks carefully about the scaling for each element.

Posted in Econometrics | Leave a comment

Forecasting an ARCH(1) Process

Volatility models are workhorses of modern finance; notable examples include the GARCH/ARCH-type models. They have won their popularity for the ability to capture the tendency of the volatility of asset returns to be persistently high or persistently low (volatility clustering), and exhibit “fat tails” (excess kurtosis). In particular, volatility models are used for financial risk management, such as the value-at-risk of a portfolio of stocks.

However, as we explore in the following, forecasting the volatility of returns is not the same as forecasting the distribution of returns. Therefore, although we can forecast the volatility of returns, we cannot use it to determine the distribution of the returns. In turn, one should be very careful when calculating forecasted tail probabilities, such as forecasting the value-at-risk.

We consider the ARCH(1) process, which is the simplest volatility model around. For a given period of time, say t=1, \, 2, \, \ldots, \, T, the ARCH(1) process states that the returns, x_{t}, are generated by

(1)   \begin{align*}   x_{t}  %   &= h_{t} z_{t}  \\ % %   h_{t}^{2}  %   &= \sigma^{2} + \alpha x_{t-1}^{2} \, , \end{align*}

for x_{0} fixed, \quad z_{t} \sim i.i.d.N(0, \, 1), \sigma^{2} > 0, and 1 > \alpha \geq 0. Notice that the upper bound of \alpha ensures covariance stationarity. We are interested in obtaining the k-step forecast of the volatility, which we define as the conditional standard deviation of the returns

(2)   \begin{align*}   h_{T+k} %   &\defeq \sqrt{\ex\left[ x_{T+k}^{2} \mid \mathcal{F}_{T} \right]} \, , \quad k \in \mathbb{N} \, , \end{align*}

where \mathcal{F}_{T} \defeq \sigma\left( x_{1}, \, x_{2}, \, \ldots, x_{T} \right) denotes the natural filtration.

We obtain the forecast for k=1 simply by substituting the expressions from the equations in (1), such that

(3)   \begin{align*}   h_{T+1} %   &= \ex\left[ x_{T+1}^{2} \mid \mathcal{F}_{T} \right] \notag \\ %   &= \ex\left[ \left( h_{T+1} z_{T+1} \right)^{2} \mid \mathcal{F}_{T} \right] \notag \\ %   &= \ex\left[ h_{T+1}^{2} z_{T+1}^{2} \mid \mathcal{F}_{T} \right] \notag \\   %   &= \ex\left[ \left( \sigma^{2} + \alpha x_{T}^{2} \right) z_{T+1}^{2} \mid \mathcal{F}_{T} \right] \notag \\ %   &= \sigma^{2} + \alpha \ex\left[ x_{T}^{2} \mid \mathcal{F}_{T} \right] \notag \\   %   &= \sigma^{2} + \alpha x_{T}^{2} \, . \end{align*}

Likewise, we obtain the forecast for k=2,

(4)   \begin{align*}   h_{T+2} %   &= \ex\left[ x_{T+2}^{2} \mid \mathcal{F}_{T} \right] \notag \\ %   &= \ex\left[ \left( h_{T+2} z_{T+2} \right)^{2} \mid \mathcal{F}_{T} \right] \notag \\ %   &= \ex\left[ h_{T+2}^{2} z_{T+2}^{2} \mid \mathcal{F}_{T} \right] \notag \\   %   &= \ex\left[ \left( \sigma^{2} + \alpha x_{T+1}^{2} \right) z_{T+2}^{2} \mid \mathcal{F}_{T} \right] \notag \\ %   &= \sigma^{2} + \alpha \ex\left[ x_{T+1}^{2} \mid \mathcal{F}_{T} \right] \notag \\   %   &= \sigma^{2} + \alpha \left( \sigma^{2} + \alpha x_{T}^{2} \right) \notag \\ %   &= \sigma^{2} + \alpha \sigma^{2} + \alpha^{2} x_{T}^{2} \, . \end{align*}

If we continue this exercise for k=3, \, 4, \, \ldots, then by induction we can obtain the forecast for k \geq 1,

(5)   \begin{align*}   h_{T+k} %   &= \sigma^{2} \sum_{i=0}^{k} \alpha^{i} + \alpha^{k} x_{T}^{2} \, .  \end{align*}

Observe that this finding holds for any distribution of the innovations z_{t} that has unit variance, \ex z_{t}^{2} = 1, and that as k \to \infty the forecast in (5) tends to the stationary variance \var\left[ x_{t} \right] = \frac{\sigma^{2}}{1-\alpha}. On a side note, we can also forecast the mean by a similar recursion, but it is of little interest for the ARCH(1) process, as it is zero for any value of k.

However, when considering the above forecasts of the volatility, it is crucial to realize that the distribution of the process at time T+k, for k \geq 2 is no longer Gaussian! Consequently, the mean and variance for x_{T+k} conditional on x_{T} does not characterize the distribution at T+k. The figure below illustrates this; it shows the forecasted densities for an ARCH(1) process with \sigma^{2} = 0.5 and \alpha = 0.8 starting in x_{T} = -0.035, for k=1, \, 2, \, \ldots, \, 8.

The density forecasts clearly illustrate how the tails of the distribution get “fatter” the further into the future we forecast. This reflects that the forecasts, as k \to \infty, will tend to the stationary distribution of the ARCH(1) process, which has fat tails.

The mathematical reason why the distribution of the forecasted returns is not Gaussian is quite straight-forward. If we consider the transition density for the ARCH(1) process,

(6)   \begin{align*}   f\left( x_{t} \mid x_{t-1} \right) %   &= \frac{1}{\sqrt{2 \pi h_{t}^{2}}} \exp \left\{-\frac{x_{t}^{2}}{2 h_{2}^{2}} \right\} \, , %   \quad h_{t}^{2} = \sigma^{2} + \alpha x_{t-1}^{2} \, ,  \end{align*}

it is clear that the variance h_{t}^{2} is a non-linear function of x_{t-1} (it is quadratic). If x_{t-1} is known, then the density of x_{t} is evidently Gaussian; as was the case with x_{T+1} conditional on x_{T}. Contrariwise, considering the density of x_{t} conditional on x_{t-2}, which can be written in terms of the integral

(7)   \begin{align*}   f\left( x_{t} \mid x_{t-2} \right) %   &= \int f\left( x_{t} \mid x_{t-1} \right) f\left( x_{t-1} \mid x_{t-2} \right) \, \kd x_{t-1} \, ,  \end{align*}

we see that the density function f\left( x_{t} \mid x_{t-2} \right) cannot be Gaussian. It is well-known that integrating out one of the parameters of the Gaussian density will yield a Gaussian density if the distribution of the parameter being integrated out is Gaussian. Moreover, it is also well-known that nonlinear transformations of Gaussian variables are not Gaussian. Thus, the nonlinear transformation of x_{t-1} results in the distribution of h_{t-1}^{2} being non-Gaussian, in turn rendering the distribution of x_{t} conditional on x_{t-2} non-Gaussian. In fact, there is not even a closed-form solution to the integral in (7). If we apply this finding to the transition density f\left( x_{T+k} \mid x_{T} \right) for k \geq 1, it is clear that the forecasted densities are no longer Gaussian beyond k=1.

Concluding, if the objective is to forecast the value-at-risk, or an other tail-dependent statistic, then it will be insufficient to forecast the volatility (and the mean, if specified). Instead the distribution must be approximated by some form of Monte Carlo simulation or bootstrapping. The simplest way is to simulate some large number, N say, realizations of length k of the ARCH(1) process starting in x_{T}, and produce a histogram of the realizations at k. The above figure was generated this way by setting N=10.000 and applying a kernel density estimator to the realizations. There are, however, other and more efficient methods, such as efficient importance sampling.

Posted in Econometrics, Finance | Leave a comment

The Kronecker Product and Vectorized Outer Product of Square Matrices

I was recently discussing with my colleague, Rasmus Søndergaard, whether it would be possible to rewrite the Kronecker product of two square matrices in terms of the outer product of the vectorized matrices (which we call the “vectorized outer product” for brevity). We were unable to find any results on this in the literature, e.g. Magnus and Neudecker (2007) and Lütkepohl (1996), but were able to establish the following relation for the simple case of two two-dimensional matrices.

Consider the two 2 \times 2 matrices

(1)   \begin{gather*}   A \defeq  %   \left[\begin{array}{cc}      a_{11} & a_{12} \\      a_{21} & a_{22}    \end{array}\right] \quad \text{and} \quad %     B \defeq  %   \left[\begin{array}{cc}      b_{11} & b_{12} \\      b_{21} & b_{22}    \end{array}\right] \, , \notag \end{gather*}

the Kronecker product of which is

(2)   \begin{gather*}   A \otimes B =  %   \left[\begin{array}{cccc}     a_{11}b_{11} & a_{11}b_{12} & a_{12}b_{11} & a_{12}b_{12} \\      a_{11}b_{21} & a_{11}b_{22} & a_{12}b_{21} & a_{12}b_{22} \\      a_{21}b_{11} & a_{21}b_{12} & a_{22}b_{11} & a_{22}b_{12} \\      a_{21}b_{21} & a_{21}b_{22} & a_{22}b_{21} & a_{22}b_{22} \\    \end{array}\right] \, ,  \end{gather*}

and the vectorized outer product of which is

(3)   \begin{gather*}   \text{vec}(A) \text{vec}(B)^{\prime} =  %   \left[\begin{array}{cccc}     a_{11}b_{11} & a_{11}b_{21} & a_{11}b_{12} & a_{11}b_{22} \\       a_{21}b_{11} & a_{21}b_{21} & a_{21}b_{12} & a_{21}b_{22} \\       a_{12}b_{11} & a_{12}b_{21} & a_{12}b_{12} & a_{12}b_{22} \\       a_{22}b_{11} & a_{22}b_{21} & a_{22}b_{12} & a_{22}b_{22} \\     \end{array}\right] \, .  \end{gather*}

It is evident that both A \otimes B and \text{vec}(A) \text{vec}(B)^{\prime} contain all possible combinations of the elements of A with the elements of B, but that the order in which these are arranged differ.

We observed that it is possible, by vectorization, to relate the elements of A \otimes B and \text{vec}(A) \text{vec}(B)^{\prime} by a matrix,

(4)   \begin{align*}   \left[\begin{array}{c}     a_{11}b_{11} \\     a_{11}b_{21} \\     a_{21}b_{11} \\     a_{21}b_{21} \\     a_{11}b_{12} \\      a_{11}b_{22} \\      a_{21}b_{12} \\      a_{21}b_{22} \\      a_{12}b_{11} \\      a_{12}b_{21} \\      a_{22}b_{11} \\      a_{22}b_{21} \\      a_{12}b_{12} \\      a_{12}b_{22} \\      a_{22}b_{12} \\      a_{22}b_{22} \\    \end{array}\right] %   &= %   \left[\begin{array}{cccc}     \begin{array}{cccc}       1 & 0 & 0 & 0 \\        0 & 0 & 0 & 0 \\        0 & 1 & 0 & 0 \\        0 & 0 & 0 & 0 \\      \end{array} & %     \begin{array}{cccc}       0 & 0 & 0 & 0 \\        1 & 0 & 0 & 0 \\        0 & 0 & 0 & 0 \\        0 & 1 & 0 & 0 \\      \end{array} & %     \mathbf{0} &  %         \mathbf{0} \\  %     \mathbf{0} &  %     \mathbf{0} &  %     \begin{array}{cccc}       1 & 0 & 0 & 0 \\        0 & 0 & 0 & 0 \\        0 & 1 & 0 & 0 \\        0 & 0 & 0 & 0 \\      \end{array} &  %     \begin{array}{cccc}       0 & 0 & 0 & 0 \\        1 & 0 & 0 & 0 \\        0 & 0 & 0 & 0 \\        0 & 1 & 0 & 0 \\      \end{array} \\  %     \begin{array}{cccc}       0 & 0 & 1 & 0 \\        0 & 0 & 0 & 0 \\        0 & 0 & 0 & 1 \\        0 & 0 & 0 & 0 \\      \end{array} &  %     \begin{array}{cccc}       0 & 0 & 0 & 0 \\        0 & 0 & 1 & 0 \\        0 & 0 & 0 & 0 \\        0 & 0 & 0 & 1 \\      \end{array} &  %     \mathbf{0} &  %     \mathbf{0} \\  %     \mathbf{0} &  %     \mathbf{0} &  %     \begin{array}{cccc}       0 & 0 & 1 & 0 \\        0 & 0 & 0 & 0 \\        0 & 0 & 0 & 1 \\        0 & 0 & 0 & 0 \\      \end{array} &  %     \begin{array}{cccc}       0 & 0 & 0 & 0 \\        0 & 0 & 1 & 0 \\        0 & 0 & 0 & 0 \\        0 & 0 & 0 & 1 \\      \end{array} \\    \end{array}\right] %     \left[\begin{array}{c}     a_{11}b_{11} \\      a_{21}b_{11} \\      a_{12}b_{11} \\      a_{22}b_{11} \\      a_{11}b_{21} \\      a_{21}b_{21} \\      a_{12}b_{21} \\      a_{22}b_{21} \\      a_{11}b_{12} \\      a_{21}b_{12} \\      a_{12}b_{12} \\      a_{22}b_{12} \\      a_{11}b_{22} \\       a_{21}b_{22} \\       a_{12}b_{22} \\       a_{22}b_{22} \\                 \end{array}\right] \, , \end{align*}

where \mathbf{0} denotes the 4 \times 4 zero matrix.

If we denote the matrix by K, such that

(5)   \begin{align*}   \text{vec}\left( A \otimes B \right) = K \text{vec}\left( \text{vec}\left( A \right)\text{vec}\left( B \right)^{\prime} \right) \, , \end{align*}

we see that the Kronecker product is related to the vectorized outer product of the matrices by

(6)   \begin{align*}   A \otimes B = \text{vec}^{-1}\left( K \text{vec}\left( \text{vec}\left( A \right)\text{vec}\left( B \right)^{\prime} \right) \right) \, , \end{align*}

where \text{vec}^{-1} denotes the operator that maps the vectorized matrix back into its matrix form. It should be noted that the mapping between the matrix and its vectorized form is bijective if, and only if, the matrix is square, i.e. of dimension n \times n, n \in \mathbb{N}.

This relation holds for the 2 \times 2 case, and we conjecture that it is possible to generalize it to square matrices of any dimension. Intuitively speaking, the Kronecker product and vectorized outer product of the matrices will contain all possible combinations of the matrix elements, regardless of the dimension of the matrices; however, whether there is a systematic way of determining the matrix K for higher dimensions remains unclear.


  • J. R. Magnus and H. Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics, 3rd ed., Wiley, 2007.
    author = {Magnus, Jan R. and Neudecker, Heinz},
    edition = {3rd},
    publisher = {Wiley},
    title = {{Matrix Differential Calculus with Applications in Statistics and Econometrics}},
    year = {2007}
  • H. Lütkepohl, Handbook of Matrices, Wiley, 1996.
    author = {L\"{u}tkepohl, Helmut},
    publisher = {Wiley},
    title = {{Handbook of Matrices}},
    year = {1996}
Posted in Econometrics | Leave a comment

An Example of Kalman Filtering: The Surface Air Temperature Anomaly

The Kalman filter is a much used statistical method; rightly so, its applications range far and wide. In the following we will go through the simplest application that I can think of: How to draw a smooth line through a series of observations. This exercise is instructive in understanding the recursions that make up the Kalman filter. After working through the details of the Kalman filter, you might want to consider other practical aspects, such as parameter estimation. For now, however, we will leave any other issues aside to avoid complicating the presentation.

For the purpose of this exercise we consider annual measurements of the the so-called “surface air temperature anomaly” in the United States from 1880 to 2011. The US surface air temperature anomaly is the surface air temperature relative to the 1951-1980 mean. The data is a part the GISTEMP study and is publicly available. I have plotted the observations over time in the below figure.

Contiguous 48 U.S. Surface Air Temperature Anomaly (Celcius)

Contiguous 48 U.S. Surface Air Temperature Anomaly (Celcius)

One question we could have in mind when inspecting the data is whether there is any visual indication of an upward trend in the anomaly, as suggested by global warming. This of course does not constitute a scientific test of global warming in any way, but I digress.

Upon inspecting the plot above, it is not immediately clear whether an upward trend is present or not; the observations are all over the place. To get a better visual impression, we need to somehow filter the noisy observations to extract a clear signal. There are many ways to approach this, the authors of the original study chose to calculate a 5-year centered moving average. The resulting signal of this approach is less noisy than the observations, but has the significant drawback that it does not yield any signal values for 2010 and 2011. Furthermore, it is rather inflexible in terms of how smooth we want the resulting signal to be.

To apply the Kalman filter, we must chose a state-space model (for our purpose the dynamics of the signal and the noise) which is i) linear and ii) Gaussian. If either of these requirements are violated, the Kalman filter should generally be replaced by a different kind of filter. Linear and Gaussian state-space models are of the form

(1)    \begin{align*}   Y_{t} &= A X_{t} + W_{t} \\   X_{t} &= \Phi X_{t-1} + V_{t} \, , \end{align*}

for t=1, \, 2, \, \ldots T where X_{0} is kept fixed, W_{t} \sim i.i.d.N(0,\Omega_{W}) and V_{t} \sim i.i.d.N(0,\Omega_{V}). We call the first equation the “observation equation”, since the sequence of Y_{t}‘s denote our observations. The second equation is generally referred to as the “signal equation” or “state equation”, since the sequence of X_{t}‘s denotes the unobserved signal. Generally, we allow for Y_{t} and/or X_{t} to be of dimensions higher than one (here we have used upper case denotes vectors and matrices).

Inspired by Durbin and Koopman (2001) we apply the so-called “local level model”, which is simply a one-dimensional random walk plus Gaussian noise. Thus, for t=1, \, 2, \, \ldots T

(2)    \begin{align*}   y_{t} &= x_{t} + w_{t} \\   x_{t} &= x_{t-1} + v_{t} \, , \end{align*}

where x_{0} is fixed, w_{t} \sim i.i.d.N(0,\omega^{2}_{w}) and v_{t} \sim i.i.d.N(0,\omega^{2}_{v}). The idea for choosing this particular model (alternatives exist) is that the signal, x_{t} is equally likely to go up as it is to go down (follows from setting \Phi = 1). In other words, we are not imposing any direction on the signal. Additionally, we state that the signal is observed through i.i.d. Gaussian measurement errors without undergoing any kind of transformation, e.g. A \neq 1 would have resulted in a scaling of the signal.

The purpose of applying the Kalman filter to a linear and Gaussian state-space model, such as the local level model, is to obtain an estimate of the unobserved signal x_{t}, given observations y_{t} for t=1, \, 2, \, \ldots, T.

For the general linear and Gaussian state-space above, it can be shown (fairly straightforwardly) that the distribution of x_{t} is Gaussian with mean \hat{X}_{t|t} and variance \hat{\Sigma}_{t|t}, where these are given by the recursions, initialized by \hat{X}_{0|0} and \hat{\Sigma}_{0|0},

(3)    \begin{align*}   \hat{X}_{t|t-1} &= \Phi \hat{X}_{t-1|t-1} \\   \hat{\Sigma}_{t|t-1} &= \Phi \hat{\Sigma}_{t-1|t-1} \Phi^{\prime} + \Omega_{V} \\   K_{t} &= \hat{\Sigma}_{t|t-1} A^{\prime}\left( A \hat{\Sigma}_{t|t-1} A^{\prime} + \Omega_{u} \right)^{-1} \\   \hat{X}_{t|t} &= \hat{X}_{t|t-1} + K_{t}\left( Y_{t} - A \hat{X}_{t|t-1} \right) \\   \hat{\Sigma}_{t|t} &= \left( I - K_{t}A \right) \hat{\Sigma}_{t|t-1} \, , \end{align*}

where the first two equations are referred to as the “prediction step”, the last two equations as the “filtering step”, and the K_{t} as the “Kalman gain”. This result is a consequence of some of the nice properties of the Gaussian distribution; namely, if we condition one Gaussian variable on another, then the conditional distribution is still Gaussian. In practice, we then use the mean \hat{X}_{t|t} as the signal, and the variance \hat{\Sigma}_{t|t} as an indication of the noise polluting the signal. There are more sophisticated intuitions to the meanings of the moments, but we will leave those aside.

For the local level model that we are interested in applying, the above recursions simplify somewhat to

(4)    \begin{align*}   \hat{x}_{t|t-1} &= \hat{x}_{t-1|t-1} \\   \hat{\sigma}_{t|t-1}^{2} &= \hat{\sigma}_{t-1|t-1}^{2} + \omega_{v}^{2} \\   k_{t} &= \hat{\sigma}_{t|t-1}^{2} \left( \hat{\sigma}_{t|t-1}^{2} + \omega_{u}^{2} \right)^{-1} \\   \hat{x}_{t|t} &= \hat{x}_{t|t-1} + k_{t}\left( y_{t} - \hat{x}_{t|t-1} \right) \\   \hat{\sigma}_{t|t}^{2} &= \left( 1 - k_{t} \right) \hat{\sigma}_{t|t-1}^{2} \, , \end{align*}

where we initialize by \hat{x}_{0|0} and \hat{\sigma}_{0|0}^{2}, admittedly ad hoc.

The main insight to be had from these recursions is that the signal at time t-1 is updated by first calculating the prediction step, second adding the prediction error y_{t} - \hat{x}_{t|t-1} scaled by the Kalman gain k_{t}. Notice that the Kalman gain is large when the predicted variance is large relative to the measurement error..

For our application we must chose values for \omega^{2}_{w} and \omega^{2}_{v}, which determine how much noise we want to filter and how much we want the signal to “jump” from year to year, respectively. In practice, these values should be estimated by an appropriate statistical method, e.g. by maximum likelihood. After fiddling around, I arrived at the values \omega^{2}_{w} = 0.5 and \omega^{2}_{v} = 0.1, reflecting that I suppose the typical measurement error is within 2*\sqrt{0.5} = 1.41 of the observation and that the signal typically should not move more than 2*\sqrt{0.01} = 0.2 from period to period. Finally, we initialize by \hat{x}_{0|0} = y_{1} and \hat{\sigma}_{0|0}^{2} = 0.1, which is also chosen in an ad hoc manner.

Kalman filtered temperature data

Contiguous 48 U.S. Surface Air Temperature Anomaly (Celcius) with signal

Calculating the signal via the above recursions yields the above plot, which reveals what seems like an upward-sloping movement in the level of the yearly anomaly. At this point we might wonder whether this is simply a result of the chosen values for \omega^{2}_{w} and \omega^{2}_{v}, and whether we are able to produce a completely different picture by simply selecting a different pair. Fortunately, it is not the case, supposing that you do not set the parameters at meaningless values (I suggest you go here to download the Ox program and investigate this claim yourself!).

As I mentioned previously, “calibrating” the parameters of a state-space model by hand is not common in practice, although it is sufficient for the purpose of drawing a smooth line through a series of observations as a descriptive statistic. Estimating the parameters via maximum likelihood is quite simple if you are familiar with the likelihood principle. The Kalman filter produces the log-likelihood value of the parameter used for the filtration as a by-product, and can in turn be maximized via numerical optimization routines such as Broyden-Fletcher-Goldfarb-Shannon (BFGS) or Nelder-Mead (Simplex). This might be the topic of an other post.


  • J. Durbin and S. J. Koopman, Time Series Analysis by State Space Methods, 2nd ed., Oxford University Press, 2012.
    author = {Durbin, James and Koopman, Siem J.},
    edition = {2nd},
    publisher = {Oxford University Press},
    title = {{Time Series Analysis by State Space Methods}},
    year = {2012}
Posted in Econometrics | Leave a comment

Conky Configuration

Conky is a light-weight system monitor for Linux and BSD. I think it can be compared with Rainmeter for Windows and Geektool for OS X (without having ever used either). Unfortunately, Conky does not come with a very useful default configuration. It is up to the user to decide both what to display and how to display it. This task takes a bit of time and energy, but can be well worth the investment. Recently, I have put a bit of effort into crafting a Conky configuration, which suits my needs.

Here you can see the end result on my blue desktop background. Most importantly, I want Conky to show me my hardware utilization rates, but also sundry system information and network stats. The network section displays information conditional on which interfaces are running. This is a neat feature, which I have not seen applied before—although I am certain I am not the first to do it.

I have put the conkyrc file up on my server for free use. Disclaimer: The Conky configuration is provided as is; use with caution. No rights reserved.

The conkyrc is available here.

If you have not previously used Conky, here is how to get it up and running. If your system does not come with Conky preinstalled, then look here. Assuming that you have Conky installed, do the following:

1. Create a new directory in your /home, and rename it .conky (the “.” makes it a hidden directory).

2. Download the conkyrc file and place it in the .conky directory.

3. Create a text file in your .conky directory, name it, and copy/paste the following script to it:

#!/bin/sh -e
wget -q -O - |
grep -Eo '\<[[:digit:]]{1,3}(\.[[:digit:]]{1,3}){3}\>'

This script retrieves your external IP from

4. If you want Conky to initialize when you log in, then create a text file in your /home directory, name it .startconky, and copy/paste the below script to it:

#!/bin/sh -e
sleep 15
conky -c ~/.conky/conkyrc_hetland
exit 0

Next, ensure that .startconky is executed upon logging in. E.g. in Ubuntu you can add a new entry to Startup Applications, name it “Conky”, and have it execute ~/.startconky.

5. If your system differs from mine in terms of number of CPU cores, HDD partitions, etc. you will have to edit the conkyrc file to fit your system. If you have worked with simple scripts before, then it should be fairly evident what to change. If you want to customize it beyond that, then I recommend looking at all the available variables.

Posted in How To's | Leave a comment

A Cointegrated VAR Model with ARCH Innovations

This post outlines a result for the equivalence of a “random coefficient Granger-Johansen representation” and a cointegrated VAR model with ARCH innovations. I find this interesting, because I was under the impression that this particular type of equivalence relationship was unique to the linear model.

Consider a cointegrated VAR(1) model with p variables and r cointegrating relations. That is, the p-dimensional process for t=1, \ldots, T

(1)    \begin{align*} \Delta X_{t} &= \alpha \beta^{\prime} X_{t-1} + \epsilon_{t} \, , \end{align*}

for initial values X_{0} fixed, \epsilon_{t} \sim NID(0, \Omega_{\epsilon}) and \alpha, \beta \in \mathbb{R}^{p \times r} with full rank, where the characteristic polynomial has exactly (p-r) roots at z=1, and the remaining roots are larger than one in absolute value, |z|>1.

For the cointegrated VAR(1) model, the Granger-Johansen representation, cf. Theorem 4.2 of Johansen (1996), becomes particularly simple:

(2)    \begin{align*} X_{t} &= C_{\Sigma} \sum_{i=1}^{t} \epsilon_{t} + C_{\xi} \xi_{t} + C_{0} \\ \xi_{t} &= ( I_{r} + \beta^{\prime} \alpha ) \xi_{t-1} + \nu_{t} \, \end{align*}

with the following definitions

(3)    \begin{gather*} C_{\Sigma} = \beta_{\perp}(\alpha_{\perp}^{\prime}\beta_{\perp})^{-1}\alpha_{\perp}^{\prime} \, , \quad C_{\xi} = \alpha (\beta^{\prime}\alpha)^{-1} \, , \quad C_{0} = C_{\Sigma} X_{0} \, , \quad \xi_{t} = \beta^{\prime} X_{t} \, , \quad \nu_{t} = \beta^{\prime} \epsilon_{t} \, , \end{gather*}

where \perp denotes the orthogonal complement, e.g. \beta^{\prime} \beta_{\perp}=0. These definitions may seem superfluous; bear with me, the idea is to—in the future—make this model fit into a bigger picture.

As is evident from the Granger-Johansen representation, the cointegrating relations \xi_{t} exhibit autoregressive dynamics in the cointegrated VAR model. However, one can easily imagine other types of dynamics, which may also be relevant to empirical applications, e.g. the autoregressive coefficient could be time-varying. For instance, suppose the autoregressive dynamics are replaced with random coefficient autoregressive dynamics, cf. Nicholls and Quinn (1982), such that:

(4)    \begin{align*} \xi_{t} &= \Phi_{t} \xi_{t-1} + \nu_{t} \, , \end{align*}

where vec(\Phi_{t}) \sim NID(vec(\Phi), \Omega_{\Phi}), \xi_{t} \in \mathbb{R}^{r}, and for simplicity letting \Phi = I_{r} + \beta^{\prime} \alpha. It is then fairly straightforward to show that the “random coefficient Granger-Johansen represention” implies a cointegrated VAR(1) model with ARCH innovations restricted to the cointegrating space. That is, for t=1, \ldots, T

(5)    \begin{align*} \Delta X_{t} &= \alpha \beta^{\prime} X_{t-1} + H_{t}^{1/2} z_{t}\\ H_{t} &= \Omega_{\epsilon} + C_{\xi} \left( \xi_{t-1}^{\prime} \otimes I_{r} \right) \Omega_{\Phi} \left( \xi_{t-1}^{\prime} \otimes I_{r} \right)^{\prime} C_{\xi}^{\prime} \, \end{align*}

where z_{t} \sim NID(0,I_{p}) and the rest are defined as above. Notationwise, \otimes denotes the Kronecker product.

If you happen to be familiar with univariate random coefficient autoregressive models, aka scalar-RCA models, then this result may seem familiar. There is a similar result that a scalar-RCA model will be observationally equivalent with some univariate ARCH model. What I found to be particularly interesting was that this also holds for the multivariate case and it fits nicely into a cointegrated VAR setting as shown above. When I started working on the “random coefficient Granger-Johansen representation”, I did not expect to find an equivalent VAR representation.


  • S. Johansen, Likelihood-Based Inference in Cointegrated Vector Autoregressive Models (Advanced Texts in Econometrics), Oxford University Press, 1996.
    author = {Johansen, Søren},
    publisher = {Oxford University Press},
    title = {{Likelihood-Based Inference in Cointegrated Vector Autoregressive Models (Advanced Texts in Econometrics)}},
    year = {1996}
  • D. F. Nicholls and B. G. Quinn, Random Coefficient Autoregressive Models: An Introduction, New York: Springer, 1982.
    address = {New York},
    author = {Nicholls, Des F. and Quinn, Barry G.},
    publisher = {Springer},
    title = {{Random Coefficient Autoregressive Models: An Introduction}},
    year = {1982}
Posted in Econometrics | 2 Comments

Setting WordPress Up for Scientific Typesetting

WordPress does not support many of the typesetting features I use daily: displaying mathematical formulas and symbols, generating bibliographies, and showing code with syntax highlighting.

Since I have just gone through the process of setting up WordPress to accommodate scientific typesetting, I thought it would be helpful to write up a short summary of how I went about it.

I am used to typesetting in LaTeX, and being able to write formulas as I am used to is at the top of my list, and I am guessing your’s too. For this purpose I use MathJax, which is an “open source JavaScript display engine for mathematics that works in all modern browsers”. In fact, I already had it on my server as a part of my MediaWiki setup. To use MathJax with WordPress, all I needed is the Mathjax Latex plugin. With this plugin I can produce math like the below:

(1)    \begin{align*} x_{t} &= \mu + \sigma_{t} z_{t} \\ \sigma^{2}_{t} &= \sigma^{2} + \alpha x_{t-1}^{2} \end{align*}

Next on the list is an easy solution to managing references. I use a .bib file (via Mendeley) to manage my references; thus, I searched for a solution that would let me build on that within WordPress. Turns out plugins with this functionality are few and far between; however, I did find one that fits the bill: papercite. This plugin can produce bibliographies from my bibtex library with minimal effort:

  • H. Hansen and S. Johansen, “Some Tests for Parameter Constancy in Cointegrated VAR-Models,” The econometrics journal, vol. 2, iss. 2, pp. 306-333, 1999.
    author = {Hansen, Henrik and Johansen, Søren},
    journal = {The Econometrics Journal},
    month = dec,
    number = {2},
    pages = {306--333},
    title = {{Some Tests for Parameter Constancy in Cointegrated VAR-Models}},
    volume = {2},
    year = {1999}
  • S. F. Yap and G. C. Reinsel, “Estimation and Testing for Unit Roots in a Partially Nonstationary Vector Autoregressive Moving Average Model,” Journal of the american statistical association, vol. 90, iss. 429, pp. 253-267, 1995.
    author = {Yap, Sook F. and Reinsel, Gregory C.},
    journal = {Journal of the American Statistical Association},
    number = {429},
    pages = {253--267},
    title = {{Estimation and Testing for Unit Roots in a Partially Nonstationary Vector Autoregressive Moving Average Model}},
    volume = {90},
    year = {1995}
  • R. F. Engle, “Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation,” Econometrica, vol. 50, iss. 4, pp. 987-1007, 1982.
    author = {Engle, Robert F.},
    journal = {Econometrica},
    number = {4},
    pages = {987--1007},
    title = {{Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation}},
    volume = {50},
    year = {1982}

Finally, I want to be able to display code in an easy and readable manner.For this there were a number of plugins, but one seemed to outshine the lot: SyntaxHighlighter Evolved. It supports a long list of languages, and can be customized with themes. Moreover, like MathJax, it is built on a general purpose JavaScript package, so I may be able to use it with my MediaWiki as well! This code is presented with SyntaxHighlighter Evolved:

for (decl t=1;t<T;++i)
   Q[t][] = Q[t-1][] * phi[t][]' + eps[t][];

Lastly, I also want a WordPress theme which has the same visual characteristics as a sheet of paper, i.e. not disproportionately wide, single column, black text on white etc. I ended up selecting “Swedish Greys” by Nordic Themepark, but there were many good options to choose from.

Posted in Uncategorized | Leave a comment

Swedish Greys - a WordPress theme from Nordic Themepark.