Inferring the error variance in Metropolis-Hastings MCMC

One of the great joys of working with two talented post-docs in the research group – Mike Stout and Mudassar Iqbal – as well as a great collaboration with Theodore Kypraios, is that they are often one step ahead of me and I am playing catch-up. Recently, Theo has discussed with them how to estimate the error variance associated with the data used in Metropolis-Hastings MCMC simulations.

The starting point, usually, is that we have some data, let us say $y_i$ for $i=1, \cdots, n$, and a model – usually, in our case, a dynamical system – which we are trying to fit to the data. For any given set of parameters $\theta$, our model will provide estimates for the data points that we will call $\hat y_i$. Now, assuming uniform Gaussian errors, our likelihood function $L(\theta)$ looks like:

$L(\theta) = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi\sigma^2}}e^{-\frac{1}{2}(\frac{y_i - \hat y_i}{\sigma})^2}$

where $\sigma^2$ is the error variance associated with the data. Now, when I first started using MCMC, I naively thought that we could use values for $\sigma^2$ provided by our experimental collaborators, and so we could use different values of $\sigma^2$ according to how confident our collaborators were in the measurements, equipment etc. What I found in practice was that these values rarely worked (in terms of convergence of the Markov chain) and we have had to make up error variances using trial and error.

So I was delighted when I heard that Theo had briefed both Mike and Mudassar about a method for estimating the error variance as part of the MCMC. Since I have not tried it before, I thought I would give it a go. I am posting the theory and some of my simulations, which are helpful results.

Theory

The theory behind estimating $\sigma^2$ is as follows. First, set

$\tau = \frac{1}{\sigma^2}$

We can then re-write the likelihood, now for the model parameters $\theta$ and also the unknown value $\tau$, as

$L(\bf{\theta}, \tau) = \frac{\tau^{(n/2)}}{\sqrt{2 \pi}^n}e^{-\frac{\tau}{2}\sum_{i=1}^n(y_i - \hat y_i)^2}$

Now observe that this has the functional form of a Gamma distribution for $\tau$, as the p.d.f. for a Gamma distribution is given by:

$f(x; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}$

So if we set a prior distribution for $\tau$ as a Gamma distribution with parameters $\alpha$ and $\beta$, then the conditional posterior distribution for $\tau$ is given by:

$p(\tau | \theta) \propto \tau^{(n/2)+ \alpha - 1}e^{-\tau(\frac{1}{2}\sum_{i=1}^n(y_i - \hat y_i)^2+\beta)}$

We observe that this is itself a Gamma distribution, with parameters $\alpha \prime = \alpha + n/2$ and $\beta \prime = \beta + \frac{1}{2} \sum_{i=1}^n (y_i - \hat y_i)^2$. Thus the parameter $\tau$ can be sampled with a Gibbs step as part of the MCMC simulation (usually using Metropolis-Hastings steps for the other parameters).

Simulations

The simulations I have run are with a toy model that I use a great deal for teaching. Consider a constitutively-expressed protein that is produced at constant rate $k$ and degrades (or dilutes) at constant rate $\gamma$ per protein. A differential equation for protein concentration $P$ is given by:

$\frac{dP}{dt} = k - \gamma P$

This ODE has the closed form solution:

$P = \frac{k}{\gamma} + (P_0 - \frac{k}{\gamma}) e^{-\gamma t}$

where $P_0$ is the concentration of protein at $t=0$. For the purposes of MCMC estimation, mixing is improved by setting $P_1 = \frac{k}{\gamma}$ so that the closed form solution is:

$P = P_1 + (P_0 - P_1) e^{-\gamma t}$

Some data I have used for teaching purposes comes from the paper Kim, J.M. et al. 2006. Thermal injury induces heat shock protein in the optic nerve head in vivo. Investigative ophthalmology and visual science 47: 4888-94. The data is quantitative Western blots of Hsp70 in the optic nerve of rats, as induced by laser damage. (Apologies for the unpleasantness of the experiment):

 Time / hours Protein / au 3 1100 6 1400 12 1700 18 2100 24 2150

The aim is to use a Metropolis-Hastings MCMC, together with a Gibbs step for the $\tau$ parameter, to fit the data. The issue that immediately arises is how to set the parameters $\alpha$ and $\beta$. This may seem arbitrary, but it is already better than choosing a value for $\sigma^2$, as the Gamma distribution will exploring of that parameter. For my first simulation, I thought that $\sigma = 100$ would be sensible (this turned out to be a remarkably good choice, as we will see). So I set $\alpha = 0.01$ and $\beta = 100$ and lo and behold, the whole MCMC worked beautifully. (Incidentally, I used independent Gaussian proposals for the other three parameters, with standard deviations of 100 for the $P_0$ and $P_1$ and standard deviation of 0.01 for $\gamma$. These parameters were forced to be positive – Darren Wilkinson has an excellent post on doing that correctly. Use of log-normal proposals in this case leads to very poor mixing, with the chain taking some large excursions for the $P_1$ and $\gamma$ parameters).

The median parameter values are $P_0 = 786$, $P_1 = 2526$, $\gamma = 0.0686$ and $\tau = 0.000122$. The latter corresponds to $\sigma = 90.6$. With these values, we can see a good fit to the data: below are plotted the data points (in red), the best fit (with median parameter values) in blue, and model predictions from a random sample of 50 parameter sets from the posterior distribution in black.

Considerations

However, some questions obviously arise: how sensitive is this procedure to choices of $\alpha$ and $\beta$? I will confess: I use Bayesian approaches fairly reluctantly, being more comfortable with classical frequentist statistics. What I like about Bayesian approaches are firstly the description of unknown parameters with a probability distribution, and secondly the availability of highly effective computer algorithms (i.e. MCMC). What makes me uncomfortable is the potential for introducing bias through the prior distributions. So I have carried out some investigations with different values of $\alpha$ and $\beta$. In particular, I wanted to know: (i) what happens if I keep the mean (equal to $\alpha / \beta$) the same but vary the parameters? (ii) what happens if I vary the mean of the distribution? The table below summarizes positive results:

 alpha beta P0 P1 gamma sigma 0.01 100 786 2526 0.0686 90.6 1 10000 747 2428 0.0795 98.0 0.0001 1 797 2533 0.0681 96.3 0.1 10 822 2603 0.0623 97.9 0.001 1000 760 2455 0.0758 94.8 1 1 792 2539 0.0676 64.9 0 0 805 2565 0.0653 98.3

As you can see (please ignore the last line for now), the results are robust to a very wide range of $\alpha$ and $\beta$, even producing a good estimate for $\sigma$ when that estimate is a long way from the mean of the prior distribution. But then we can make the following observation. Consider the sum of squares for a ‘best-fit’ model, for example using the parameters for the first row (this is 12748). So as long as $\alpha \ll n/2$ and $\beta \ll 12748/2$, the prior will introduce very little bias. But if you try to use values of $\alpha$ and especially $\beta$ very much larger than an estimated sum of squares from well-fitted model parameters, then things might go wrong. For example, when I set $\alpha = 1$ and $\beta = 10^6$ then my MCMC did not converge properly.

This leads to my final point, and the final row in the table. Would it be possible to remove prior bias altogether? If you look at the marginal posterior for $\tau$, we observe that if we set $\alpha = \beta = 0$, we obtain a Gamma distribution, whose mean is precisely the error variance, as, in this case,

$\frac{\beta \prime}{\alpha \prime} = \frac{\sum_{i=1}^n(y_i - \hat y_i)^2}{n}$

The algorithm should work perfectly well sampling from this Gamma distribution, and indeed it does, producing comparable results to when an informative prior is used.

Conclusions

In summary, I am happy to conclude that this method is good for estimating error variance. Clear advantages are:

1. It is simple to implement and fairly fast to run – adding a Gibbs step is no big deal.
2. It is clearly preferable to making up a fixed number for the error variance – which was what we were doing before.
3. The prior parameters allow you to make use of information you might have from experimental collaborators on likely errors in the data.
4. The level of bias from the priors is relatively low, and can be eliminated altogether.

Some thoughts on Thomas et al. 2012. Directional Migration of Recirculating Lymphocytes through Lymph Nodes via Random Walks.

This article [1] was published just over 4 months ago by Benny Chain’s laboratory in UCL, based on work carried out for Niclas Thomas’s PhD. It is a rare privilege to read an article that so clearly relates itself to work that I carried out – indeed work carried out during my own PhD. Therefore I have decided to post some thoughts on this (and related) articles, which I will also post on the PLoS ONE web site.

Thomas et al.’s work contains new data on the distribution of lymphocyte transit times, together with rigorous fitting of mathematical models to their data. Importantly, they show that their data can be fitted by a random walk model that allows for motion orthogonal to the main direction of motion (i.e. through the lymph node tissue). This random walk model, although implemented in one dimension, is intended to reflect a three dimensional motion in which the cells move either along the main axis of motion, or in dimensions perpendicular to it.

There were two models for lymphocyte recirculation that I proposed for my PhD, both of which were implemented in a one dimensional domain (along the lymph node). The first was a convection-diffusion model [2], which could be thought of as a biased random walk, although could be sufficiently general to include other mechanisms. The second was a model that proposed that lymphocyte migration was halted due to encounters with dendritic or other cell types [3, 4]. Both models could explain the available data on lymphocyte recirculation transit times.

Following my own PhD, two photon microscopy technology developed to the point where the motion of individual lymphocytes could be tracked [5, 6, 7]. This has led to the discovery that lymphocyte migration is essentially random, and that the hypothesis I set forward in [3, 4] is false. Thomas et al.’s work, following work by Textor et al. [7], has shown that indeed the distribution of lymphocyte recirculation times can be explained well by a three-dimensional diffusion model.

So, how do I feel to have had some of my research empirically falsified? Well, actually, it makes me rather happy! Now don’t get me wrong: I would much rather that the two photon microscopy had shown T cells moving along the lymph nodes, stopping at dendritic cells for a while, and then moving along again, as my work of [3, 4] proposed. That is not the case. I take solace in two things. First, the convection-diffusion model of [2] is essentially correct in its most naive form. But more significantly, I can hear Karl Popper cheering me on from the side-lines: my hypotheses were good science, even if incorrect. The important point to learn from [2, 3, 4] is that the observed distribution of lymphocyte transit times is non-trivial: it demands a mechanistic explanation, and, at that time, no such explanations had been proposed. The dendritic cell hypothesis was perfectly plausible (in fact, I came up with it through discussions with Benny Chain and David Katz, with whom Benny Chain shared a lab) – and worthy of experimental testing. (Parenthetically, this was in the bad old days of “mathematical biology”, when theoreticians generally worked independently of experimentalists. I remember once, at that time, a theoretician proudly stating at a conference that no experimental data could falsify their model. Thankfully things are much better today under the “systems biology” paradigm, and Thomas et al’s article is an excellent example of experiment and theory working so well together).

To be honest, the thing that annoys me a little is that I didn’t think to check myself, during my PhD, whether three dimensional diffusion could explain the distribution of recirculation times (and full credit to Niclas Thomas and others for investigating this!). At the time, I was unhappy with the value of the one-dimensional diffusion coefficient that my first model needed to fit the data: based on a Brownian motion calculation for T-cell diffusivity, I thought that it was two orders too fast to be realistic as random motion, and so looked to other explanations. Indeed, this was discussed at my PhD viva, and my examiners (an eminent modelling-friendly immunologist and an eminent mathematical biologist) made me correct my thesis to include the Brownian motion calculation. But the assumptions behind the calculation were completely wrong, for reasons that my examiners and I should have known. First, on biological grounds, the calculations were based on Brownian motion of T-cells – when of course we knew that T-cells move actively – and even then there was some data available on speed of such movement (e.g. from Tim Springer’s lab). Second, the calculations were based on a 1-D diffusion – when of course we know that 3D diffusion is qualitatively and quantitatively different (e.g. the recurrence / transience of 1D / 3D random walks taught in undergraduate Markov chain courses). I had even written a 3D diffusion simulator (for another part of my PhD) which could have easily been used to test the hypothesis. Hind-sight is a wonderful thing!

All-in-all, I congratulate Niclas Thomas on this work. I think it is wonderful and enhances our knowledge of this extremely important field.

1. Thomas N, Matejovicova L, Srikusalanukul W, Shawe-Taylor J, Chain B. 2012. Directional Migration of Recirculating Lymphocytes through Lymph Nodes via Random Walks. PLoS ONE 7(9): e45262.

2. Stekel, D.J., Parker, C.E. and Nowak, M.A. 1997. A model of lymphocyte recirculation. Immunology Today, 18:216-21.

3. Stekel, D.J. 1997. The role of inter-cellular adhesion in the recirculation of T lymphocytes. Journal of Theoretical Biology, 186:491-501.

4. Stekel, D.J. 1998. The simulation of density-dependent effects in the recirculation of T lymphocytes. Scandinavian journal of immunology47:426-30.

5. Miller MJ, Hejazi AS, Wei SH, Cahalan MD, Parker I. 2004. T cell repertoire scanning is promoted by dynamic dendritic cell behavior and random t cell motility in the lymph node. Proceedings of the National Academy of Sciences of the United States of America 101: 998-1003.

6. Beltman JB, Mare AFM, Lynch JN, Miller MJ, de Boer RJ. 2007. Lymph node topology dictates t cell migration behavior. The Journal of Experimental Medicine 204: 771–780.

7. Textor J, Peixoto A, Henrickson SE, Sinn M, von Andrian UH, Westermann, J. 2011. Defining the quantitative limits of intravital two-photon lymphocyte tracking. Proceedings of the National Academy of Sciences of the United States of America 108: 12401–6.