# Inferring the error variance in Metropolis-Hastings MCMC

One of the great joys of working with two talented post-docs in the research group – Mike Stout and Mudassar Iqbal – as well as a great collaboration with Theodore Kypraios, is that they are often one step ahead of me and I am playing catch-up. Recently, Theo has discussed with them how to estimate the error variance associated with the data used in Metropolis-Hastings MCMC simulations.

The starting point, usually, is that we have some data, let us say $y_i$ for $i=1, \cdots, n$, and a model – usually, in our case, a dynamical system – which we are trying to fit to the data. For any given set of parameters $\theta$, our model will provide estimates for the data points that we will call $\hat y_i$. Now, assuming uniform Gaussian errors, our likelihood function $L(\theta)$ looks like: $L(\theta) = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi\sigma^2}}e^{-\frac{1}{2}(\frac{y_i - \hat y_i}{\sigma})^2}$

where $\sigma^2$ is the error variance associated with the data. Now, when I first started using MCMC, I naively thought that we could use values for $\sigma^2$ provided by our experimental collaborators, and so we could use different values of $\sigma^2$ according to how confident our collaborators were in the measurements, equipment etc. What I found in practice was that these values rarely worked (in terms of convergence of the Markov chain) and we have had to make up error variances using trial and error.

So I was delighted when I heard that Theo had briefed both Mike and Mudassar about a method for estimating the error variance as part of the MCMC. Since I have not tried it before, I thought I would give it a go. I am posting the theory and some of my simulations, which are helpful results.

#### Theory

The theory behind estimating $\sigma^2$ is as follows. First, set $\tau = \frac{1}{\sigma^2}$

We can then re-write the likelihood, now for the model parameters $\theta$ and also the unknown value $\tau$, as $L(\bf{\theta}, \tau) = \frac{\tau^{(n/2)}}{\sqrt{2 \pi}^n}e^{-\frac{\tau}{2}\sum_{i=1}^n(y_i - \hat y_i)^2}$

Now observe that this has the functional form of a Gamma distribution for $\tau$, as the p.d.f. for a Gamma distribution is given by: $f(x; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}$

So if we set a prior distribution for $\tau$ as a Gamma distribution with parameters $\alpha$ and $\beta$, then the conditional posterior distribution for $\tau$ is given by: $p(\tau | \theta) \propto \tau^{(n/2)+ \alpha - 1}e^{-\tau(\frac{1}{2}\sum_{i=1}^n(y_i - \hat y_i)^2+\beta)}$

We observe that this is itself a Gamma distribution, with parameters $\alpha \prime = \alpha + n/2$ and $\beta \prime = \beta + \frac{1}{2} \sum_{i=1}^n (y_i - \hat y_i)^2$. Thus the parameter $\tau$ can be sampled with a Gibbs step as part of the MCMC simulation (usually using Metropolis-Hastings steps for the other parameters).

#### Simulations

The simulations I have run are with a toy model that I use a great deal for teaching. Consider a constitutively-expressed protein that is produced at constant rate $k$ and degrades (or dilutes) at constant rate $\gamma$ per protein. A differential equation for protein concentration $P$ is given by: $\frac{dP}{dt} = k - \gamma P$

This ODE has the closed form solution: $P = \frac{k}{\gamma} + (P_0 - \frac{k}{\gamma}) e^{-\gamma t}$

where $P_0$ is the concentration of protein at $t=0$. For the purposes of MCMC estimation, mixing is improved by setting $P_1 = \frac{k}{\gamma}$ so that the closed form solution is: $P = P_1 + (P_0 - P_1) e^{-\gamma t}$

Some data I have used for teaching purposes comes from the paper Kim, J.M. et al. 2006. Thermal injury induces heat shock protein in the optic nerve head in vivo. Investigative ophthalmology and visual science 47: 4888-94. The data is quantitative Western blots of Hsp70 in the optic nerve of rats, as induced by laser damage. (Apologies for the unpleasantness of the experiment):

 Time / hours Protein / au 3 1100 6 1400 12 1700 18 2100 24 2150

The aim is to use a Metropolis-Hastings MCMC, together with a Gibbs step for the $\tau$ parameter, to fit the data. The issue that immediately arises is how to set the parameters $\alpha$ and $\beta$. This may seem arbitrary, but it is already better than choosing a value for $\sigma^2$, as the Gamma distribution will exploring of that parameter. For my first simulation, I thought that $\sigma = 100$ would be sensible (this turned out to be a remarkably good choice, as we will see). So I set $\alpha = 0.01$ and $\beta = 100$ and lo and behold, the whole MCMC worked beautifully. (Incidentally, I used independent Gaussian proposals for the other three parameters, with standard deviations of 100 for the $P_0$ and $P_1$ and standard deviation of 0.01 for $\gamma$. These parameters were forced to be positive – Darren Wilkinson has an excellent post on doing that correctly. Use of log-normal proposals in this case leads to very poor mixing, with the chain taking some large excursions for the $P_1$ and $\gamma$ parameters). The median parameter values are $P_0 = 786$, $P_1 = 2526$, $\gamma = 0.0686$ and $\tau = 0.000122$. The latter corresponds to $\sigma = 90.6$. With these values, we can see a good fit to the data: below are plotted the data points (in red), the best fit (with median parameter values) in blue, and model predictions from a random sample of 50 parameter sets from the posterior distribution in black. #### Considerations

However, some questions obviously arise: how sensitive is this procedure to choices of $\alpha$ and $\beta$? I will confess: I use Bayesian approaches fairly reluctantly, being more comfortable with classical frequentist statistics. What I like about Bayesian approaches are firstly the description of unknown parameters with a probability distribution, and secondly the availability of highly effective computer algorithms (i.e. MCMC). What makes me uncomfortable is the potential for introducing bias through the prior distributions. So I have carried out some investigations with different values of $\alpha$ and $\beta$. In particular, I wanted to know: (i) what happens if I keep the mean (equal to $\alpha / \beta$) the same but vary the parameters? (ii) what happens if I vary the mean of the distribution? The table below summarizes positive results:

 alpha beta P0 P1 gamma sigma 0.01 100 786 2526 0.0686 90.6 1 10000 747 2428 0.0795 98.0 0.0001 1 797 2533 0.0681 96.3 0.1 10 822 2603 0.0623 97.9 0.001 1000 760 2455 0.0758 94.8 1 1 792 2539 0.0676 64.9 0 0 805 2565 0.0653 98.3

As you can see (please ignore the last line for now), the results are robust to a very wide range of $\alpha$ and $\beta$, even producing a good estimate for $\sigma$ when that estimate is a long way from the mean of the prior distribution. But then we can make the following observation. Consider the sum of squares for a ‘best-fit’ model, for example using the parameters for the first row (this is 12748). So as long as $\alpha \ll n/2$ and $\beta \ll 12748/2$, the prior will introduce very little bias. But if you try to use values of $\alpha$ and especially $\beta$ very much larger than an estimated sum of squares from well-fitted model parameters, then things might go wrong. For example, when I set $\alpha = 1$ and $\beta = 10^6$ then my MCMC did not converge properly.

This leads to my final point, and the final row in the table. Would it be possible to remove prior bias altogether? If you look at the marginal posterior for $\tau$, we observe that if we set $\alpha = \beta = 0$, we obtain a Gamma distribution, whose mean is precisely the error variance, as, in this case, $\frac{\beta \prime}{\alpha \prime} = \frac{\sum_{i=1}^n(y_i - \hat y_i)^2}{n}$

The algorithm should work perfectly well sampling from this Gamma distribution, and indeed it does, producing comparable results to when an informative prior is used.

#### Conclusions

In summary, I am happy to conclude that this method is good for estimating error variance. Clear advantages are:

1. It is simple to implement and fairly fast to run – adding a Gibbs step is no big deal.
2. It is clearly preferable to making up a fixed number for the error variance – which was what we were doing before.
3. The prior parameters allow you to make use of information you might have from experimental collaborators on likely errors in the data.
4. The level of bias from the priors is relatively low, and can be eliminated altogether.

# PhD opportunities at the University of Nottingham

The University of Nottingham and the Rothamsted Research Institute are now advertising for 42 fully funded four-year PhD places in their Doctoral Training Partnership. For applicants with a maths, physics or computing background interested in mathematical / computational biology, there are opportunities in all three themes to become involved in world-leading bioscience research. There are three projects on which I would be a second / third supervisor.

1. Bayesian Inference for Dynamical Systems: From Parameter Estimation to Experimental Design with Theodore Kypraios (maths) as main supervisor. This project will be entirely mathematical / computational.
2. The role of a novel zinc uptake system (C1265-7) in uropathogenic E. coli, with Jon Hobman as main supervisor. This project will be mostly experimental, but could involve a mathematical modelling component should the student be interested.
3. Tunable zinc responsive bacterial promoters for controlled gene expression in E. coli, with Phil Hill as main supervisor. This project will be mostly experimental, but could involve a mathematical modelling component should the student be interested.

# Speaking at Workshop: Recent Advances in Statistical Inference for Mathematical Biology

Today I will be presenting at at the Mathematical Biosciences Institute at Ohio State University which this week is hosting the workshop Recent Advances in Statistical Inference for Mathematical Biology. I will be giving a talk about Hiroki’s work (abstract here and below), while Dorota will be presenting a poster about her work.

I am very excited about this workshop as it is the first to my knowledge to bring together mathematical modelling with statistical inference. In my view, this marriage is crucial to the future development of mathematical biology as a field.

Title:

Inferring the gap between mechanism and phenotype in dynamical models of gene regulation

Abstract:

Dynamical (differential equation) models in molecular biology are often cast in terms of biological mechanisms such as transcription, translation and protein-protein and protein-DNA interactions. However, most molecular biological measurements are at the phenotypic level, such as levels of gene or protein expression in wild type and chemically or genetically perturbed systems. Mechanistic parameters are often difficult or impossible to measure. We have been combining dynamical models with statistical inference as a means to integrate phenotypic data with mechanistic hypotheses. In doing so we are able to identify key parameters that determine system behaviour, and parameters with insufficient evidence to estimate, and thus make informed predictions for further experimental work. We are also able to use inferred parameters to build stochastic and multi-scale models to investigate behaviour at single-cell level. We apply these ideas to two systems in microbiology: global gene regulation in the antibiotic-resistance bearing RK2 plasmids, and zinc uptake and efflux regulation in Escherichia coli.

# Teaching maths to biologists – report from HE Academy event

Yesterday I attended an event at the University of Reading run by the Bioscience centre of the Higher Education Academy. At that event, I heard talks from or had informal discussions with academics teaching mathematics to biology undergraduates / postgraduates at a number of institutions, including Abertay, Anglia Ruskin, Bath, Cambridge, Cardiff, Liverpool and Reading. Very interesting key points to emerge that will help inform my maths teaching next year, especially to the first year undergraduates.

1. All institutions are facing the same issues, regardless of ‘status’. Specifically:
1. General recognition of the quite separate issues of teaching basic maths skills to all u/g biologists and the teaching of higher level skills to biologists to become involved in Systems Biology research.
2. The skills required by all undergraduate biologists around units, concentrations, powers, logarithms, exponentials, basic algebra (manipulating equations) and basic numeracy (is the answer plausible).
3. The range of background / abilities of students coming into university study. This is linked to a wide range of school experience, from students with no more maths teaching after a ‘C’ in GCSE maths through to students with an ‘A’ in A-level maths, and everything in between.
2. The importance of gathering the right data and evidence. This includes:
1. Information on background of students, including numbers of students with GCSE, AS and A2 maths, and grades of those students.
2. Feedback on different elements of the teaching, specifically how helpful the students are finding lectures, practicals, worksheets, on-line materials and so forth.
3. Formative assessment during the course of the term to identify students who are struggling with particular elements and direct (often limited) tutorial resource to those students.
3. The importance of blended approaches, specifically:
1. The findings from Liverpool that the students found workshops and tutorials far more valuable than the lectures: they had six 3 hour workshops AND sign-in tutorials
2. The findings from Abertay that a system of regular on-line tests with extra tutorials if not meeting goals and electronic nagging massively improved results
3. The use in Bath of on-line tests with 100% pass marks but many attempts allowed to improve learning of key concepts.
4. Helpful ideas about approaches, three C’s from Anglia Ruskin:
1. Context: very important to include biological context – embedding the mathematics in biological problems. Reading used the analogy of teaching people to make hammers and screwdrivers without telling them about nails and screws.
2. Confidence: very important to build students’ confidence, even if this means giving very high marks (doesn’t matter as in most universities 1st year marks only count for progression).
3. Continuity: need to link both with school-level work and with material in other 1st, 2ns and 3rd year modules: this is a challenge for all module leaders/lecturers.
5. A large number of on-line resources, which I have not yet looked at, including:
1. Bionrich
2. Essential maths for medics and vets
3. mathtutor
4. NuMBers
5. SUMS
6. biomathtutor
7. mathstore
8. And, quite differently, StarLogo TNG

All-in-all, a highly successful and interesting day, very timely given by first-year teaching, and I look forward to embedding some of these ideas, practices and resources in next year’s running of the module.