# First use of bioRxiv: Reconstructing Promoter Activity From Lux Bioluminescent Reporters

Today I have made a new publication foray and submitted a manuscript to bioRxiv. This is the main paper to have come out of work on our BBSRC Lux grant. We are yet to find a peer-review home – but one of our co-authors has already had a conversation with someone who wants to use the method – so it was time to put the manuscript out there while we continue with the peer-review process. R code and Biomodels submission will follow. The manuscript details are:

# New publication: A Bayesian approach to analyzing phenotype microarray data enables estimation of microbial growth parameters

Delighted to say that our first paper of 2016 is published. Matthias’s Biolog paper is now online-ready with the Journal of Bioinformatics and Computational Biology. This is also the first output from our Biolog grant – with a second paper detailing our newer software and analysis being planned.

Gerstgrasser M, Nicholls S, Stout M, Smart K, Powell C, Kypraios T and Stekel DJ. 2016. A Bayesian approach to analyzing phenotype microarray data enables estimation of microbial growth parameters. J Bioinform Comput Biol. DOI: 10.1142/S0219720016500074

### Abstract

Biolog phenotype microarrays (PMs) enable simultaneous, high throughput analysis of cell cultures in different environments. The output is high-density time-course data showing redox curves (approximating growth) for each experimental condition. The software provided with the Omnilog incubator/reader summarizes each time-course as a single datum, so most of the information is not used. However, the time courses can be extremely varied and often contain detailed qualitative (shape of curve) and quantitative (values of parameters) information. We present a novel, Bayesian approach to estimating parameters from Phenotype Microarray data, fitting growth models using Markov Chain Monte Carlo (MCMC) methods to enable high throughput estimation of important information, including length of lag phase, maximal “growth” rate and maximum output. We find that the Baranyi model for microbial growth is useful for fitting Biolog data. Moreover, we introduce a new growth model that allows for diauxic growth with a lag phase, which is particularly useful where Phenotype Microarrays have been applied to cells grown in complex mixtures of substrates, for example in industrial or biotechnological applications, such as worts in brewing. Our approach provides more useful information from Biolog data than existing, competing methods, and allows for valuable comparisons between data series and across different models.

# New Publication: Analysis of Occludin Trafficking, Demonstrating Continuous Endocytosis, Degradation, Recycling and Biosynthetic Secretory Trafficking

We are delighted that our work with Josh Rappoport‘s laboratory, supported by the Birmingham-Nottingham Strategic Collaboration Fund, has led to successful publication. Well done to all, especially Sarah Fletcher who carried out all the experimental work, Mudassar Iqbal who did the model data fitting, and Sara Jabbari who has helped out both Mudassar with the modelling and Sarah with sorting out journal requirements.

Fletcher, S.J., Iqbal, M., Jabbari, S., Stekel, D.J. and Rappoport, J.Z. 2014. Analysis of Occludin Trafficking, Demonstrating Continuous Endocytosis, Degradation, Recycling and Biosynthetic Secretory Trafficking. PLoS ONE DOI: 10.1371/journal.pone.0111176.

Abstract

Tight junctions (TJs) link adjacent cells and are critical for maintenance of apical-basolateral polarity in epithelial monolayers. The TJ protein occludin functions in disparate processes, including wound healing and Hepatitis C Virus infection. Little is known about steady-state occludin trafficking into and out of the plasma membrane. Therefore, we determined the mechanisms responsible for occludin turnover in confluent Madin-Darby canine kidney (MDCK) epithelial monolayers. Using various biotin-based trafficking assays we observed continuous and rapid endocytosis of plasma membrane localised occludin (the majority internalised within 30 minutes). By 120 minutes a significant reduction in internalised occludin was observed. Inhibition of lysosomal function attenuated the reduction in occludin signal post-endocytosis and promoted co-localisation with the late endocytic system. Using a similar method we demonstrated that ~20% of internalised occludin was transported back to the cell surface. Consistent with these findings, significant co-localisation between internalised occludin and recycling endosomal compartments was observed. We then quantified the extent to which occludin synthesis and transport to the plasma membrane contributes to plasma membrane occludin homeostasis, identifying inhibition of protein synthesis led to decreased plasma membrane localised occludin. Significant co-localisation between occludin and the biosynthetic secretory pathway was demonstrated. Thus, under steady-state conditions occludin undergoes turnover via a continuous cycle of endocytosis, recycling and degradation, with degradation compensated for by biosynthetic exocytic trafficking. We developed a mathematical model to describe the endocytosis, recycling and degradation of occludin, utilising experimental data to provide quantitative estimates for the rates of these processes.

# Inferring the error variance in Metropolis-Hastings MCMC

One of the great joys of working with two talented post-docs in the research group – Mike Stout and Mudassar Iqbal – as well as a great collaboration with Theodore Kypraios, is that they are often one step ahead of me and I am playing catch-up. Recently, Theo has discussed with them how to estimate the error variance associated with the data used in Metropolis-Hastings MCMC simulations.

The starting point, usually, is that we have some data, let us say $y_i$ for $i=1, \cdots, n$, and a model – usually, in our case, a dynamical system – which we are trying to fit to the data. For any given set of parameters $\theta$, our model will provide estimates for the data points that we will call $\hat y_i$. Now, assuming uniform Gaussian errors, our likelihood function $L(\theta)$ looks like:

$L(\theta) = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi\sigma^2}}e^{-\frac{1}{2}(\frac{y_i - \hat y_i}{\sigma})^2}$

where $\sigma^2$ is the error variance associated with the data. Now, when I first started using MCMC, I naively thought that we could use values for $\sigma^2$ provided by our experimental collaborators, and so we could use different values of $\sigma^2$ according to how confident our collaborators were in the measurements, equipment etc. What I found in practice was that these values rarely worked (in terms of convergence of the Markov chain) and we have had to make up error variances using trial and error.

So I was delighted when I heard that Theo had briefed both Mike and Mudassar about a method for estimating the error variance as part of the MCMC. Since I have not tried it before, I thought I would give it a go. I am posting the theory and some of my simulations, which are helpful results.

#### Theory

The theory behind estimating $\sigma^2$ is as follows. First, set

$\tau = \frac{1}{\sigma^2}$

We can then re-write the likelihood, now for the model parameters $\theta$ and also the unknown value $\tau$, as

$L(\bf{\theta}, \tau) = \frac{\tau^{(n/2)}}{\sqrt{2 \pi}^n}e^{-\frac{\tau}{2}\sum_{i=1}^n(y_i - \hat y_i)^2}$

Now observe that this has the functional form of a Gamma distribution for $\tau$, as the p.d.f. for a Gamma distribution is given by:

$f(x; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}$

So if we set a prior distribution for $\tau$ as a Gamma distribution with parameters $\alpha$ and $\beta$, then the conditional posterior distribution for $\tau$ is given by:

$p(\tau | \theta) \propto \tau^{(n/2)+ \alpha - 1}e^{-\tau(\frac{1}{2}\sum_{i=1}^n(y_i - \hat y_i)^2+\beta)}$

We observe that this is itself a Gamma distribution, with parameters $\alpha \prime = \alpha + n/2$ and $\beta \prime = \beta + \frac{1}{2} \sum_{i=1}^n (y_i - \hat y_i)^2$. Thus the parameter $\tau$ can be sampled with a Gibbs step as part of the MCMC simulation (usually using Metropolis-Hastings steps for the other parameters).

#### Simulations

The simulations I have run are with a toy model that I use a great deal for teaching. Consider a constitutively-expressed protein that is produced at constant rate $k$ and degrades (or dilutes) at constant rate $\gamma$ per protein. A differential equation for protein concentration $P$ is given by:

$\frac{dP}{dt} = k - \gamma P$

This ODE has the closed form solution:

$P = \frac{k}{\gamma} + (P_0 - \frac{k}{\gamma}) e^{-\gamma t}$

where $P_0$ is the concentration of protein at $t=0$. For the purposes of MCMC estimation, mixing is improved by setting $P_1 = \frac{k}{\gamma}$ so that the closed form solution is:

$P = P_1 + (P_0 - P_1) e^{-\gamma t}$

Some data I have used for teaching purposes comes from the paper Kim, J.M. et al. 2006. Thermal injury induces heat shock protein in the optic nerve head in vivo. Investigative ophthalmology and visual science 47: 4888-94. The data is quantitative Western blots of Hsp70 in the optic nerve of rats, as induced by laser damage. (Apologies for the unpleasantness of the experiment):

 Time / hours Protein / au 3 1100 6 1400 12 1700 18 2100 24 2150

The aim is to use a Metropolis-Hastings MCMC, together with a Gibbs step for the $\tau$ parameter, to fit the data. The issue that immediately arises is how to set the parameters $\alpha$ and $\beta$. This may seem arbitrary, but it is already better than choosing a value for $\sigma^2$, as the Gamma distribution will exploring of that parameter. For my first simulation, I thought that $\sigma = 100$ would be sensible (this turned out to be a remarkably good choice, as we will see). So I set $\alpha = 0.01$ and $\beta = 100$ and lo and behold, the whole MCMC worked beautifully. (Incidentally, I used independent Gaussian proposals for the other three parameters, with standard deviations of 100 for the $P_0$ and $P_1$ and standard deviation of 0.01 for $\gamma$. These parameters were forced to be positive – Darren Wilkinson has an excellent post on doing that correctly. Use of log-normal proposals in this case leads to very poor mixing, with the chain taking some large excursions for the $P_1$ and $\gamma$ parameters).

The median parameter values are $P_0 = 786$, $P_1 = 2526$, $\gamma = 0.0686$ and $\tau = 0.000122$. The latter corresponds to $\sigma = 90.6$. With these values, we can see a good fit to the data: below are plotted the data points (in red), the best fit (with median parameter values) in blue, and model predictions from a random sample of 50 parameter sets from the posterior distribution in black.

#### Considerations

However, some questions obviously arise: how sensitive is this procedure to choices of $\alpha$ and $\beta$? I will confess: I use Bayesian approaches fairly reluctantly, being more comfortable with classical frequentist statistics. What I like about Bayesian approaches are firstly the description of unknown parameters with a probability distribution, and secondly the availability of highly effective computer algorithms (i.e. MCMC). What makes me uncomfortable is the potential for introducing bias through the prior distributions. So I have carried out some investigations with different values of $\alpha$ and $\beta$. In particular, I wanted to know: (i) what happens if I keep the mean (equal to $\alpha / \beta$) the same but vary the parameters? (ii) what happens if I vary the mean of the distribution? The table below summarizes positive results:

 alpha beta P0 P1 gamma sigma 0.01 100 786 2526 0.0686 90.6 1 10000 747 2428 0.0795 98.0 0.0001 1 797 2533 0.0681 96.3 0.1 10 822 2603 0.0623 97.9 0.001 1000 760 2455 0.0758 94.8 1 1 792 2539 0.0676 64.9 0 0 805 2565 0.0653 98.3

As you can see (please ignore the last line for now), the results are robust to a very wide range of $\alpha$ and $\beta$, even producing a good estimate for $\sigma$ when that estimate is a long way from the mean of the prior distribution. But then we can make the following observation. Consider the sum of squares for a ‘best-fit’ model, for example using the parameters for the first row (this is 12748). So as long as $\alpha \ll n/2$ and $\beta \ll 12748/2$, the prior will introduce very little bias. But if you try to use values of $\alpha$ and especially $\beta$ very much larger than an estimated sum of squares from well-fitted model parameters, then things might go wrong. For example, when I set $\alpha = 1$ and $\beta = 10^6$ then my MCMC did not converge properly.

This leads to my final point, and the final row in the table. Would it be possible to remove prior bias altogether? If you look at the marginal posterior for $\tau$, we observe that if we set $\alpha = \beta = 0$, we obtain a Gamma distribution, whose mean is precisely the error variance, as, in this case,

$\frac{\beta \prime}{\alpha \prime} = \frac{\sum_{i=1}^n(y_i - \hat y_i)^2}{n}$

The algorithm should work perfectly well sampling from this Gamma distribution, and indeed it does, producing comparable results to when an informative prior is used.

#### Conclusions

In summary, I am happy to conclude that this method is good for estimating error variance. Clear advantages are:

1. It is simple to implement and fairly fast to run – adding a Gibbs step is no big deal.
2. It is clearly preferable to making up a fixed number for the error variance – which was what we were doing before.
3. The prior parameters allow you to make use of information you might have from experimental collaborators on likely errors in the data.
4. The level of bias from the priors is relatively low, and can be eliminated altogether.

# Job opportunity: two-month postdoctoral position in mathematical modelling / inference

## Research Associate/ Fellow

Reference
SCI10709
Closing Date
Friday, 8th February 2013
Job Type
Research & Teaching
Department
School of Biosciences – Division of Agricultural & Environmental Science, Multidisciplinary Centre for Integrative Biology
Salary
£24766 to £29541 per annum depending on skills and experience, minimum £27,854 per annum with relevant PhD.

This full-time post is available on a fixed term contract for a period of two months.

Applications are invited to join a highly motivated multi-disciplinary team of research scientists working the Universities of Nottingham and Birmingham. The successful candidate will join a jointly funded project to carry out modeling of occludin trafficking during epithelial polarization and wound healing. The post could be located either in the School of Biosciences at the University of Nottingham’s Sutton Bonington Campus, or at the School of Biosciences at the University of Birmingham.

The work will include (i) developing a mathematical models (using ODEs) to describe the turnover of occludin protein in the cell as well as the kinetic trafficking of occludin between cellular compartments; (ii) to estimate model parameter values from experimentally derived data using Monte Carlo Markov Chain approaches; and (iii) to iteratively improve the model, with cycles of model and data comparison, in order to provide greater certainty about the important mechanisms involved that can explain the experimental data. Other duties will include contributing to publication of this research in peer-reviewed journals, contributing to writing of research grant applications, and generally collaborating between disciplines and institutions.

The successful candidate must have a PhD or equivalent in mathematical modelling or statistics or a related area. Research experience within a mathematical biology or systems biology research area would be desirable but not essential. Candidates must to be able to demonstrate excellent mathematical ability, especially in the areas of ordinary differential equations and statistical analysis of data; experience of application of these skills to biological research would be desirable. Candidates must also be able to evidence excellent computing skills in a suitable environment (e.g. R or Matlab). Excellent English language oral and written communication skills are also essential. This post will require the person appointed to be able to work independently and as part of a multi-disciplinary team, to be motivated, flexible and willing to learn.

Full details, including how to apply, can be found on the University of Nottingham’s vacancy system.

Informal enquiries may be addressed to Dr Dov Stekel, email: dov.stekel@nottingham.ac.uk or Dr Josh Rappoport, email: j.rappoport@bham.ac.uk.

# Recent Advances in Statistical Inference for Mathematical Biology – report on MBI workshop

Today saw the end of the workshop at MBI on Recent Advances in Statistical Inference for Mathematical Biology. It has been a very enjoyable and thought-provoking workshop – definitely well worth the visit. My own talk received a good number of questions and plenty of interesting discussion. It was definitely at the more ‘applied’ end of the talks given; many of the talks described new methodologies and it is these that were particularly useful.

Perhaps the most interesting feature to emerge from this workshop is the work on identifiability or estimability of the parameters: it is the four talks most focussed on this topic that I will review very briefly below. The difference between these two terms is non-identifiability of parameters is a structural issue: no amount of additional data could help; non-estimability is a feature of the model and the data: the parameters cannot be estimated from the data at hand, but perhaps with different data they could be. This issue has certainly become an important concern in our own work: situations in which the Markov chain is unable to provide meaningful estimates for one or more parameters. On one level, this is useful, indeed it is one of the reasons why we are using these approaches: if we cannot estimate two parameters but could estimate (say) the ratio of two parameters then we want to know that, and the joint posterior distributions give that information. But in other cases it is holding us back: we have inference schemes that do not converge for one or more parameters, limiting our capacity to make scientific inductions, and we need good methods both to diagnoze a problem and to suggest sensible resolutions.

Two talks discussed approaches to simulations based on the geometric structure of the likelihood space. Mark Transtrum’s talk considered Riemannian geometric approaches to search optimization.  The solution space often describes a manifold in data coordinates that can have a small number of ‘long’ dimensions and many ‘narrow’ dimensions. The issue he was addressing a long canyons of ‘good’ solutions that are difficult for a classical MCMC or optimization scheme to follow. Interestingly, this leads to the classical Levenberg-Marquardt algorithm that allows optimal and rapid searching along the long dimensions – and Mark described an improvement to the algorithm. However, in discussions afterwards, he mentioned that following geodesics along the narrow dimensions to the manifold boundary can help identify combinations of parameters that cannot be estimated well from the data. Mark’s paper is Transtrum, M.K. et al. 2011. Phys. Rev. E. 83, 036701.

Similar work was described by Ben Calderhead. He described work trying to do inference on models with oscillatory dynamics, leading to difficult multi-model likelihood functions. The approach was also to use a Riemannian-manifold MCMC combined with running a chain with parallel temperatures that give different levels of weight of the (difficult) likelihood relative to the (smooth) prior. The aim again is to follow difficult ridges in the solution space, while also being able to escape and explore other regions. Ben’s methodological paper is Girolami, M. and Calderhead, B. 2011. J. Roy. Stat. Soc. 73: 123-214.

A very different approach was described by Subhash Lele. Here, the issue is diagnosing estimability and convergence of a chain using a simple observation: if you imagine ‘cloning’ the data, i.e. repeating the inference using two or more copies (N say) of your original data, then the more copies of the data you use, the more the process will converge to the maximum likelihood estimate. Fewer copies will weight the prior more. This means that if all is working well: (i) as N increases, the variance of the posterior should decrease; (ii) if you start with different priors, then as N increases, the posteriors should become more similar. If these do not happen, then you have a problem. The really nice thing about this approach is that it is very easy to explain and implement: methods based on Riemannian geometry are not for the faint-hearted and can only really be used by people with a strong mathematical background; data cloning methods are more accesible! Subhash’s papers on data cloning can be downloaded from his web site.

Finally, another approach to identifiability was described by Clemens Kreutz. He described ways of producing confidence intervals for parameters that involved following individual parameters and then re-optimizing for the other parameters. Although more computationally intensive, this looks useful for producing more reliable estimates both of parameter and model fit variability. Clemens’s work is available at http://arxiv.org/abs/1107.0013.

There were many more applied talks too, that I very much enjoyed, to a range of interesting applications and data. Barbel Finkenstadt gave a talk that included, in part, work carried out by Dafyd Jenkins, and I was filled with an up-welling of pride to see him doing so well! I also particularly appreciated Richard Boys’s honest attempt to build an inference scheme with a messy model and messy data and obtaining mixed results.

All-in-all, an enjoyable and interesting week, well worth the trip, and I look forward to following up on some interesting new methodologies.

# New Publication: Global transcription regulation of RK2 plasmids: a case study in the combined use of dynamical mathematical models and statistical inference for integration of experimental data and hypothesis exploration.

Today we have published a new article in BMC Systems Biology:

Herman, D., Thomas, C.M. and Stekel, D.J. 2011. Global transcription regulation of RK2 plasmids: a case study in the combined use of dynamical mathematical models and statistical inference for integration of experimental data and hypothesis exploration. BMC Systems Biology 2011, 5:119.

This is Dorota’s first published research article so big congratulations to due to her: a very well-deserved achievement! The paper abstract is:

#### Background

IncP-1 plasmids are broad host range plasmids that have been found in clinical and environmental bacteria. They often carry genes for antibiotic resistance or catabolic pathways. The archetypal IncP-1 plasmid RK2 is a well-characterized biological system, with a fully sequenced and annotated genome and wide range of experimental measurements. Its central control operon, encoding two global regulators KorA and KorB, is a natural example of a negatively self-regulated operon. To increase our understanding of the regulation of this operon, we have constructed a dynamical mathematical model using Ordinary Differential Equations, and employed a Bayesian inference scheme, Markov Chain Monte Carlo (MCMC) using the Metropolis-Hastings algorithm, as a way of integrating experimental measurements and a priori knowledge. We also compared MCMC and Metabolic Control Analysis (MCA) as approaches for determining the sensitivity of model parameters.

#### Results

We identified two distinct sets of parameter values, with different biological interpretations, that fit and explain the experimental data. This allowed us to highlight the proportion of repressor protein as dimers as a key experimental measurement defining the dynamics of the system. Analysis of joint posterior distributions led to the identification of correlations between parameters for protein synthesis and partial repression by KorA or KorB dimers, indicating the necessary use of joint posteriors for correct parameter estimation. Using MCA, we demonstrated that the system is highly sensitive to the growth rate but insensitive to repressor monomerization rates in their selected value regions; the latter outcome was also confirmed by MCMC. Finally, by examining a series of model refinements for partial repression by KorA or KorB dimers alone, we showed that a model including partial repression by KorA and KorB was most compatible with existing experimental data.

#### Conclusions

We have demonstrated that the combination of dynamical mathematical models with Bayesian inference is valuable in integrating diverse experimental data and identifying key determinants and parameters for the IncP-1 central control operon. Moreover, we have shown that Bayesian inference and MCA are complementary methods for identification of sensitive parameters. We propose that this demonstrates generic value in applying this combination of approaches to systems biology dynamical modelling.