The estimation of tuberculosis transmission parameters using ABC and MCMC methods

(1)

Faculty of Technology

Degree Programme in Technomathematics and Technical Physics

Alina Malyutina

THE ESTIMATION OF TUBERCULOSIS TRANSMISSION PARAM- ETERS USING ABC AND MCMC METHODS

Examiners: Professor Heikki Haario Professor Jukka Corander

(2)

Lappeenranta University of Technology Faculty of Technology

Degree Programme in Technomathematics and Technical Physics Alina Malyutina

THE ESTIMATION OF TUBERCULOSIS TRANSMISSION PARAM- ETERS USING ABC AND MCMC METHODS

Master’s thesis 2014

54 pages, 20 figures, 6 tables

Examiners: Professor Heikki Haario Professor Jukka Corander

Keywords: Bayesian inference, Approximate Bayesian computation, Marcov chain Monte Carlo, likelihood-free, tuberculosis strains

The aim of this work is to apply approximate Bayesian computation in combination with Marcov chain Monte Carlo methods in order to estimate the parameters of tuberculosis transmission. The methods are applied to San Francisco data and the results are compared with the outcomes of previous works. Moreover, a method- ological idea with the aim to reduce computational time is also described. Despite the fact that this approach is proved to work in an appropriate way, further analysis is needed to understand and test its behaviour in different cases. Some related suggestions to its further enhancement are described in the corresponding chapter.

(3)

First, I would like to express sincere gratitude to my research supervisor Heikki Haario for this interesting topic and his guidance on every stage of the process starting from the very beginning. I am eternally grateful for every his advice, idea and support.

Besides, I am also very grateful to Jukka Corander for giving a good overview of the topic and providing with relevant literature throughout the research.

Then, I am very thankful to the Department of Mathematics at Lappeenranta Uni- versity of Technology for giving me the opportunity to study and financial support.

And, finally, I want to say thank you to my family and my friend Roman Filimonov for their support during the whole year. They made me feeling comfortable and calm even if I was far away from home.

Lappeenranta, October 10, 2014.

Alina Malyutina

(4)

1 INTRODUCTION

1.1 Background

Tuberculosis (TB) is a disease caused by strains of mycobacteria that can spread from person to person through the air. Typically, this type of disease is associated with an infection of lungs but it can affect almost every part of the human body.

Furthermore, one person can be infected with different strains of mycobacterium tuberculosis.

The tubercle bacillus itself was discovered in 1882 by the microbiologist Robert Koch. He conducted research at a time when TB infection was the the reason for every seventh death in Europe. Due to the unawareness of efficient treatment, those who got the disease were isolated in private hospitals allowing medical scientists to supervise and study the main features of TB.

Nowadays, the patients with TB can be treated using drug therapy. Depending on the stage of the disease, the surgery can be also applied, otherwise the patient can have a drug treatment even at home. In this case, the curing ensures that nobody of a family or household members can get the infection from the patient.

The feasibility to recover from TB is high in a case if the infection is found on early stages. Unfortunately, at initial stage, due to the incubation period of the disease, the person does not know that he can be infectious to people. Once he meets a susceptible person, his strains can be transmitted to the person , as well as he can get another strain himself. Even if there is a rate of clearance of infection for each particular strain, this can lead to the epidemics.

Furthermore, there exists so called drug-resistant type of TB that provides a big challenge for doctors, since it needs a longer treatment and requires more expensive drugs. Then, totally drug-resistant TB causes serious public health issue in many developing countries as there is still no efficient medical treatment that can cope with it. The reason for that is the resistance to all common types of medication used in other types of TB. This case of TB infection was first observed in Italy, in 2003, but now there are totally drug-resistant TB epidemics than can be found all over the world, for example, an outbreak of mycobacterium tuberculosis was also diagnosed in Samara, Russia, 2013.

(8)

1.2 Research problem, objectives and delimitation

The above mentioned facts are the reasons for the growing interest, especially in the statistics community, to introduce methods for providing the understanding of the TB transmission features. One way of doing it is via defining transmission pathways of infectious pathogens from molecular sequence data. This data can allow people to follow the progress of the infection using special techniques.

In this study, the problem of estimating transmission parameters is studied. The data for conducting research is taken from Tanaka et al. [14]. The main goals during the research can be described as:

• the development of the insight understanding of TB infection and its transmission behaviour.

• the familiarization with models and methods that are commonly used for solving this kind of issues.

• the implementation of the methods that are considered to be good enough to provide realistic results.

• the improvement of the methods in terms of time consumption and reliability if possible.

While working with large sets of data and simulating the process, it is a common to try doing it multiple times and for an appropriately large amount of time. As this research was limited in time, amount of simulations was reduced. This can be the reason for the visible differences while comparing the results with other works but this fact does not influence the main tendency substantially.

1.3 Research methodology

In order to imitate the process of the strain distribution among the community, continuous-time Markov chain model is applied. This model, which is parametrized with a certain amount of parameters, describes the dynamics of the strains using transition probabilities for a small time increment.

(9)

When the model is determined, the next step is to perform the simulations according to it. In this work, approximate Bayesian computational (ABC) methods in combination with Marcov chain Monte Carlo (MCMC) methods are applied to obtain the model parameter estimation as they provide a better efficiency than other methods.

In order to get the whole understanding of the ABC-MCMC methods that are applied to solve the main problem, the ABC and MCMC methods were studied separately, analyzed and compared using simple examples. All the simulations were done using Matlab software.

1.4 Organization of the study

This work is divided into several sections. The first section provides an introduction to the problem, its relevance and the ways of solving which are actually used in the work. Furthermore, the objectives and delimitation are also pointed out.

The second chapter provides an insight to the MCMC methods, their base ideas and different types. In order to make the MCMC idea clear the Bayesian parameter estimation theory is also described.

The third section is dedicated to the ABC methods and approaches that are used in case if the likelihood function is not known or too complicated to calculate.

Moreover, the concept of the ABC-MCMC is also described in the chapter.

The forth chapter contains the main part of the work. One can find here the problem and data description, the model specification, the algorithm of the solving approach and the idea of the method improvement.

The remaining part of the diploma is devoted to the discussion, conclusions and summary.

(10)

2 Bayesian estimation and MCMC

2.1 Bayesian parameter estimation

According to the Bayesian parameter estimation, the unknown parameter α is treated as a random variable and the target is to find their posterior distribution.

The posterior density gives the probability for having value α given measurements y and according to the Bayes formula, it can be found using the following ratio:

π(α|y) = l(y|α)p(α) R l(y|α)p(α)dα

The prior distribution p(α) includes all available information about estimated parameters, such as their bounds. This information is a significant part of Bayesian inference since it is combined with the new data distribution with the aim to match better the posterior distribution. Yet, there are two main problems in choosing the prior: what information should be included in prior and how it will affect the posterior distribution. The latter can be studied by conducting a sensitivity analysis that indicates a comparison of the resulting posterior distributions under different choices of prior.

However, different choices of prior distribution can have a minor impact on the inference in cases of big size data or well-identified parameters. In other cases, the situation is the opposite: prior distribution becomes essential if data size is small or there exists only indirect information about the parameters.

The likelihood function, known as likelihood, is a function that depends on the parameters of the statistical model. Thus, the likelihood l(y|α)defines the probability of observing measurementsy if the parameter value in the model equals to α.

The likelihood is mostly used in statistical methods of parameter value estimation from the set of measurements. However, the likelihood function and the probability density function are not the same since the latter provides the density function of the parameter, meanwhile, the likelihood function states for the function of observing the measurements with a concrete parameter value. Thus, the likelihood is always a function of the model parameter. It is also defined up to a coefficient of proportionality.

(11)

One of the useful properties of the likelihood is the following: if the data y_i are independent then the likelihood function can be determined as the product:

l(y|α) =

n

Y

i=1

l(y_i|α)

The most common application of the likelihood function is its presence in posterior distribution during the Bayesian estimation procedure, that will be described later. However, it also can be used in order to obtain the estimators of the model parameters in the maximum likelihood method.

2.2 ML method

The maximum likelihood (ML) method is a statistical method which is used for estimating model parameters from sample data. In other words, if there is a statistical model and empirical data, then, applying ML method, one can obtain the estimators of the model parameters. The estimation is done by maximizing the probability of getting the particular sample that can be actually obtained from the given statistical model.

Suppose that there is a sampley=(y₁, y₂, ..., y_n) ofnmeasurements, for simplicity it is assumed that they are independent and identically distributed (iid) observations.

Thus, one can write:

y=f(x,α) +

wherex= (x1, x2, ..., xk)denotes the measurement conditions andα= (α1, α2, ..., αk) corresponds to a vector of unknown parameters, is a measurement noise that is, for simplicity, assumed to be Gaussian, ∼ N(0, σ²I) . The task is to find the vector of unknown parameters α) by maximizing the likelihood functionl(y|α)).

Since=y - f(x,α) ),y_i plays the role of a random variable andf(x_i,α))is stated for a mean. Having assumed this and noting that there are n iid measurements, one can obtain the likelihood function of the form:

l(y|α) =

n

Y

i=1

l(y_i|α) = ( 1 σ√

2π)

n

e⁻^2σ¹²^Pⁿⁱ⁼¹^[yⁱ⁻^f(xⁱ^,^α)]²

Taking everything above into consideration, one can describe the algorithm of estimating model parameters using ML method in the following way:

(12)

• As a first step, collect the data (responses) from an experiment or simulations and define the model.

• Construct the likelihood function using parameters of the model and choosing the distribution of measurement noise (if it is known).

• Find the maximum value of the likelihood function and corresponding values of parameters using some optimization routines.

• Check how well the found parameters fit the model.

Even if the algorithm does not look like to be complicated the most challenging part for using ML method may be that the distribution of measurement noise is not known. This fact makes the process of constructing likelihood function difficult or impossible, even if it is not prohibited to take a guess due to some assumptions one can proceed with, yet there are no guarantees that the likelihood will be properly chosen.

Let us consider one example of estimating parameters using ML method. Assume that there is a model:

y=α₁(1−e^−α²^x) +,

where a vector of space variables is denoted asx and a vector of responses asy, the measurement error is ∼N(0,σ²I), σ= 0.1:

x=

0 2 4 6 8 τ

y=

0.076 0.258 0.369 0.492 0.559 τ

,

Having performed all the steps in the algorithm, the estimators for the unknown parameters were obtained and the graph was drawn to check them (see Figure 1).

(13)

Figure 1: Comparison of the given data with one that were obtained using ML method.

But as we know that the data are noisy, a natural question is how acceptable the estimated valuesαb = (α₁, α₂, ..., α_k)are. One way to answer the question is to apply Markov chain Monte Carlo methods to the problem.

2.3 MCMC methods for parameter estimation

Monte Carlo methods are a class of computational techniques which are based on running the simulation repeatedly until the desired amount of the sampled statistics is collected. These methods have become widely used for solving such problems as numerical optimization and integration in different areas of science.

In this work, a special class of Monte Carlo methods, the Markov chain Monte Carlo (MCMC) methods is used.

MCMC methods are a class of algorithms with the aim to sample from probability distributions in order to construct a Markov chain that is assumed to approximate the desired distribution. In other words, the goal of MCMC is to produce a sequence of random samples (α1,α2, ...,αN) whose distribution tends to the posterior distribution with the increase of N. In most cases the construction of a Markov chain itself is not as challenging as a determination of the steps that one should take to achieve the convergence to the target distribution with an acceptable error.

(14)

The Markov chain term is used because the samples are generated in such a way that every new point in the sequence is independent from every other point except the previous one. Furthermore, it has been shown in Markov chain theory that the resulting sample from the MCMC calculation approaches the target distribution.

One of the most widely used MCMC methods is the random walk Metropolis algorithm. The main idea of the Metropolis method is the selection of the samples using a proposal distribution and an accept - reject technique. The key steps of the Metropolis algorithm can be presented as following:

• Initialization step: the starting pointα1 is chosen.

• Generation step: a new candidateαb is produced from the proposal distribution that can depend on the previous point of the chain α_n.

• Selection step: the candidate is accepted with the probability:

β(αn,α) =b min(1, π(α)b π(α_n))

Here π denotes the posterior density. If the point is rejected, one should proceed with the previous point of the chain. The procedure is to be repeated from the step 2 until the desired conditions are satisfied.

Therefore, the algorithm itself is not complicated, but there are some assumptions that need to be pointed out. First, it is assumed that the proposal distribution is symmetric. This means, that the probability of changing from the current point to the proposed one is the same as the changing in the opposite way. However, there are modifications of the Metropolis algorithm in a way that a non-symmetric distribution can be also used. However, in this work, the proposal distribution is assumed to be symmetric.

Second, one of the most critical points of the algorithm is how to select the proposal distribution. At least two items should be covered: its relative ease in sampling from, and similarity with the target distribution. Besides, it can be easily seen after the simulations that inefficient implementation of the algorithm is mainly caused by the proposal distribution which is selected in an inappropriate way: if it is too small, the candidates will be mostly accepted during the third step of the algorithm, but they will be too close to the previous point. This will cause the smaller movement of the chain and as a result, bigger amount of steps will be needed to cover the distribution.

(15)

On the other hand, if the proposal distribution is too large, the probability of the rejection will be much bigger and the points will be accepted too rarely.

In this work the proposal distribution is selected to be multivariate Gaussian distribution as examples are made in a multidimensional continuous parameter spaces.

The current point in the sequence provides a center point to the Gaussian proposal distribution. This idea is the base for the slight modification of the Metropolis method called random walk Metropolis algorithm (Solonen et al. [12]).

In order to understand the basics of the random walk Metropolis algorithm the already mentioned model is considered:

y=f(x,α) +

Here, again, the measurement noise is assumed to be Gaussian, independent and identically distributed. Using the formula of posterior density and assuming a uni- form prior (p(α)∼1), the posterior density is obtained in a form:

π(α|y)∼l(y|α)p(α)∼e⁻^2σ¹²^SS(α)

where SS(α) = Pn

i=1[y_i −f(x_i,α)]². Having assumed this, the acceptance ratio which is used in the second step of the Metropolis algorithm can be rewritten:

β(α_n,α) =b min(1, π(α)b

π(α_n)) = min(1, e⁻^2σ¹²^[SS(^α)−SS(α^b ⁿ^)])

Taking into account all the assumptions that were made above and noting that there is a multivariate Gaussian proposal distribution with covariance matrixC and initial point α_n=α₀, one can proceed with the following algorithm:

• Generation step: a new candidate αi is produced from the proposal distribution N(α_n,C) and SS(α)b is computed.

β(α_n,α) =b min(1, e⁻^2σ¹²^[SS(^α)−SS(α^b ⁿ^)]) If the candidate is accepted:

– α_n :=αb

(16)

– SS(α_n) := SS(α)b

Otherwise, repeat α_n in the chain.

• The procedure is to be repeated from the step 1 until the desired conditions are satisfied.

This version of the Metropolis algorithm is used in the work.

Even in random walk Metropolis algorithm the problem of the choice of the proposal distribution remains, but now the covariance matrix is responsible for the size and the shape of the proposal distribution. Fortunately, research has been done concerning this issue and some ways of approximating the covariance matrix were proposed. One approach is based on the linearization of the model. According to it, the covariance matrix can be approximated via the Jacobian matrix J:

C=σ²(J^τJ)⁻¹,

where [J]_ij = ^∂f^(x_∂ⁱ^,α)

j and αis, typically, the value obtained via ML method.

Besides, the scaling is very significant. It was found that for Gaussian targets the scaling factor is s_d= 2.4²/d, whered is dimension of the problem:

C=s_dσ²(J^τJ)⁻¹

The distribution of parameters for the example was also obtained using the random walk Metropolis algorithm, where the covariance matrix was chosen to be either scaled identity matrix or Jacobian-based with the already mentioned scaling factor (see Figures 2,3). It can be seen from the figures, that the MCMC chain obtained using Jacobian covariance matrix is more stable and the acceptance rate here is higher.

(17)

(a) The distribution of parameters (b) 2D plot of parameters

Figure 2: The distribution of the parameters (a) and 2D plot (b) with covariance matrix that was approximated via the Jacobian using MCMC.

Figure 3: The distribution of the parameters (a) and 2D plot (b) with a scaled identity covariance matrix using MCMC.

However, often modelling problems are high-dimensional. For example, the number of unknown parameters depends on the numerical problem discretization. These issues can lead to the preclusion of the fixed standard MCMC methods making its adaptive version more preferable.

(18)

2.4 Adaptive version of Metropolis algorithm

As discussed in the last section, covariance matrix is responsible for the shape and size of the proposal distribution. Even if there are some ways of improving it via the linearization of the model and scaling, it will not always lead to efficient sampling, though it can provide a good starting point. A possible reason for such a turn can be a bad parameter identification which often results in a singular Jacobian matrix.

Fortunately, there is an improved version of MCMC, called adaptive Metropolois (AM) that will be discussed in this section.

The AM random walk with Gaussian increments was proposed in Haario et al.

[4]. The idea behind the adaptive version of Metropolis is that covariance matrix is updated during the sampling procedure. Thus the current covariance matrix is replaced by empirical covariance matrix of the points that were already sampled.

One significant remark is that the algorithm is no longer Markovian since each individual chosen point depends not only on the previous point of the chain, but on the whole history of the sequence due to the covariance adaptation.

In the AM algorithm , the proposal distribution follows the same idea as in random walk Metropolis algorithm, it is chosen to be Gaussian, centered at the current point of the chain. At the beginning, the proposal covariance matrix C₀ is chosen to be strictly positive definite and based on the prior knowledge of the process.

It is recommended to apply Jacobian based version for the starting covariance, though it can be enough to use diagonal C₀ for simple cases. Then, during the sampling procedure, the covariance matrix is adapted and taken to be the empirical one of the already chosen points. That is, if one have already proceed with the sequence (α₁,α₂, ...,α_n−1), then the next candidate will be sampled according to the covariance matrixC_n=s_dCov(α₁,α₂, ...,α_n−1)+s_dεI_d, wheres_dis the scaling factor andεis a positive regularization parameter that can be used to ensure positive definiteness of the covariance. Practically,ε may be chosen to be 0 or a small value.

Taking all the notations into account, the covariance matrix can be updated in the (AM) algorithm according to:

Cn=







C₀, n≤k C_n = s_d Cov(α₁,α₂, ...,αn−1) + s_d ε I_d, n > k

Here, k is a positive time index that determines the length of the non-adaptation period from the start of the algorithm.

(19)

Furthermore, it is important to mention that it is not recommended to update the covariance at every step of the procedure, it is enough to do it during the process several times. The length of the initial non-adaptation period, specified by k is totally free, but one should know that the larger it is - the longer it takes for the adaptation to start.

Basically, the AM algorithm can be presented in a following way:

• Initialization step: the length of the chain N, the starting pointα₁ and initial covariance C₁ are chosen.

• Procedure cycle for i = 1,2,...,N:

– Metropolis step with proposal distributionN(αi,Ci) – Adaptation of covariance: C_ik+1 =Cov(α₁,α₂, ...,α_i)

Following the algorithm one can obtain the distribution of the example parameters (see Figure 4).

Figure 4: The distribution of the parameters (a) and 2D plot (b) obtained using adaptive MCMC.

2.5 Bootstrap methods

Bootstrapping is a statistical method for estimation of nearly any statistic which is based on resampling. Typically, this method involves a straightforward procedure,

(20)

but it should be repeated a certain amount of time (usually quite large). This often causes heavy computer calculations.

In case of regression models, the procedure of resampling simply means changing the places of individual cases in a given data set. The idea is based on the following scheme:

• If there is an existing data in a form of x= (x₁, x₂, ..., x_k), y= (y₁, y₂, ..., y_k), the next step is to generate a new data setbx,byby resampling the existing one.

The transformation can be performed by choosing k indices from 1, ..., k by resampling and taking corresponding data points with respect to that indices.

• Create a fit to the model based on the generated in the previous step data.

• Return to the step 1 , until a desired number of estimators is obtained.

Even if the procedure is quite simple, the method has some disadvantages. First, as was mentioned in the beginning, this method involves the repeated calls of the optimization routine. This fact makes it computationally demanding. Second, it is heavily dependent on the success of the optimization step. Moreover, the bootstrapping procedure is inefficient if the number of measurements is small, since the resampled data will often have only a few points making the problem of the parameter estimation ill-posed. However, this situation can be improved using the procedure of bootstrapping with residuals. It has the following algorithm:

• If there is an existing data in a form of x= (x₁, x₂, ..., x_k), y= (y₁, y₂, ..., y_k), the next step is to find the LSQ estimate of the fit αb and then compute the residuals:

res_i =y_i−f(x_i,α)b

• Then, generate a new data set resc , by resampling the existing one. The transformation can be performed by choosing indices from 1, ..., k and taking corresponding data points with respect to that indices.

• Create a new data according to:

y_i =f(x_i,α) +b resd_i

• Use the new data to compute the model fit.

(21)

• Return to the step 2 , until a desired number of estimators is obtained.

In the example, the number of the measurements is not big enough to proceed with the ordinary bootstrap procedure. Taking this into account, bootstrapping with residuals was chosen to perform the calculations for the example that is used in this work (see Figure 5). However, only a finite number of distinct combinations of data can be created, which impacts the results for small number of data points.

Figure 5: The comparison between parameter distribution obtained via MCMC and bootstrap method.

3 ABC and likelihood-free methods

3.1 Introduction to ABC methods

Approximate Bayesian computation (ABC) involves a class of computational methods based on Bayesian statistics. These methods are also known as likelihood-free methods and they can provide quite satisfactory ways of simulating the problem without using a likelihood . In this chapter, a concept of the original ABC and some extensions of it will be presented.

It is important to point out that the computational issue in a case of an unknown likelihood function l(α|y) will rise considerably. The reasons for such an issue is that the likelihood function can be very demanding to calculate or it is not available in a form as a function of the unknown parameter. Furthermore, one might not

(22)

know the normalizing constant of the likelihood.

This situation is common in epidemiology analysis. Specifically, the likelihood is called intractable since it can be presented as uncomputable multidimensional integral. Some solutions were proposed to solve the problem but it was difficult to calibrate them (Cucala et al. [2]). There exist different conditions under which the Bayesian inference faces a partially known likelihood function in a form l(α|y) = l₁(α|y)l₂(α) where l₂ is unknown. In such a case, the exact simulations from its posterior distribution can be impossible. In order to handle the problem, variational Bayes solutions and Laplace evaluations were advanced but all of them need various model simplifications.

The ABC methodology was first mentioned by Rubin [10]. It presents an almost automated resolution of the problems with models that are considered to be intractable but still can be simulated from (Marin et al. [6]). Thus, in the paper, Rubin produced the algorithm and described it as a special case of an accept-reject procedure. In this procedure, the parametersαto be estimated are generated from the prior distribution π(α). The parameters will be accepted conditionally on the responses from them being identical to a given sample, that is written as y. Addi- tionally, it was assumed thaty has values from a finite (countable) set D. Thus, the algorithm was presented as :

• Generate α^∗ from the prior distribution π(·).

• Generate h from the likelihood f(·|α^∗).

• Return to the step 1 , untilh =y.

• Set α_i =α^∗.

• Repeat the whole process until a desired length of the chain will be obtained.

Initially, ABC was introduced by Tavare et al [15] and then was discussed by Pritchard et al. [9]. The author extended the algorithm for continuous sample spaces. The algorithm reads as follows:

• Generate α^∗ from the prior distribution π(·).

• Generate h from the likelihood f(·|α^∗).

• Return to the step 1 , untilρ[S(h)]≤ε.

(23)

• Set α_i =α^∗.

• Repeat the whole process until a desired length of the chain will be obtained.

New elements that appear in the algorithm can be listed as:

−S (h) = (S₁(h), S₂(h), ..., S_n(h)), a vector of summary statistics.

−ε >0,a tolerance value.

−ρ >0, a distance defined on D.

In the Euclidean spaceRⁿ, it is more common to use the Euclidean distance (2-norm distance) in order to obtain the distance between two points:

ρ(x, y)=(

n

X

i=1

|x_i−y_i |²)¹²

This distance was used to obtain the distribution of the parameters from the example (see Figure 6).

Figure 6: The comparison between parameter distribution obtained via MCMC and ABC methods with different ε.

The Minkowski distance of order p (p-norm distance) can also be used:

ρ(x, y)=(

n

X

i=1

|x_i−y_i |^p)¹^p

(24)

The main idea of ABC is to produce a good approximation to the posterior distribution by using a representative summary statisticS with a small toleranceε. Thus the goal is:

π_ε(α|y)≈π(α|y)

Unfortunately, this class of methods often suffers from a low acceptance rate due to the rejection procedure. This is one of the main reasons to use this approach in combination with MCMC algorithm.

3.2 Likelihood - free MCMC

Typically, MCMC is a preferred choice for generating observations from a posterior distribution. However, the method is based on knowing the likelihood function which can be not always known due to some reasons. For example, there are complex probability models used in population genetics where it is impossible to find the likelihood function analytically or it can be computationally prohibitive. In such a case a likelihood-free MCMC method can be applied. Although, there are different ways of defining the accept-reject step in the algorithm. In this work, some of the approaches will be considered.

As stated in the previous chapter, if there are given measurements y and a vector of unknowns α to be estimated, the posterior density is the probability for having valuesαgiven measurementsy and according to the Bayes formula, it can be found using the following ratio:

π(α|y) = l(y|α)p(α) R l(y|α)p(α)dα

where l(y|α) states for the likelihood and p(α) denotes the prior density.

Thus, if the likelihood is not known the posterior density can not be found because the accept-reject step ratio that is used in ordinary MCMC is unknown. One possible way to cope with the problem is a change of the accept-reject criterion in the algorithm. For instance, if statistical model of the problem is known and defined as M,p(α)corresponds to prior distribution then according to Marjoram et al. [7] the algorithm can be modified to:

(25)

• Generation step: a new candidate α_i is produced from the proposal distribution Q(αi−1|α_i).

• Simulation step: a new data set y⁰ is simulated from the model M with parameters αi.

• Checking step: if y⁰ = y move forward. Otherwise, go to step 1.

β(αi−1,α_i) =min(1, p(αi)Q(αi|αi−1) p(αi−1)Q(αi−1|α_i)) If the candidate is accepted:

– α_i−1 :=α_i

Otherwise, repeat αi−1 in the chain.

It was proved that the posterior distribution obtained using this algorithm is a stationary distribution of the posterior.

However, one can use ABC and MCMC in combination. This method results in higher acceptance rate than for plain ABC or likelihood-free MCMC.

3.3 The combination of MCMC and ABC

Characteristically, the ABC methods suffer from the low acceptance rate due to rejection sampling algorithm. In such a situation, the combination with MCMC algorithm tends to improve the efficiency of the ABC-based approximations.

The algorithm starts, like a typical MCMC, with the choice of the current parameter value and this value is generated using the proposal distribution instead of prior like in ABC case. Naturally, this method can be treated as an extended space of ordinary MCMC, as it involves collecting values for Markov chain in the same way with the only difference that corresponds to the accept-reject step.

The comparison between simulated and given data can be performed in different ways during step 3 in the likelihood-free MCMC. It is not a common case that the

(26)

simulated and given data are considered to be completely equal since it can not occur in many cases due to model specification or a nature of the data. One can rather apply a condition on the Eucledian distance:

(

n

X

i=1

|y⁰_i−y_i |²)¹² ≤ε

Besides, the concept of summary statistics is also used like in ABC. Thus, one can apply a condition on the Eucledian distance for normed summary statistics S₁,S₁,...S_n :

(

n

X

i=1

|S_i⁰−S_i |²)¹² ≤ε

Generally, the combination of ABC and MCMC reads as follows:

• Generation step: a new candidateα_iis produced from the proposal distribution Q(αi−1|α_i).

• Simulation step: a new data set y⁰ is simulated from the model M with parameters α_i and summary statistics S⁰ are calculated based on it.

• Selection step: the candidate is accepted if both conditions are satisfied:

ρ ( S⁰,S) ≤ ε

u < β(αi−1,α_i) = min(1, p(α_i)Q(α_i|αi−1) p(αi−1)Q(αi−1|α_i))

where u ∼ U(0,1), S is a vector of summary statistics from the given data, ε is a small threshold and ρ is Eucledian distance. If the candidate is accepted:

– αi−1 :=α_i

Otherwise, repeat αi−1 in the chain.

In this work, the main problem is solved using this particular combination of MCMC and ABC.

(27)

4 ABC in estimation of TB transmission parame- ters

4.1 Introduction to the problem and data description

As was mentioned in the introduction part, TB is a directly transmitted disease caused by the bacterium known as Mycobacterium tuberculosis. It can be said that one-third of the living people has been infected with it, but in many cases, the infection with the bacterium does not cause TB disease, making them being asymp- tomatic [17]. However, TB is the most common reason for death from infectious disease after HIV infections. A considerable amount of TB cases is registered in developing countries such as African and Asian countries [18].

Different studies have been made to analyze the behavior of TB transmission in populations. In this work, the method from Marjoram et al. [7] that was applied San Francisco data of Small et al.[11] is considered. In the work, TB transmission was studied by genotyping strains of M. tuberculosis isolated from patients.

These typing methods contain insertion sequence (IS) typing in order to study the diversity of tuberculosis strains. Thus, the IS6110 marker was injected to clinical isolates of M. tuberculosis as it transposes fast enough to provide with recognition of TB types (for more details see Cave et al. [1]). Then, a DNA fingerprint based on IS6110 is created, making it possible to define the number of mutation events.

One approach to study TB transmission is to form clusters of TB cases that have identical genotypes. Here, clusters represent a group of genotypes that have same population size (see details in 4.3 chapter). The only assumption is that TB cases within one cluster have emerged through recent transmission, as contrasted to reac- tivation in Small et al. [11].

ABC-MCMC method and stochastic model of disease transmission is used to estimate the key characteristics of the process applied to a published data set of TB genotypes from San Francisco.

(28)

4.2 The model description

A continuous - time stochastic model is used to describe the process. It is similar to the one that was studied in Tavare [16] with the only difference in the probability of new genotypes occurrence: here, it is proportional to the number of genotype cases.

There are three main parameters in the model:

φ= (α, δ, θ),

where

• α - "birth" rate per case per year that is responsible for the emergence of new infections

• β - "death" rate per case per year that models the death or recovery of the person (host)

• θ - "mutation" rate per case per year that corresponds to the replacement of one TB type with another within a host

It is important to note an assumption that if the mutation occurs, it creates a rise to genotype that has never appeared before. It is assumed that all genotypes have the same epidemiological properties. Moreover, the probability of each of three events is assumed to be proportional to the number of cases and the rates per host do not change over time, making the process time homogeneous.

There are several notations that are needed to describe the algorithm:

• X_i(t)- the amount of genotype i cases

• G(t) - the number of genotypes that have existed up to timet

• N(t) - total number of cases up to timet:

N(t) =

G(t)

X

i=1

X_i(t)

• P_i,x(t) - the probability of having x cases of genotype i up to time t:

P_i,x(t) = P(X_i(t) = x)

(29)

• P¯_n(t) - the probability that total number of cases up to timet is equal to n:

P¯n(t) = P(N(t) =n)

• P˜_g(t) - the probability that the number of genotypes that have existed up to time t is equal to g:

P˜_g(t) =P(G(t) =g)

To start the process, the population is initialized with a single infection only. Taking into account all the assumptions above, the time evolution P_i,x(t) is described via differential equation:

dP_i,x(t)

dt =−(α+δ+θ)xP_i,x(t)+α(x−1)Pi,x−1(t)+(δ+θ)(x+1)P_i,x+1(t), f or x= 1,2,3, ...

Boundary conditions:

dPi,0(t)

dt = (δ+θ)P_i,1(t) , f or i= 1,2, ...G(t) Initial conditions:

P_1,0(0) = 1

P_i,0(0) = 1 , f or i= 2,3,4, ...

Pi,x(0) = 0 , in other cases

The evolution of P¯_n(t) follows the process described as such:

P¯n(t)

dt =−(α+δ)nP¯_n(t) +α(n−1) ¯Pn−1(t) + (δ)(n+ 1) ¯P_n+1(t) , f or n= 1,2,3, ...

P¯₀

dt = (δ) ¯P₁(t) Initial conditions:

P¯₁(0) = 1

P¯_n(0) = 0, f or n 6= 1

The probability P˜_g(t)changes according to the following equation:

(30)

dP˜_g(t)

dt =−θN(t) ˜P_g(t) +θN(t) ˜P_g−1(t), f or g = 2,3,4, ...

dP˜1(t)

dt =−θN(t) ˜P₁(t) , G(0) = 1 Initial conditions:

P_g,1(t_g) = 1

P_g,x(t_g) = 0 , f or x6= 1

where t_g - the time when genotype g appears first

It can be seen that this model is the extension of the standard linear birth-death process.

Basically, we simulate a discretized version of the continuous - time stochastic model that was described. The time between the events here is distributed exponentially but it will not be simulated since the time experienced by the infectious population is not of the interest. Assuming that parameters α, δ and θ are fixed, the birth- death-mutation process is, therefore, described as:

• Initialization step: set X1(0) = 1, Xi(0) = 0 f or i6= 1, N(0) = 1, G(0) = 1

• Select the genotype of the next event with a probability proportional to the amount of the genotype in the sample, let it be genotype i

• Choose an event to emerge according to the probabilities of birth, death or mutation:

P(birth|event) = α α+δ+θ P(death|event) = δ

α+δ+θ P(mutation|event) = θ

α+δ+θ

• Select next actions according to the chosen event:

- birth: X_i(t) = X_i(t) + 1 - death: X_i(t) = X_i(t)−1

- mutation: X_i(t) =X_i(t)−1 , G(t) = G(t) + 1 , X_G(t)(t) = 1

(31)

• Continue the process from step 2 until N(t) = N_stop

The stopping valueN_stop should be chosen such that it reflects the size of the population and represents an appropriate level of diversity. Previous research has shown that samples of size Nstop = 10³ gave too high acceptance rates during the simulation while N_stop = 10⁵ is considered to be big enough for true infectious population sizes (Tanaka et al.[14]). Thus, the value was chosen to be N_stop= 10⁴.

4.3 The estimation of the process and summary statistics

As was mentioned in the beginning of the chapter, the approach of finding the estimates is based on organising the clusters. Usually, the structure of a cluster is denoted by n^k, with:

k= #{x_i, x_i =n},

i.e., n^k indicates there arek genotypes x_i of size n and x states for the genotypes.

The resulting configuration of clusters obtained in this work is compared with the real data of Small et al. [11]:

30¹ 23¹ 15¹ 10¹ 8¹ 5² 4⁴ 3¹³ 2²⁰ 1²⁸²

The estimation of the process is based on approximate Bayesian inference and MCMC (Marjoram et al. [7]) . The goal of it is to estimate the birth, death and mutation rates through their distribution obtained via Markov chain. The criteria MCMC is based on the evaluation of a function depending on summary statistics.

Here, two summary statistics are used: the number of distinct genotypes g: g=X

i

k_i,

and the gene diversity H (Tanaka et al.[14]):

H = 1−X

(ni/n)², where n_i stands for clusters sizes and n=P

in_i.

According to the real data, the number of distinct genotypesg is equal to 326, while the number of isolates n is 473.

It was shown by Ewens [3] thatg is a sufficient statistic, but in this study the model is somewhat different. That is why it was decided to consider H also as it related to heterozygosity in randomly mating diploid populations (Tanaka et al.[14]).

(32)

The real life situation is imitated as follows: we simulate the population to the size N_stop then take a subset of the same size n as that with real data.

If φ= (α, δ, θ), the detailed description of the algorithm reads as:

1. Initialization step : set t=0 and define φ₀

2. Generate a new set of parameter valuesφ^∗ ∼q(φ|φ_t)

In this work, the proposal density was chosen to be multivariate Normal N (φ_t,P ) with covariance matrix:

X=







0.5² 0.225 0 0.225 0.5² 0

0 0 0.015²







3. Simulate birth-death-mutation process, obtain a subset of size n from it and obtain summary statistics g^∗ and H^∗.

4. If the following conditions are satisfied 1

n|g−g^∗|+|H−H^∗|<

and

u < min{1,p(φ^∗)q(φ_t|φ^∗) p(φ_t)q(φ^∗|φ_t)

where is a small threshold and u ∼ U(0,1) then set φ_t+1 = φ^∗. Otherwise, set φ_t+1 =φ_t.

5. Set t=t+1 and return to step 2.

One of the key challenges here is how to specify the prior distributionp(α, δ, θ). The only available information in the literature for the San Francisco data corresponds to the mutation parameter of some genetic markers (Tanaka et al. [14]). That is the reason why this information was included in the prior and it is chosen to be:

p(α, δ, θ) =







p(θ), if 0< δ < α 0, otherwise

Having analyzed the research that was conducted in order to estimate the mutation rate using IS6110 fingerprints, the prior distribution of the mutation parameterp(θ)

(33)

is selected to be N(0.198,0.06735²) (Tanaka et al. [14]). The birth and death parameters are selected, in a rational way , to be positive andα > δ. It can be seen that the prior distribution does not integrate to one as the normalising constant is absent but in the MCMC sampler it does not make sense since the normalising constants cancel out in the fraction.

However, the randomness of the process often results in unstable calculations since, for example, the subset is taken once only and the summary statistics are calculated based on it. One way to get more efficient summary statistics is to take the subset different times and calculate summary statistics as average over obtained. This strategy was also used in this work: the subset was taken 100 times and both summary statistics were calculated as the average among 100.

4.4 The concept of early rejection

In order to get a realistic result, the simulations are performed a large amount of times and as a consequence the computational time is very large. With the aim to speed up the process, a concept of early rejection can be used.

The idea behind early rejection is not new, and has been considered for MCMC in Solonen et al. [13] (they first used the expression of early rejection) and more recently in Picchini [8]. Here, another approach is introduced that is based on cluster structure.

The idea is following: rather than simulating the whole process every time and rejecting it using summary statistics, one can reject the sample earlier, during collecting first 10⁴ cases. The problem here is the randomness of the process that can decrease considerably the efficiency of any method to be applied and the choice of early rejection criterion that can cope with this issue and reduce the computational time simultaneously.

In order to find out the way of early rejection criterion of theX sample, the process was simulated and the clusters with the accepted points were saved and analyzed.

By clusters one should assume the number of cases that have the same genotype in a sample. Data from the clusters were collected and illustrated (see Figure 7).

One can notice typical behaviour of the function. This can lead to an intuitive approach - to fit the function of the last (accepted) point of the chain and compare

(34)

with the current one.

Figure 7: The illustration of cluster that corresponds to an accepted point.

Because of the fluctuations in the cluster function and the randomness of the process, it is impossible to find a function that can fit them perfectly, but the main tendency can be characterized using the approximationy=ax^−b . Thus, to explore how well the fitting works the fminsearch Matlab routine was used to find a, b parameters for each individual point and the cost function was calculated as:

n

X

i=1

|y_i−ax^−b_i |

where n is the number of clusters and y is the vector that states for the amount of genotypes within a particular case.

It was decided to use the sum of absolute values of differences. Besides, it was decided to neglect those genotypes that appeared in obtained sample 1 or 2 times as, in fact, they can affect the fitting in a bad way and their absence do not change considerably the main result. The natural reason for such a turn is following: in every case, it is more likely to have a big amount of the genotypes that can be met in a sample 1 and 2 times with respect to others. Thus, the clusters were transformed from original (see Table 1) in the described way (see Table 2).

(35)

Genotypes 1 2 3 4 5 6 7 8 9 10 Number of cases 1136 378 218 126 91 60 44 28 12 9 Table 1: The example of clusters structure before the transformation.

Genotypes 3 4 5 6 7 8 9 10 11 12

Number of cases 218 126 91 60 44 28 12 9 8 8 Table 2: The example of clusters structure after the transformation.

Having performed the procedure of obtaining the fitting for both parameters, the values of the a,b parameters for each cluster were obtained as well as the distribution of the cost function values (see Figure 8). The clusters for the figure were collected during the simulation of the process and estimation of the model parameters. Thus, they represent some of the accepted points of the chain.

Figure 8: The distribution of the function parameters andcost function values.

(36)

Even if the values of the cost function is in the interval (50 , 140), one can still accept such a big difference as the process is random. However, the fitting follows the actual behaviour of the function (see Figure 9):

Figure 9: Example of fitting to the cluster (zoomed view).

As can be seen from the picture, even if the fitting reflects the main tendency of the clusters distribution, the difference still exists due to the up and down spikes of the fitted function.

The next step is to provide an early rejection criterion in order to avoid the subset sampling. Referring to the main algorithm, it means that one should wait until the number of cases reaches 10⁴ value and the clusters are organised. After that a criterion should be applied to find out whether the X sample is effective for taking subsets from or it could be rejected even at that stage.

The early rejection procedure with the assumption that the last accepted point in the algorithm was φ_t−1 = (αt−1, δt−1, θt−1), the fitting was also applied to approximate the behaviour of its clusters and the parameters at−1, bt−1 and Ft−1 (cost function value) were obtained, was chosen to be like this:

• the clusters of size 1 and 2 are dropped away from the distribution.

• the sumS_t is calculated according to the formula:

S_t=

n

X

i=1

|y_i−at−1x^−b_i ^t−1 |

(37)

where x-genotypes that appeared at least 3 times, y is the amount of the corresponding genotypes andat−1, bt−1 - the obtained values of the point φ_t−1.

• If S_t < ξFt−1, then the sample is allowed to go to the next step of the algorithm, otherwise the sample is rejected, a new sample is generated from a neighbourhood of the same point φ_t−1 and the procedure starts again from the step 1.

Here, ξ >1, its range will be determined after a more detailed analysis.

It is important to mention that in case if the sample will not be rejected and, moreover, the corresponding point will be accepted, new parametersa_t, b_t and F_t will be fitted

The algorithm was applied to the problem and the results were analyzed. Accord- ing to them the approach is working in an appropriate way and the samples were properly rejected in most cases.

Going further, it was decided to improve the algorithm in a following way: instead of checking a sample when it is already complete one can apply the early rejection criterion on earlier stages - during the completion process of the sample. In this case the early rejection procedure should be modified to account for the expected properties of the sample. This means that one should predict the behaviour of the sample in further sampling. In this work, the way out was decided to be in this form : at every iteration when the number of cases in the sample is increased on 1000 the new early rejection criterion is checked. Basically, the idea of it is the same as was in previous case, but now one should take into account that the sample is not complete yet (like in previous case where the criterion was applied when the sample reached its threshold value). Thus, to cope with the issue the following solution is applied:

• Assuming that there is the situation when the length of the X sample equals to 1000*p, where p=1,...10 and all the previous notations are valid

– organize the genotype clusters starting from those that appeared at least 3 times in the current state of the sample

– collect the value of the sum to the val array:

val(p) =

n

X

i=1

|y_i−at−1x^−b_i ^t−1 |

(38)

where y is the amount of the genotypes in the clusters obtained in previous point of the chain

– if (p>2) then:

∗ check the early rejection criterion:

val(1)−(10−p)(max(dif f(val)))> ξFt−1

if this condition is satisfied the sample is rejected,a new sample is generated from a neighbourhood of the same point φ_t−1 and the procedure starts again

Here, diff corresponds to the difference between the adjacent elements of the array:

dif f(val)[i] =val(i−1)−val(i)

This algorithm was applied to the problem and analyzed with respect to different values of ξ and ways in which a and b parameters are updated. The first way is just to follow the algorithm, meanwhile the second is to save the parameters and recalculate them as the mean values of already existing (see Figures 10,11,12,13)

Figure 10: The distribution of thecost function values both rejected on early stages (red) and allowed for further sampling (blue). The value is 0.0025. This is the case when (a) and (b) parameters are decided to be chosen as the mean values of already accepted points parameters and ξ is chosen to be 1.5.

(39)

Figure 11: The distribution of thecost function values both rejected on early stages (red) and allowed for further sampling (blue). The value is 0.0025, ξ is chosen to be 1.5.

Figure 12: The distribution of thecost function values both rejected on early stages (red) and allowed for further sampling (blue). The value is 0.0025. This is the case when (a) and (b) parameters are decided to be chosen as the mean values of already accepted points parameters and ξ is chosen to be 2.

(40)

Figure 13: The distribution of thecost function values both rejected on early stages (red) and allowed for further sampling (blue). The value is 0.0025, ξ is chosen to be 2.

At the same time, the proportion of the rejected points among all of the unaccepted points and percentage of the rejected points by mistake with respect to all the accepted points were obtained (see Table 3 ).

ξ Averaging of a and b % of rejected %of rejected by mistake

1.5 + 64.18 5

1.5 - 65.43 27

2 + 46.88 1

2 - 47.04 2.9

Table 3: The choice of parameters and their influence in the rejection procedure.

Thus, the early rejection algorithm was applied to solve the main problem. Accord- ing to the pictures and table, it has shown quite satisfactory results. Furthermore, showing at what stage the parameter set is rejected is also of special interest. By the term stage it is meant at what time or, more precisely, at what value was parameter p in the algorithm when the sample was rejected. As the early rejection criterion was checked seven times starting when the sample size is representative (p=4) the stages are within the interval [4,10] (see Figures 14,15,16,17).

(41)

Figure 14: The distribution of the stages when the sample was rejected during the early rejection procedure. The value is 0.0025. This is the case when (a) and(b) parameters are decided to be chosen as the mean values of already accepted points parameters and ξ is chosen to be 1.5.

Figure 15: The distribution of the stages when the sample was rejected during the early rejection procedure. The value is 0.0025, ξ is chosen to be 1.5.

(42)

Figure 16: The distribution of the stages when the sample was rejected during the early rejection procedure. The value is 0.0025. This is the case when (a) and(b) parameters are decided to be chosen as the mean values of already accepted points parameters and ξ is chosen to be 2.

Figure 17: The distribution of the stages when the sample was rejected during the early rejection procedure. The value is 0.0025, ξ is chosen to be 2.

However, there are some parts that can be clarified:

(43)

• first, it can be seen from the algorithm that the early rejection criterion is checked only in case if the length of theX sample is at least 3000. The reason for such decision is that this length was considered to be enough big to analyze the further behaviour of the sample, meanwhile the smaller length can lead to a rough forecast.

• second, by the term forecast it is meant that one is trying to predict the final value of the sum (when the X sample is already collected). Surely, the prediction is not that much accurate, especially on the earlier stages. However, if the sample is not rejected on early stages, the forecast is becoming more accurate, and, finally, on the last step the approximate value will be exact.

• third, as it was mentioned in the beginning of this work, the algorithm is still under development as more time is needed to test and check what values should be used, for example, for ξ.

5 RESULTS

As was mentioned in previous sections, the ABC-MCMC method was applied to the problem. In order to check the results, they were compared with the outcomes of Tanaka et al. [14] (see Table 4) where the threshold value of the cost function was decided to be 0.0025 ( = 0.0025 ).

Parameter Description Mean Median

α−δ Net transmission rate 0.69 0.68 log(2)/(α−δ) Doubling time 1.08 1.02

α/δ Reproductive value 19.04 3.43

Table 4: Posterior estimates of compound parameters from San Francisco data, Tanaka et al. [14] .

Using the same threshold value, the results of this study were collected (see Table 5) and the method with the averaging of summary statistics was also applied (see Table 6).

(44)

Parameter Description Mean Median α−δ Net transmission rate 0.65 0.67 log(2)/(α−δ) Doubling time 1.10 1.04

Table 5: Posterior estimates of compound parameters from San Francisco data obtained in this work.

Parameter Description Mean Median

α−δ Net transmission rate 0.67 0.66 log(2)/(α−δ) Doubling time 1.13 1.03

Table 6: Posterior estimates of compound parameters from San Francisco data obtained in this work(with average summary statistics).

Moreover, different threshold values were chosen in order to understand their influence on the TB parameters distribution. Thus, the effects of the tolerance parameter, in terms of sampler accuracy, were also analyzed (see Figures 18,19,20).

Figure 18: Posterior densities of net transmission rate α − δ, doubling time log(2)/(α− δ), and reproductive value α/δ as dependent on algorithm tolerance . The values of are 0.025 (shaded dashed lines), 0.015 (shaded solid lines), 0.005

(45)

(dashed solid lines), and 0.0025 (thick solid lines). The picture is the outcome of Tanaka et al. [14]

Figure 19: Posterior densities of net transmission rate α − δ, doubling time log(2)/(α−δ), and reproductive value α/δ as dependent on algorithm tolerance . The figure corresponds to this work and was obtained using ABC-MCMC method.

Figure 20: Posterior densities of net transmission rate α − δ, doubling time log(2)/(α−δ), and reproductive value α/δ as dependent on algorithm tolerance . The figure corresponds to this work and was obtained using ABC-MCMC method with summary statistics.

The estimation of tuberculosis transmission parameters using ABC and MCMC methods

Contents

1 INTRODUCTION

1.1 Background

1.2 Research problem, objectives and delimitation

1.3 Research methodology

1.4 Organization of the study

2 Bayesian estimation and MCMC

2.1 Bayesian parameter estimation

2.2 ML method

2.3 MCMC methods for parameter estimation

2.4 Adaptive version of Metropolis algorithm

2.5 Bootstrap methods

3 ABC and likelihood-free methods

3.1 Introduction to ABC methods

3.2 Likelihood - free MCMC

3.3 The combination of MCMC and ABC

4 ABC in estimation of TB transmission parame- ters

4.1 Introduction to the problem and data description

4.2 The model description

4.3 The estimation of the process and summary statistics

4.4 The concept of early rejection

5 RESULTS