• Ei tuloksia

A continuous - time stochastic model is used to describe the process. It is similar to the one that was studied in Tavare [16] with the only difference in the probability of new genotypes occurrence: here, it is proportional to the number of genotype cases.

There are three main parameters in the model:

φ= (α, δ, θ),

where

• α - "birth" rate per case per year that is responsible for the emergence of new infections

• β - "death" rate per case per year that models the death or recovery of the person (host)

• θ - "mutation" rate per case per year that corresponds to the replacement of one TB type with another within a host

It is important to note an assumption that if the mutation occurs, it creates a rise to genotype that has never appeared before. It is assumed that all genotypes have the same epidemiological properties. Moreover, the probability of each of three events is assumed to be proportional to the number of cases and the rates per host do not change over time, making the process time homogeneous.

There are several notations that are needed to describe the algorithm:

• Xi(t)- the amount of genotype i cases

• G(t) - the number of genotypes that have existed up to timet

• N(t) - total number of cases up to timet:

• P¯n(t) - the probability that total number of cases up to timet is equal to n:

n(t) = P(N(t) =n)

• P˜g(t) - the probability that the number of genotypes that have existed up to time t is equal to g:

g(t) =P(G(t) =g)

To start the process, the population is initialized with a single infection only. Taking into account all the assumptions above, the time evolution Pi,x(t) is described via differential equation:

dPi,x(t)

dt =−(α+δ+θ)xPi,x(t)+α(x−1)Pi,x−1(t)+(δ+θ)(x+1)Pi,x+1(t), f or x= 1,2,3, ...

Boundary conditions:

dPi,0(t)

dt = (δ+θ)Pi,1(t) , f or i= 1,2, ...G(t) Initial conditions:

P1,0(0) = 1

Pi,0(0) = 1 , f or i= 2,3,4, ...

Pi,x(0) = 0 , in other cases

The evolution of P¯n(t) follows the process described as such:

n(t)

dt =−(α+δ)nP¯n(t) +α(n−1) ¯Pn−1(t) + (δ)(n+ 1) ¯Pn+1(t) , f or n= 1,2,3, ...

Boundary conditions:

0

dt = (δ) ¯P1(t) Initial conditions:

1(0) = 1

n(0) = 0, f or n 6= 1

The probability P˜g(t)changes according to the following equation:

dP˜g(t)

where tg - the time when genotype g appears first

It can be seen that this model is the extension of the standard linear birth-death process.

Basically, we simulate a discretized version of the continuous - time stochastic model that was described. The time between the events here is distributed exponentially but it will not be simulated since the time experienced by the infectious population is not of the interest. Assuming that parameters α, δ and θ are fixed, the birth-death-mutation process is, therefore, described as:

• Initialization step: set X1(0) = 1, Xi(0) = 0 f or i6= 1, N(0) = 1, G(0) = 1

• Select the genotype of the next event with a probability proportional to the amount of the genotype in the sample, let it be genotype i

• Choose an event to emerge according to the probabilities of birth, death or mutation:

• Select next actions according to the chosen event:

- birth: Xi(t) = Xi(t) + 1 - death: Xi(t) = Xi(t)−1

- mutation: Xi(t) =Xi(t)−1 , G(t) = G(t) + 1 , XG(t)(t) = 1

• Continue the process from step 2 until N(t) = Nstop

The stopping valueNstop should be chosen such that it reflects the size of the popu-lation and represents an appropriate level of diversity. Previous research has shown that samples of size Nstop = 103 gave too high acceptance rates during the simula-tion while Nstop = 105 is considered to be big enough for true infectious population sizes (Tanaka et al.[14]). Thus, the value was chosen to be Nstop= 104.

4.3 The estimation of the process and summary statistics

As was mentioned in the beginning of the chapter, the approach of finding the estimates is based on organising the clusters. Usually, the structure of a cluster is denoted by nk, with:

k= #{xi, xi =n},

i.e., nk indicates there arek genotypes xi of size n and x states for the genotypes.

The resulting configuration of clusters obtained in this work is compared with the real data of Small et al. [11]:

301 231 151 101 81 52 44 313 220 1282

The estimation of the process is based on approximate Bayesian inference and MCMC (Marjoram et al. [7]) . The goal of it is to estimate the birth, death and mutation rates through their distribution obtained via Markov chain. The criteria MCMC is based on the evaluation of a function depending on summary statistics.

Here, two summary statistics are used: the number of distinct genotypes g: g=X

i

ki,

and the gene diversity H (Tanaka et al.[14]):

H = 1−X

(ni/n)2, where ni stands for clusters sizes and n=P

ini.

According to the real data, the number of distinct genotypesg is equal to 326, while the number of isolates n is 473.

It was shown by Ewens [3] thatg is a sufficient statistic, but in this study the model is somewhat different. That is why it was decided to consider H also as it related to heterozygosity in randomly mating diploid populations (Tanaka et al.[14]).

The real life situation is imitated as follows: we simulate the population to the size Nstop then take a subset of the same size n as that with real data.

If φ= (α, δ, θ), the detailed description of the algorithm reads as:

1. Initialization step : set t=0 and define φ0

2. Generate a new set of parameter valuesφ ∼q(φ|φt)

In this work, the proposal density was chosen to be multivariate Normal N (φt,P )

3. Simulate birth-death-mutation process, obtain a subset of size n from it and obtain summary statistics g and H.

4. If the following conditions are satisfied 1

One of the key challenges here is how to specify the prior distributionp(α, δ, θ). The only available information in the literature for the San Francisco data corresponds to the mutation parameter of some genetic markers (Tanaka et al. [14]). That is the reason why this information was included in the prior and it is chosen to be:

p(α, δ, θ) =

Having analyzed the research that was conducted in order to estimate the mutation rate using IS6110 fingerprints, the prior distribution of the mutation parameterp(θ)

is selected to be N(0.198,0.067352) (Tanaka et al. [14]). The birth and death parameters are selected, in a rational way , to be positive andα > δ. It can be seen that the prior distribution does not integrate to one as the normalising constant is absent but in the MCMC sampler it does not make sense since the normalising constants cancel out in the fraction.

However, the randomness of the process often results in unstable calculations since, for example, the subset is taken once only and the summary statistics are calculated based on it. One way to get more efficient summary statistics is to take the subset different times and calculate summary statistics as average over obtained. This strategy was also used in this work: the subset was taken 100 times and both summary statistics were calculated as the average among 100.

4.4 The concept of early rejection

In order to get a realistic result, the simulations are performed a large amount of times and as a consequence the computational time is very large. With the aim to speed up the process, a concept of early rejection can be used.

The idea behind early rejection is not new, and has been considered for MCMC in Solonen et al. [13] (they first used the expression of early rejection) and more recently in Picchini [8]. Here, another approach is introduced that is based on cluster structure.

The idea is following: rather than simulating the whole process every time and rejecting it using summary statistics, one can reject the sample earlier, during col-lecting first 104 cases. The problem here is the randomness of the process that can decrease considerably the efficiency of any method to be applied and the choice of early rejection criterion that can cope with this issue and reduce the computational time simultaneously.

In order to find out the way of early rejection criterion of theX sample, the process was simulated and the clusters with the accepted points were saved and analyzed.

By clusters one should assume the number of cases that have the same genotype in a sample. Data from the clusters were collected and illustrated (see Figure 7).

One can notice typical behaviour of the function. This can lead to an intuitive approach - to fit the function of the last (accepted) point of the chain and compare

with the current one.

Figure 7: The illustration of cluster that corresponds to an accepted point.

Because of the fluctuations in the cluster function and the randomness of the process, it is impossible to find a function that can fit them perfectly, but the main tendency can be characterized using the approximationy=ax−b . Thus, to explore how well the fitting works the fminsearch Matlab routine was used to find a, b parameters for each individual point and the cost function was calculated as:

n

X

i=1

|yi−ax−bi |

where n is the number of clusters and y is the vector that states for the amount of genotypes within a particular case.

It was decided to use the sum of absolute values of differences. Besides, it was decided to neglect those genotypes that appeared in obtained sample 1 or 2 times as, in fact, they can affect the fitting in a bad way and their absence do not change considerably the main result. The natural reason for such a turn is following: in every case, it is more likely to have a big amount of the genotypes that can be met in a sample 1 and 2 times with respect to others. Thus, the clusters were transformed from original (see Table 1) in the described way (see Table 2).

Genotypes 1 2 3 4 5 6 7 8 9 10 Number of cases 1136 378 218 126 91 60 44 28 12 9 Table 1: The example of clusters structure before the transformation.

Genotypes 3 4 5 6 7 8 9 10 11 12

Number of cases 218 126 91 60 44 28 12 9 8 8 Table 2: The example of clusters structure after the transformation.

Having performed the procedure of obtaining the fitting for both parameters, the values of the a,b parameters for each cluster were obtained as well as the distribution of the cost function values (see Figure 8). The clusters for the figure were collected during the simulation of the process and estimation of the model parameters. Thus, they represent some of the accepted points of the chain.

Figure 8: The distribution of the function parameters andcost function values.

Even if the values of the cost function is in the interval (50 , 140), one can still accept such a big difference as the process is random. However, the fitting follows the actual behaviour of the function (see Figure 9):

Figure 9: Example of fitting to the cluster (zoomed view).

As can be seen from the picture, even if the fitting reflects the main tendency of the clusters distribution, the difference still exists due to the up and down spikes of the fitted function.

The next step is to provide an early rejection criterion in order to avoid the subset sampling. Referring to the main algorithm, it means that one should wait until the number of cases reaches 104 value and the clusters are organised. After that a criterion should be applied to find out whether the X sample is effective for taking subsets from or it could be rejected even at that stage.

The early rejection procedure with the assumption that the last accepted point in the algorithm was φt−1 = (αt−1, δt−1, θt−1), the fitting was also applied to approximate the behaviour of its clusters and the parameters at−1, bt−1 and Ft−1 (cost function value) were obtained, was chosen to be like this:

• the clusters of size 1 and 2 are dropped away from the distribution.

• the sumSt is calculated according to the formula:

St=

n

X

i=1

|yi−at−1x−bi t−1 |

where x-genotypes that appeared at least 3 times, y is the amount of the corresponding genotypes andat−1, bt−1 - the obtained values of the point φt−1.

• If St < ξFt−1, then the sample is allowed to go to the next step of the al-gorithm, otherwise the sample is rejected, a new sample is generated from a neighbourhood of the same point φt−1 and the procedure starts again from the step 1.

Here, ξ >1, its range will be determined after a more detailed analysis.

It is important to mention that in case if the sample will not be rejected and, moreover, the corresponding point will be accepted, new parametersat, bt and Ft will be fitted

The algorithm was applied to the problem and the results were analyzed. Accord-ing to them the approach is workAccord-ing in an appropriate way and the samples were properly rejected in most cases.

Going further, it was decided to improve the algorithm in a following way: instead of checking a sample when it is already complete one can apply the early rejection criterion on earlier stages - during the completion process of the sample. In this case the early rejection procedure should be modified to account for the expected properties of the sample. This means that one should predict the behaviour of the sample in further sampling. In this work, the way out was decided to be in this form : at every iteration when the number of cases in the sample is increased on 1000 the new early rejection criterion is checked. Basically, the idea of it is the same as was in previous case, but now one should take into account that the sample is not complete yet (like in previous case where the criterion was applied when the sample reached its threshold value). Thus, to cope with the issue the following solution is applied:

• Assuming that there is the situation when the length of the X sample equals to 1000*p, where p=1,...10 and all the previous notations are valid

– organize the genotype clusters starting from those that appeared at least 3 times in the current state of the sample

– collect the value of the sum to the val array:

val(p) =

n

X

i=1

|yi−at−1x−bi t−1 |

where y is the amount of the genotypes in the clusters obtained in pre-vious point of the chain

– if (p>2) then:

∗ check the early rejection criterion:

val(1)−(10−p)(max(dif f(val)))> ξFt−1

if this condition is satisfied the sample is rejected,a new sample is generated from a neighbourhood of the same point φt−1 and the procedure starts again

Here, diff corresponds to the difference between the adjacent ele-ments of the array:

dif f(val)[i] =val(i−1)−val(i)

This algorithm was applied to the problem and analyzed with respect to different values of ξ and ways in which a and b parameters are updated. The first way is just to follow the algorithm, meanwhile the second is to save the parameters and recalculate them as the mean values of already existing (see Figures 10,11,12,13)

Figure 10: The distribution of thecost function values both rejected on early stages (red) and allowed for further sampling (blue). The value is 0.0025. This is the case when (a) and (b) parameters are decided to be chosen as the mean values of already accepted points parameters and ξ is chosen to be 1.5.

Figure 11: The distribution of thecost function values both rejected on early stages (red) and allowed for further sampling (blue). The value is 0.0025, ξ is chosen to be 1.5.

Figure 12: The distribution of thecost function values both rejected on early stages (red) and allowed for further sampling (blue). The value is 0.0025. This is the case when (a) and (b) parameters are decided to be chosen as the mean values of already accepted points parameters and ξ is chosen to be 2.

Figure 13: The distribution of thecost function values both rejected on early stages (red) and allowed for further sampling (blue). The value is 0.0025, ξ is chosen to be 2.

At the same time, the proportion of the rejected points among all of the unaccepted points and percentage of the rejected points by mistake with respect to all the accepted points were obtained (see Table 3 ).

ξ Averaging of a and b % of rejected %of rejected by mistake

1.5 + 64.18 5

1.5 - 65.43 27

2 + 46.88 1

2 - 47.04 2.9

Table 3: The choice of parameters and their influence in the rejection procedure.

Thus, the early rejection algorithm was applied to solve the main problem. Accord-ing to the pictures and table, it has shown quite satisfactory results. Furthermore, showing at what stage the parameter set is rejected is also of special interest. By the term stage it is meant at what time or, more precisely, at what value was parameter p in the algorithm when the sample was rejected. As the early rejection criterion was checked seven times starting when the sample size is representative (p=4) the stages are within the interval [4,10] (see Figures 14,15,16,17).

Figure 14: The distribution of the stages when the sample was rejected during the early rejection procedure. The value is 0.0025. This is the case when (a) and(b) parameters are decided to be chosen as the mean values of already accepted points parameters and ξ is chosen to be 1.5.

Figure 15: The distribution of the stages when the sample was rejected during the early rejection procedure. The value is 0.0025, ξ is chosen to be 1.5.

Figure 16: The distribution of the stages when the sample was rejected during the early rejection procedure. The value is 0.0025. This is the case when (a) and(b) parameters are decided to be chosen as the mean values of already accepted points parameters and ξ is chosen to be 2.

Figure 17: The distribution of the stages when the sample was rejected during the early rejection procedure. The value is 0.0025, ξ is chosen to be 2.

However, there are some parts that can be clarified:

• first, it can be seen from the algorithm that the early rejection criterion is checked only in case if the length of theX sample is at least 3000. The reason for such decision is that this length was considered to be enough big to analyze the further behaviour of the sample, meanwhile the smaller length can lead to a rough forecast.

• second, by the term forecast it is meant that one is trying to predict the final value of the sum (when the X sample is already collected). Surely, the prediction is not that much accurate, especially on the earlier stages. However, if the sample is not rejected on early stages, the forecast is becoming more accurate, and, finally, on the last step the approximate value will be exact.

• third, as it was mentioned in the beginning of this work, the algorithm is still under development as more time is needed to test and check what values should be used, for example, for ξ.

5 RESULTS

As was mentioned in previous sections, the ABC-MCMC method was applied to the problem. In order to check the results, they were compared with the outcomes of Tanaka et al. [14] (see Table 4) where the threshold value of the cost function was decided to be 0.0025 ( = 0.0025 ).

Parameter Description Mean Median

α−δ Net transmission rate 0.69 0.68 log(2)/(α−δ) Doubling time 1.08 1.02

α/δ Reproductive value 19.04 3.43

Table 4: Posterior estimates of compound parameters from San Francisco data, Tanaka et al. [14] .

Using the same threshold value, the results of this study were collected (see Table 5) and the method with the averaging of summary statistics was also applied (see Table 6).

Parameter Description Mean Median α−δ Net transmission rate 0.65 0.67 log(2)/(α−δ) Doubling time 1.10 1.04

α/δ Reproductive value 9.09 3.10

Table 5: Posterior estimates of compound parameters from San Francisco data ob-tained in this work.

Parameter Description Mean Median

α−δ Net transmission rate 0.67 0.66 log(2)/(α−δ) Doubling time 1.13 1.03

α/δ Reproductive value 11.17 3.94

Table 6: Posterior estimates of compound parameters from San Francisco data ob-tained in this work(with average summary statistics).

Moreover, different threshold values were chosen in order to understand their influ-ence on the TB parameters distribution. Thus, the effects of the tolerance parameter, in terms of sampler accuracy, were also analyzed (see Figures 18,19,20).

Figure 18: Posterior densities of net transmission rate α − δ, doubling time log(2)/(α− δ), and reproductive value α/δ as dependent on algorithm tolerance . The values of are 0.025 (shaded dashed lines), 0.015 (shaded solid lines), 0.005

(dashed solid lines), and 0.0025 (thick solid lines). The picture is the outcome of Tanaka et al. [14]

Figure 19: Posterior densities of net transmission rate α − δ, doubling time log(2)/(α−δ), and reproductive value α/δ as dependent on algorithm tolerance . The figure corresponds to this work and was obtained using ABC-MCMC method.

Figure 20: Posterior densities of net transmission rate α − δ, doubling time log(2)/(α−δ), and reproductive value α/δ as dependent on algorithm tolerance . The figure corresponds to this work and was obtained using ABC-MCMC method with summary statistics.

6 CONCLUSIONS

In this work the problem of estimating TB transmission parameters was studied.

The ABC-MCMC method was applied to San Francisco data that was taken from

The ABC-MCMC method was applied to San Francisco data that was taken from