Supervised classification - Bayesian methods in bacterial population genomics

reassign-ment that lead to the maximal increase in the marginal likeli-hood (equation (2.9)).

iv In a random order, split each cluster into m subclusters using complete linkage clustering algorithm as described in operator (iii), wherem=min(20,d|s_c|/5e) and|s_c|is the total number of data items in the cluster. Then try to reassign each subcluster to another cluster; choose the split and reassignment that leads to the maximal increase in the marginal likelihood (equation (2.9)).

Output : an estimate of the best partition ˆS, leading to the highest marginal likelihoodp(x^(N)|S)ˆ

The above greedy stochastic algorithm uses heuristics to visit the high probability areas in the partition spaceS. Operatoriexchanges data items between clusters to optimize the current partition; operatorii merges sim-ilar clusters together to reduce the number of clusters; operator iiiand iv split out heterogeneous data items to optimize the current partition. Al-though the algorithm does not guarantee global optimality of the solution, it searches the high probability areas very efficiently according to our in-tensive experiments. In practice, we have a wealth of empirical evidence that the estimated partition tends to be biologically meaningful and more sensible than alternative estimates based on standard methods for Bayesian computation, such as the Gibbs sampler or Metropolis-Hastings algorithm using completely random proposals.

Note that other distances between samples rather than Hamming dis-tance could also be utilized here, depending on specific scenarios. In prac-tice, the marginal likelihood needs to be calculated on a log scale to avoid numerical overflow, since the values are in general extremely small.

2.2 Supervised classification

Supervised classification differs from unsupervised classification (clustering) in that it requires training data, which containK classes (or groups). The primary aim is often to assign the unlabeled data items into any of the K classes, however, also purposes are frequently considered, where one typically calculates some functionals of the data assigned into each class.

An example of this is the relative contribution of each known source (class) to the population of unlabeled samples.

20 2 Bayesian clustering and classification methods In the context of bacterial population genomics, a popular application of supervised classification is to classify a new sample to a selected taxonomic rank (usually species) based on its sequence information. The training data is usually a sequence database, which is a large collection of sequences and their labels from various biological projects. The test data usually contain sequences produced in a new biological project, where assigning labels to the sequences is very important in understanding the data.

In this subsection, the training data are denoted by z^(M⁾, where M = {1,2,· · · , m}. The training data are assumed to be divided intoK classes based on some auxiliary knowledge or an earlier unsupervised analysis, which are denoted byT ={T₁, T₂,· · · , T_m}, whereT_i∈ {1,2,· · ·, K}. The

Note that the test data items have the same features as the training data items. The aim is to assign a label S_i to each test data item x_i based on its resemblance to the observations within each group of the training data.

The joint labeling of the test data is denoted by S = {S₁, S₂,· · ·, S_n}, whereS_i ∈ {1,2,· · ·, K}.

A typical assumption is that a test data item is generated from one of the underlying distributions of the K classes in the training data, where the parameters of the underlying distributions are learned from the training data. Thus the test data items are independent given the known param-eters, such that the labeling of one data item will not affect the others.

The labeling of one data item is solely based on the information of the training data, which does not borrow any information from other test data items. When the training data are very sparse, there is a high risk of wrong labeling of the test data.

The strategy adopted in this thesis, however, does not assume indepen-dence of the test data items. Instead, we label all test data items simulta-neously such that the labeling of one test data item also borrows statistical strength from other test data items. Corander et al. [23] provide a de-tailed discussion about two classifiers based on the above two classification principles and another marginalized classifier.

2.2 Supervised classification 21 The posterior probability of the joint labeling S is given by

p(S|x^(N),z^(M⁾, T) = p(x^(N)|z^(M), T, S)p(S|z^(M⁾, T)p(z^(M), T) p(x^(N),z^(M⁾, T)

∝p(x^(N⁾|z^(M⁾, T, S)p(S|z^(M), T), (2.12) wherep(z^(M⁾, T) andp(x^(N⁾,z^(M), T) are constants with respect toS. The aim is to seek a joint labeling ˆS of the test data items which maximizes the posterior probability, i.e.

Sˆ= arg max

p(S|x^(N),z^(M⁾, T). (2.13)

2.2.1 Prior

To place a prior distribution for S, we only need to know the number of classes K in the training data. Hence it is reasonable to assume that the prior distribution ofS is independent of the training dataz^(M). Since each test data item could be placed in any of theK classes, the prior ofSequals

p(S|z^(M⁾, T) =p(S|T) = 1

Kⁿ. (2.14)

2.2.2 Likelihood

The marginal likelihood in equation (2.12) does not have an explicit form, thus we need to introduce nuisance parameter θ to calculate it. The nui-sance parameter here is the same as that defined in the previous section (equation (2.5)). With the help of the θ, the likelihood in equation (2.12) can be written as follows:

p(x^(N⁾|z^(M⁾, T, S) = Z

p(x^(N)|θ,z^(M⁾, T, S)p(θ|z^(M), T, S)dθ

= Z

p(x^(N)|θ, S)p(θ|z^(M), T)dθ (2.15) where we implicitly assume that the test data depend on the training data only through the nuisance parametersθ.

The first term in the integral of equation (2.15) is the likelihood of generating the test datax^(N) given the nuisance parameterθand partition S, which has an identical expression as equation (2.5). The second term is the posterior probability ofθgiven the training dataz^(M) and partitionT.

22 2 Bayesian clustering and classification methods Again we assume the same Dirichlet prior as that in equation (2.8) for θ, which leads to a posterior as follows:

p(θ|z^(M), T)∝p(z^(M), T|θ)p(θ) where mcij is the number of the jth base in the ith column of class c in the training dataz^(M). We can easily observe that the posterior ofθci·is a Dirichlet distribution Dir(mci1+α1,· · ·, mci4+α4).

By plugging equation (2.5) and (2.16) into (2.15), we get p(x^(N⁾|z^(M⁾, T, S) where the last integration follows from the properties of the product Dirich-let distribution.

2.2.3 Posterior & Inference

Equation (2.14) and (2.17) provide explicit expressions of the prior and the marginal likelihood in equation (2.12). Thus we are able to compare the posterior probability given any two labelings S and S⁰ of the test data.

The inference is almost the same as that in the unsupervised classifi-cation, except that we set K_max =K. Therefore, the test data items can be assigned to maximally K classes and at least 1 class. The algorithm chooses the partition ˆS which maximizes the posterior (equation (2.12)).

In document Bayesian methods in bacterial population genomics (sivua 27-30)