• Ei tuloksia

Maximally parsimonious reconstructions

4.2 Maximum parsimony

4.4.1 Maximally parsimonious reconstructions

In the previous section, we described multiple unpleasant asymptotic proper-ties of maximum parsimony, namely it is not always consistent and it cannot distinguish between a large number of candidate phylogenies. However, the results presented so far provide neither a clear picture of how serious the issues are in practice nor tangible guidelines on when the results given by MP can be trusted.

Article IV attempts to answer to these concerns with a series of ex-periments performed with simulated phylogenetic data. In the study, we generated a large number of phylogenetic data sets with up to twelve taxa and 256 characters. Each data set was generated for a random unrooted binary tree, drawn from the Yule–Harding distribution [23] that is known to be closer to the observed distribution of phylogenies in biology than the uniform distribution [4]. Once the tree was fixed, the edge probabilitiespe were assigned according to the Jukes–Cantor model [29] with the substitu-tion rateμ= 4q/(34q), where q∈(0,0.75) is a fixed parameter and the edge lengthst were drawn independently from the exponential distribution with unit mean and the parameter q can be interpreted as the expected value ofpe. In the experiments, the valuesq= 0.08,0.16,0.24, . . . ,0.48 were considered.

Notably, the edge lengths were drawn separately for each character, so the synthetic data follows the no-common-mechanism model discussed in Section 4.2. This choice of model can be motivated by two arguments:

firstly, that it is the most natural model to use here since the phylogeny reconstructed by MP is the maximum likelihood solution; and secondly, because it has been argued that the extreme flexibility inherent in the no-common-mechanism model allows it to better describe complex evolutionary phenomena [14]. Since it has also been shown that biologically inspired models, when applicable, tend to outperform the no-common-mechanism model [28], our choice of model has the additional benefit that we may reasonably conjecture that the results show MP at its best and hence should be treated as a kind of upper bound.

34 4 Phylogenetic reconstruction with maximum parsimony Based on the experiments with simulated data, we were able to approx-imate various probabilistic quantities related to the performance of the maximum parsimony method. The most important of them is the so-called probability of success for a data-generating model, parametrized by the triplet (n, m, q) with interpretations as described above; it is the probability that the phylogeny reconstructed by MP is the true one, with the additional assumption that if there are multiple maximally parsimonious phylogenies, one of them is picked randomly. Denoting the true tree byT and the set of maximally parsimonious trees byT, it can be shown that the probability of success equals

E

χ(T∈T)

|T |

whereχ(·) is an indicator function. By the law of large numbers, this quantity may be approximated by averaging over a large number of independent trials.

The estimated probabilities of success, as displayed in Table 1 of Arti-cle IV, give quite Arti-clear indications of the boundaries that determine whether the reconstructed phylogenies are accurate or wrong. If a success probability of 0.90 is taken as a reasonable value to aim for, then 128 characters are enough for even twelve taxa, provided that the expected edge probability takes the valueq = 0.08, the smallest one used in the study. As the value of q is increased, the necessary number of characters quickly exceeds the scenarios used in the study; for q= 0.48, even 256 characters for five taxa give a success probability of only about 0.70.

Interestingly, it appears that at least for five or six taxa, MP actually performs better for q = 0.16 than for q = 0.08. This is explained by the fact that the relationship between the probability of success and the corresponding optimal value ofq is of course not monotonic—constant or almost-constant characters provide little phylogenetic information.

Figure 1 of Article IV illustrates the same probabilities as a heatmap.

As a sanity check, it can be seen from the figure that the required increase in the number of characters (as a function of n) appears to roughly match the logarithmic lower bound discussed in Section 4.3.2.

In a similar manner, other quantities were estimated as well: in particular, the article provides estimates of the probability that the true phylogeny is maximally parsimonious, the probability that there is a unique maximally parsimonious tree, and the expected inverse of the number of maximally parsimonious trees. The reader is referred to Article IV for a discussion on what conclusions may be drawn from these quantities.

4.4 Small samples 35 4.4.2 Skewness of the tree length distribution

The maximum parsimony method can be used to produce an estimated phylogeny for all possible data sets, even those that are random and thus do not correspond to any true phylogeny. Let us assume that we know that MP is applicable for a given data set we may have (e.g., by means of the results of Article IV, as discussed above). If we have no external knowledge that suggests that the data at hand has a phylogenetic structure, how are we to know that the result given by MP is plausible? In other words, how can we ensure that MP is not just hallucinating structure where there is none?

One answer lies in the distribution obtained by computing the lengths of all trees in U(n), as we briefly hinted already in Section 4.3.2. Hillis [25] proposed using theskewness (the third standardized moment) of the distribution, as given by the formula

γ1 =

T∈U(n)(T −μ)3

|U(n)3

in whichμandσ denote the mean and the standard deviation of theT over allT ∈U(n). Symmetric distributions have a skewness of zero, and negative and positive values often have the natural interpretation of quantifying the non-symmetry present in the shape of the distribution.

The crucial empirical observation of Hillis was that random data tends to induce tree length distributions with small (slightly negative) skewness, while phylogenetic data results in a markedly more negative skewness. This enables one to construct a statistical test for rejecting the null hypothesis that a given data set is random: By generating a large number of random data sets and computing the skewness of the resulting tree length distributions, one can calculate a critical value that is smaller than, for instance, 99 percent of the resultingγ1’s. Then, if a new data set produces a skewness value that is more negative than the critical value, we can be fairly confident that the data is not random. This is, of course, dependent on the data generation mechanism used for calibrating the skewness test, but the approach has been previously shown to work well for up to eight taxa [25, 27].

Our work in Article IV extends the skewness test to data sets with up to twelve taxa. This can be reasonably viewed as the limit of what is possible with current theoretical knowledge and computing hardware. Namely, for n= 13, there are about|U(13)| ≈1.37·1010 possible phylogenies, so even if we had 1000 computing cores available and each of them required just one millisecond to compute the tree length of a single phylogeny, it would take us 159 days to compute the tree length distributions for 1000 random data

36 4 Phylogenetic reconstruction with maximum parsimony sets. Although one might consider trying a na¨ıve tree sampling strategy, this would probably not be useful due to the long tails exhibited by the distributions. Therefore, new theoretical results would be needed in order to extend the scope of the skewness test to more than twelve taxa.

In the article, we provide critical values for the 95% and 99% confidence limits. The values are seen to be quite similar to the ones provided by Hillis [25] forn≤8, though not exactly the same; the differences arise from the method for generating random data for the number of characters used. We also evaluated the skewness test on the simulated phylogenetic data sets described in Section 4.4.1 and found that the skewness test was able to reject the null hypothesis of the data being random much more often than MP was able to recover the true phylogeny. This is not a surprising result per se, because reconstruction is a much more difficult task, but it shows that the tree length distribution does contain useful information besides the identities of the maximally parsimonious trees and that the skewness test is able to capture some of that information.

Chapter 5 Conclusions

In this thesis, I have given an introduction to the topic of model selection in machine learning, described a broadly (though not universally) applica-ble framework to describe fundamental concepts such as consistency, and discussed a number of specialized model selection methods tailored for solv-ing the subset selection problem in linear regression and the phylogenetic reconstruction problem in computational biology. Interspersed within the chapters, I have also summarized the main results of the four original articles by me and my coauthors.

In linear regression, I concentrated primarily on methods based on sequential prediction. New results presented in the original articles include the consistency of the Sequentially Normalized Least Squares (SNLS) method and the introduction of a new consistent method that can be seen as a hybrid of SNLS and the classic Predictive Least Squared (PLS) method. In addition to the asymptotic results, I demonstrated how the various methods fare when the number of data points available is small, and I presented an application of linear regression model selection to the field of image processing.

As for phylogenetic reconstruction, I summarized the major asymptotic results concerning the Maximum Parsimony (MP) method for selecting an evolutionary model that describes the ancestral relationships of a given phylogenetic data set. I provided additional combinatorial arguments for why the results given by the method should be treated with some suspicion, and presented empirical results highlighting the boundaries that determine whether the resultant reconstructions are plausible or not.

While the results of this thesis provide answers to many questions concerning model selection in linear regression and phylogenetics, it is clear that much is still left unanswered and there is plenty of room for improvements in both the methods themselves and in our understanding

37

38 5 Conclusions of their properties in various scenarios. This is, of course, at it should be:

science is never ready, and answers to questions induce new questions. Thus it is natural that I conclude by listing questions that I would like to see answered in the future.

For the SNLS criterion, is Assumption 3.3 regarding fourfold products necessary? Is it implied for polynomial regression or other restricted classes of design matrices?

Can the rate of convergence in Theorem 3.5 be improved by introducing additional assumptions? For instance, one might try setting a fixed lower bound for the absolute values of non-zero coefficients.

What happens if the PLS/SNLS hybrid method is wrapped in an optimization problem for the scale parameterλ2? The new method is unlikely to have a closed-form solution, but is the optimization problem convex? Is the resulting method consistent?

For maximum parsimony, what are the distributional properties of the tree-length distribution for theNr model or some other phylogenetic model? Can the relationship between the number of characters and the probability of success be quantified analytically?

Is it feasible to extend the skewness test to thirteen or more taxa by sampling from the tree-length distribution? Should other methods of calibration be considered?

Moreover, some of the methods used in the thesis might be extended to situations other than those discussed so far. For instance, the denoising algo-rithm proposed in Article III could be generalized to use context-dependent windows, and some of the computational methods used in Article IV for analyzing the maximum parsimony method could be applied also to other reconstruction methods.

References

[1] H. Akaike. Information theory and an extension of the maximum likelihood principle. InSecond International Symposium on Information Theory (Tsahkadsor, 1971), pages 267–281. Akad´emiai Kiad´o, Budapest, 1973.

[2] H. Akaike. A new look at the statistical model identification. IEEE Trans. Automat. Control, 19(6):716–723, 1974. doi: 10.1109/TAC.1974.

1100705.

[3] H. Akaike. A Bayesian analysis of the minimum AIC procedure. Ann.

Inst. Statist. Math., 30(Part A):9–14, 1978. doi: 10.1007/BF02480194.

[4] D. J. Aldous. Stochastic models and descriptive statistics for phyloge-netic trees, from Yule to today. Statist. Sci., 16(1):23–34, 2001. doi:

10.1214/ss/998929474.

[5] S. Arlot and A. Celisse. A survey of cross-validation procedures for model selection. Statist. Surv., 4:40–79, 2010. doi: 10.1214/09-SS054.

[6] Y. Berra and D. Kaplan. When you come to a fork in the road, take it!:

Inspiration and wisdom from one of baseball’s greatest heroes. Hachette Books, New York, NY, USA, 2001.

[7] G. E. P. Box. Robustness in the strategy of scientific model building.

In R. B. Launer and G. N. Wilkinson, editors, Robustness in statistics.

Academic Press, 1979.

[8] L. L. Cavalli-Sforza and A. W. F. Edwards. Phylogenetic analysis.

Models and estimation procedures. Am. J. Hum. Genet., 19(3 Pt 1):

233–257, 1967.

[9] G. J. Chaitin. On the simplicity and speed of programs for computing infinite sets of natural numbers. J. ACM, 16(3):407–422, 1969. doi:

10.1145/321526.321530.

39

40 References [10] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans.

Image Process., 16(8):2080–2095, 2007. doi: 10.1109/TIP.2007.901238.

[11] A. P. Dawid. Statistical theory: The prequential approach. J. Roy.

Statist. Soc. Ser. A, 147(2):278–292, 1984. doi: 10.2307/2981683.

[12] A. W. F. Edwards and L. L. Cavalli-Sforza. The reconstruction of evolution. Heredity, 18:553, 1963. doi: doi:10.1038/hdy.1963.66.

[13] P. L. Erd˝os, M. A. Steel, L. A. Sz´ekely, and T. J. Warnow. Construct-ing big trees from short sequences. In P. Degano, R. Gorrieri, and A. Marchetti-Spaccamela, editors, Automata, Languages and Program-ming, volume 1256 ofLecture Notes in Computer Science, pages 827–837.

Springer, Berlin, Germany, 1997. doi: 10.1007/3-540-63165-8 235.

[14] J. S. Farris. Parsimony and explanatory power. Cladistics, 24(5):

825–847, 2008. doi: 10.1111/j.1096-0031.2008.00214.x.

[15] W. Feller. An introduction to probability theory and its applications.

John Wiley & Sons, New York, NY, USA, 3rd edition, 1968.

[16] J. Felsenstein. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Biol., 27(4):401–410, 1978. doi:

10.1093/sysbio/27.4.401.

[17] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, Sunderland, MA, USA, 2004.

[18] W. M. Fitch. Toward defining the course of evolution: Minimum change for a specific tree topology. Systematic Zoology, 20(4):406–416, 1971.

doi: 10.2307/2412116.

[19] L. R. Foulds and R. L. Graham. The Steiner problem in phylogeny is NP-complete. Adv. in Appl. Math., 3(1):43–49, 1982. doi: 10.1016/

S0196-8858(82)80004-3.

[20] A. Graybeal. Is it better to add taxa or characters to a difficult phylogenetic problem? Syst. Biol., 47(1):9–17, 1998. doi: 10.1080/

106351598260996.

[21] P. D. Gr¨unwald. The minimum description length principle. The MIT Press, Cambridge, MA, USA, 2007.

References 41 [22] I. Guyon, A. Saffari, G. Dror, and G. C. Cawley. Model selection:

Beyond the Bayesian/frequentist divide. J. Mach. Learn. Res., 11:

61–87, 2010.

[23] E. F. Harding. The probabilities of rooted tree-shapes generated by random bifurcation. Adv. in Appl. Probab., 3(1):44–77, 1971. doi:

10.2307/1426329.

[24] M. D. Hendy and D. Penny. Branch and bound algorithms to determine minimal evolutionary trees. Math. Biosci., 59(2):277–290, 1982. doi:

10.1016/0025-5564(82)90027-X.

[25] D. M. Hillis. Discriminating between phylogenetic signal and random noise in DNA sequences. In M. M. Miyamoto and J. Cracraft, editors, Phylogenetic Analysis of DNA Sequences, chapter 13, pages 278–294.

Oxford University Press, New York, NY, USA, 1991.

[26] D. M. Hillis. Inferring complex phylogenies. Nature, 383:130–131, 1996.

doi: 10.1038/383130a0.

[27] J. P. Huelsenbeck. Tree-length distribution skewness: An indicator of phylogenetic information. Systematic Zoology, 40(3):257–270, 1991.

doi: 10.1093/sysbio/40.3.257.

[28] J. P. Huelsenbeck, M. E. Alfaro, and M. A. Suchard. Biologically inspired phylogenetic models strongly outperform the no common mechanism model. Syst. Biol., 60(2):225–232, 2011. doi: 10.1093/

sysbio/syq089.

[29] T. H. Jukes and C. R. Cantor. Evolution of protein molecules. In M. N. Munro, editor,Mammalian protein metabolism, volume III, pages 21–132. Academic Press, New York, 1969.

[30] R. E. Kass and A. E. Raftery. Bayes factors. J. Amer. Statist. Assoc., 90(430):773–795, 1995. doi: 10.1080/01621459.1995.10476572.

[31] J. Kim. General inconsistency conditions for maximum parsimony:

Effects of branch lengths and increasing numbers of taxa. Syst. Biol., 45(3):363–374, 1996. doi: 10.2307/2413570.

[32] A. N. Kolmogorov. Three approaches to the quantitative definition of information. Probl. Inf. Transm., 1(1):3–11, 1965.

[33] K. L. Lange, R. J. A. Little, and J. M. G. Taylor. Robust statistical modeling using the t distribution. J. Amer. Statist. Assoc., 84(408):

881–896, 1989. doi: 10.2307/2290063.

42 References [34] E. E. Leamer. Specification searches: Ad hoc inference with

nonexperi-mental data. Wiley, New York, NY, USA, 1978.

[35] P. Lemey, M. Salemi, and A.-M. Vandamme, editors. The Phylogenetic Handbook. Cambridge University Press, Cambridge, UK, 2nd edition, 2009.

[36] A. D. R. McQuarrie and C.-L. Tsai. Regression and time series model selection. World Scientific, Singapore, 1998.

[37] J. M¨a¨att¨a and T. Roos. Robust sequential prediction in linear regression with Student’st-distribution. InProc. 14th International Symposium on Artificial Intelligence and Mathematics (ISAIM 2016), Fort Lauderdale, FL, USA, Jan 2016.

[38] J. M¨a¨att¨a and T. Roos. Maximum parsimony and the skewness test: A simulation study of the limits of applicability. PLoS ONE, 11(4):1–21, 2016. doi: 10.1371/journal.pone.0152656.

[39] J. M¨a¨att¨a, S. Siltanen, and T. Roos. A fixed-point image denoising algorithm with automatic window selection. In Proc. 5th European Workshop on Visual Information Processing (EUVIP 2014), Paris,

France, Dec 2014. doi: 10.1109/EUVIP.2014.7018393.

[40] J. M¨a¨att¨a, D. F. Schmidt, and T. Roos. Subset selection in linear regression using sequentially normalized least squares: Asymptotic theory. Scand. J. Stat., 2015. doi: 10.1111/sjos.12181. In press.

[41] P. A. Murtaugh. In defense of P values. Ecology, 95(3):611–617, 2014.

doi: 10.1890/13-0590.1.

[42] J. Neyman. Molecular studies of evolution: A source of novel statistical problems. In S. S. Gupta and J. Yackel, editors, Statistical Decision Theory and Related Topics, pages 1–27. Academic Press, New York,

NY, USA, 1971.

[43] R. J. O’Hara and P. M. W. Robinson. Computer-assisted methods of stemmatic analysis. Occasional Papers of the Canterbury Tales Project, 1:53–74, 1993.

[44] J. Piironen and A. Vehtari. Comparison of Bayesian predictive methods for model selection. ArXiv e-prints, 2015. URL http://arxiv.org/

abs/1503.08650.

References 43 [45] J. Rissanen. Modeling by shortest data description. Automatica, 14(5):

465–471, 1978. doi: 10.1016/0005-1098(78)90005-5.

[46] J. Rissanen. A predictive least-squares principle. IMA J. Math. Control Inform., 3(2–3):211–222, 1986. doi: 10.1093/imamci/3.2-3.211.

[47] J. Rissanen, T. Roos, and P. Myllym¨aki. Model selection by sequentially normalized least squares. J. Multivariate Anal., 101(4):839–849, 2010.

doi: 10.1016/j.jmva.2009.12.009.

[48] T. Roos and Y. Zou. Analysis of textual variation by latent tree structures. In 11th IEEE International Conference on Data Mining (ICDM), pages 567–576, Dec 2011. doi: 10.1109/ICDM.2011.24.

[49] N. Saitou and M. Nei. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4(4):406–425, 1987.

[50] G. Schwarz. Estimating the dimension of a model. Ann. Statist., 6(2):

461–464, 1978.

[51] G. A. F. Seber and A. J. Lee. Linear regression analysis. Wiley, Hoboken, NJ, USA, 2nd edition, 2003.

[52] Y. M. Shtar’kov. Universal sequential coding of individual messages.

Problemy Peredachi Informatsii, 23(3):3–17, 1987.

[53] R. J. Solomonoff. A preliminary report on a general theory of inductive inference. Report V-131, Zator Company, Cambridge, MA, 1960.

[54] M. Steel. Sufficient conditions for two tree reconstruction techniques to succeed on sufficiently long sequences. SIAM J. Discrete Math., 14 (1):36–48, 2001. doi: 10.1137/S0895480198343571.

[55] M. Steel. Some statistical aspects of the maximum parsimony method.

In R. DeSalle, W. Wheeler, and G. Giribet, editors, Molecular System-atics and Evolution: Theory and Practice, volume 92 of Experientia Supplementum, pages 125–139. Birkh¨auser Basel, Switzerland, 2002.

doi: 10.1007/978-3-0348-8114-2 9.

[56] M. Steel and D. Penny. Maximum parsimony and the phylogenetic information in multistate characters. In V. A. Albert, editor,Parsimony, Phylogeny, and Genomics, chapter 9, pages 163–178. Oxford University Press, Oxford, UK, 2006.

44 References [57] S. M. Stigler. Gauss and the invention of least squares. Ann. Statist.,

9(3):465–474, 1981. doi: 10.1214/aos/1176345451.

[58] M. Stone. An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J. R. Stat. Soc. Ser. B. Stat.

Methodol., 39(1):44–47, 1977.

[59] C. Tuffley and M. Steel. Links between maximum likelihood and maximum parsimony under a simple model of site substitution. Bull.

Math. Biol., 59(3):581–607, 1997. doi: 10.1007/BF02459467.

[60] A. Vehtari and J. Ojanen. A survey of Bayesian predictive methods for model assessment, selection and comparison. Statist. Surv., 6:142–228, 2012. doi: 10.1214/12-SS102.

[61] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Trans.

Image Process., 13(4):600–612, 2004. doi: 10.1109/TIP.2003.819861.

[62] C. Z. Wei. On predictive least squares principles. Ann. Statist., 20(1):

1–42, 1992. doi: 10.1214/aos/1176348511.

[63] E. Wit, E. van den Heuvel, and J.-W. Romeijn. ‘All models are wrong...’:

An introduction to model uncertainty.Stat. Neerl., 66(3):217–236, 2012.

doi: 10.1111/j.1467-9574.2012.00530.x.

[64] S. Zhu and M. Steel. Does random tree puzzle produce Yule–Harding trees in the many-taxon limit? Math. Biosci., 243(1):109–116, 2013.

doi: 10.1016/j.mbs.2013.02.003.