• Ei tuloksia

Normalized compression distance is a universal similarity metric. This means that it can be used to cluster and classify many different kinds of measurement data without having knowledge about the feature based similarity of the samples. However the functionality of normalized compression distance depends on a solely theoretical construct called normal compressor which does not exist in real life and therefore normal compressor is substituted for general purpose compressor programs in general usage. This introduces problems in relation to features of the chosen compressor because compressors do have features, such as block size, that hinder the computation of normalized compression distance with certain kinds of inputs.

The purpose of this work was to create a cohesive framework than can be used to evalu-ate different compressor programs when computing the normalized compression distance.

To facilitate this, the framework contains two different data generation models with tun-able parameters. The first one generates random strings of given length and normalized compression distance is then calculated against a mutated copy of the original string. In this model it is possible to adjust the length of the input string and the mutation proba-bility within the mutated copy of the string. The second model contains a deterministic Markov chain whose consecutive states are encoded as a string. A second string is gener-ated for comparison so that at each step the chain might get mutgener-ated to some other random state with a given probability. The length parameter was used to control the length of the Markov chain in this case. Together these models also make it possible to indirectly observe how the compressors fulfill the first and second property of normal compressor, namely idempotency and monotonicity.

This framework was used to evaluate the performance of gzip, bzip2, PPM, and LZMA compressors with different values of the mutation probability and input length. This was achieved by running the individual computations on a Techila Grid Engine system used at Tampere University of Technology. These results clearly show that gzip can effectively only compute normalized compression distance for input lengths of less than 32 kilobytes.

This is due to gzip’s internal look-ahead buffer. The bzip2 compressor has a similar similar problem in relation to its block size which is 900 kilobytes by default. PPM and LZMA, on the other hand, seem to be able to handle any input lengths within the tested range. How-ever PPM was slowest of the tested compressors whereas LZMA proved to be way more faster than PPM. In addition, LZMA provided the widest usable range in relation to the

mutation probability and its NCD values were most linear in this range. Therefore LZMA should be the default choice of compressor when computing the normalized compression distance without any knowledge of the input’s feature based similarities.

A further extension of this research would be to extend the evaluation framework to con-tain the other two features of normal compressor, symmetry and distributivity. In addition, it appears that no research has been performed to date on the usage of lossy compressors when computing the normalized compression distance. This might yield interesting new results about when to use a lossy approach and when to use a lossless approach.

REFERENCES

[1] Apache commons compress. http://commons.apache.org/compress/.

[2] Compression via arithmetic coding in java. version 1.1. http://www.colloquial.com/

ArithmeticCoding/.

[3] Xz utils. http://tukaani.org/xz/.

[4] J.L. Bentley, D.D. Sleator, R.E. Tarjan, and V.K. Wei. A locally adaptive data com-pression scheme. Communications of the ACM, 29(4):320–330, 1986.

[5] S. Bornholdt. Systems biology: less is more in modeling large genetic networks.

Science, 310(5747):449–451, 2005.

[6] M. Burrows and D.J. Wheeler. A block-sorting lossless data compression algorithm.

Technical report, Digital Equipment Corporation, 1994.

[7] M. Cebrian, M. Alfonseca, and A. Ortega. Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Communications in Information & Systems, 5(4):367–384, 2005.

[8] R. Cilibrasi and P.M.B. Vitányi. Clustering by compression. IEEE Transactions on Information Theory, 51(4):1523–1545, 2005.

[9] J. Cleary and I. Witten. Data compression using adaptive coding and partial string matching. Communications, IEEE Transactions on, 32(4):396–402, 1984.

[10] T.M. Cover, J.A. Thomas, J. Wiley, et al.Elements of information theory, 2nd edition.

Wiley-Interscience, 2006.

[11] P. Deutsch. Rfc1952: Gzip file format specification version 4.3. Internet RFCs, 1996.

[12] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern classification. John Wiley & Sons, 2 edition, 2006.

[13] D.A. Huffman. A method for the construction of minimum-redundancy codes. Pro-ceedings of the IRE, 40(9):1098–1101, 1952.

[14] M. Kanehisa and S. Goto. Kegg: kyoto encyclopedia of genes and genomes.Nucleic acids research, 28(1):27–30, 2000.

[15] A.N. Kolmogorov. Three approaches to the quantitative definition of information.

Problems of information transmission, 1(1):1–7, 1965.

[16] G.N.N. Martin. Range encoding: an algorithm for removing redundancy from a digitised message. InVideo & Data Recording Conference, Southampton, 1979.

[17] A. Moffat. Implementing the ppm data compression scheme.Communications, IEEE Transactions on, 38(11):1917–1921, 1990.

[18] M.E.J. Newman. The structure and function of complex networks. SIAM review, 45(2):167–256, 2003.

[19] M. Nykter, N.D. Price, M. Aldana, S.A. Ramsey, S.A. Kauffman, L.E. Hood, O. Yli-Harja, and I. Shmulevich. Gene expression dynamics in the macrophage exhibit criticality. Proceedings of the National Academy of Sciences, 105(6):1897, 2008.

[20] M. Nykter, N.D. Price, A. Larjo, T. Aho, S.A. Kauffman, O. Yli-Harja, and I. Shmulevich. Critical networks exhibit maximal information diversity in structure-dynamics relationships. Physical Review Letters, 100(5):058702, Feb 2008.

[21] J. Rissanen and G.G. Langdon. Arithmetic coding. IBM Journal of research and development, 23(2):149–162, 1979.

[22] J. Seward. bzip2 source code. http://www.bzip.org/, 1996–2010.

[23] J. Seward. bzip2 web page. http://www.bzip.org/, 1996–2010.

[24] C.E. Shannon. A mathematical theory of communication.The Bell System Technical Journal, 27:379–423, 623–656, July, October 1948.

[25] N.N. Taleb. Beware the big errors of ‘big data’. http://www.wired.com/opinion/

2013/02/big-data-means-big-errors-people/, 2013. Online; accessed 28-September-2013].

[26] Techila Technologies. Techila fundamentals. http://www.techila.fi/wp-content/

uploads/2013/08/Techila-Fundamentals.pdf, 2013.

[27] Wikipedia. Burrows–wheeler transform — Wikipedia, the free encyclo-pedia. http://en.wikipedia.org/w/index.php?title=Burrows%E2%80%93Wheeler_

transform&oldid=517699361, 2012. [Online; accessed 24-October-2012].

[28] Wikipedia. Lempel–ziv–markov chain algorithm — Wikipedia, the free encyclo-pedia. http://en.wikipedia.org/w/index.php?title=Lempel%E2%80%93Ziv%E2%

80%93Markov_chain_algorithm&oldid=565248892, 2013. [Online; accessed 28-September-2013].

[29] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. Infor-mation Theory, IEEE Transactions on, 23(3):337–343, 1977.

[30] J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding.

Information Theory, IEEE Transactions on, 24(5):530–536, 1978.