• Ei tuloksia

Effect of dimensionality reduction

0 5 10 15 20 25 30 35

0.1 0.15 0.2 0.25 0.3 0.35

recall rate

D

z

Figure 21: The effect of dimensionality reduction (Dz) on recall rate with 544 dimen-sional shape data set. Only one curve used (Nc = 1).

5http://cs.uef.fi/sipu/datasets/

0 5 10 15 20 25 30 35 0.2

0.3 0.4 0.5 0.6 0.7 0.8

recall rate

D

z

Figure 22: The effect of dimensionality reduction (Dz) on recall rate with 544 dimen-sional shape data set. Seven curves used (Nc = 7).

Figures 21, 22 shows the effect of dimensionality reduction on the quality in the ZNN-algorithm.Dzranges from2to32because the z-order -curve is defined for a minimum of two dimensions and 32 was the maximum value allowed by our implementation.

Window width ofW = 100was used in this experiment. In Figure 21, only one curve was used and 7 curves in Figure 22. The experiment was repeated 50 times for each Dz while other parameters remained constant.

It can be seen from the results that the recall rate decreases rapidly whenDz < 10, so sufficiently largeDz is needed to provide good results. However, whenDz >10, there are only small differences in the quality of results. Additionally, whenDz grows, the program execution time also grows. We therefore suspect that there is an optimal value forDz smaller thanD, but more research is needed to determine this.

It can be also seen from Figures 21,22 that there is a significant amount of variation in the quality of results between program runs. The variations originates from three kinds of randomization in the algorithm: (1) random shifting of points, (2) random division to subvectors in dimensionality reduction and (3) random ordering of dimensions in bit interleaving.

7 Discussion

Constructing thekNN-graph has many important practical applications in a wide vari-ety of fields. Many good solutions have been developed for this problem. Typical fast solutions likek-d -trees work well in the simple cases where the number of dimensions is relatively low (D < 10). For higher number of dimensions these methods begin to fail and provide no significant improvements over the trivial brute force method.

Many kNN-graph construction methods have been developed specifically for higher number of dimensions, but they can still be slow or inaccurate, and their results vary unpredictably depending on the data set.

We introduced a new method called ZNP algorithm which uses space filling curves and neighborhood propagation. The performance of this algorithm was compared against two other leading fastkNN-graph construction methods using 3 different medium to high dimensional data sets. The results show that withk ∈ {10,20}our method out-performs existing methods with all of the tested data sets while providing moderate (27-46%) speed improvements for the same quality. Withk ∈ {1,5}our method also performed well although no method outperformed the others.

Our method provides speed improvements inkNN graph construction for medium to high dimensional data sets but the speed still remains too slow for several practical applications especially for high dimensional (D >500) data. Therefore, it remains an open question whether an efficient solution exists for data sets where the number of intrinsic dimensions is very high. We believe that the current method based on z-order curve is not yet up to its full potential and further research could provide significant further improvement.

Future research

Several kNN-graph construction methods use one dimensional proximity preserving projection. Our method and [7] do this using space filling curves. Zhang & al. [8]

use vectors of locality sensitive hash values which were projected to random lines.

In all these methods, one projection can preserve the proximity well for some points but not for others. Therefore, multiple different projections are often used. There are major differences between these algorithms, but they all face a common problem: how

to generate a set of one dimensional projections that best preserve the proximity of points.

In all of the surveyed algorithms some kind of randomization is used to produce differ-ent proximity preserving projections. In our method, randomization was used because it is an easy way to generate the variation for the z-order curve. However, it is not an optimal way and some deterministic way of constructing the one dimensional map-pings might be more efficient. Figure 21 shows that there is a large variance in the quality of results depending on which random projection is used. This suggests that it would be possible to improve the quality of the results if better projections could be chosen in a computationally efficient way.

We used a combination of the z-order curve and a simple dimensionality reduction technique to create a set of proximity preserving one dimensional mappings. The di-mensionality reduction technique we used did not take into account properties of the data such as data distribution. It might be possible to improve our method by using a more advanced dimensionality reduction technique such asprincipal component anal-ysis (PCA) that adapts to the internal structure of the data. The problem with PCA, however, is that it is computationally expensive. To the best of our knowledge the fastest exact method for PCA take O(D2h+D2N) time where h is the number of target dimensions [57]. So doing a full principal component analysis would likely in-crease the computation time more than it would save. For our purposes, however, the accuracy of PCA is not very important. The axes chosen for projection need only to provide better proximity preserving qualities than our current method which projects sub-vectors to the diagonal axis. Therefore some approximation method might be a good solution. For example, a trivial sampling technique for the calculation of princi-pal axes might work.

To increase the difference between multiple proximity preserving one dimensional pro-jections, we used random shifting of points and random permutation of dimensions.

Also rotations have been successfully used in ak-d tree based method [36] for a simi-lar purpose. Rotation would be possible to use also with space filling curves, but it is unclear if it would bring enough benefits to justify theO(D) → O(D·log(D))per point increase in time complexity.

Our algorithm ordered the data points by calculating and storing z-values for each point and then sorting them according to the z-values. Another approach used in [7] is

to calculate the z-values only when the points are compared by the sorting algorithm.

This results in O(N · log(N)) z-value calculations instead of O(N) in our method.

The good point is that the memory requirements are reduced byO(ND)bytes because the values are not stored. In addition, their method does not calculate the full z-values. Instead they iteratively calculate the z-values for compared points down from most significant bit towards least significant bit until a difference between the points is found. In their experiments, this method was faster than calculating the full z-values.

Therefore it might be possible to use this method to provide a further speedup to our algorithm. However, their method was tested only with low dimensional (D≤3) data sets and it is not clear if it would give the same speed benefits for high dimensional data sets.

The recall rate we used to measure the quality of results has been used in many studies of the same subject [11, 12, 8]. However, there are other quality factors that could be considered. For example, for some applications it may be more important that the first nearest neighbor in the graph is correct than thekth nearest neighbor. The recall rate does not reflect this as it depends only on the total number of errors.

There is some evidence in our experiments that the errors in the results produced by our ZNN algorithm are more likely to be on small number ofnfornth nearest neighbor (n ∈[1, k]) than for the NNDES algorithm. In the NNDES algorithm, the errors were more likely to be for largen. However, our experiments on this were not quite extensive and more research should be done to find conclusive evidence.

8 Conclusion

In this thesis we have analyzed different methods for high dimensional kNN graph construction, and introduced a new method for solving this problem by using a combi-nation of space filling curves and neighborhood propagation. We compared our method with two other methods called Nearest Neighbor Descent [12] and Recursive Lanczos Bisection [11] using 14 to 544 -dimensional data sets and Euclidean distance metric.

Experiments showed that our method is, in most cases, faster than the other tested methods. It performs well especially for large values ofk and for the 14 dimensional corel data set consisting of 662,317 points. In this case, it is 46% faster than the 2nd best method. Our method also performs well even with higher dimensional data sets.

With 544 dimensional shape data set it was 36% faster than the 2nd best method.

However, due to lack of having code available, we were unable to compare our method against two other methods [13, 8] which have reported similar performance increases in comparison to Nearest Neighbor Descent, which in our experiments was the second best method. One limitation of our method is that it has so far been confirmed to work only with the Euclidean distance metric while other methods work with larger variety of distance functions [12, 13, 8]. Therefore the performance of our method in comparison to other fast methods still remains somewhat unclear. We also believe that our current solution is not optimal and could be further improved by one of those ideas discussed in Section 7.

9 References

[1] X. Zhu, J. Lafferty, and R. Rosenfeld,Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon University, Language Technologies Institute, School of Computer Science, 2005.

[2] P. Fränti, O. Virmajoki, and V. Hautämaki, “Fast agglomerative clustering using a k-nearest neighbor graph,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 11, pp. 1875–1881, 2006.

[3] K. Hajebi, Y. Abbasi-Yadkori, H. Shahbazi, and H. Zhang, “Fast approximate nearest-neighbor search with k-nearest neighbor graph,” in IJCAI Proceedings-International Joint Conference on Artificial Intelligence, vol. 22, p. 1312, 2011.

[4] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,”Neural computation, vol. 15, no. 6, pp. 1373–1396, 2003.

[5] J. Philbin, J. Sivic, and A. Zisserman, “Geometric latent dirichlet allocation on a matching graph for large-scale image datasets,”International journal of computer vision, vol. 95, no. 2, pp. 138–153, 2011.

[6] V. Hautamäki, I. Kärkkäinen, and P. Fränti, “Outlier Detection Using k-Nearest Neighbour Graph.,” inICPR (3), pp. 430–433, 2004.

[7] M. Connor and P. Kumar, “Fast construction of k-nearest neighbor graphs for point clouds,” IEEE Transactions on Visualization and Computer Graphics, vol. 16, no. 4, pp. 599–608, 2010.

[8] Y.-M. Zhang, K. Huang, G. Geng, and C.-L. Liu, “Fast kNN Graph Construction with Locality Sensitive Hashing,” inMachine Learning and Knowledge Discov-ery in Databases, pp. 660–674, Springer, 2013.

[9] V. Garcia, E. Debreuve, and M. Barlaud, “Fast k nearest neighbor search using GPU,” in2012 IEEE Computer Society Conference on Computer Vision and Pat-tern Recognition Workshops, pp. 1–6, IEEE, 2008.

[10] J. H. Friedman, J. L. Bentley, and R. A. Finkel, “An Algorithm for Finding Best Matches in Logarithmic Expected Time,”ACM Trans. Math. Softw., vol. 3, pp. 209–226, Sept. 1977.

[11] J. Chen, H.-r. Fang, and Y. Saad, “Fast approximate k NN graph construction for high dimensional data via recursive Lanczos bisection,”The Journal of Machine Learning Research, vol. 10, pp. 1989–2012, 2009.

[12] W. Dong, C. Moses, and K. Li, “Efficient k-nearest neighbor graph construction for generic similarity measures,” inProceedings of the 20th international confer-ence on World wide web, pp. 577–586, ACM, 2011.

[13] J. Wang, J. Wang, G. Zeng, Z. Tu, R. Gan, and S. Li, “Scalable k-NN graph construction for visual descriptors,” inComputer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 1106–1113, IEEE, 2012.

[14] J. L. Bentley, “Multidimensional binary search trees used for associative search-ing,”Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975.

[15] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is "Nearest Neigh-bor" Meaningful?,” inIn Int. Conf. on Database Theory, pp. 217–235, 1999.

[16] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the Surprising Behavior of Distance Metrics in High Dimensional Spaces,” inProceedings of the 8th Interna-tional Conference on Database Theory, ICDT ’01, (London, UK), pp. 422–434, Springer-Verlag, 2001.

[17] D. Dobkin and R. J. Lipton, “Multidimensional searching problems,”SIAM Jour-nal on Computing, vol. 5, no. 2, pp. 181–186, 1976.

[18] M. I. Shamos,Computational geometry. PhD thesis, Yale University, 1978.

[19] P. M. Vaidya, “An O(n log n) algorithm for the all-nearest-neighbors problem,”

Discrete & Computational Geometry, vol. 4, no. 1, pp. 101–115, 1989.

[20] E. Pielou, “A single mechanism to account for regular, random and aggregated populations,”The Journal of Ecology, pp. 575–584, 1960.

[21] J. McNames, “A fast nearest-neighbor algorithm based on a principal axis search tree,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 9, pp. 964–976, 2001.

[22] T. Seidl and H.-P. Kriegel, “Optimal multi-step k-nearest neighbor search,” in ACM SIGMOD Record, vol. 27, pp. 154–165, ACM, 1998.

[23] H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang, “iDistance: An Adap-tive B+-tree Based Indexing Method for Nearest Neighbor Search,”ACM Trans.

Database Syst., vol. 30, pp. 364–397, June 2005.

[24] B. Yao, F. Li, and P. Kumar, “K nearest neighbor queries and knn-joins in large relational databases (almost) for free,” inData Engineering (ICDE), 2010 IEEE 26th International Conference on, pp. 4–15, IEEE, 2010.

[25] E. Chavez, K. Figueroa, and G. Navarro, “A fast algorithm for the all k nearest neighbors problem in general metric spaces,” 1997.

[26] L. A. Piegl and W. Tiller, “Algorithm for finding all k nearest neighbors,”

Computer-Aided Design, vol. 34, no. 2, pp. 167–172, 2002.

[27] G. Karypis, E.-H. Han, and V. Kumar, “Chameleon: Hierarchical clustering using dynamic modeling,”Computer, vol. 32, no. 8, pp. 68–75, 1999.

[28] O. Virmajoki and P. Fränti, “Divide-and-conquer algorithm for creating neighbor-hood graph for clustering,” inPattern Recognition, 2004. ICPR 2004. Proceed-ings of the 17th International Conference on, vol. 1, pp. 264–267, IEEE, 2004.

[29] R. Paredes and E. Chávez, “Using the k-nearest neighbor graph for proxim-ity searching in metric spaces,” inString Processing and Information Retrieval, pp. 127–138, Springer, 2005.

[30] A. Hinneburg, C. C. Aggarwal, and D. A. Keim, “What is the nearest neighbor in high dimensional spaces?,” inProc. of the 26th VLDB Conference, Cario, Egypt, 2000.

[31] N. Kouiroukidis and G. Evangelidis, “The Effects of Dimensionality Curse in High Dimensional kNN Search,” in Informatics (PCI), 2011 15th Panhellenic Conference on, pp. 41–45, Sept 2011.

[32] P. Indyk and R. Motwani, “Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality,” inProceedings of the Thirtieth Annual ACM Sym-posium on Theory of Computing, STOC ’98, (New York, NY, USA), pp. 604–613, ACM, 1998.

[33] R. F. Sproull, “Refinements to nearest-neighbor searching in k-dimensional trees,”Algorithmica, vol. 6, no. 1-6, pp. 579–589, 1991.

[34] S. Arya, D. M. Mount, and O. Narayan, “Accounting for boundary effects in nearest-neighbor searching,”Discrete & Computational Geometry, vol. 16, no. 2, pp. 155–176, 1996.

[35] C. Silpa-Anan and R. Hartley, “Optimised KD-trees for fast image descriptor matching,” inComputer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8, IEEE, 2008.

[36] P. W. Jones, A. Osipov, and V. Rokhlin, “Randomized approximate nearest neigh-bors algorithm,” Proceedings of the National Academy of Sciences, vol. 108, no. 38, pp. 15679–15686, 2011.

[37] M. F. Mokbel and W. G. Aref, “Irregularity in high-dimensional space-filling curves,”Distributed and Parallel Databases, vol. 29, no. 3, pp. 217–238, 2011.

[38] H. Tropf and H. Herzog, “Multidimensional Range Search in Dynamically Bal-anced Trees.,”ANGEWANDTE INFO., no. 2, pp. 71–77, 1981.

[39] J. A. Orenstein and T. H. Merrett, “A class of data structures for associative searching,” in Proceedings of the 3rd ACM SIGACT-SIGMOD symposium on Principles of database systems, pp. 181–190, ACM, 1984.

[40] C. Faloutsos and S. Roseman, “Fractals for Secondary Key Retrieval,” in Pro-ceedings of the Eighth ACM SIGACT-SIGMOD-SIGART Symposium on Princi-ples of Database Systems, PODS ’89, (New York, NY, USA), pp. 247–252, ACM, 1989.

[41] N. Megiddo and U. Shaft, “Efficient nearest neighbor indexing based on a collec-tion of space filling curves,”IBM Almaden research center, San Jose, CA, Tech, 1997.

[42] J. A. Shepherd, X. Zhu, and N. Megiddo, “Fast indexing method for multidimen-sional nearest-neighbor search,” inElectronic Imaging’99, pp. 350–355, Interna-tional Society for Optics and Photonics, 1998.

[43] S. Liao, M. Lopez, and S. Leutenegger, “High dimensional similarity search with space filling curves,” inData Engineering, 2001. Proceedings. 17th International Conference on, pp. 615–622, 2001.

[44] R. Dafner, D. Cohen-Or, and Y. Matias, “Context-based Space Filling Curves,”

inComputer Graphics Forum, vol. 19, pp. 209–218, Wiley Online Library, 2000.

[45] T. Bially, “Space-filling curves: Their generation and their application to band-width reduction,” IEEE Transactions on Information Theory, vol. 15, no. 6, pp. 658–664, 1969.

[46] I. Gargantini, “An Effective Way to Represent Quadtrees,” Commun. ACM, vol. 25, pp. 905–910, Dec. 1982.

[47] C. Faloutsos and Y. Rong, “DOT: A Spatial Access Method Using Fractals,”

in Proceedings of the Seventh International Conference on Data Engineering, (Washington, DC, USA), pp. 152–159, IEEE Computer Society, 1991.

[48] H.-L. Chen and Chang, “All-nearest-neighbors finding based on the Hilbert curve,”Expert Systems with Applications, vol. 38, no. 6, 2011.

[49] M. Mokbel, W. Aref, and I. Kamel, “Analysis of Multi-Dimensional Space-Filling Curves,”GeoInformatica, vol. 7, no. 3, pp. 179–209, 2003.

[50] G. Peano, “Sur une courbe, qui remplit toute une aire plane,” Mathematische Annalen, vol. 36, no. 1, pp. 157–160, 1890.

[51] G. M. Morton,A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company, 1966.

[52] S.-X. Li and M. H. Loew, “The quadcode and its arithmetic,”Communications of the ACM, vol. 30, no. 7, pp. 621–626, 1987.

[53] H.-L. Chen and Y.-I. Chang, “Neighbor-finding Based on Space-filling Curves,”

Information Systems, vol. 30, pp. 205–226, May 2005.

[54] J. Lawder,The application of space-filling curves to the storage and retrieval of multi-dimensional data. PhD thesis, University of London, UK, 2000.

[55] R. Raman and D. S. Wise, “Converting to and from dilated integers,”IEEE Trans-actions on Computers, vol. 57, no. 4, pp. 567–573, 2008.

[56] S.-W. Ra and J. Kim, “A fast mean-distance-ordered partial codebook search al-gorithm for image vector quantization,”IEEE transactions on Circuits and Sys-tems II: Analog and Digital Signal Processing, vol. 40, no. 9, pp. 576–579, 1993.

[57] A. Sharma and K. K. Paliwal, “Fast principal component analysis using fixed-point algorithm,” Pattern Recognition Letters, vol. 28, no. 10, pp. 1151–1155, 2007.

Appendix 1: Source code for z-value generation

22 l o o k u p _ t a b l e = new LONG* [ d i m e n s i o n s ] ;