Pure voting - A Variant of the Multiple Random Projection Trees Algorithm for Nearest Neighbors

We tested the viability of pure voting algorithm on the MNIST data set. The results are shown in Fig. 2. In the light of these results it appears that the pure voting algorithm is unable to attain much higher than 70% accuracy.

For example using 500 trees will yield 72% accuracy in 10-NN search. Other approximate methods can attain well over 90% accuracy with reasonable running time, and comparisons between different methods discussed in Sec.

5.4 also suggest that pure voting is clearly less accurate than other methods.

0 0.2 0.4 0.6 0.8 1

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

Accuracy on 10-NN

Time (ms)

n₀ 12832 512 2048 4096

Figure 2: Accuracy of pure voting method on MNIST data set. Values ofT of 100, 150, 250 and 500 were used. Linear search took 72.9 ms.

However, even though the accuracy is not very high, a more nuanced analysis than we have been able to undertake would be necessary to fully determine the potential of pure voting method. The MNIST data set is neither ultrahigh dimensional nor big, under which conditions the benefits of the pure voting method, such as ability to discard the data set after tree construction, are expected to materialize fully. Also in certain use cases, e.g.

in recommendation systems, the accuracy delivered by pure voting may well suffice.

While pure voting did not reach high accuracy, a detailed analysis of the effect of random vector sparseness and suitable parameters is in place because voting is used as a part of the mixture method.

0.5 0 1

Figure 3: The effect of random vector sparseness on latency time and accuracy.

The tests were performed with MNIST data set. Leaf size parameter n₀ of 4096 was used for accuracy estimation. Query time shown in the figure on the left does not contain vote counting phase, which is unaffected by vector sparseness. The fact that query time does not increase when number of trees is increased from 50 to 100 using dense vectors is an anomaly.

The issue of random vector sparseness was analyzed with the MNIST data set. We used three settings for generating a random vector: dense vector with standard normal components, sparse with components−1,1 with probabilities ¹₆ and very sparse with components −1,1 with probabilities

1 2√

784 = 3.57% where 784 is the dimensionality of MNIST data. The results are shown in Fig. 3.

It is clear that at least for these data, sparse vectors will deliver equally good or better results in less time than dense vectors. In fact in all test runs very sparse vectors gave the best accuracy. In this sense using very sparse vectors is an example of a trade-off of the most desirable kind, the one in which nothing at all is traded for a significant gain. It is notable that the data used in the test runs is only 784 dimensional and in comparison to dense vectors, very sparse vectors are expected to be the faster the more dimensional the data is. Because tree depth increases with the size of the data, the gain would also be greater with larger data sets and smaller leaf-sizes.

Nevertheless even with these settings the gain is evident.

For pure voting method, parameter values of interest are maximum leaf sizen₀and the number of treesT, which were also evaluated with the MNIST data. According to the results in table 1, the accuracy is essentially the same with leaf sizes ranging from 256 to 4096. While in the limiting case of using a tree of depth one the results are clearly weaker than with smaller n0, decreasing the tree depth for example from 7 (corresponding ton0 256) to 3 (n0 4096) makes effectively no difference for accuracy. As expected,

Leaf sizen₀

4 32 256 512 1024 2048 4096 8192 16384

50 0.23 0.34 0.39 0.4 0.4 0.38 0.37 0.31 0.19 100 0.34 0.46 0.51 0.52 0.52 0.52 0.49 0.44 0.34 150 0.4 0.53 0.57 0.58 0.59 0.59 0.56 0.52 0.43 200 0.45 0.56 0.63 0.62 0.63 0.63 0.61 0.57 0.48 500 0.6 0.68 0.73 0.73 0.72 0.72 0.71 0.69 0.63 Table 1: The effect ofn0 and T parameters in query accuracy of pure voting method on MNIST data.

holdingn₀ constant, increasingT will yield better accuracy. However, larger T implies higher computational cost, which means that using more trees trades running time for accuracy.

We further analyzed the running time of voting using the news data set.

In these analysis we used parameter settings that are suitable for usage with the mixture method. These results are shown in table 2, where the running time of pure voting is broken to three steps: querying trees, finding the number of votes each data point has received, and finally finding the points with most votes. It can be seen that the cost of finding most voted points remains essentially constant, as our implementation will always perform a pass over an array with nelements, which dominates the running time of this step.

Tree queries take less time whenn₀ is higher because trees are the flatter the larger the maximum leaf size is (in table 2 this effect is most obvious when one considers the rows whereT is 500). However, increased time required by counting votes swamps this effect when we consider the total running time of voting.

Furthermore it is clear that the voting step, that is counting votes and finding the most voted points, imposes a considerable cost in running time.

For example when 500 trees and maximum leaf size parameter 1024 is used, around two thirds of the running time consists of voting, whereas querying trees requires only a third of the running time.

It is, however, important to note that the cost of vote counting is not proportional to the dimensionality of data, whereas tree queries have cost proportional also to the dimensionality. Thus the higher the dimensionality, the less difference the cost of the final voting step will make.

n₀ T

128 100 0.38 4.94 1.76 0.95

128 150 0.48 4.77 2.57 0.93

128 250 0.56 5.07 4.13 0.89

128 500 1.3 5.04 10.22 0.79

256 50 0.48 5.82 1.21 0.95

256 100 0.82 5.11 1.81 0.91

256 150 0.93 5.07 2.67 0.87

256 250 1.27 5.31 4.34 0.79

256 500 2.57 5.17 8.8 0.63

512 50 0.86 5.53 0.91 0.91

512 100 1.1 5.16 1.61 0.83

512 150 1.57 5.18 2.41 0.75

512 250 2.52 5.21 3.98 0.62

512 500 4.96 5.27 8.19 0.39

1024 50 1.11 5.11 0.69 0.82

1024 100 2.19 5.22 1.42 0.68

1024 150 3.02 5.23 2.14 0.56

1024 250 4.87 5.3 3.66 0.39

1024 500 9.61 5.34 7.36 0.16

Table 2: The running time of voting using the news data set. We searched for 500 nearest neighbors, which is a typical setting when voting is used as part of the mixture method.

Table 2 also shows the number of data points which have received no votes at all. If almost all points received zero votes, a possible optimization would be to attempt to avoid traversingnitems in finding most voted points by using some suitable data structure. However, it appears that when using parameter configurations that allow for high accuracy (table 3 shows accuracy results for news data set using the mixture method), this is not the case.

5.3 Mixture method

Figure 4 shows the performance of the mixture method on MNIST data set for 10-NN search. The algorithm attains on average 98% accuracy with computation time remaining at around 3.5% of that required by the linear search. This is a very encouraging result especially given the small size and

medium dimensionality of the data set. We also performed tests on the news data set, and these results are shown in table 3.

0 0.2 0.4 0.6 0.8 1

1.5 2 2.5 3 3.5 4 4.5 5

Accuracy on 10-NN

Time (ms)

Maximum leafsize n₀ 32 12864 256 1024

Figure 4: Accuracy of the mixture method on MNIST data with different settings forn₀. Values ofT of 50, 100, 250 and 500 were used. Linear search took 70.4 ms.

The results are very similar in the sense that while linear search took 700 milliseconds (as opposed to 70 on MNIST data), again in about 3.5% of the linear search time around 98% accuracy was attained for 10-NN search.

For news data set we also analyzed a number of different values of k. Table 3 shows that as expected, the problem is the more difficult the higher the value ofkused, but this deterioration of results with increasingkis relatively modest.

In testing the implementation we were concerned with three parameters, the maximum leaf sizen₀, the number of treesT and final search set sizeS.

Of thesen0 andT concern only the voting step of the algorithm, and were analyzed in the context of the pure voting method. The results shown in Fig. 4 and table 3 also confirm that as was the case with pure voting, n₀ has relatively modest effect on the quality of results, whereas larger values ofT clearly improve the results.

The size of the final search set is a central question for the algorithm. To establish the best final search set size we analyze two questions, namely how the number of nearest neighbors found behaves as a function of the search

(a) Nearest neighbor Leaf size n0

128 256 512 1024

50 0.66 (8.2) 0.71 (8.3) 0.74 (8.2) 0.80 (8.3) 100 0.80 (10.5) 0.81 (10.1) 0.87 (9.9) 0.89 (10.0) 250 0.90 (17.0) 0.94 (16.5) 0.96 (15.5) 0.98 (15.7) 500 0.96 (27.3) 0.98 (26.1) 0.99 (24.7) 1.00 (24.3)

(b) 10 nearest neighbors Leaf size n0

128 256 512 1024

50 0.47 (8.3) 0.56 (8.3) 0.57 (8.2) 0.65 (8.2) 100 0.66 (10.6) 0.67 (10.3) 0.76 (10.2) 0.80 (10.1) 250 0.81 (16.9) 0.86 (16.1) 0.90 (15.6) 0.93 (15.0) 500 0.91 (27.1) 0.94 (25.7) 0.96 (24.0) 0.98 (23.6)

128 256 512 1024

50 0.43 (8.3) 0.53 (8.4) 0.54 (8.2) 0.61 (8.2) 100 0.62 (10.6) 0.64 (10.3) 0.73 (10.1) 0.77 (10.0) 250 0.78 (16.7) 0.83 (16.1) 0.88 (15.4) 0.91 (15.1) 500 0.89 (27.1) 0.93 (25.5) 0.95 (24.0) 0.97 (23.6)

(d) 25 nearest neighbors Leaf sizen₀

128 256 512 1024

50 0.39 (8.3) 0.48 (8.4) 0.49 (8.2) 0.56 (8.2) 100 0.57 (10.6) 0.59 (10.3) 0.68 (10.1) 0.73 (10.0) 250 0.73 (16.9) 0.80 (16.2) 0.85 (15.4) 0.89 (15.1) 500 0.86 (27.1) 0.91 (25.5) 0.93 (24.0) 0.96 (23.7)

Table 3: K-nearest neighbors search accuracy on news data set for various K. Final search set parameterS 500 was used. Latency time in milliseconds is in parenthesis. Linear search took 711 milliseconds.

set, holding other parameters constant, and also how big the final search set should be for all the nearest neighbors to be included on average.

In Fig. 5a the change of accuracy is depicted as a function of the final search set size. It is clear from the figure that with very small search sets a large number of actual nearest neighbors are left out, and increasing the search set size will drastically improve the results. However diminishing returns are encountered relatively quickly, in this case when searching 1% of the data set in final search step. Because the tree queries and vote counting impose a heavy overhead, the initial improvement in results occurs virtually at no computational cost. Figure 5a shows that using larger number of trees is a better way to improve accuracy than increasing S as long asS is not unreasonably small.

Figure 5b shows how big a search set is needed for finding all the nearest neighbors, as a function ofT. Notably we have here analyzed the average over 1000 query points, which means that selecting a search set size of 400 for example will not lead to perfect accuracy even if in the figure it appears that on average a search set size of 300 will suffice for finding all the nearest neighbors. This is because even if some query point requires a very large search set, the average required over all points in the test set could be much lower.

(b) S required for perfect accuracy Figure 5: Effect of S on accuracy of the mixture method. The tests were run with MNIST data set using maximum leaf size of 4096.

These results also show that while pure voting method proved rather inaccurate, it is very effective in pruning the search set. For example, with 75 trees perfect accuracy is obtained by searching on average 312.4 points or around 1% of the whole data set. With larger number of desired nearest neighbors the required search set is of course always bigger, but the trend

is very similar. With 25 nearest neighbors perfect accuracy would still be obtained by a search over 812 points. Thus the low accuracy of pure voting is explained by its inability to accurately find the very nearest points from the small set of points that receive the most votes, rather than a complete failure of voting.

In document A Variant of the Multiple Random Projection Trees Algorithm for Nearest Neighbors Search (sivua 27-34)