Distance measures for classification of numerical features
Georgy Minaev Robert Pich´ e Ari Visa Tampere University of Technology,
Finland June 14, 2018
1 List of functions
All tables show the distance functions. The formulad= 1−sis used to transform similarity to distance. Please readPas PN
i=1. N represents the vector length. Addition variables which are used in formulas are shown at the Comment column.
Table 1: Power (p,r) distances [1, chapter 17.2]
id Name d(P,Q) Comment Source
1. Power (p,r) Eu- clidean distance
(P|Pi−Qi|p)1/r p= 2, r= 2 [1]
2. Power (p,r) Man- hattan distance
(P
|Pi−Qi|p)1/r p= 1, r= 1 [1]
3. Power (p,r) Cheby- shev distance
(P|Pi−Qi|p)1/r p=inf,r=inf,
implementation: d = max(|P−Q|)
[1]
4. Power (p,r) squared Euclidean distance
(P
|Pi−Qi|p)1/r p= 2, r= 1,d4=d21 [1]
Table 2: Distances on distribution laws [1, chapter 14.2]
id Name d(P,Q) Comment Source
5. Fidelity distance 1−P√
PiQi [1]
6. Matusita distance pP (√
Pi−√
Qi)2 d6=√
d7 [1]
7. Squared-chord dis- tance
P(√ Pi−√
Qi)2 d7=d26 [1]
8. Harmonic mean distance
1−2P((PiQi)/(Pi+Qi)) [1]
9. Bhattacharya 1 dis- tance
(arccos(√
PiQi))2 twice of this is Fisher Dis- tance
[1]
10. Bhattacharya 2 dis- tance
−ln(√
PiQi) [1]
11. Pearsonχ2distance P((Pi−Qi)2/Qi) aka χ2-distance, quasi- distance, non-symmetric
[1]
12. Neyman χ2 dis- tance
P((Pi−Qi)2/Pi) quasi-distance, non- symmetric
[1]
13. Probabilistic sym- metricχ2 distance
2P
((Pi−Qi)2/(Pi+Qi)) [1]
14. Separation quasi- distance
max(1−Pi/Qi) quasi-distance, non- symmetric
[1]
15. Kullback-Leibler distance
PPiln(Pi/Qi) aka relative entropy, information deviation, information gain, KL- distance, quasi-distance, non-symmetric
[1]
16. Skew divergence KL(P, αQ+ (1−α)P) α= 0.1, KL is Kullback- Leibler distance, quasi- distance, non-symmetric
[1]
17. K-divergence dis- tance
KL(P, αQ+ (1−α)P) α= 1/2, KL is Kullback- Leibler distance, quasi- distance, non-symmetric
[1]
18. Jeffrey divergence P
(Pi−Qi) ln(Pi/Qi) aka J-divergence, di- vergence distance, KL2-divergence, non- symmetric
[1]
19. Jensen-Shannon di- vergence
Z=αPi+ (1−α)Qi
αKL(P, Z) + (1−α)KL(Q, Z))
α= 0.1, KL is Kullback- Leibler distance, non- symmetric
[1]
20. Topsøe distance Z= 12(Pi+Qi) P(PilnZPi
i+QilnQZi
i)
aka information statistics [1]
21. Taneja distance Z= 12(Pi+Qi) PZiln(Zi/√
PiQi)
[1]
22. Resistor-average distance
(1/KL(P, Q) + 1/KL(Q, P))−1 KL is Kullback-Leibler distance
[1]
23. Chernoff distance max(−lnP(PitQ1−ti )1−t) t = 0.1, 0 ≤ t < 1,aka R´enyi cross-entropy, non- symmetric
[1]
24. R´enyi distance 1−t1 lnPQi(QPi
i)t) t= 0.1, 0≤t <1, quasi- distance, if t = 1 than Kullback-Leibler distance, non-symmetric
[1]
Table 3: Similarities and distances for numerical data [1, chapter 17.1]
id Name d(P,Q) Comment Source
25. Ruzicka distance P
|Pi−Qi|/P
max(Pi, Qi) aka Soergel, Tanimoto [1]
26. Roberts distance 1 − P
(((Pi +
Qi)max(Pmin(Pi,Qi)
i,Qi))/P(Pi+Qi))
[1]
27. Intersection dis- tance
1 −
Pmin(Pi, Qi)/min(PPi,PQi)
[1]
28. Motyka distance 1−Pmin(Pi, Qi)/P(Pi+Qi) aka Sørensen distance, Czekanowski distance, d28= 12+12d29
[1]
29. Bray-Curtis dis- tance
1−N( ¯P+ ¯2 Q)Pmin(Pi, Qi) aka Renkonen % similar- ity or percentage similar- ity
[1]
30. Canberra distance P
(|Pi−Qi|/|Pi|+|Qi|) [1]
31. Kulczynski 1 dis- tance
P|Pi−Qi|/P
min(Pi, Qi) [1]
32. Kulczynski 2 dis- tance
1−N2(P1¯ +Q1¯)Pmin(Pi, Qi) [1]
33. Baroni-Urbani- Buser distance
1−
P(min(Pi, Qi)) +pP
(min(Pi, Qi))P(max(P)−max(Pi, Qi)) P(max(Pi, Qi)) +pP
(min(Pi, Qi))P(max(P)−max(Pi, Qi)) [1]
Table 4: Relatives of Euclidean distance [1, chapter 17.2]
id Name d(P,Q) Comment Source
34. Penrose size dis- tance
√NP|Pi−Qi| [1]
35. Mean character dis- tance
1 N
P|Pi−Qi| aka Gower distance [1]
36. Lorentzian distance P
ln(1 +|Pi−Qi|) [1]
37. Penrose shape dis- tance
pP((Pi−P¯)−(Qi−Q))¯ 2 [1]
38. Clark distance (N1 P
((Pi − Qi)/(|Pi| +
|Qi|))2)1/2
[1]
39. Meehl distance PN−1
i=1 (Pi−Qi−Pi+1+Qi+1)2 [1]
40. Hellinger distance q
2P(p
Pi/P¯−p
Qi/Q)¯ 2 [1]
41. Whittaker index of association distance
1 2
P Pi/P¯−Qi/Q¯
[1]
42. Symmetric χ2 dis-
tance s
X P¯+ ¯Q N( ¯P+ ¯Q)2
(PiQ¯−QiP¯)2 Pi+Qi
aka chi-distance [1]
Table 5: Spectra distances
id Name d(P,Q) Comment Source
43. Spearman Corre- lation Coefficient, Pearson Correla- tion Coefficient
1−√P[(Qi−Q)(P¯ i−P)]¯
P(Qi−Q)¯ 2P(Pi−P)¯2 [3, 4, 5].
44. Similarity Index (SI)
r
P{PiQi−Qi×100}2
N non-symmetric [6]
45. Improved Similar- ity Index
q1 N
P{PPi−Qi
i+Qi ×100}2 [6]
46. Absolute Value Dis- tance
(1 +
P(|Qi−Pi|)
P(Pi) )−1 non-symmetric [7]
47. Dot-Product (co- sine)
(PQiPi)2
PQ2iPPi2 [6]
48. Spectral Contrast Angle
PQiPi
√PQ2iPPi2 d48=√
d47 [6]
Table 6: Distances from Cha
id Name d(P,Q) Comment Source
49. Wave Hedges dis- tance
P |Pi−Qi|
max(Pi,Qi) [2]
50. Cosine distance 1−√PPPiQi
Pi2√P
Q2i [2]
51. Jaccard distance PPi2P+P(PiQ−Q2i−iP)2PiQi [2]
52. Dice distance PPP(Pi2i+−QPiQ)22i [2]
53. Inner Product dis- tance
1−PPiQi [2]
54. Divergence dis- tance
2P(Pi−Qi)2
(Pi+Qi)2 [2]
55. Additive symmetric χ2 distance
P(Pi−Qi)2(Pi+Qi)
PiQi [2]
56. Jensen difference P[12(PilnPi + QilnQi) − (Pi+Q2 i) ln(Pi+Q2 i)]
[2]
57. Kumar-Johnson distance
P(2(P(Pi2−Q2i)2
iQi)3/2) [2]
58. Avg(L1, L∞) dis- tance
1 2(P
(|Pi−Qi|)+maxi|Pi−Qi|) d59= 12(d3+d2) [2]
59. Vicis-Wave Hadges distance
P |Pi−Qi|
min(Pi,Qi) [2]
60. Vicis-Symmetricχ2 1 distance
P (Pi−Qi)2
min(Pi,Qi)2 [2]
61. Vicis-Symmetricχ2 2 distance
P (Pi−Qi)2
min(Pi,Qi) [2]
62. Vicis-Symmetricχ2 3 distance
P (Pi−Qi)2
max(Pi,Qi) [2]
63. max-Symmetric χ2 distance
max(P(Pi−Qi)2
Pi ,P(Pi−Qi)2
Qi ) [2]
64. min-Symmetric χ2 distance
min(P(Pi−Qi)2
Pi ,P(Pi−Qi)2
Qi ) [2]
Table 7: Distances to compare compressed data
id Name d(P,Q) Comment Source
65. Hausdorff distance maxx⊆Aminy⊆Qd(x, y)
where: d(x,y) is the Euclidean distance between points x and y implementation:
[P P, QQ] = meshgrid(P, Q);
maximinj|P Pi,j−QQi,j|
ultrametric, Hausdorff metric
[8].
66. Edit distance See reference, implementation (Matlab 2017b):
tol = 0.1;
edr(P,Q,tol);
Editing metric [9]
67. Levenshtein dis- tance
See reference, implementa- tion: https://en.wikibooks.
org/wiki/Algorithm_
Implementation/Strings/
Editing metric [1]
References
[1] M. M. Deza and E. Deza,Encyclopaedia of Distances, Springer, Berlin, Heidelberg, 2009.
[2] S. H. Cha,Comprehensive survey on distance/similarity measures between probability density, International Journal of Mathematics and Methods in Applied Sciences, vol. 1, no. 4, pp. 300–307, 2007.
[3] C. Saraiva, R., Lovisolo, L. RF Fingerprinting Location Techniques. in S.A.Zekavat, R.M.Buehrer, eds., Handbook of Position Location: Theory, Practice and Advances, Hoboken, NJ, USA: Wiley, 2011.
[4] G. Minaev, A. Visa, R. Pich´e, Comprehensive Survey of Similarity Measures for Ranked Based Location Fingerprinting Algorithm, International Conference on Indoor Positioning and Indoor Navigation (IPIN), Sept 2017.
[5] Y. Xie, Y. Wang, A. Nallanathan, L. Wang,An Improved K-Nearest-Neighbor Indoor Localization Method Based on Spearman Distance. IEEE Signal Processing Letters 23, No. 3, 351355, 2016.
[6] K. Wan, I. Vidavsky, L.M. Gross, Comparing similar spectra: from similarity index to spectral contrast angle, Journal of the American Society for Mass Spectrometry 13, 85 88, 2002, URL: http://www.
sciencedirect.com/science/article/pii/S1044030501003270.
[7] S.E. Stein, D.R. Scott, Optimization and testing of mass spectral library search algorithms for compound identification, Journal of the American Society for Mass Spectrometry 5, 859-866, 1994, URL: http:
//www.sciencedirect.com/science/article/pii/1044030594870098.
[8] M. Atallah,A linear time algorithm for the Hausdorff distance between convex polygons, Inf. Process, Lett.
17, 207-209, 1983.
[9] R.A. Wagner, M.J. Fischer, The string-to-string correction problem. J. ACM 21, 168-173, 1974, URL:
http://doi.acm.org/10.1145/321796.321811.