Enhancing Nearest Neighbor Based Entropy Estimator for High Dimensional Distributions via Bootstrapping Local Ellipsoid

(1)

Enhancing Nearest Neighbor Based Entropy Estimator for High Dimensional Distributions via Bootstrapping Local Ellipsoid

Chien Lu, Jaakko Peltonen

Tampere University Kalevantie 4 33100 Tampere, Finland

chien.lu@tuni.fi, jaakko.peltonen@tuni.fi

Abstract

An ellipsoid-based, improved kNN entropy estimator based on random samples of distribution for high dimensionality is developed. We argue that the inaccuracy of the classical kNN estimator in high dimensional spaces results from the local uniformity assumption and the proposed method mitigates the local uniformity assumption by two crucial extensions, a local ellipsoid-based volume correction and a correction acceptance testing procedure. Relevant theoretical contributions are provided and several experiments from simple to complicated cases have shown that the proposed estimator can ef- fectively reduce the bias especially in high dimensionalities, outperforming current state of the art alternative estimators.

Introduction

The differential entropy of a continuous-valued random vari- ableXis defined as

H(X) =− Z

p(x) logp(x)dx. (1) wherep(x)is the probability density function of X. En- tropy has been an important numerical quantity in Statistics, Machine Learning and other disciplines such as Physics. It provides a summary measurement of the degree of uncer- tainty of a system. Entropy is also related to other important information theoretic measures, including Kullback-Leibler divergence (Kullback and Leibler 1951) and mutual information (Cover and Thomas 2006) , which is defined for random variablesXandY as

I(X,Y) = Z

Y

Z

X

log p(x,y)

p(x)p(y)dxdy=−H(Y|X)+H(Y). (2) Classical kNN Estimator

In theory, to obtain the entropy of a system, the underlying probability distribution must be known, that is, the analytical form of the probability density function (PDF) needs to be available. However, in real-world cases, it is common that Copyright c2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

the underlying PDF is not available but only a set of samples are observed. This raises the research problem of estimating the entropy from the observed data only; in particular, flexible estimation approaches that do not assume the distribution to lie in a particular parametric model family are known as non-parametric entropy estimation. Estimators such as the k-nearest neighbor estimator (kNN; Kozachenko and Leonenko 1987 and Goria et al. 2005) and the kernel density estimator (KDE; Silverman 2018) or hybrid methods (Orava 2011) have been proposed. Note that there is another important group of approaches, called ensemble estimators (Sricharan, Wei, and Hero 2013; Moon, Sricharan, and Hero 2017; Moon et al. 2018), which use weighted combinations of different estimators. This work focuses on the kNN estimator and approaches related to it.

Assume there are N i.i.d. D-dimensional samples x₁, ...,x_N ∼ P, where the probability density functionp is unknown. The classical kNN estimator is written as

H(X)≈ −1 N

N

X

i=1

logp(xi)

≈ψ(N)−ψ(k) + log(c_D) +D N

N

X

i=1

logε_i (3) where ψ denotes the digamma function,ε_i denotes the Euclidean distance from xi to its nearest neighbor,cD =

π^D2

Γ(1+^D₂), andψ(N)−ψ(k)is the correction term.

The classical estimator has been widely applied in many different research problems. It has been also adopted to non- parametrically estimate the mutual information (Kraskov, Stögbauer, and Grassberger 2004) of Equation (2) and KL divergence (Pérez-Cruz 2008; Wang, Kulkarni, and Verdú 2009).

Bias of the classical kNN

Although the classical kNN estimator has shown promising applicability, it has been found biased especially in higher dimensional cases (Noh et al. 2014; Chauveau and Vandek- erkhove 2014). Theoretical bounds relating the convergence to the data dimension include (Gao, Oh, and Viswanath

(2)

2018; Singh and P´oczos 2016b; 2014; 2016a), for exam- ple (Gao, Oh, and Viswanath 2018) showed the the bias of the classical kNN estimator isO(N˜ ^−1/D), whereO˜denotes limiting behavior up to polylogarithmic factors inN.

One explanation of the bias can be that the local uniform assumption of the classical kNN cannot always hold (Lom- bardi and Pant 2016; Gao, Ver Steeg, and Galstyan 2015;

Gao, Steeg, and Galstyan 2015; Gao, Oh, and Viswanath 2016; Lord, Sun, and Bollt 2018) especially when the dimension is high. In other words, the hyperspherical structure (ε-ball) is not capable of capturing the twisted shape of the underlying distribution of the random variable around the sample; such a situation can have increasingly strong effect on the bias when dimensionality becomes higher.

The rest of the paper is organized as follows. In Sec- tion we describe our proposed method and bias analysis of the proposed method including the corresponding correction term and the bias bound. In Section we review related work based on local approximations. In Section we describe comparison experiments both in estimation of entropy and estimation of mutual information. Lastly, Section provides conclusions.

Method

We now introduce our novel entropy estimation method, called the kNN estimator with Ellipsoidal correction (EC- kNN); the resulting method will be directly applicable both to entropy estimation and to mutual information estimation.

The idea is to construct a local ellipsoid instead of aε-ball, so that inside of the local ellipsoid samples can be assumed more uniformly distributed and hence will better fit the uniformity assumption of kNN based entropy estimation. The proposed method comprises two parts: the first part is the local Ellipsoidal correction where a local ellipsoid is learned via performing a local principal component analysis (PCA) algorithm, the second part is an acceptance testing procedure which is performed in a boot-strapping manner. Note that the local-PCA approach has been adopted by other similar works but the bootstrap step is an significant novelty in our approach.

The classical kNN estimator of Equation (3) can be rewritten as

H(X)≈ 1

NHkN N(xi) (4) whereHkN N(xi)is an approximation of−logp(xi)based on aD-dimensional ball of radiusεiencompassingkneigh- bors, where

H_{kN N}(x_i) =ψ(N)−ψ(k) + log(V_i) (5) and thelog(V_i)term denotes the logarithm of the volumeV_i of the ball, computed as:

log(Vi) = log(cD) +Dlogεi (6) wherecD= ^π

D 2

Γ(1+^D₂)as before.

Our aim is to generalize the above by a logarithmic volume correction termlog(∆ ˜V_i), corresponding to the difference between logarithmic volume of a local ellipsoid versus

the epsilon-ball. Furthermore, as a result of the local ellipsoid the number of neighborskmay slightly change around each pointx_i, yielding an individualk_ifor each point.

Theorem 1. The correction term of the ellipsoid-based estimator isψ(N)−ψ(k), which is the same as the classical estimator.

Proof. LetE(x_i,r(x_i))denote the ellipsoid aroundx_i defined by axes r(xi) = [r1(xi), r2(xi), . . . , rD(xi)]where r₁(x_i)andr_D(x_i)are the longest and shortest axes respectively. For any choice ofr(xi)theE(xi,r(xi))can also be defined by introducing a Mahalanobis matrixMx_isuch that

∀ξ∈ E(xi,r(xi)) q

(ξ−xi)^>Mxi(ξ−xi)≤r1(xi). (7) In this definition the matrix controls the shape of the ellipsoid andr₁(x_i)controls its overall size.

Letp(r1(xi)denote the probability mass inside the Ellip- soid controlled byr1(xi)andMx_i; for brevity the notation p(r₁(x_i)shows the dependence onr₁(x_i)only. The probability mass is

p(r1(xi)) = Z

ξ∈E(x,r(x))

p(ξ)dξ

= Z

√(ξ−x)^>Mxi(ξ−x)≤r1(x_i)

p(ξ)dξ (8) Similar to the work of (Kraskov, St¨ogbauer, and Grass- berger 2004), let µ(r1(xi)) be the probability density of r1(xi),µ(r1(xi))dr1(xi)is the probability that the Maha- lanobis distance of thekth nearest neighbor and thex_iis in the interval[r1(xi), r1(xi) +dr1(xi)]. It can be written as

µ(r1(xi))dr1(xi) = (N−1)!

1!(k−1)!(N−k−1)!

×dp(r₁(x_i)) dr1(xi) dr₁(x_i)

×p(r1(xi))^k−1×(1−p(r1(xi)))^N^−k−1 (9) by using the trinomial formula. The correction term can be obtained via taking the expectation of−logp((r1(xi))

E[−logp(r1(xi))] = Z ∞

0

µ(r1(xi))dr1(xi) (−logp(r1(xi)))

=ψ(N)−ψ(k). (10)

Under the assumption that inside of E(x_i,r(x_i)), the probability density is p(xi) a constant such that p(xi)× VE_i ≈p(r1(xi))andlogVE_i = log(Vi) + log(∆ ˜Vi),

E[−logp(xi)]≈ψ(N)−ψ(k) + logV_E_i

=ψ(N)−ψ(k) + log(Vi) + log(∆ ˜Vi) (11)

(3)

, the resulting entropy estimator EC-kNN will be of the form H(X)≈ 1

NH_{EC−kN N}(x_i) (12) where

H_{EC−kN N}(x_i) =ψ(N)−ψ(k_i) + log(V_i) + log(∆ ˜V_i) (13) andlog(Vi)is defined by Equation (6). However, since the local ellipsoid is learned from a small amount of local data points, we must guard against potential over-correction to the shape learned from this limited amount of data; we do this by an acceptance testing procedure, and apply the logarithmic correction termlog(∆ ˜V_i)only if the test is suc- cessful.

Next, in Section we describe the local ellipsoidal correction for computing the correction term∆ ˜Viand in Section we describe the bootstrap based acceptance testing.

Local Ellipsoidal Correction

Theε-ball in the classical kNN entropy estimator treats all directions equally. In a high-dimensional space the underlying probability distribution around a data point x_i may be curved or stretched so that, due to the shape of the distribution, the density along some directions of the high- dimensional space may be larger than along other directions, and moreover, the density may experience larger changes along some directions than along others. In such a situation, it can still be a reasonable approximation to assume the density to be approximately uniform in a small hyper-region, but it is no longer a good assumption that the hyper-region would be anε-ball. We instead aim to learn an ellipsoid as a better representation of a local hyper-region with approximately uniform density.

To learn a local ellipsoid structure, we propose to use a local PCA based approach as follows. In the local PCA approach, we first find theknearest neighbors ofx_iby small- est Euclidean distance as usual. We then compute the sample covariance matrix of the resulting neighborhood ofx_i (including xi itself) and rotate the neighborhood to a new coordinate system by projecting the data onto the eigenvectors of the obtained covariance matrix. The eigenvectors are used as the axes of the local ellipsoid. Next, the widths of the local ellipsoid along each axis are then computed via search- ing for the maximum distance fromxito the neighbor points along each coordinate axis (see Figure ).

After performing the local PCA, the sum of the log of ratios of the longest axis to other each of the axes of the estimated ellipsoid is then taken as the logarithmic volume correction term; additionally, the number of neighbors is re- counted based on the ellipsoid.

In detail, since the volume of an ellipsoid with axes r1(xi), r2(xi), ..., rD(xi)is defined as

π^D² Γ(1 +^D₂)

D

Y

d=1

rd(xi), (14) we assume that the longest axisr₁, the distance from the origin to the farthest point alone the longest axis, represents

Figure 1: (a) Theε-ball of the classical kNN estimator. (b) Local ellipsoid learned via local PCA. Lines extending from the center denote local axes found by local PCA; the widths r_dof the lines denote distance from the central point to the furthest neighbor along each axis, and the width of the ellipsoid along the axis is defined based on that distance. Note that as a result, due to the curvature of the ellipsoid bound- ary, the furthest neighbor along an axis may not lie inside of the ellipsoid.

the original radius, the correction term is obtained as

∆ ˆV(xi,X) =

D

Y

d=1

rd(xi)

r₁(x_i) . (15) Algorithm 1 shows the computation of the correction term.

In the complete entropy estimation algorithm described in the next section, we denote the correction term as∆ ˜Vi. Algorithm 1Local ellipsoid-based volume correction Require: x_i: sample point

X={x1, . . . ,xN}: all sample points D: dimention

k: number of neighbors

Ensure: ∆ ˆV(x_i,X): local ellipsoid-based volume correction

1: Find thek-th neighbor of thex_i, compute the distance εi

2: Perform a PCA on the set of thek+ 1points

3: Project thek+ 1points to the new coordinate system, get the new center via averaging all the projected points 4: Find the lengthsr₁, ..., r_d of the axes of the projected ellipsoid through computing the maximum difference of the point from the center to the center along the PCA projection axis

return∆ ˆV(x_i,X) =QD d=1

r_d r₁

Bootstrap Testing for Correction Acceptance A nonuniform empirical distribution of a finite set of data samples along different coordinate axes can also happen due to random sampling variation. That is, a nonuniform distribution of the data subset in an ε-ball around a data point can be observed even under the uniformity assumption of the underlying distribution. Therefore, it can happen that the volume corrections from some points are not necessary and an over-correction problem can occur if the correction is performed for every sample point.

(4)

Hence, we introduce a bootstrap style acceptance testing procedure to amend the potential over-correction issue. For each sample point, a reference correction∆ ˆVuis generated, representing a rough estimate of the correction that can occur due to random sampling alone, under a uniformity assumption of the underlying density in theε-ball around the sample point. The reference correction is then used in the acceptance testing procedure as an acceptance threshold.

The reference correction∆ ˆVuis simply a volume correction obtained from a set ofk+ 1samples generated inside of a uniformly distributedε-ball around thexi. The details of generating the∆ ˆV_uare shown in Algorithm 2.

Since the data pointxihas itself been randomly generated from an underlying distribution, the center point of theε-ball is itself a random variable. Therefore, in order to simulate a true random configuration, the randomness of whether the x_i is the center of theε-ball should also be taken into ac- count. Therefore, after simulating the k + 1 points inside of theε-ball, the center of the ball is not necessarily the origi- nalxibut is instead redefined by choosing the point among thek+ 1points which is closest to the original point.

The proposed kNN estimator with ellipsoidal correction (EC-kNN) is then developed. It combines the above- proposed Algorithm 1 and Algorithm 2. For each sample point, the volume correction is performed with a bootstrap acceptance test. The correction term∆ ˜V_iand the referenced correction term ∆Vu are generated respectively. Then the bootstrap acceptance test is conducted via comparing the values of the two∆ ˜Viand∆Vu. The correction is accepted if∆ ˜Vi < ∆Vu, otherwise the algorithm uses the result of the classical kNN estimator.

Note that, in high dimensional cases, it can happen that when the number of neighbors inside the new ellipsoid are counted, it turns out that there are no points of the finite data set inside of the ellipsoid. When encountering this issue, the axes of the ellipsoid are increased slightly until there is at least one data point inside of the ellipsoid. Here, in the proposed algorithm, every one of the axes is length- ened by multiplying the rd by a small ratio (e.g., 1.01).

The volume correction is changed accordingly. By definition,xi= [xi,1, ..., xi,D]^>is inside of the ellipsoid if

D

X

d=1

(xi,d−cd)²

r²_d ≤1 (16)

wherec= [c₁, ..., c_D]^>is the origin of the ellipsoid.

After going through every data point in the data set, the algorithm produces the final corrected result by averaging the corrected entropy values from each data point from the data set. Note that, there is not hyper-parameter setting required and since the bootstrap acceptance test procedure is a result of a randomly generated value∆Vu, therefore, the result of the computation can be slightly different every time.

The final algorithm including the local correction computation, reference correction computation, bootstrap testing, enlargement of ellipsoids, and computation of the entropy estimate is summarized as Algorithm 3.

Algorithm 2Reference correction Require: D: dimension

k: number of neighbors ε: radius

Ensure: Vˆ_u: Acceptance variable

1: Generate random samplesU ={u1, ...,uk+1}from a D-dimensional uniformly distributedε-ball

2: Select the point uj which is the closest to the origin among thek+ 1random samples

returnVˆu= ∆ ˆVk(uj,U)using Algorithm 1

Bias analysis

Theorem 2. The biasE_X∼p

|logp(x)−log ˆp_r(x)(x)|

is bounded.

Proof. By the mean value theorem (Lebesgue 1910) for any 0 < a < bwe have(log(b)−log(a))/(b−a) = 1/cfor somecin the open interval (a, b)Applying this inside the expectationEX∼p[|logp(x)−log ˆp(x)|]we get the bound E_X∼p[|logp(x)−log ˆp(x)|]≤M^∗E_X∼p[|p(x)−p(x)|]ˆ (18) wherep(x)ˆ is an arbitrary estimator ofp(x), andM^∗ = sup_X∼pmax(1/p(x),1/ˆp(x))which is finite ifpandpˆare both bounded above zero inside the support ofp. We make use of the result thatE_X∼p

|p(x)−pˆ_ε(x)(x)|

is bounded for the classical estimator (Singh and P´oczos 2016b).

Consider the classical estimatorpˆ_ε(x)(x)which assumes the density is uniform inside the ball. In the asymptotic case of increasing data the estimate becomes the integral over density inside the ball,

ˆ

p_ε(x)(x) = 1 V_B

Z

ξ∈E(x,ε(x))

p(ξ)dξ

where, by an abuse of notation,V_B denotes the volume of the Ball. The bias of the estimator then becomes

E_X∼p

|p(x)−pˆ_ε(x)(x)|

=

E_X∼p

"

1 V_B

Z

ξ∈E(x,ε(x))

p(x)−p(ξ)dξ

# (19)

Similarly, for the case of the proposed estimator, let r(x) denote the axes of the ellipsoidE(x,r(x))aroundxso that

EX∼p

|p(x)−pˆr(x)(x)|

=E_X∼p

"

1 V_E

Z

ξ∈E(x,r(x))

p(x)−p(ξ)dξ

# (20) Since by construction the ellipsoid is contained within the

(5)

Algorithm 3EC-kNN

Require: X={x1, . . . ,xN}: all sample points D: dimension

k: number of neighbors

Ensure: HˆX: corrected entropy estimation 1: foreachxido

2: Find theknearest neighbors, compute the distance εiand the volume of theεi-ballVi

3: Compute the empirical volume correction

∆ ˜Vi= ∆ ˆV(xi,X) (17) using Algorithm 1

4: Compute the reference volume∆ ˆVu correction using Algorithm 2

5: if∆ ˜Vi<∆ ˆVuthen

6: Find the number of samples inside of the ellipsoid

7: whileNo samples inside of the ellipsoiddo Lengthen every axis by multiplying eachrdwith a small ratio (1.01), adjust the∆ ˜Viaccordingly

8: end while 9:

H_{EC−kN N}(xi) =ψ(N)−ψ(ki) + log(Vi) + log(∆ ˜Vi) 10: else

11: H_{EC−kN N}(xi) =ψ(N)−ψ(ki) + log(Vi) 12: end if

13: end for

returnH(X) =_N¹ PN

i=1H_{EC−kN N}(x_i)

corresponding ball, (20) becomes E_X∼p

"

VB

V_E 1 V_B

Z

ξ∈E(x,r(x))

p(x)−p(ξ)dξ

#

≤C·EX∼p

"

1 V_B

Z

ξ∈E(x,r(x))

p(x)−p(ξ)dξ

#

≤C·

E_X∼p

"

1 VB

Z

ξ∈E(x,r(x))

p(x)−p(ξ)dξ

#

+E_X∼p

"

1 V_B

Z

ξ∈E(x,ε(x))\E(x,r(x))

p(x)−p(ξ)dξ

#

=C·EX∼p

p(x)−pˆε(x)(x)

(21) whereC = sup_X∼p^V_V^B

E is bounded above by any suitable regularization keeping the ellipsoids at non-zero volume, the right-hand term is for the classical estimator which is know to be bounded, thus the ellipsoid case is bounded as well.

Theorem 3. Bound ofE_X∼p

|logp(x)−log ˆp_r(x)(x)|

is less or equal to bound ofEX∼p

|logp(x)−log ˆpε(x)(x)|

.

Proof. We continue from (18). Assume that the probability densitypis second order differentiable. Then, with Taylor expansion the density bias term of the classical estimator can be approximated as

EX∼p

p(x)−pˆε(x)(x)

=

E_X∼p

"

1 V_B

Z

ξ∈E(x,ε(x))

p(x)−p(ξ)dξ

#

=E_X∼p

"

1 2V_B

Z

ξ∈B(x,ε(x))

(ξ−x)^>H(x)(ξ−x)dξ

# +_B

(22) where, again by an abuse of notation,V_Bdenotes the volume of the ball and_Bdenotes the remainder term of the Taylor series; we assume_Bis small enough to be neglected. Sim- ilarly, for the case of the proposed estimator, letr(x)again denote the axes of the ellipsoidE(x,r(x))aroundx,

E_X∼p

p(x)−pˆ_r(x)(x)|

=E_X∼p

"

1 VE

Z

ξ∈E(x,r(x))

p(x)−p(ξ)dξ

#

=E_X∼p

"

1 V_E

Z

ξ∈E(x,r(x))

5p(x)^>(ξ−x)dξ

#

+E_X∼p

"

1 2V_E

Z

ξ∈E(x,r(x))

# +_E

(23) whereV_E denotes the volume of the ellipsoid, and again, _E is assumed ignorable. Due to the symmetry of the ellipsoid, E_X∼ph

1 VE

R

ξ∈E(x,r(x))5p(x)^>(ξ−x)dξ

i = 0.

Hence, EX∼p

p(x)−pˆ_r(x)(x)

≈E_X∼p

"

1 2V_E

Z

ξ∈E(x,r(x))

#

=E_X∼p





1 2VE

Z

· · · Z

PD d=1

(ξd−xd)²

rd(x)2 ≤1





=E_X∼p

Q

drd(x) 2V_E

Z

· · · Z

u^>u≤1

(r(x)u)^>H(x)(r(x)u)du

≤E_X∼p

ε(x)^D 2VB

Z

· · · Z

u^>u≤1

(ε(x)u)^>H(x) (ε(x)u)du

=E_X∼p

"

1 2V_B

Z

ξ∈B(x,ε(x))

#

≈E_X∼p

"

1 V_B

Z

ξ∈B(x,ε(x))

p(x)−p(ξ)dξ

#

=E_X∼p

p(x)−pˆ_ε(x)(x)|

(24)

(6)

where the≈denotes the Taylor approximation and we used the knowledge thatrd(x)≤ε(x)∀rd(x)∈r(x). With the proven inequality, the mean value theorem indicates that the asymptotic bias bound of the proposed estimator is less or equal to the bias bound of the classical estimator.

Related Works

Local PCA approximation

Several approaches based on local PCA approximation (Gao, Ver Steeg, and Galstyan 2015; Lord, Sun, and Bollt 2018) have been developed. Gao et al. have proposed LNC (Local Nonuniformity Correction) estimator which comprises a local PCA based correction and a test procedure.

However, the testing procedure highly depends on an arbitrary ”small value”αand the way of how to determine the αis not addressed. Where as in our proposed method, no hyper-parameter for the testing procedure is required.

Likewise, Lord et al. have proposed a similar approach with a re-centralizing local points and applying SVD to estimate the local volume. Nonetheless, there is no testing or filtering procedure in the method which might lead to over- correction. Besides, in the their paper, only the estimation of mutual information is performed. It is possible that the over- correction resulted error has been cancelled out during the computation. Also note that, both of the above-mentioned works, in their papers, haven’t applied their methods in re- alistically high dimensional cases. The highest dimension used to evaluate LNC is D = 3 whereas in Lord et al.’s work the highest dimension isD= 4.

Local Gaussian approximation

Approaches based on local Gaussian approximation have been proposed (Gao, Steeg, and Galstyan 2015; Lord, Sun, and Bollt 2018). One approach, kNN-bw (bandwidth) (Gao, Oh, and Viswanath 2016) utilizes a local Gaussian kernel to determine the bandwidth of the kNN estimator; as this method is the only one of this group having a public implementation by authors available, we chose it as the represen- tative for this group of approaches.

Simulation Experiments

We carry out several experiments comparing our estimator EC-kNN to other methods in two tasks, entropy estimation and mutual information estimation¹. For both tasks we compare to state of the art alternatives, using implementations from their respective authors with default values. We also compare to the baseline kNN estimator using several values ofk, whereas for our method EC-kNNksimply fixed to 25 in every dimension, which already turns out to work well.

We next describe the data distributions used in the experiments, and then describe the entropy estimation and mutual information tasks along with their respective results.

1R implementation can be found in online repository https://github.com/hummmblelu/eckNN

Designed Cases

To verify the usability of the proposed approach, three cases are set from simple to complicated, details are as follows.

1. Symmetric Gaussian. Samples are generated from ad- dimensional Gaussian distribution N(µ,Σ). The mean vectorµis set to a zero-vector

µ= [0,0, . . . ,0] (25) and the covariance matrixΣis generated as follow

Σ=s×s^>+diag(1, . . . ,1) (26) wheres ∼ N(0, diag(3,3, ...3)). The generation of the Σcreates the random dependence between dimensions.

2. Asymmetric Gaussian.Similar to the previous case, the mean vectorµis set to a zero-vector but the covariance matrixΣis generated as follow

Σ=s×s^>+diag(1, . . . , D) (27) wheres ∼ N(0, diag(3,3, ...3)). When the dimension- alityDgrows, the variation increases accordingly.

3. Mixture Gaussian.A mixture of two Gaussian distributions is selected to evaluate the performance of the proposed method in a scenario which is more complicated than the previous two cases. The probability density function of a mixture of two Gaussian distributions is

p(x) =πp(x|µ1,Σ1) + (1−π)p(x|µ2,Σ2) (28) whereµ1andµ2are mean vectors;Σ1andΣ2are covariance matrices of the two different Gaussian distributions and the πis the mixture weight. Here we setπto 0.4,

µ1= [0,0, . . . ,0] (29) µ₂= [10,10, . . . ,10] (30) Σ1=s1×s1>+diag(1, . . . ,1) (31) Σ2=s2×s2>+diag(1, . . . , D) (32) wheres₁ands₂are sampled fromN(0, diag(3,3, ...3)).

Entropy Estimation

We compare our approach (EC-kNN) with the classical kNN estimator (Kozachenko and Leonenko 1987) and kNN-bw estimator (Gao, Oh, and Viswanath 2016)²with three different cases. The ground truth value of entropy of the first two cases can be obtained analytically

H(X) =1

2log det (2πeΣ) (33) for the mixture Gaussian case, since the entropy value cannot be analytically obtained, Monte Carlo estimation with 100000 samples sampled from the distribution is employed.

In each case, each dimension and each method, we repeat 10 times simulating 1000 samples and computing the estimation to compare with the ground truth entropy and mutual information. The average of the root mean square error (RMSE) is taken as the performance measure. The result can be found in Figure 2. EC-kNN outperforms both the classical kNN and kNN-bw especially in high dimensionalities.

2We directly use the program from https://github.com/wgao9/

lnn, input parameters are set to the default values

(7)

Figure 2: Performance comparisons for entropy estimation. (Average RMSE) (a): Symmetric Gaussian Case (b): Asymmetric Case (c) Mixture Gaussian Case. EC-kNN outperforms other approaches in all cases when dimensionality becomes higher.

Figure 3: Performance comparisons for mutual information estimation. In order to fit the LNC dot line (grey) into the plot, we apply a logarithm scale on the y axis after around 50 (a): Symmetric Gaussian Case (b): Asymmetric Case (c) Mixture Gaussian Case. The proposed EC-kNN outperforms other approaches in all cases when dimensionality becomes higher.

Mutual Information Estimation

For multivariate extension of mutual information between random variablesX = [X₁, . . . , X_D], whereX_ddenote the component random variables of the vector-valued random variableX, here we adopt the following definition

D

X

d=1

H(Xd)−H(X). (34) It is also called the total correlation or multi-information (Van de Cruys 2011) and it is also used in one of the above- mentioned works (Gao, Ver Steeg, and Galstyan 2015).

The ground truth value of mutual information for the first two cases can be obtained analytically using the Equation (33). For the third case (mixture Gaussian), the ground true value of mutual information for the mixture Gaussian case is obtained using Monte-Carlo integration with 100000 samples.We compare our approach with kNN estimator, KSG estimator, LNC³estimator and kNN-bw estimator.

Again, in each case, each dimension and each method, we repeat 10 times simulating 1000 samples and computing the estimation to compare with the ground truth entropy and mutual information. The average of the root mean square error (RMSE) is taken as the performance measure. The result

3We use the program from https://github.com/BiuBiuBiLL/

NPEET LNC for the computations of KSG and LNC and the input parameters are set to the default values

can be found in Figure 3. EC-kNN again outperforms both the classical kNN and other alternatives especially in high dimensionalities.

Conclusions and Discussions

The contribution of this paper is the novel approach for entropy estimation called the EC-kNN estimator which re- duces the bias of the kNN estimator. The proposed EC-kNN comprises the local PCA learning and the boot-strap style correction acceptance procedure which together address the bias resulting from the uniformity assumption.

The advantage of the EC-kNN has been shown to be prominent especially when the data set is high-dimensional and complicated. The experiments have implied that the local ellipsoidal correction and the boot-strap type acceptance procedure can properly capture the local uniformity region.

Our approach provides several interesting directions of fu- ture work. Our method using a fixed value ofkalready out- performed alternatives and performance with optimized k could be even better. Secondly, we proved here a bias bound and additional bounds of variance of the estimator would also be valuable.

Acknowledgments

The work was supported by Academy of Finland decisions 312395 and 313748.

(8)

References

Chauveau, D., and Vandekerkhove, P. 2014. The nearest neighbor entropy estimate: an adequate tool for adaptive mcmc evaluation. Technical report, HAL.

https://hal.archives-ouvertes.fr/hal-0106808.

Cover, T. M., and Thomas, J. A. 2006. Elements of information theory 2nd edition.Willey-Interscience: NJ.

Gao, W.; Oh, S.; and Viswanath, P. 2016. Breaking the bandwidth barrier: Geometrical adaptive entropy estimation. In Advances in Neural Information Processing Systems, 2460–

2468.

Gao, W.; Oh, S.; and Viswanath, P. 2018. Demystify- ing fixed k-nearest neighbor information estimators. IEEE Transactions on Information Theory.

Gao, S.; Steeg, G. V.; and Galstyan, A. 2015. Estimating mutual information by local gaussian approximation. InUn- certainty in Artificial Intelligence, 278–285.

Gao, S.; Ver Steeg, G.; and Galstyan, A. 2015. Efficient estimation of mutual information for strongly dependent variables. InArtificial Intelligence and Statistics, 277–286.

Goria, M. N.; Leonenko, N. N.; Mergel, V. V.; and Novi In- verardi, P. L. 2005. A new class of random vector entropy estimators and its applications in testing statistical hypothe- ses.Journal of Nonparametric Statistics17(3):277–297.

Kozachenko, L., and Leonenko, N. N. 1987. Sample estimate of the entropy of a random vector.Problemy Peredachi Informatsii23(2):9–16.

Kraskov, A.; St¨ogbauer, H.; and Grassberger, P. 2004.

Estimating mutual information. Physical review E 69(6):066138.

Kullback, S., and Leibler, R. A. 1951. On information and sufficiency.The annals of mathematical statistics22(1):79–

86.

Lebesgue, H. 1910. Sur l’intégration des fonctions dis- continues. In Annales scientifiques de l’ École normale supérieure, volume 27, 361–450.

Lombardi, D., and Pant, S. 2016. Nonparametric k- nearest-neighbor entropy estimator. Physical Review E 93(1):013310.

Lord, W. M.; Sun, J.; and Bollt, E. M. 2018. Geometric k-nearest neighbor estimation of entropy and mutual information. Chaos: An Interdisciplinary Journal of Nonlinear Science28(3):033114.

Moon, K.; Sricharan, K.; Greenewald, K.; and Hero, A.

2018. Ensemble estimation of information divergence. En- tropy20(8):560.

Moon, K. R.; Sricharan, K.; and Hero, A. O. 2017. Ensem- ble estimation of mutual information. In2017 IEEE Inter- national Symposium on Information Theory (ISIT), 3030–

3034. IEEE.

Noh, Y.-K.; Sugiyama, M.; Liu, S.; Plessis, M. C.; Park, F. C.; and Lee, D. D. 2014. Bias reduction and metric learning for nearest-neighbor estimation of kullback-leibler divergence. InArtificial Intelligence and Statistics, 669–677.

Orava, J. 2011. K-nearest neighbour kernel density estimation, the choice of optimal k.Tatra Mountains Mathematical Publications50(1):39–50.

P´erez-Cruz, F. 2008. Kullback-leibler divergence estimation of continuous distributions. In Information Theory, 2008.

ISIT 2008. IEEE International Symposium on, 1666–1670.

IEEE.

Silverman, B. W. 2018.Density estimation for statistics and data analysis. Routledge.

Singh, S., and P´oczos, B. 2014. Exponential concentration of a density functional estimator. In Advances in Neural Information Processing Systems, 3032–3040.

Singh, S., and P´oczos, B. 2016a. Analysis of k-nearest neighbor distances with application to entropy estimation.

arXiv preprint arXiv:1603.08578.

Singh, S., and P´oczos, B. 2016b. Finite-sample analysis of fixed-k nearest neighbor density functional estimators. In Advances in neural information processing systems, 1217–

1225.

Sricharan, K.; Wei, D.; and Hero, A. O. 2013. Ensemble estimators for multivariate entropy estimation. IEEE transactions on information theory59(7):4374–4388.

Van de Cruys, T. 2011. Two multivariate generalizations of pointwise mutual information. InProceedings of the Work- shop on Distributional Semantics and Compositionality, 16–

20. Association for Computational Linguistics.

Wang, Q.; Kulkarni, S. R.; and Verd´u, S. 2009. Di- vergence estimation for multidimensional densities via k- nearest-neighbor distances. IEEE Transactions on Informa- tion Theory55(5):2392–2405.